mariadb cluster recovery

mariadb cluster recovery

列几个集群检查命令

1
2
3
4
5
6
7
MariaDB [(none)]> show status like 'wsrep_incoming_addresses';
+--------------------------+----------------------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------------------+
| wsrep_incoming_addresses | 10.0.0.51:3306,10.0.0.52:3306,10.0.0.53:3306 |
+--------------------------+----------------------------------------------+
1 row in set (0.01 sec)

1
2
3
4
5
6
7
MariaDB [(none)]> show status like 'wsrep_cluster_size';
+--------------------+-------+
| Variable_name | Value |
+--------------------+-------+
| wsrep_cluster_size | 3 |
+--------------------+-------+
1 row in set (0.01 sec)
1
2
3
4
5
6
7
MariaDB [(none)]> show status like 'wsrep_cluster_state_uuid';
+--------------------------+-----------------------------------------+
| Variable_name | Value |
+--------------------------+-----------------------------------------+
| wsrep_cluster_state_uuid | 345098abd2-291a-9893-acbd3-30923abcdef9 |
+--------------------------+-----------------------------------------+
1 row in set (0.01 sec)
1
2
3
4
5
6
7
MariaDB [(none)]> show status like 'wsrep_local_state_comment';
+---------------------------+--------+
| Variable_name | Value |
+---------------------------+--------+
| wsrep_local_state_comment | Synced |
+---------------------------+--------+
1 row in set (0.01 sec)

场景一

只是很简单的某节点掉线,重启让其重新加入集群即可

1
2
3
4
5
6
7
MariaDB [(none)]> show status like 'wsrep_incoming_addresses';
+--------------------------+-------------------------------+
| Variable_name | Value |
+--------------------------+-------------------------------+
| wsrep_incoming_addresses | 10.0.0.51:3306,10.0.0.52:3306 |
+--------------------------+-------------------------------+
1 row in set (0.01 sec)

1
2
3
4
5
6
7
MariaDB [(none)]> show status like 'wsrep_cluster_size';
+--------------------+-------+
| Variable_name | Value |
+--------------------+-------+
| wsrep_cluster_size | 2 |
+--------------------+-------+
1 row in set (0.01 sec)
1
# systemctl restart mariadb

场景二

唯一有一个幸存节点

1
2
3
4
5
6
7
MariaDB [(none)]> show status like 'wsrep_cluster_size';
+--------------------+-------+
| Variable_name | Value |
+--------------------+-------+
| wsrep_cluster_size | 1 |
+--------------------+-------+
1 row in set (0.01 sec)

幸存节点刚好优雅关闭正确提交了数据,状态为Primary

1
2
3
4
5
6
7
MariaDB [(none)]> show status like 'wsrep_cluster_status';
+----------------------+---------+
| Variable_name | Value |
+----------------------+---------+
| wsrep_cluster_status | Primary |
+----------------------+---------+
1 row in set (0.01 sec)

哪果是non-Primary状态还要单独恢复
幸存节点上执行

1
MariaDB [(none)]> set global wsrep_provider_options='pc.bootstrap=YES';

可是 我们一般不这样做,常规做法是直接指定这个幸存节点的主节点,让其它节点加入集群

1
2
3
# systemctl stop mariadb
# galera_new_cluster
# systemctl restart mariadb

场景三
集群崩溃完全不可用,这也分为2种情况,基于最高seqno值恢复
状态一是没有集群UUID, seqno: -1,数据定义语言DDL出错

1
2
3
4
5
6
# cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid: 00000000-0000-0000-0000-000000000000
seqno: -1
safe_to_bootstrap: 0

状态2是有集群UUID,seqno: -1,事务处理数据时出错

1
2
3
4
5
6
# cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid: 886dd8da-3d07-11e8-a109-8a3c80cebab4
seqno: -1
safe_to_bootstrap: 0

状态3是有集群UUID,seqno值最高正值

1
2
3
4
5
6
# cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid: 886dd8da-3d07-11e8-a109-8a3c80cebab4
seqno: 31929
safe_to_bootstrap: 1

此时我们认为状态3是OK的,需要修改状态1、2数据库配置文件去掉IP地址配置项

1
wsrep_cluster_address="gcomm://"

这样节点就能正常启动了

1
systemctl restart mariadb

最后再修改配置文件后把这2个节点正常加入集群就好了

1
wsrep_cluster_address="gcomm://10.8.8.53,10.8.8.54,10.8.8.55"

最后是基于Last Committed恢复
先修改配置文件让服务启起来,入库查看wsrep_last_committed

1
2
3
#systemctl stop mariadb
编辑配置文件去掉IP地址wsrep_cluster_address="gcomm://"
#systemctl start mariadb

1
2
3
4
5
6
7
MariaDB [(none)]> show status like 'wsrep_last_committed';
+----------------------+---------+
| Variable_name | Value |
+----------------------+---------+
| wsrep_last_committed | 319589 |
+----------------------+---------+
1 row in set (0.01 sec)

找到节点最大值认为就是主节点,然后在这个主节点上

1
#  galera_new_cluster

其它节点加入集群即可