徒手运维MGR集群基本操作

Ⅰ、MGR基本参数说明

  • binlog/relaylog
server_id = 1
log_bin = bin
log_slave_updates = 1
relay_log = relay
binlog_format = row    #必须row,用row的理由太多,这里不细述
binlog_checksum = NONE    #5.7.17不支持带checksum的binlog event
master_info_repository = table    #gr用到多源复制,slave的channel信息必须保存在表中
relay_log_info_repository = table	
  • gtid
gtid_mode = 1
enforce_gtid_consistency = 1
  • 并行复制
slave_parallel_type = 'logical_clock'
slave_parallel_workers = N
slave_preserve_commit_order = 1
  • 主键信息采集
transaction_write_set_extraction = XXHASH64    #server层采集主键信息,hash后存储起来,所有节点保持一致
  • gr插件相关
plugin_load_add = 'group_replication.so'    #引入gr插件
loose-group_replication_group_name = "745c2bc6-9fe4-11ea-8c89-fa163e98606f"    #组名称
loose-group_replication_start_on_boot = 0    #服务启动时不启动gr
loose-group_replication_local_address = "192.168.0.65:33061"    #本地成员地址
loose-group_replication_group_seeds = "192.168.0.32:33061,192.168.0.65:33061,192.168.0.185:33061"    #配置种子成员地址	
loose-group_replication_bootstrap_group = 0    #默认设置节点为非引导节点,初始化集群时在一个节点上临时开启,初始化后立即关闭
loose-group_replication_whitelist = 'ip或者网段'    #白名单配置,默认不配置,只允许和本机同网段的机器加入

'loose-'多用于插件参数,表示不支持该参数时也可以正常启动MySQL服务,不报错

Ⅱ、MGR模式切换

默认情况下,mgr部署好是单主模式,这里我们做模式切换

单主还是多主由一个参数控制

group_replication_single_primary_mode

不支持在线模式切换,需要把所有成员退出,在所有节点调整上面这个参数再重新初始化组为需要的模式,再把成员加进来

撸一把看

(root@localhost) [(none)]> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | a50df330-a4b0-11ea-bfd3-fa163ed6ef91 | master      |        3306 | ONLINE       | PRIMARY     | 8.0.19         |
| group_replication_applier | a6a83b07-a4b0-11ea-ace2-fa163e6f3efc | slave1      |        3306 | ONLINE       | SECONDARY   | 8.0.19         |
| group_replication_applier | a7dbfa5f-a4b0-11ea-b21b-fa163e98606f | slave2      |        3306 | ONLINE       | SECONDARY   | 8.0.19         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
3 rows in set (0.00 sec)

目前看集群处于单主模式,一个PRIMARY,两个SECONDARY,操作起来

所有节点做如下操作:

(root@localhost) [(none)]> stop group_replication;                                                                                                                                               
Query OK, 0 rows affected (4.41 sec)

(root@localhost) [(none)]> set global group_replication_single_primary_mode = 0;
Query OK, 0 rows affected (0.00 sec)

任意一个节点上操作重新初始化集群:

(root@localhost) [(none)]> SET GLOBAL group_replication_bootstrap_group=1;
Query OK, 0 rows affected (0.00 sec)

(root@localhost) [(none)]> start group_replication;
Query OK, 0 rows affected (3.12 sec)

(root@localhost) [(none)]> SET GLOBAL group_replication_bootstrap_group=0;
Query OK, 0 rows affected (0.00 sec)

剩余节点加入集群:

(root@localhost) [(none)]> start group_replication;
Query OK, 0 rows affected (3.58 sec)

查看当前集群状态

(root@localhost) [(none)]> select * from performance_schema.replication_group_members;                                                                                                           
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | a50df330-a4b0-11ea-bfd3-fa163ed6ef91 | master      |        3306 | ONLINE       | PRIMARY     | 8.0.19         |
| group_replication_applier | a6a83b07-a4b0-11ea-ace2-fa163e6f3efc | slave1      |        3306 | ONLINE       | PRIMARY     | 8.0.19         |
| group_replication_applier | a7dbfa5f-a4b0-11ea-b21b-fa163e98606f | slave2      |        3306 | ONLINE       | PRIMARY     | 8.0.19         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
3 rows in set (0.00 sec)

仨primary,nice,多主变单主也是一样操作即可~

Ⅲ、集群故障处理

听说,不管是单主还是双主,但凡集群中大于半数节点出现故障,则整个集群无法对外提供服务,我们这里讨论的就是这种情况,而不是正常的增加节点和删除节点,这里需要注意下。

我们用kill -9 mysqld服务来模拟现场

此时集群是多主模式

先停一个节点slave2

[root@slave2 ~]# ps auxwf |grep mysql |grep -v grep |awk '{print $2}' |tail -n 1 |xargs kill -9

这里我服务器上kill最后一个mysql,是因为我用mysqld_safe启动的mysqld,所以前面还有一个守护进程,我们保留不杀

剩余俩节点上查看集群

(root@localhost) [(none)]> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | a50df330-a4b0-11ea-bfd3-fa163ed6ef91 | master      |        3306 | ONLINE       | PRIMARY     | 8.0.19         |
| group_replication_applier | a6a83b07-a4b0-11ea-ace2-fa163e6f3efc | slave1      |        3306 | ONLINE       | PRIMARY     | 8.0.19         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
2 rows in set (0.00 sec)

(root@localhost) [(none)]> insert into t.t values(6);
Query OK, 1 row affected (0.01 sec)

(root@localhost) [(none)]> select * from t.t;
+----+
| id |
+----+
|  1 |
|  2 |
|  3 |
|  4 |
|  5 |
|  6 |
+----+
6 rows in set (0.00 sec)

好,slave2已经没了,不过还可以插入数据

再把slave1杀掉

[root@slave1 ~]# ps auxwf |grep mysql |grep -v grep |awk '{print $2}' |tail -n 1 |xargs kill -9

master上看集群状态

(root@localhost) [(none)]> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | a50df330-a4b0-11ea-bfd3-fa163ed6ef91 | master      |        3306 | ONLINE       | PRIMARY     | 8.0.19         |
| group_replication_applier | a6a83b07-a4b0-11ea-ace2-fa163e6f3efc | slave1      |        3306 | UNREACHABLE  | PRIMARY     | 8.0.19         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
2 rows in set (0.00 sec)

(root@localhost) [(none)]> insert into t.t values(7);
~hang住

显示slave1无法通信,插入数据也插不进去,集群无法对外服务了

测试过程中发现,若短时间杀掉俩节点,此处会显示两个节点无法通信

新开一个master会话看看

innodb status
---TRANSACTION 3123, ACTIVE (PREPARED) 19 sec
mysql tables in use 1, locked 1
1 lock struct(s), heap size 1136, 0 row lock(s), undo log entries 1
MySQL thread id 8, OS thread handle 139976409593600, query id 49 localhost root waiting for handler commit
insert into t.t values(7)

----------------

innodb_trx
(root@localhost) [(none)]> select * from information_schema.innodb_trx\G
*************************** 1. row ***************************
                    trx_id: 3123
                 trx_state: RUNNING
               trx_started: 2020-06-07 11:52:01
     trx_requested_lock_id: NULL
          trx_wait_started: NULL
                trx_weight: 2
       trx_mysql_thread_id: 8
                 trx_query: insert into t.t values(7)
       trx_operation_state: NULL
         trx_tables_in_use: 1
         trx_tables_locked: 1
          trx_lock_structs: 1
     trx_lock_memory_bytes: 1136
           trx_rows_locked: 0
         trx_rows_modified: 1
   trx_concurrency_tickets: 0
       trx_isolation_level: REPEATABLE READ
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 0
          trx_is_read_only: 0
trx_autocommit_non_locking: 0
1 row in set (0.00 sec)

----------------

processlist
(root@localhost) [(none)]> show processlist;
+----+-----------------+-----------+------+---------+------+--------------------------------------------------------+----------------------------------+
| Id | User            | Host      | db   | Command | Time | State                                                  | Info                             |
+----+-----------------+-----------+------+---------+------+--------------------------------------------------------+----------------------------------+
|  4 | event_scheduler | localhost | NULL | Daemon  | 2162 | Waiting on empty queue                                 | NULL                             |
|  8 | root            | localhost | NULL | Query   |  106 | waiting for handler commit                             | insert into t.t values(7)        |
| 10 | system user     |           | NULL | Connect |  560 | waiting for handler commit                             | Group replication applier module |
| 13 | system user     |           | NULL | Query   |  560 | Slave has read all relay log; waiting for more updates | NULL                             |
| 22 | root            | localhost | NULL | Query   |    0 | starting                                               | show processlist                 |
+----+-----------------+-----------+------+---------+------+--------------------------------------------------------+----------------------------------+
5 rows in set (0.01 sec)

(root@localhost) [(none)]> kill 8;
Query OK, 0 rows affected (0.00 sec)

原来hang住的session

(root@localhost) [(none)]> insert into t.t values(7);
ERROR 2013 (HY000): Lost connection to MySQL server during query

新开的session上观察

(root@localhost) [(none)]> show processlist;
+----+-----------------+-----------+------+---------+------+--------------------------------------------------------+----------------------------------+
| Id | User            | Host      | db   | Command | Time | State                                                  | Info                             |
+----+-----------------+-----------+------+---------+------+--------------------------------------------------------+----------------------------------+
|  4 | event_scheduler | localhost | NULL | Daemon  | 2202 | Waiting on empty queue                                 | NULL                             |
|  8 | root            | localhost | NULL | Killed  |  146 | waiting for handler commit                             | insert into t.t values(7)        |
| 10 | system user     |           | NULL | Connect |  600 | waiting for handler commit                             | Group replication applier module |
| 13 | system user     |           | NULL | Query   |  600 | Slave has read all relay log; waiting for more updates | NULL                             |
| 22 | root            | localhost | NULL | Query   |    0 | starting                                               | show processlist                 |
+----+-----------------+-----------+------+---------+------+--------------------------------------------------------+----------------------------------+
5 rows in set (0.00 sec)

innodb status和innodb_trx中事务状态还是和之前保持一致,这里就不贴了

kill这个hang住的session竟然没有成功,只是显示killed,这个为什么一直显示killed呢,

先放着

看master的错误日志

2020-06-06T14:04:52.234039Z 212 [Warning] [MY-010056] [Server] Host name 'hn.kd.jz.adsl' could not be resolved: Name or service not known
2020-06-06T15:54:42.751689Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address slave2:3306 has become unreachable.'
2020-06-06T15:54:44.753896Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: slave2:3306'
2020-06-06T15:59:53.600136Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address slave1:3306 has become unreachable.'
2020-06-06T15:59:53.600173Z 0 [ERROR] [MY-011495] [Repl] Plugin group_replication reported: 'This server is not able to reach a majority of members in the group. This server will now block all updates. The server will remain blocked until contact with the majority is restored. It is possible to use group_replication_force_members to force a new group membership.'

最后一句啥意思懂吧?

此服务器无法访问组中的大多数成员。

此服务器现在将阻止所有更新。

在恢复与大多数成员的通信之前,服务器将保持被阻塞状态。

可以使用group_replication_force_members参数强制更新组成员

言下之意,将活着的节点,重新组成新集群

那我们不妨来一下嘛

(root@localhost) [(none)]> set global group_replication_force_members = "192.168.0.65:33061";
Query OK, 0 rows affected (9.74 sec)

(root@localhost) [(none)]> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | a50df330-a4b0-11ea-bfd3-fa163ed6ef91 | master      |        3306 | ONLINE       | PRIMARY     | 8.0.19         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
1 row in set (0.00 sec)

(root@localhost) [(none)]> insert into t.t values(7);
ERROR 1062 (23000): Duplicate entry '7' for key 't.PRIMARY'
(root@localhost) [(none)]> select * from t.t;
+----+
| id |
+----+
|  1 |
|  2 |
|  3 |
|  4 |
|  5 |
|  6 |
|  7 |
+----+
7 rows in set (0.00 sec)

(root@localhost) [(none)]> insert into t.t values(8);
Query OK, 1 row affected (0.01 sec)

我们服务现在已经恢复了,后续把节点再加进去即可

需要注意的是:

①加节点,注意每个节点的模式一致,单主还是多主,譬如我们这里是多主,加节点之前需要先配好group_replication_single_primary_mode为0,默认是1,单主。
②事后最好把group_replication_force_members重新置空,不处理也问题不大
③直接在不可用集群中加节点是加不了的,必须先恢复集群可用才能加节点

这里发现一个奇怪的现象,刚才killed的事务竟然提交了,并没有回滚

但是,既然我kill了,那服务恢复后就应该回滚掉这个事务,所以这里是不是有问题

测试过程中发现如果不kill,直接重启服务这个事务会被回滚

这里很可能是我个人对killed的理解有问题,先放一下,毕竟不是本节的重点。

Ⅳ、MGR监控

这里说是监控,其实就是一些集群中常用的运维用的元数据信息吧

  • 查看节点状态

这是我学习过程中用的最多的一个sql,这里每个字段一目了然,这里不用再说了

(root@localhost) [(none)]> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | a50df330-a4b0-11ea-bfd3-fa163ed6ef91 | master      |        3306 | ONLINE       | PRIMARY     | 8.0.19         |
| group_replication_applier | a6a83b07-a4b0-11ea-ace2-fa163e6f3efc | slave1      |        3306 | ONLINE       | PRIMARY     | 8.0.19         |
| group_replication_applier | a7dbfa5f-a4b0-11ea-b21b-fa163e98606f | slave2      |        3306 | ONLINE       | PRIMARY     | 8.0.19         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
3 rows in set (0.00 sec)

还是说下member_state字段吧

- 含义
offline gr插件未启动
recovering 复制加入集群之前产生的数据
online 完成后,可对外服务
error 节点出现错误,gr无法正常运行
unreachable 无法与其他成员通信,网络问题或其他成员非正常退出
  • 查看成员详细信息

这里注释一些必要字段说明

(root@localhost) [(none)]> select * from performance_schema.replication_group_member_stats limit 1\G
*************************** 1. row ***************************
                              CHANNEL_NAME: group_replication_applier	
                                   VIEW_ID: 15915014697349018:6	组视图id,后面写原理部分解释
                                 MEMBER_ID: a50df330-a4b0-11ea-bfd3-fa163ed6ef91
               COUNT_TRANSACTIONS_IN_QUEUE: 0	队列中等待做全局事务认证的事务数量
                COUNT_TRANSACTIONS_CHECKED: 2	做了全局事务认证的事务总数量,从加入组开始
                  COUNT_CONFLICTS_DETECTED: 0	全局事务认证时,有冲突的事务总数量
        COUNT_TRANSACTIONS_ROWS_VALIDATING: 0	冲突检测的记录总行数
        TRANSACTIONS_COMMITTED_ALL_MEMBERS: 745c2bc6-9fe4-11ea-8c89-fa163e98606f:1-69:1000003-1000005:2000005-2000006	所有事务的gtid集合,相当于gtid_executed交集,非实时,隔段时间更新一次
            LAST_CONFLICT_FREE_TRANSACTION: 745c2bc6-9fe4-11ea-8c89-fa163e98606f:67	最后一个没有冲突的事务gtid
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0	当前节点从gr中接收到的正在等待被应用的事务数量
         COUNT_TRANSACTIONS_REMOTE_APPLIED: 5	当前节点从gr中接收的已经应用的事务数量
         COUNT_TRANSACTIONS_LOCAL_PROPOSED: 2	当前节点产生并发送给gr的事务数量
         COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0	当前节点产生并被gr回滚的事务数量
1 row in set (0.00 sec)
  • 其他表

replication_connection_status

新节点加入集群,先通过异步复制拉取加入组之前产生的数据,通过这个表监控这个过程

replication_applier_status

group_replication_applier通道来执行binlog event,通过这个表监控这个过程

threads

该表监控gr组件的线程

posted @ 2020-06-07 21:34  91洲际哥  阅读(1143)  评论(0编辑  收藏  举报