生产环境新上线集群小概率gtid空洞问题
背景:生产环境新上集群会产生gtid空洞问题
例子:
群集名 |
服务器 |
端口 |
master UUID |
复制异常点 |
ibupub1group01 |
10.60.52.168 |
55944 |
9cd48d9c-6e29-11ec-92d8-98039ba567ea |
存在GAP: 间隔10秒两次动态GTID比较:1-402:404-5103#1-402:404-5109 |
线索1:查看binlog发现空洞gtid的binlog不存在,且问题发生于集群已经搭建完毕,已经开始写心跳的阶段后
[op1@SVR16599HP360 tmp]$ cat log.txt |grep "9cd48d9c-6e29-11ec-92d8-98039ba567ea:40" -A 10 | head -n 50
SET @@SESSION.GTID_NEXT= '9cd48d9c-6e29-11ec-92d8-98039ba567ea:402'/*!*/;
# at 159796
#220105 22:45:58 server id 52153 end_log_pos 159880 CRC32 0xdc731cf7 Query thread_id=6068 exec_time=0 error_code=0
SET TIMESTAMP=1641393958/*!*/;
BEGIN
/*!*/;
# at 159880
#220105 22:45:58 server id 52153 end_log_pos 159957 CRC32 0x49efcff4 Rows_query
# replace into repl_heartbeat
# values (@@hostname,now())
# at 159957
--
SET @@SESSION.GTID_NEXT= '9cd48d9c-6e29-11ec-92d8-98039ba567ea:404'/*!*/;
# at 160194
继续查:
线索2:线上已有集群写心跳不会出现问题,为什么新上集群才有小概率会发生?
猜测:和写心跳这个操作无关,写心跳只是gtid增长的手段,新上集群和已有集群的差别在于新上集群在搭建集群环节有reset master操作,怀疑到语句reset master
验证:
经过一系列测试,接下来的操作可100%稳定复现:
第一步:在测试集群上部署一个小脚本,while 1 不停的在master上写心跳,然后在master上手工reset master。然后观察Executed_Gtid_Set:
mysql> show master status;
+-------------------------+----------+--------------+------------------+-----------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+-------------------------+----------+--------------+------------------+-----------------------------------------------+
| FATM00003214-bin.000001 | 12771 | | | 744c6c9a-766d-11eb-9807-fa163e6a649a:1-31:135 |
+-------------------------+----------+--------------+------------------+-----------------------------------------------+
发现会稳定出现一个不存在的幽灵gtid 135,且binlog中搜不到这个gtid的相关信息:
mysqlbinlog -vvv FATM00003214-bin.000001|grep 744c6c9a-766d-11eb-9807-fa163e6a649a:135 ---》 返回空
第二步:模拟搭建集群操作,在slave上stop slave;reset master;change master to xxx;start slave;show slave status;
mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 10.6.26.233
Master_User: usvr_replication
Master_Port: 55111
Connect_Retry: 60
Master_Log_File: FATM00003214-bin.000001
Read_Master_Log_Pos: 15585
Relay_Log_File: FATM00003215-relay-bin.000002
Relay_Log_Pos: 15812
Relay_Master_Log_File: FATM00003214-bin.000001
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 15585
Relay_Log_Space: 16082
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 10001
Master_UUID: 744c6c9a-766d-11eb-9807-fa163e6a649a
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 744c6c9a-766d-11eb-9807-fa163e6a649a:1-38
Executed_Gtid_Set: 744c6c9a-766d-11eb-9807-fa163e6a649a:1-38
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)
Slave在向master拉取gtid时,拉取到了理论上应该存在的744c6c9a-766d-11eb-9807-fa163e6a649a:1-38,以及幽灵gtid 135,然后因为binlog里没有gtid135,导致slave上不会执行,不会记载在Executed_Gtid_Set中。
此时集群已经搭建完成,在心跳不断的写入的情况下,master的gtid :1-38:135会被填平至例如1-169,但因为master的135是不存在的,slave上就会被跳过,至此产生空洞:
mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 10.6.26.233
Master_User: usvr_replication
Master_Port: 55111
Connect_Retry: 60
Master_Log_File: FATM00003214-bin.000001
Read_Master_Log_Pos: 68475
Relay_Log_File: FATM00003215-relay-bin.000002
Relay_Log_Pos: 68702
Relay_Master_Log_File: FATM00003214-bin.000001
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 68475
Relay_Log_Space: 68972
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 10001
Master_UUID: 744c6c9a-766d-11eb-9807-fa163e6a649a
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 744c6c9a-766d-11eb-9807-fa163e6a649a:1-134:136-169
Executed_Gtid_Set: 744c6c9a-766d-11eb-9807-fa163e6a649a:1-134:136-169
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)
结论:集群上线的过程中可能在mysql实例部署完毕后某个脚本就开始运行,然后这个采集脚本的insert动作和master的reset master操作撞车,产生那个幽灵gtid,导致后续slave/dr上产生gtid空洞。至于为什么撞车这个动作会生成一个幽灵gtid尚且不知
解决方法:集群搭建流程中master在reset master之前flush tables with read lock; 等最后再打开,杜绝撞车的可能性