一次线上MySQL主从延迟排查
今天早上来上班,发现zabbix一直告警主从延迟,mysql slave Seconds_Behind_Master (mysql.slave_status[Seconds_Behind_Master]): 69
登录MySQL从库查看 slave状态,Seconds_Behind_Master: 10562,确实存在延迟
mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
......
Master_Log_File: mysql-bin.002582
Read_Master_Log_Pos: 60231375
Relay_Log_File: relay-log.000998
Relay_Log_Pos: 283
Relay_Master_Log_File: mysql-bin.002580
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 1070638020
Relay_Log_Space: 29799422300
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_Key:
Seconds_Behind_Master: 10562
Master_SSL_Verify_Server_Cert: No
1 row in set (0.00 sec)
于是登录MySQL主库排查可能出现的原因
1、可能的原因如下
(1)主从服务器处于不同的网络之中,由于网络延迟导致;
(2)主从服务器的硬件配置不同,从服务器的硬件配置(包括内存,CPU,网卡等)远低于主服务器;
(3)主库上有大量的写入操作,导致从库无法实时重放主库上的binlog;
(4)主库上存在着大事务操作或者慢SQL,导致从库在应用主库binlog的过程过慢,形成延迟;
(5)数据库实例的参数配置问题导致,如:从库开启了binlog,或者配置了每次事务都去做刷盘操作;
经过排查,确定不是1、2、5的原因,查看慢日志文件,也没发现主库慢日志,所以判断主库上可能有大量的写入操作,或者主库上存在着大事务操作
2、查看主库的binlog文件,看到在1号有一个28G的mysql-bin文件
# ls -rtlsh
1.1G -rw-rw---- 1 mysql mysql 1.1G Mar 31 19:06 mysql-bin.002579
28G -rw-rw---- 1 mysql mysql 28G Apr 1 08:02 mysql-bin.002580
1.1G -rw-rw---- 1 mysql mysql 1.1G Apr 1 10:22 mysql-bin.002581
3、结合上述条件,用binlog_rollback分析MySQL binlog,判断是主库都执行了什么大的dml语句
binlog_rollback简介
binlog_rollback实现了基于row格式binlog的回滚闪回功能,让误删除或者误更新数据,可以不停机不使用备份而快速回滚误操作。
binlog_rollback也可以解释binlog(支持非row格式binlog)生成易读的SQL,让查找问题如什么时个某个表的某个字段的值被更新成了1,或者找出某个时间内某个表的所有删除操作等问题变得简单。
binlog_rollback可以按配置输出各个表的update/insert/delete统计报表, 也会输出大事务与长事务的分析, 应用是否干了坏事一目了然, 也会输出所有DDL。
binlog_rollback通过解释mysql/mariadb binlog/relaylog实现以下三大功能:
1)flashback/闪回/回滚, DML回滚到任意时间或者位置。
生成的文件名为rollback.xxx.sql或者db.tb.rollback.xxx.sql
生成的SQL形式如下
```sql
begin
DELETE FROM `danny`.`emp` WHERE `id`=1
# datetime=2017-10-23_00:14:28 database=danny table=emp binlog=mysql-bin.000012 startpos=417 stoppos=575
commit
```
2)前滚,把binlog/relaylog的DML解释成易读的SQL语句。
*支持非row格式的binlog, 默认不解释非row格式的DML, 需要指定参数-stsql
生成的文件名为forward.xxx.sql或者db.tb.forward.xxx.sql
生成的SQL形式如下
```sql
begin
# datetime=2017-10-23_00:14:28 database=danny table=emp binlog=mysql-bin.000012 startpos=417 stoppos=575
INSERT INTO `danny`.`emp` (`id`,`name`,`sr`,`icon`,`points`,`sa`,`sex`) VALUES (1,'张三1','华南理工大学&SCUT',X'89504e47',1.1,1.1,1)
commit
```
3)统计分析, 输出各个表的DML统计, 输出大事务与长事务, 输出所有的DDL
DML统计结果文件:binlog_stats.txt
大事务与长事务结果文件:big_long_trx.txt
DDL结果文件:ddl_info.txt
具体可以产考 https://github.com/GoDannyLai/binlog_rollback
# /root/dba/mysql_dml/bin/binlog_rollback -m file -w stats -M mysql -i 20 -b 100 -l 10 -o /data/dml_binlog/log /data/mysql/var/mysql-bin.002580
# cat binlog_stats.txt |awk '{print $8}' |sort -nr|head -10
417936881
9718
7183
7183
7165
7162
7152
7150
7137
7119
mysql-bin.002580 2020-03-31_18:33:00 2020-03-31_18:33:00 1070638106 3966168092 0 0 417936881 content testing
可以看到在 2020-03-31_18:33:00的时候content.testing删除了 417936881条数据,是由于delete造成数据库主从延迟的主要原因。
最后和相关的开发建讨论群追踪问题,发现是一个研发直接delete form testing;删了4亿多条数据的一个大事务,从而造成主从延迟。
解决办法
(一)
由于这组实例不是核心业务,qps和tps都不是很高,零时把读写都切到master上,等主从延迟追平了,再把读切换到从库上。
(二)
也可以在主库上备份一个全备,再从库上恢复2个新的slave,恢复完了重新做主从。
查看哪些MySQL数据库有哪些大的dml操作
mysql> select delete_totals,db_name,table_name,starttime,stoptime from test.t_dml_count order by delete_totals desc limit 10;
+---------------+--------------------+------------------+---------------------+---------------------+
| delete_totals | db_name | table_name | starttime | stoptime |
+---------------+--------------------+------------------+---------------------+---------------------+
| 417936881 | content | testing | 2020-03-31_18:33:00 | 2020-03-31_18:33:00 |
| 9718 | content | data_count | 2020-04-01_00:15:01 | 2020-04-01_00:15:01 |
| 4831 | Log | soLog | 2020-04-01_05:10:40 | 2020-04-01_05:10:59 |
| 4813 | Log | soLog | 2020-04-01_05:21:20 | 2020-04-01_05:21:39 |
| 4811 | Log | soLog | 2020-04-01_05:34:00 | 2020-04-01_05:34:19 |
| 4806 | Log | soLog | 2020-04-01_05:11:40 | 2020-04-01_05:11:59 |
| 4802 | Log | soLog | 2020-04-01_05:33:40 | 2020-04-01_05:33:59 |
| 4800 | Log | soLog | 2020-04-01_05:34:40 | 2020-04-01_05:34:59 |
| 4779 | Log | soLog | 2020-04-01_05:35:20 | 2020-04-01_05:35:39 |
| 4776 | Log | soLog | 2020-04-01_05:19:40 | 2020-04-01_05:19:59 |
+---------------+--------------------+------------------+---------------------+---------------------+
10 rows in set (0.01 sec)
mysql> select insert_totals,db_name,table_name,starttime,stoptime from test.t_dml_count order by insert_totals desc limit 10;
+---------------+--------------------+-----------------------------+---------------------+---------------------+
| insert_totals | db_name | table_name | starttime | stoptime |
+---------------+--------------------+-----------------------------+---------------------+---------------------+
| 9722 | content | data_count | 2020-04-01_00:15:01 | 2020-04-01_00:15:01 |
| 4000 | Log | LogBackUp | 2020-04-01_05:05:40 | 2020-04-01_05:05:59 |
| 3500 | Log | LogBackUp | 2020-04-01_05:03:01 | 2020-04-01_05:03:19 |
| 3500 | Log | LogBackUp | 2020-04-01_05:02:40 | 2020-04-01_05:02:58 |
| 3500 | Log | LogBackUp | 2020-04-01_05:03:40 | 2020-04-01_05:03:57 |
| 3500 | Log | LogBackUp | 2020-04-01_05:01:40 | 2020-04-01_05:01:58 |
| 3500 | Log | LogBackUp | 2020-04-01_05:00:40 | 2020-04-01_05:00:58 |
| 3500 | Log | LogBackUp | 2020-04-01_05:04:20 | 2020-04-01_05:04:37 |
| 3500 | Log | LogBackUp | 2020-04-01_05:01:01 | 2020-04-01_05:01:19 |
| 3500 | Log | LogBackUp | 2020-04-01_05:02:01 | 2020-04-01_05:02:19 |
+---------------+--------------------+-----------------------------+---------------------+---------------------+
10 rows in set (0.01 sec)
mysql> select update_totals ,db_name,table_name,starttime,stoptime from test.t_dml_count order by update_totals desc limit 10;
+---------------+--------------------+--------------------+---------------------+---------------------+
| update_totals | db_name | table_name | starttime | stoptime |
+---------------+--------------------+--------------------+---------------------+---------------------+
| 2629 | dataanalytics | Code | 2020-04-01_04:10:56 | 2020-04-01_04:11:15 |
| 2493 | dataanalytics | Code | 2020-04-01_04:10:16 | 2020-04-01_04:10:35 |
| 2430 | dataanalytics | Code | 2020-04-01_04:11:16 | 2020-04-01_04:11:35 |
| 2377 | dataanalytics | Code | 2020-04-01_04:10:36 | 2020-04-01_04:10:55 |
| 1778 | dataanalytics | Code | 2020-04-01_04:10:01 | 2020-04-01_04:10:15 |
| 959 | dataanalytics | Code | 2020-04-01_04:11:36 | 2020-04-01_04:11:43 |
| 503 | Log | PointLog | 2020-04-01_08:03:09 | 2020-04-01_08:03:28 |
| 431 | content | push | 2020-04-01_15:15:13 | 2020-04-01_15:15:13 |
| 351 | Log | PointLog | 2020-04-01_11:43:09 | 2020-04-01_11:43:28 |
| 349 | Log | PointLog | 2020-04-01_11:51:09 | 2020-04-01_11:51:28 |
+---------------+--------------------+--------------------+---------------------+---------------------+
10 rows in set (0.02 sec)