MySQL 有关MHA搭建与切换的几个错误log
1:masterha_check_repl 副本集方面报错 replicates is not defined in the configuration file!
具体信息如下:
# /usr/local/bin/masterha_check_repl --conf=/etc/mha/app1.cnf
Thu Nov 21 15:33:15 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Thu Nov 21 15:33:15 2018 - [info] Reading application default configuration from /etc/mha/app1.cnf.. Thu Nov 21 15:33:15 2018 - [info] Reading server configuration from /etc/mha/app1.cnf.. Thu Nov 21 15:33:15 2018 - [info] MHA::MasterMonitor version 0.56. Thu Nov 21 15:33:16 2018- [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln671] Master 179.179.19.179:3306 from which slave 179.179.19.180(179.179.19.180:3306) replicates is not defined in the configuration file! Thu Nov 21 15:33:16 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln424] Error happened on checking configurations. at /usr/local/share/perl5/MHA/MasterMonitor.pm line 326. Thu Nov 21 15:33:16 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln523] Error happened on monitoring servers. Thu Nov 21 15:33:16 2018 - [info] Got exit code 1 (Not master dead). MySQL Replication Health is NOT OK!
分析:MHA 漂移过后,我们知道配置信息中 主节点的信息就不在了,我们需要及时维护,否则/usr/local/bin/masterha_check_repl --conf=/etc/mha/XXX.cnf 检查副本集状态报错。
2. masterha_master_switch 在线切换方面 报错 We should not start online master switch when one of connections are running long updates on the current master
具体信息如下:
# /usr/local/bin/masterha_master_switch --master_state=alive --conf=/etc/mha/app1.cnf It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 179.179.19.184(179.179.19.184:3306)? (YES/no): y Tue Nov 19 17:19:09 2018 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time.. Tue Nov 19 17:19:09 2018 - [info] ok. Tue Nov 19 17:19:09 2018 - [info] Checking MHA is not monitoring or doing failover.. Tue Nov 19 17:19:09 2018 - [info] Checking replication health on 179.179.19.185.. Tue Nov 19 17:19:09 2018 - [info] ok. Tue Nov 19 17:19:09 2018 - [error][/usr/local/share/perl5/MHA/MasterRotate.pm, ln161] We should not start online master switch when one of connections are running long updates on the current master(179.179.19.184(179.179.19.184:3306)). Currently 1 update thread(s) are running. Details: {'Time' => '12815','db' => undef,'Id' => '1','User' => 'event_scheduler','State' => 'Waiting on empty queue','Command' => 'Daemon','Info' => undef,'Host' => 'localhost'} Tue Nov 19 17:19:09 2018 - [error][/usr/local/share/perl5/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/local/bin/masterha_master_switch line 53.
分析:set global event_scheduler=off; 主从都要关闭. (特别提醒:从节点不关闭一样报错)
常用的切换命令补充如下:
masterha_master_switch --conf=/etc/mha配置文件 --master_state=alive --new_master_host=新主ServerIP --new_master_port=端口号 --orig_master_is_new_slave --running_updates_limit=60
3. masterha_master_switch 在线切换方面 报错 Got Error: DBI .....failed: Access denied for user
# /usr/local/bin/masterha_master_switch --master_state=alive --conf=/etc/mha/app1.cnf Starting master switch from 179.179.19.185(179.179.19:3306) to 179.179.19.184(179.179.19.184:3306)? (yes/NO): yes Tue Nov 19 18:52:04 2018 - [info] Checking whether 179.179.19.184(179.179.19.184:3306) is ok for the new master.. Tue Nov 19 18:52:04 2018 - [info] ok. Tue Nov 19 18:52:04 2018 - [info] ** Phase 1: Configuration Check Phase completed. Tue Nov 19 18:52:04 2018 - [info] Tue Nov 19 18:52:04 2018 - [info] * Phase 2: Rejecting updates Phase.. Tue Nov 19 18:52:04 2018 - [info] Tue Nov 19 18:52:04 2018 - [info] Executing master ip online change script to disable write on the current master: Tue Nov 19 18:52:04 2018 - [info] /usr/local/bin/master_ip_online_change_appuanalysis --command=stop --orig_master_host=179.179.19.185 --orig_master_ip=179.179.19.185 --orig_master_port=3306--orig_master_user='weixinLX391P_xldbmha' --orig_master_password='weixinLX391P_xldbmha\)qlk' --new_master_host=179.179.19.184 --new_master_ip=179.179.19.184 --new_master_port=55988 --new_master_user='us_mha' --new_master_password='weixinLX391P_xldbmha\)qlk' --orig_master_ssh_user=root --new_master_ssh_user=root Got Error: DBI connect(';host=179.179.19.184;port=3306;mysql_connect_timeout=4','weixinLX391P_xldbmha',...) failed: Access denied for user 'weixinLX391P_xldbmha'@'179.179.19.166' (using password: YES) at /usr/local/share/perl5/MHA/DBHelper.pm line 205. at /usr/local/bin/master_ip_online_change_app1 line 119. Tue Nov 19 18:52:04 2018 - [error][/usr/local/share/perl5/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/local/bin/masterha_master_switch line 53.
分析:账号密码有需要转移字符的。app1.cnf 文件中user账号相应的密码 password 不能有待转移的字符,例如本例中的')',但是 账号 repl_user 相应的密码repl_password 没有此限制。
4.如果使用的是xtrabackup,注意从节点会把event还原上去,可能会造成数据不一致,同步失败的问题。
如果主节点有event,需要手动关闭从节点的event。例如,主节点有归档删除数据的event,从节点需要关闭,否则报错。类似如下错误:
Could not execute Delete_rows event on table ????DB.*****table; Can't find record in '*****', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos XXXXXXX
5 .GTID 模式转换为传统模式后,MHA 机制下数据库主从检查报错。
检查的命令:
/usr/local/bin/masterha_check_repl --conf=/etc/mha/qqweixinordb.cnf
主要的报错信息
Can't exec "mysqlbinlog": No such file or directory at /usr/local/share/perl5/MHA/BinlogManager.pm line 106. mysqlbinlog version command failed with rc 1:0, please verify PATH, LD_LIBRARY_PATH, and client options at /usr/local/bin/apply_diff_relay_logs line 493. Fri Aug 28 04:38:22 2019 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln205] Slaves settings check failed! Fri Aug 28 04:38:22 2019 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln413] Slave configuration failed. Fri Aug 28 04:38:22 2019 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln424] Error happened on checking configurations. at /usr/local/bin/masterha_check_repl line 48. Fri Aug 28 04:38:22 2019 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln523] Error happened on monitoring servers. Fri Aug 28 04:38:22 2019 - [info] Got exit code 1 (Not master dead). MySQL Replication Health is NOT OK!
解决方案--在每个DB节点执行以下命令
ln -s /usr/local/mysql/bin/mysqlbinlog /usr/local/bin/mysqlbinlog ln -s /usr/local/mysql/bin/mysql /usr/local/bin/mysql
再次检查,报错信息消失,OK。
6.root 账号密码过期
以root账号设置的ssh免密登陆,而ssh有过期限制,则mha ssh检查时报错:
/usr/local/bin/masterha_check_ssh --conf=/etc/mha/qqorder.cnf Thu Nov 5 10:09:09 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Thu Nov 5 10:09:09 2018 - [info] Reading application default configuration from /etc/mha/pms20epime.cnf.. Thu Nov 5 10:09:09 2018 - [info] Reading server configuration from /etc/mha/pms20epime.cnf.. Thu Nov 5 10:09:09 2018 - [info] Starting SSH connection tests.. Thu Nov 5 10:09:09 2018 - [error][/usr/local/share/perl5/MHA/SSHCheck.pm, ln63] Thu Nov 5 10:09:09 2018 - [debug] Connecting via SSH from root@172.181.191.191(172.181.191.191:22) to root@172.181.191.192(172.181.191.192:22).. WARNING: Your password has expired. Password change required but no TTY available. Thu Nov 5 10:09:09 2018 - [error][/usr/local/share/perl5/MHA/SSHCheck.pm, ln111] SSH connection from root@172.181.191.191(172.181.191.191:22) to root@172.181.191.192(172.181.191.192:22) failed! Thu Nov 5 10:09:10 2018 - [error][/usr/local/share/perl5/MHA/SSHCheck.pm, ln63] Thu Nov 5 10:09:09 2018 - [debug] Connecting via SSH from root@172.181.191.192(172.181.191.192:22) to root@172.181.191.191(172.181.191.191:22).. WARNING: Your password has expired. Password change required but no TTY available. Thu Nov 5 10:09:10 2018 - [error][/usr/local/share/perl5/MHA/SSHCheck.pm, ln111] SSH connection from root@172.181.191.192(172.181.191.192:22) to root@172.181.191.191(172.181.191.191:22) failed! SSH Configuration Check Failed! at /usr/local/bin/masterha_check_ssh line 44.
另外一种表现,就是第二次执行账号切换时报错 sudo su -
解决方案在root账号下,执行以下命令:
chage -M 99999 root
7.FailOver 过程失败,uninitialized value $ssh_user错误
切换过程失败,还有更严重的现象,VIP丢失,此时直接影响业务访问。此外,主从状态也是异常(Slave_IO_Running: No )。
查看MHA log,异常报错信息如下:
Fri Jul 30 01:05:24 2019 - [info] Forcing shutdown so that applications never connect to the current master.. Fri Jul 30 01:05:24 2019 - [info] Executing master IP deactivation script: Fri Jul 30 01:05:24 2019 - [info] /data/mhascripts/master_ip_failover --orig_master_host=110.110.119.119 --orig_master_ip=110.110.119.119 --orig_master_port=3307 --command=stop Got Error: Use of uninitialized value $ssh_user in concatenation (.) or string at /data/mhascripts/master_ip_failover line 94. IN SCRIPT TEST====/sbin/ifconfig net120:1 down==/sbin/ifconfig net120:1 110.110.119.120/20=== *************************************************************** Disabling the VIP - 110.110.119.120/20 on old master: 110.110.119.119 *************************************************************** Fri Jul 30 01:05:24 2019 - [error][/usr/local/share/perl5/MHA/MasterFailover.pm, ln352] Failed to deactivate master IP with return code 1:0 Fri Jul 30 01:05:24 2019 - [error][/usr/local/share/perl5/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/local/bin/masterha_manager line 65. Fri Jul 30 01:05:24 2019 - [info]
当发生Master机器重启、关机或者网络异常时,可能会发生这种情况。再关闭MySQL服务时,很难复现这个异常。关闭MySQL服务,MHAFailOver正常。
解决方案:
修改MHA的配置文件(例如 app1.cnf),将
master_ip_failover_script=/data/mhascripts/master_ip_failover
修改为
master_ip_failover_script=/data/mhascripts/master_ip_failover --ssh_user=root
【failover时,代码中多次指定--ssh_user=root,不影响执行】
8.MHA中虚IP问题
在此处,我们分享一个因为MHA 文件配置文件问题导致的真实故障案例。
简单来讲,新搭建集群2时,copy的是集群1的配置文件(app1.cnf --基础项配置文件、 master_ip_failover ---自动failover时候的切换脚本、master_ip_online_change---手动切换时候的切换脚本),
但是在接下来在替换这些配置文件中的参数时,出现了纰漏,没有替换完全。在本案例中是 app1.cnf 中的 master_ip_failover_script= XXXXX,master_ip_online_change_script=YYYYY,参数没有替换,导致了将集群中的VIP再次绑在了集群2上,即集群1对应的VIP即在集群1上,又在集群2上。
这个错误还是相对少见的吧,大家经常看到的错误可能是 master_ip_failover 脚本或者 master_ip_online_change 脚本中的VIP没有替换,导致的VIP绑错。没关系,两者错误现象一样,报错原理一样。
为便于理解,增加以下故障描述:
时间点 | 操作或描述 |
18:17 |
DBA为启动集群2的MHA,执行masterha_check_repl --conf=/etc/mha/app1.cnf,但是此时MHA的cnf文件是有问题的,但是执行成功了,带来的后果是,集群1对应的VIP,绑定到了集群2上。绑定的VIP还立即生效了。(此时通过ip addr 查看,可能会看到两个集群配置的VIP都是集群1的VIP) |
18:23 | DBA 发现自己操作有问题,发现配置文件中的参数没有配置正确,所以,修改正确后,启动集群2的MHA。执行了nohup masterha_manager --conf=/etc/mha/app2.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/mha/app1/manager.log 2>&1 &
(此时通过ip addr 查看,集群1的VIP只在集群1上,集群2的VIP是集群2的VIP),此时正确的VIP覆盖了18:17的误操作,还立即生效了(因为MHA脚本中,我们包含了广播 /sbin/arping -q -c 2 -U -I SSSSS XXX.XXX.XXX.XXX 的脚本)。 |
18:28 | 因为业务感知的滞后性,此时开始收到,业务保障。技术复盘,不用太care此时间点,这个时间点完全和监控相关联,如果监控做的好,此时间,会大大提前。 |
18:36 | 此时ping 集群1的VIP还是不通的。 |
18:37 | 系统自动恢复了。此时ping集群1的VIP是通的。 |
18:17--18:25 这个时间段,虽然可用,但出现了脏写脏读,数据都到别的库上去了(还不如不可用呢),需要特别注意。18:23--18:37 整个时间段根本不可用,应用端的连接会报超时、连不上。
需要注意的是:
(1)MHA中触发绑VIP的脚本,要特别注意,即有 masterha_manager 还有 masterha_check_repl。记住是两个,执行都会绑吆。
(2).导致VIP绑错,最主要的、最经常发生问题的是 master_ip_failover 脚本、master_ip_online_change脚本中的VIP的写错,还应注意app.cnf调用的脚本对不对,也会张冠李戴(本例就是这个问题)。
(3)为什么 20分钟 自动恢复。这个和 交换机中ARP表老化时间有关。大部分Cisco交换机是5分钟,华为的设备一般是20分钟
(4)遇到类似问题,解决方案:及时广播下。在误伤的服务器上广播下理想的VIP ,例如本案例,即在集群1上广播下VIP;或者,重启下受伤害的集群的MHA,例如本案例,即,重新启动下集群1的MHA。
这类问题,多出现在新搭建集群的时候。如果是两个上线已久的集群,因为开始没有仔细测试,或者半途上MHA,这个问题再发生,那影响就更大了。两个VIP都会受影响,这个集群的业务都会报故障的。大家可以想想下。
9.MySQL 8.0 MHA 搭建检查复制时报错(8.0.29 版本)
在执行masterha_check_repl命令时报错,报错信息如下:
Thu May 11 14:41:48 2022 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Thu May 11 14:41:48 2022 - [info] Reading application default configuration from /data/mha/qq_test_mha.cnf.. Thu May 11 14:41:48 2022 - [info] Reading server configuration from /data/mha/qq_test_mha.cnf.. Thu May 11 14:41:48 2022 - [info] MHA::MasterMonitor version 0.56. Thu May 11 14:41:49 2022 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln188] There is no alive server. We can't do failover Thu May 11 14:41:49 2022 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln424] Error happened on checking configurations. at /usr/local/share/perl5/MHA/MasterMonitor.pm line 326. Thu May 11 14:41:49 2022 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln523] Error happened on monitoring servers. Thu May 11 14:41:49 2022 - [info] Got exit code 1 (Not master dead). MySQL Replication Health is NOT OK!
经过验证发现,是账户密码的加密算法的问题 即 caching_sha2_password 导致连接不上,修改为mysql_native_password即可解决。
执行的命令如下:
ALTER USER 'mha的账户'@'%' IDENTIFIED WITH mysql_native_password BY 'mha的密码';
10. 主机Down 机(掉线),MHA FailOver时间相对较长的分析
我们可以感受到,这种情况: MySQL service 故障导致的 MHA Failvoer 或者 DBA手动执行FailOver的 耗时 ,远远 少于 ,因为 Server 宕机导致的Failover 消耗的时间。
特别注意,我们此时说的 Server 宕机 是指 因为 Server 系统故障导致的掉线,并且是 没有快速维护的,即在MHA FailOver 的过程中,没有恢复上线的时间。
一般,这个FailOver过程都需要 两分钟左右。
那么,我们先看下此时的FailOver log
Tue Jun 13 17:03:23 2018 - [info] Forcing shutdown so that applications never connect to the current master.. Tue Jun 13 17:03:23 2018 - [info] Executing master IP deactivation script: Tue Jun 13 17:03:23 2018 - [info] /????/master_ip_failover --ssh_user=dblinuxuser --orig_master_host=171.171.171.99 --orig_master_ip=171.171.171.99 --orig_master_port=3306 --command=stop IN SCRIPT TEST====/sbin/ifconfig net110:1 down==/sbin/ifconfig net110:1 171.171.171.222/24=== *************************************************************** Disabling the VIP - 171.171.171.222/24 on old master: 171.171.171.99 *************************************************************** ssh: connect to host 171.171.171.99 port 22: Connection timed out Tue Jun 13 17:05:30 2018 - [info] done. Tue Jun 13 17:05:30 2018 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
我们看到,这个主要步骤,耗时2分钟7秒。
主要是移除原master(171.99)上面的虚拟集群IP(171.222)导致。
继续分析,为啥这个过程会慢的,我看看下关闭VIP的代码:
# A simple system call that enable the VIP on the new master sub start_vip() { `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`; } # A simple system call that disable the VIP on the old_master sub stop_vip() { `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`; }
定位到关键代码了,那么我们继续探究下:
为啥,此时会突然耗时,较长呢?
代码,就较简单,就一行代码,一条命令。
一条ssh的代码。此时,熟悉SSH命令的同学可能会想到,是不是,因为主机宕机,SSH连接不上,导致的FailOver变长了。
恭喜你,答对了,就是因为SSH超时时间问题。
SSH连接超时时间,可以通过 ConnectTimeout 设置。
例如
ssh -o ConnectTimeout=3 192.168.0.10
根据关于SSH的介绍,可以搜索学习相关文档。
再次,我想补充的一点是,MHA 是不是将这个超时时间设置的越小越好呢?当然不是,太小,脑裂的风险就越大。