搭建MHA环境
参考资料:https://www.cnblogs.com/gomysql/p/3675429.html
MHA软件分为两部分,MHA Manager(管理节点)和MHA Node(数据节点),
Manager节点可以独立部署在一台服务器,或和Mysql数据库部署在一起,
Node节点,每个Mysql数据库服务器都要部署。
MHA manager会定期探测master节点,当master节点故障时,自动进行故障切换,提升某一个slave节点为master节点。
在故障转移期间,mha manager会试图连接故障的master节点,保存二进制日志,最大程序保证数据不丢失,建议将集群环境配置为同步复制,保证其中一个slave节点已经收到了最新的二进制文件。
MHA软件由两部分组成,Manager工具包和Node工具包,具体的说明如下。
Manager工具包主要包括以下几个工具:
masterha_check_ssh 检查MHA的SSH配置状况 masterha_check_repl 检查MySQL复制状况 masterha_manger 启动MHA masterha_check_status 检测当前MHA运行状态 masterha_master_monitor 检测master是否宕机 masterha_master_switch 控制故障转移(自动或者手动) masterha_conf_host 添加或删除配置的server信息
Node工具包(这些工具通常由MHA Manager的脚本触发,无需人为操作)主要包括以下几个工具:
save_binary_logs 保存和复制master的二进制日志 apply_diff_relay_logs 识别差异的中继日志事件并将其差异的事件应用于其他的slave filter_mysqlbinlog 去除不必要的ROLLBACK事件(MHA已不再使用这个工具) purge_relay_logs 清除中继日志(不会阻塞SQL线程)
1、本地环境已经搭建完成,一主两备,使用GTID+半同步+多线程复制
全部Centos7
Mater 192.168.150.101 部署MHA Node
Slave1 192.168.150.102 部署MHA Node
Slave2 192.168.150.103 部署MHA Node 、MHA Manger
三台数据库relay_log_purge参数需要设置为OFF,MHA在补数据期间,会用到中断日志,如果中断日志自动清理了,就无法补数据了。
2、三台机器配置SSH互信
在MHA补数据时会自动ssh另一台服务器,因此三台服务器都需要配置SSH互信
配置方法,参考这个文档:
https://www.cnblogs.com/nanxiang/p/13839698.html
3、下载MHA Manager 和MHA Node
之前我下载的0.54版本,不支持GTID复制,记得0.55以上才开始支持。
目前最新的是0.56,我安装的是0.56版本,放在网盘了
链接:
链接:https://pan.baidu.com/s/1sjmGNPV6_X7BL--5A9Fjxw
提取码:1234
复制这段内容后打开百度网盘手机App,操作更方便哦
提取码:1234
4、安装
4.1 安装Node节点程序
解压缩Node包
[root@localhost ~]# unzip mha4mysql-node-master-201604-0.56.zip [root@localhost ~]# cd mha4mysql-node-master/
MHA依赖于perl语言,先要安装perl和相关包
yum install perl-DBD-MySQL -y yum install cpan -y yum install perl-ExtUtils-CBuilder perl-ExtUtils-MakeMaker -y cpan ExtUtils::Install cpan install Module::Install ---如有提示【YES】之类的信息,直接回车即可
编辑安装MHA
命令:perl Makefile.PL
[root@localhost mha4mysql-node-master]# perl Makefile.PL
[root@localhost mha4mysql-node-master]# make && make install
如果perl Makefile.PL命令报错:
Can't locate inc/Module/Install.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at Makefile.PL line 1.
BEGIN failed--compilation aborted at Makefile.PL line 1.
就退出当前界面,重新进入再安装。
安装完成后/usr/local/bin会出现如下文件
[root@host102 mha4mysql-node-master]# ll /usr/local/bin//*logs -r-xr-xr-x. 1 root root 16371 Oct 19 14:20 /usr/local/bin//apply_diff_relay_logs -r-xr-xr-x. 1 root root 8263 Oct 19 14:20 /usr/local/bin//purge_relay_logs -r-xr-xr-x. 1 root root 7525 Oct 19 14:20 /usr/local/bin//save_binary_logs
三台服务器需要全部安装Node节点。
4.2安装Manager节点程序
yum install perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Time-HiRes -y
unzip mha4mysql-manager-master-201604-0.56.zip
cd mha4mysql-manager-master/
[root@host101 mha4mysql-manager-master]# perl Makefile.PL
include /root/mha4mysql-manager-master/inc/Module/Install.pm
include inc/Module/Install/Metadata.pm
include inc/Module/Install/Base.pm
include inc/Module/Install/Makefile.pm
include inc/Module/Install/Scripts.pm
include inc/Module/Install/AutoInstall.pm
include inc/Module/Install/Include.pm
include inc/Module/AutoInstall.pm
*** Module::AutoInstall version 1.19
*** Checking for Perl dependencies...
[Core Features]
- DBI ...loaded. (1.627)
- DBD::mysql ...loaded. (4.023)
- Time::HiRes ...loaded. (1.9725)
- Config::Tiny ...loaded. (2.14)
- Log::Dispatch ...missing.
- Parallel::ForkManager ...missing.
- MHA::NodeConst ...loaded. (0.57)
==> Auto-install the 2 mandatory module(s) from CPAN? [y] y
报错还缺少两个包,按y装不上,需要自己安装一下。
cpan install Log::Dispatch
cpan install Parallel::ForkManager
安装过程中一直超时,
修改cpan源再安装试下。
[root@host101 CPAN]# perl -MCPAN -e shell Terminal does not support AddHistory. cpan shell -- CPAN exploration and modules installation (v1.9800) Enter 'h' for help. cpan[1]> o conf init ---查看配置文件所在路径 CPAN.pm requires configuration, but most of it can be done automatically. If you answer 'no' below, you will enter an interactive dialog for each configuration option instead. Would you like to configure as much as possible automatically? [yes] yes ---没有配置文件,提示是否要配置,输入yes Your 'urllist' is already configured. Type 'o conf init urllist' to change it. Autoconfiguration complete. commit: wrote '/root/.cpan/CPAN/MyConfig.pm' ---这里就是配置文件 You can re-run configuration any time with 'o conf init' in the CPAN shell cpan[2]> exit
[root@host101 CPAN]# vi MyConfig.pm ----修改这个配置文件 'urllist' => [q[http://mirrors.aliyuncs.com/CPAN/]], 第一次改为阿里的安装一直卡住,才发现这个域名已经Ping不通了,网站也打不开了,好像阿里已经不再维护了, 果断换了sohu的。 'urllist' => [q[http://mirrors.sohu.com/CPAN/]], 再执行 cpan install Log::Dispatch cpan install Parallel::ForkManager 进度非常快
perl模块安装完成后,
重新编译
perl Makefile.PL
make && make install
4.3运行MHA脚本【masterha_check_ssh】检测SSH互信情况
[root@host101 ~]# masterha_check_ssh --conf=/etc/masterha/app1.cnf Mon Oct 19 17:53:08 2020 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Mon Oct 19 17:53:08 2020 - [info] Reading application default configuration from /etc/masterha/app1.cnf.. Mon Oct 19 17:53:08 2020 - [info] Reading server configuration from /etc/masterha/app1.cnf.. Mon Oct 19 17:53:08 2020 - [info] Starting SSH connection tests.. Mon Oct 19 17:53:12 2020 - [debug] Mon Oct 19 17:53:08 2020 - [debug] Connecting via SSH from root@192.168.150.102(192.168.150.102:22) to root@192.168.150.101(192.168.150.101:22).. Mon Oct 19 17:53:10 2020 - [debug] ok. Mon Oct 19 17:53:10 2020 - [debug] Connecting via SSH from root@192.168.150.102(192.168.150.102:22) to root@192.168.150.103(192.168.150.103:22).. Mon Oct 19 17:53:12 2020 - [debug] ok. Mon Oct 19 17:53:17 2020 - [debug] Mon Oct 19 17:53:09 2020 - [debug] Connecting via SSH from root@192.168.150.103(192.168.150.103:22) to root@192.168.150.101(192.168.150.101:22).. Mon Oct 19 17:53:11 2020 - [debug] ok. Mon Oct 19 17:53:11 2020 - [debug] Connecting via SSH from root@192.168.150.103(192.168.150.103:22) to root@192.168.150.102(192.168.150.102:22).. Mon Oct 19 17:53:17 2020 - [debug] ok. Mon Oct 19 17:53:18 2020 - [debug] Mon Oct 19 17:53:08 2020 - [debug] Connecting via SSH from root@192.168.150.101(192.168.150.101:22) to root@192.168.150.102(192.168.150.102:22).. Mon Oct 19 17:53:10 2020 - [debug] ok. Mon Oct 19 17:53:10 2020 - [debug] Connecting via SSH from root@192.168.150.101(192.168.150.101:22) to root@192.168.150.103(192.168.150.103:22).. Mon Oct 19 17:53:18 2020 - [debug] ok. Mon Oct 19 17:53:18 2020 - [info] All SSH connection tests passed successfully. Use of uninitialized value in exit at /root/perl5/bin/masterha_check_ssh line 44.
4.4 运行【masterha_check_repl】,检测主备集环境
0.56版本中有bug,总是提示集群不OK,其实是OK的,
https://github.com/yoshinorim/mha4mysql-manager/issues/89
中途遇到bug,需要按照官方要求,修改下源文件,重新编译。
我上传的网盘附件,已经是正确的了,不会再有这个问题了。
[root@host101 ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf Mon Oct 19 17:53:51 2020 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Mon Oct 19 17:53:51 2020 - [info] Reading application default configuration from /etc/masterha/app1.cnf.. Mon Oct 19 17:53:51 2020 - [info] Reading server configuration from /etc/masterha/app1.cnf.. Mon Oct 19 17:53:51 2020 - [info] MHA::MasterMonitor version 0.57. Mon Oct 19 17:53:52 2020 - [info] GTID failover mode = 1 Mon Oct 19 17:53:52 2020 - [info] Dead Servers: Mon Oct 19 17:53:52 2020 - [info] Alive Servers: Mon Oct 19 17:53:52 2020 - [info] 192.168.150.101(192.168.150.101:3306) Mon Oct 19 17:53:52 2020 - [info] 192.168.150.102(192.168.150.102:3306) Mon Oct 19 17:53:52 2020 - [info] 192.168.150.103(192.168.150.103:3306) Mon Oct 19 17:53:52 2020 - [info] Alive Slaves: Mon Oct 19 17:53:52 2020 - [info] 192.168.150.102(192.168.150.102:3306) Version=5.7.29-log (oldest major version between slaves) log-bin:enabled Mon Oct 19 17:53:52 2020 - [info] GTID ON Mon Oct 19 17:53:52 2020 - [info] Replicating from 192.168.150.101(192.168.150.101:3306) Mon Oct 19 17:53:52 2020 - [info] Primary candidate for the new Master (candidate_master is set) Mon Oct 19 17:53:52 2020 - [info] 192.168.150.103(192.168.150.103:3306) Version=5.7.29-log (oldest major version between slaves) log-bin:enabled Mon Oct 19 17:53:52 2020 - [info] GTID ON Mon Oct 19 17:53:52 2020 - [info] Replicating from 192.168.150.101(192.168.150.101:3306) Mon Oct 19 17:53:52 2020 - [info] Current Alive Master: 192.168.150.101(192.168.150.101:3306) Mon Oct 19 17:53:52 2020 - [info] Checking slave configurations.. Mon Oct 19 17:53:52 2020 - [info] Checking replication filtering settings.. Mon Oct 19 17:53:52 2020 - [info] binlog_do_db= , binlog_ignore_db= Mon Oct 19 17:53:52 2020 - [info] Replication filtering check ok. Mon Oct 19 17:53:52 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking. Mon Oct 19 17:53:52 2020 - [info] Checking SSH publickey authentication settings on the current master.. Mon Oct 19 17:53:53 2020 - [info] HealthCheck: SSH to 192.168.150.101 is reachable. Mon Oct 19 17:53:53 2020 - [info] 192.168.150.101(192.168.150.101:3306) (current master) +--192.168.150.102(192.168.150.102:3306) +--192.168.150.103(192.168.150.103:3306) Mon Oct 19 17:53:53 2020 - [info] Checking replication health on 192.168.150.102.. Mon Oct 19 17:53:53 2020 - [info] ok. Mon Oct 19 17:53:53 2020 - [info] Checking replication health on 192.168.150.103.. Mon Oct 19 17:53:53 2020 - [info] ok. Mon Oct 19 17:53:53 2020 - [info] Checking master_ip_failover_script status: Mon Oct 19 17:53:53 2020 - [info] /usr/local/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=192.168.150.101 --orig_master_ip=192.168.150.101 --orig_master_port=3306 Mon Oct 19 17:53:53 2020 - [error][/root/perl5/lib/perl5/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations. Can't exec "/usr/local/bin/master_ip_failover": No such file or directory at /root/perl5/lib/perl5/MHA/ManagerUtil.pm line 68. Mon Oct 19 17:53:53 2020 - [error][/root/perl5/lib/perl5/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers. Mon Oct 19 17:53:53 2020 - [info] Got exit code 1 (Not master dead). MySQL Replication Health is NOT OK! Mon Oct 19 17:53:53 2020 - [error][/root/perl5/lib/perl5/MHA/MasterMonitor.pm, ln229] Failed to get master_ip_failover_script status with return code 1:0. Mon Oct 19 17:53:53 2020 - [error][/root/perl5/lib/perl5/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations. at /root/perl5/bin/masterha_check_repl line 48. Mon Oct 19 17:53:53 2020 - [error][/root/perl5/lib/perl5/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers. Mon Oct 19 17:53:53 2020 - [info] Got exit code 1 (Not master dead). MySQL Replication Health is NOT OK!
4.5 从日志看集群环境已经可以正确识别到了,但是没有找到 master_ip_failover 文件。
创建此文件,并加执行权限 。
touch /usr/local/bin/master_ip_failover
chmod +x /usr/local/bin/master_ip_failover
文件内容如下
#!/usr/bin/env perl use strict; use warnings FATAL => 'all'; use Getopt::Long; my ( $command, $ssh_user, $orig_master_host, $orig_master_ip, $orig_master_port, $new_master_host, $new_master_ip, $new_master_port ); my $vip = '192.168.0.88/24'; ---要为网卡加哪个虚拟IP,就填写那个。 my $key = '1'; my $ssh_start_vip = "/sbin/ifconfig eth1:$key $vip"; ---vip这部分请注意,一会需要手动在网卡加虚拟IP,这里是eth1:1,如果不是,按实际需求改。 my $ssh_stop_vip = "/sbin/ifconfig eth1:$key down"; GetOptions( 'command=s' => \$command, 'ssh_user=s' => \$ssh_user, 'orig_master_host=s' => \$orig_master_host, 'orig_master_ip=s' => \$orig_master_ip, 'orig_master_port=i' => \$orig_master_port, 'new_master_host=s' => \$new_master_host, 'new_master_ip=s' => \$new_master_ip, 'new_master_port=i' => \$new_master_port, ); exit &main(); sub main { print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n"; if ( $command eq "stop" || $command eq "stopssh" ) { my $exit_code = 1; eval { print "Disabling the VIP on old master: $orig_master_host \n"; &stop_vip(); $exit_code = 0; }; if ($@) { warn "Got Error: $@\n"; exit $exit_code; } exit $exit_code; } elsif ( $command eq "start" ) { my $exit_code = 10; eval { print "Enabling the VIP - $vip on the new master - $new_master_host \n"; &start_vip(); $exit_code = 0; }; if ($@) { warn $@; exit $exit_code; } exit $exit_code; } elsif ( $command eq "status" ) { print "Checking the Status of the script.. OK \n"; exit 0; } else { &usage(); exit 1; } } sub start_vip() { `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`; } sub stop_vip() { return 0 unless ($ssh_user); `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`; } sub usage { print "Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n"; }
再使用【masterha_check_repl 】检测一遍,就是OK的了
[root@host101 ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf Mon Oct 19 18:02:04 2020 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Mon Oct 19 18:02:04 2020 - [info] Reading application default configuration from /etc/masterha/app1.cnf.. Mon Oct 19 18:02:04 2020 - [info] Reading server configuration from /etc/masterha/app1.cnf.. Mon Oct 19 18:02:04 2020 - [info] MHA::MasterMonitor version 0.57. Mon Oct 19 18:02:05 2020 - [info] GTID failover mode = 1 Mon Oct 19 18:02:05 2020 - [info] Dead Servers: Mon Oct 19 18:02:05 2020 - [info] Alive Servers: Mon Oct 19 18:02:05 2020 - [info] 192.168.150.101(192.168.150.101:3306) Mon Oct 19 18:02:05 2020 - [info] 192.168.150.102(192.168.150.102:3306) Mon Oct 19 18:02:05 2020 - [info] 192.168.150.103(192.168.150.103:3306) Mon Oct 19 18:02:05 2020 - [info] Alive Slaves: Mon Oct 19 18:02:05 2020 - [info] 192.168.150.102(192.168.150.102:3306) Version=5.7.29-log (oldest major version between slaves) log-bin:enabled Mon Oct 19 18:02:05 2020 - [info] GTID ON Mon Oct 19 18:02:05 2020 - [info] Replicating from 192.168.150.101(192.168.150.101:3306) Mon Oct 19 18:02:05 2020 - [info] Primary candidate for the new Master (candidate_master is set) Mon Oct 19 18:02:05 2020 - [info] 192.168.150.103(192.168.150.103:3306) Version=5.7.29-log (oldest major version between slaves) log-bin:enabled Mon Oct 19 18:02:05 2020 - [info] GTID ON Mon Oct 19 18:02:05 2020 - [info] Replicating from 192.168.150.101(192.168.150.101:3306) Mon Oct 19 18:02:05 2020 - [info] Current Alive Master: 192.168.150.101(192.168.150.101:3306) Mon Oct 19 18:02:05 2020 - [info] Checking slave configurations.. Mon Oct 19 18:02:05 2020 - [info] Checking replication filtering settings.. Mon Oct 19 18:02:05 2020 - [info] binlog_do_db= , binlog_ignore_db= Mon Oct 19 18:02:05 2020 - [info] Replication filtering check ok. Mon Oct 19 18:02:05 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking. Mon Oct 19 18:02:05 2020 - [info] Checking SSH publickey authentication settings on the current master.. Mon Oct 19 18:02:10 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 192.168.150.101! at /root/perl5/lib/perl5/MHA/HealthCheck.pm line 343. Mon Oct 19 18:02:10 2020 - [info] 192.168.150.101(192.168.150.101:3306) (current master) +--192.168.150.102(192.168.150.102:3306) +--192.168.150.103(192.168.150.103:3306) Mon Oct 19 18:02:10 2020 - [info] Checking replication health on 192.168.150.102.. Mon Oct 19 18:02:10 2020 - [info] ok. Mon Oct 19 18:02:10 2020 - [info] Checking replication health on 192.168.150.103.. Mon Oct 19 18:02:10 2020 - [info] ok. Mon Oct 19 18:02:10 2020 - [info] Checking master_ip_failover_script status: Mon Oct 19 18:02:10 2020 - [info] /usr/local/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=192.168.150.101 --orig_master_ip=192.168.150.101 --orig_master_port=3306 IN SCRIPT TEST====/sbin/ifconfig eth1:1 down==/sbin/ifconfig eth1:1 192.168.150.88/24=== Checking the Status of the script.. OK Mon Oct 19 18:02:10 2020 - [info] OK. Mon Oct 19 18:02:10 2020 - [warning] shutdown_script is not defined. Mon Oct 19 18:02:10 2020 - [info] Got exit code 0 (Not master dead). MySQL Replication Health is OK.
4.6 为101的机器加虚拟ip
ifconfig eth0:1 192.168.150.88 netmask 255.255.255.0 up
4.7 创建配置文件,实际配置中请将所有注释去掉,否则会报错。
mkdir -p /etc/masterha/
touch /etc/masterha/app1.cnf
mkdir /var/log/masterha/app1.log ----切换后这个目录 会保存一个空文件,【app1.failover.complete】,再次启动MHA时,必须手动将这个文件删掉。
mkdir /var/log/masterha/app1
touch /var/log/masterha/app1/manager.log
[server default] manager_workdir=/var/log/masterha/app1.log //设置manager的工作目录 manager_log=/var/log/masterha/app1/manager.log //设置manager的日志 master_binlog_dir=/usr/local/mysql/data //设置master 保存binlog的位置,以便MHA可以找到master的日志,我这里的也就是mysql的数据目录 master_ip_failover_script= /usr/local/bin/master_ip_failover //设置自动failover时候的切换脚本 master_ip_online_change_script= /usr/local/bin/master_ip_online_change //设置手动切换时候的切换脚本 password=123456 //设置mysql中root用户的密码,这个密码是前文中创建监控用户的那个密码 user=root 设置监控用户root ping_interval=1 //设置监控主库,发送ping包的时间间隔,默认是3秒,尝试三次没有回应的时候自动进行railover remote_workdir=/tmp //设置远端mysql在发生切换时binlog的保存位置 repl_password=123456 //设置复制用户的密码 repl_user=repl //设置复制环境中的复制用户名 report_script=/usr/local/send_report //设置发生切换后发送的报警的脚本 secondary_check_script= /usr/local/bin/masterha_secondary_check -s 192.168.150.103 -s 192.168.150.102 shutdown_script="" //设置故障发生后关闭故障主机脚本(该脚本的主要作用是关闭主机放在发生脑裂,这里没有使用) ssh_user=root //设置ssh的登录用户名 [server1] hostname=192.168.150.101 port=3306 [server2] hostname=192.168.150.102 port=3306 candidate_master=1 //设置为候选master,如果设置该参数以后,发生主从切换以后将会将此从库提升为主库,即使这个主库不是集群中事件最新的slave check_repl_delay=0 //默认情况下如果一个slave落后master 100M的relay logs的话,MHA将不会选择该slave作为一个新的master,因为对于这个slave的恢复需要花费很长时间,通过设置check_repl_delay=0,MHA触发切换在选择一个新的master的时候将会忽略复制延时,这个参数对于设置了candidate_master=1的主机非常有用,因为这个候选主在切换的过程中一定是新的master [server3] hostname=192.168.150.103 port=3306
4.8 启动MHA
nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1 &
注意这两个参数:
--remove_dead_master_conf:如果设置此参数,当成功failover后,MHA manager将会自动删除配置文件中关于dead master的配置选项。
--ignore_last_failover:如果最近failover 失败,MHA 将不会再次开始failover机制,因为这个问题可能再次发生。常规步骤:手动清理failover 错误文件,此文件一般在manager_workdir/app_name.failover.error文件,然后在启动failover机制。
切换后,一定要去手动修改配置文件,因为集群主从环境已经发生了变化,再启动MHA.
4.9观察启动日志
cat /var/log/masterha/app1/manager.log
Mon Oct 19 22:55:00 2020 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Mon Oct 19 22:55:00 2020 - [info] Reading application default configuration from /etc/masterha/app1.cnf.. Mon Oct 19 22:55:00 2020 - [info] Reading server configuration from /etc/masterha/app1.cnf.. Mon Oct 19 22:55:00 2020 - [info] MHA::MasterMonitor version 0.57. Mon Oct 19 22:55:01 2020 - [info] GTID failover mode = 1 Mon Oct 19 22:55:01 2020 - [info] Dead Servers: Mon Oct 19 22:55:01 2020 - [info] Alive Servers: Mon Oct 19 22:55:01 2020 - [info] 192.168.150.101(192.168.150.101:3306) Mon Oct 19 22:55:01 2020 - [info] 192.168.150.102(192.168.150.102:3306) Mon Oct 19 22:55:01 2020 - [info] 192.168.150.103(192.168.150.103:3306) Mon Oct 19 22:55:01 2020 - [info] Alive Slaves: Mon Oct 19 22:55:01 2020 - [info] 192.168.150.102(192.168.150.102:3306) Version=5.7.29-log (oldest major version between slaves) log-bin:enabled Mon Oct 19 22:55:01 2020 - [info] GTID ON Mon Oct 19 22:55:01 2020 - [info] Replicating from 192.168.150.101(192.168.150.101:3306) Mon Oct 19 22:55:01 2020 - [info] 192.168.150.103(192.168.150.103:3306) Version=5.7.29-log (oldest major version between slaves) log-bin:enabled Mon Oct 19 22:55:01 2020 - [info] GTID ON Mon Oct 19 22:55:01 2020 - [info] Replicating from 192.168.150.101(192.168.150.101:3306) Mon Oct 19 22:55:01 2020 - [info] Current Alive Master: 192.168.150.101(192.168.150.101:3306) Mon Oct 19 22:55:01 2020 - [info] Checking slave configurations.. Mon Oct 19 22:55:01 2020 - [info] read_only=1 is not set on slave 192.168.150.102(192.168.150.102:3306). Mon Oct 19 22:55:01 2020 - [info] Checking replication filtering settings.. Mon Oct 19 22:55:01 2020 - [info] binlog_do_db= , binlog_ignore_db= Mon Oct 19 22:55:01 2020 - [info] Replication filtering check ok. Mon Oct 19 22:55:01 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking. Mon Oct 19 22:55:01 2020 - [info] Checking SSH publickey authentication settings on the current master.. Mon Oct 19 22:55:06 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 192.168.150.101! at /root/perl5/lib/perl5/MHA/HealthCheck.pm line 343. Mon Oct 19 22:55:06 2020 - [info] 192.168.150.101(192.168.150.101:3306) (current master) +--192.168.150.102(192.168.150.102:3306) +--192.168.150.103(192.168.150.103:3306) Mon Oct 19 22:55:06 2020 - [info] Checking master_ip_failover_script status: Mon Oct 19 22:55:06 2020 - [info] /usr/local/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=192.168.150.101 --orig_master_ip=192.168.150.101 --orig_master_port=3306 IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.150.88/24=== Checking the Status of the script.. OK Mon Oct 19 22:55:06 2020 - [info] OK. Mon Oct 19 22:55:06 2020 - [warning] shutdown_script is not defined. Mon Oct 19 22:55:06 2020 - [info] Set master ping interval 1 seconds. Mon Oct 19 22:55:06 2020 - [info] Set secondary check script: /usr/local/bin/masterha_secondary_check -s 192.168.150.103 -s 192.168.150.102 Mon Oct 19 22:55:06 2020 - [info] Starting ping health check on 192.168.150.101(192.168.150.101:3306).. Mon Oct 19 22:55:06 2020 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
4.9 现在101服务器还有一个虚拟IP就是192.168.150.88,现在手动将101数据库停掉,模拟宕机,看下MHA的切换日志。
service mysql stop
看 cat /var/log/masterha/app1/manager.log 最新日志
第45行开始是主库宕机后的日志。
从日志信息看,主分5部分,
MHA连接主库四次失败,就认为主库已经宕了,然后读诹/etc/masterha/app1.cnf 配置文件,获取了当前集群架构,哪些是在线的,哪些是下线的。
第1部分:是不是GTID faile over模式,打当前集群拓扑,在线和不在线的都有显示。
第2部分:主库下线阶段,这时会调用master_ip_failover脚本,拿走192.168.150.88 这个虚拟ip。
第3部分:查看哪个slave节点的日志是最新的。
选择日志最新的slave节点为将来的master节点。
最后应用日志。
新的master节点就产生了,此时会获取一些位点之类的信息,同时配置上虚拟IP。192.168.150.88
第4部分:修改备库的change master信息到新主库。
第5部分:failv over故障转移结束,打印一些详细信息。
36 Checking the Status of the script.. OK 37 Mon Oct 19 22:55:06 2020 - [info] OK. 38 Mon Oct 19 22:55:06 2020 - [warning] shutdown_script is not defined. 39 Mon Oct 19 22:55:06 2020 - [info] Set master ping interval 1 seconds. 40 Mon Oct 19 22:55:06 2020 - [info] Set secondary check script: /usr/local/bin/masterha_secondary_check -s 192.168.150.103 -s 192.168.150.102 41 Mon Oct 19 22:55:06 2020 - [info] Starting ping health check on 192.168.150.101(192.168.150.101:3306).. 42 Mon Oct 19 22:55:06 2020 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond.. 43 Mon Oct 19 22:58:12 2020 - [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away) 44 Mon Oct 19 22:58:12 2020 - [info] Executing secondary network check script: /usr/local/bin/masterha_secondary_check -s 192.168.150.103 -s 192.168.150.102 --user=root --master_host=192.168.150.101 --master_ip=192.168.150.101 --master_port=3306 --master_user=r oot --master_password=123456 --ping_type=SELECT 45 Mon Oct 19 22:58:12 2020 - [info] Executing SSH check script: exit 0 46 Monitoring server 192.168.150.103 is reachable, Master is not reachable from 192.168.150.103. OK. 47 Mon Oct 19 22:58:13 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.150.101' (111)) 48 Mon Oct 19 22:58:13 2020 - [warning] Connection failed 2 time(s).. 49 Monitoring server 192.168.150.102 is reachable, Master is not reachable from 192.168.150.102. OK. 50 Mon Oct 19 22:58:13 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start. 51 Mon Oct 19 22:58:14 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.150.101' (111)) 52 Mon Oct 19 22:58:14 2020 - [warning] Connection failed 3 time(s).. 53 Mon Oct 19 22:58:15 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.150.101' (111)) 54 Mon Oct 19 22:58:15 2020 - [warning] Connection failed 4 time(s).. 55 Mon Oct 19 22:58:17 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 192.168.150.101! at /root/perl5/lib/perl5/MHA/HealthCheck.pm line 343. 56 Mon Oct 19 22:58:17 2020 - [warning] Master is not reachable from health checker! 57 Mon Oct 19 22:58:17 2020 - [warning] Master 192.168.150.101(192.168.150.101:3306) is not reachable! 58 Mon Oct 19 22:58:17 2020 - [warning] SSH is NOT reachable. 59 Mon Oct 19 22:58:17 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/masterha/app1.cnf again, and trying to connect to all servers to check server status.. 60 Mon Oct 19 22:58:17 2020 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. 61 Mon Oct 19 22:58:17 2020 - [info] Reading application default configuration from /etc/masterha/app1.cnf.. 62 Mon Oct 19 22:58:17 2020 - [info] Reading server configuration from /etc/masterha/app1.cnf.. 63 Mon Oct 19 22:58:18 2020 - [info] GTID failover mode = 1 64 Mon Oct 19 22:58:18 2020 - [info] Dead Servers: 65 Mon Oct 19 22:58:18 2020 - [info] 192.168.150.101(192.168.150.101:3306) 66 Mon Oct 19 22:58:18 2020 - [info] Alive Servers: 67 Mon Oct 19 22:58:18 2020 - [info] 192.168.150.102(192.168.150.102:3306) 68 Mon Oct 19 22:58:18 2020 - [info] 192.168.150.103(192.168.150.103:3306) 69 Mon Oct 19 22:58:18 2020 - [info] Alive Slaves: 70 Mon Oct 19 22:58:18 2020 - [info] 192.168.150.102(192.168.150.102:3306) Version=5.7.29-log (oldest major version between slaves) log-bin:enabled 71 Mon Oct 19 22:58:18 2020 - [info] GTID ON 72 Mon Oct 19 22:58:18 2020 - [info] Replicating from 192.168.150.101(192.168.150.101:3306) 73 Mon Oct 19 22:58:18 2020 - [info] 192.168.150.103(192.168.150.103:3306) Version=5.7.29-log (oldest major version between slaves) log-bin:enabled 74 Mon Oct 19 22:58:18 2020 - [info] GTID ON 75 Mon Oct 19 22:58:18 2020 - [info] Replicating from 192.168.150.101(192.168.150.101:3306) 76 Mon Oct 19 22:58:18 2020 - [info] Checking slave configurations.. 77 Mon Oct 19 22:58:18 2020 - [info] read_only=1 is not set on slave 192.168.150.102(192.168.150.102:3306). 78 Mon Oct 19 22:58:18 2020 - [info] Checking replication filtering settings.. 79 Mon Oct 19 22:58:18 2020 - [info] Replication filtering check ok. 80 Mon Oct 19 22:58:18 2020 - [info] Master is down! 81 Mon Oct 19 22:58:18 2020 - [info] Terminating monitoring script. 82 Mon Oct 19 22:58:18 2020 - [info] Got exit code 20 (Master dead). 83 Mon Oct 19 22:58:18 2020 - [info] MHA::MasterFailover version 0.57. 84 Mon Oct 19 22:58:18 2020 - [info] Starting master failover. 85 Mon Oct 19 22:58:18 2020 - [info] 86 Mon Oct 19 22:58:18 2020 - [info] * Phase 1: Configuration Check Phase.. 87 Mon Oct 19 22:58:18 2020 - [info] 88 Mon Oct 19 22:58:19 2020 - [info] GTID failover mode = 1 89 Mon Oct 19 22:58:19 2020 - [info] Dead Servers: 90 Mon Oct 19 22:58:19 2020 - [info] 192.168.150.101(192.168.150.101:3306) 91 Mon Oct 19 22:58:19 2020 - [info] Checking master reachability via MySQL(double check)... 92 Mon Oct 19 22:58:19 2020 - [info] ok. 93 Mon Oct 19 22:58:19 2020 - [info] Alive Servers: 94 Mon Oct 19 22:58:19 2020 - [info] 192.168.150.102(192.168.150.102:3306) 95 Mon Oct 19 22:58:19 2020 - [info] 192.168.150.103(192.168.150.103:3306) 96 Mon Oct 19 22:58:19 2020 - [info] Alive Slaves: 97 Mon Oct 19 22:58:19 2020 - [info] 192.168.150.102(192.168.150.102:3306) Version=5.7.29-log (oldest major version between slaves) log-bin:enabled 98 Mon Oct 19 22:58:19 2020 - [info] GTID ON 99 Mon Oct 19 22:58:19 2020 - [info] Replicating from 192.168.150.101(192.168.150.101:3306) 100 Mon Oct 19 22:58:19 2020 - [info] 192.168.150.103(192.168.150.103:3306) Version=5.7.29-log (oldest major version between slaves) log-bin:enabled 101 Mon Oct 19 22:58:19 2020 - [info] GTID ON 102 Mon Oct 19 22:58:19 2020 - [info] Replicating from 192.168.150.101(192.168.150.101:3306) 103 Mon Oct 19 22:58:19 2020 - [info] Starting GTID based failover. 104 Mon Oct 19 22:58:19 2020 - [info] 105 Mon Oct 19 22:58:19 2020 - [info] ** Phase 1: Configuration Check Phase completed. 106 Mon Oct 19 22:58:19 2020 - [info] 107 Mon Oct 19 22:58:19 2020 - [info] * Phase 2: Dead Master Shutdown Phase.. 108 Mon Oct 19 22:58:19 2020 - [info] 109 Mon Oct 19 22:58:19 2020 - [info] Forcing shutdown so that applications never connect to the current master.. 110 Mon Oct 19 22:58:19 2020 - [info] Executing master IP deactivation script: 111 Mon Oct 19 22:58:19 2020 - [info] /usr/local/bin/master_ip_failover --orig_master_host=192.168.150.101 --orig_master_ip=192.168.150.101 --orig_master_port=3306 --command=stop 112 113 114 IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.150.88/24=== 115 116 Disabling the VIP on old master: 192.168.150.101 117 Mon Oct 19 22:58:19 2020 - [info] done. 118 Mon Oct 19 22:58:19 2020 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master. 119 Mon Oct 19 22:58:20 2020 - [info] * Phase 2: Dead Master Shutdown Phase completed. 120 Mon Oct 19 22:58:20 2020 - [info] 121 Mon Oct 19 22:58:20 2020 - [info] * Phase 3: Master Recovery Phase.. 122 Mon Oct 19 22:58:20 2020 - [info] 123 Mon Oct 19 22:58:20 2020 - [info] * Phase 3.1: Getting Latest Slaves Phase.. 124 Mon Oct 19 22:58:20 2020 - [info] 125 Mon Oct 19 22:58:20 2020 - [info] The latest binary log file/position on all slaves is mysql-bin.000039:487 126 Mon Oct 19 22:58:20 2020 - [info] Latest slaves (Slaves that received relay log files to the latest): 127 Mon Oct 19 22:58:20 2020 - [info] 192.168.150.102(192.168.150.102:3306) Version=5.7.29-log (oldest major version between slaves) log-bin:enabled 128 Mon Oct 19 22:58:20 2020 - [info] GTID ON 129 Mon Oct 19 22:58:20 2020 - [info] Replicating from 192.168.150.101(192.168.150.101:3306) 130 Mon Oct 19 22:58:20 2020 - [info] 192.168.150.103(192.168.150.103:3306) Version=5.7.29-log (oldest major version between slaves) log-bin:enabled 131 Mon Oct 19 22:58:20 2020 - [info] GTID ON 132 Mon Oct 19 22:58:20 2020 - [info] Replicating from 192.168.150.101(192.168.150.101:3306) 133 Mon Oct 19 22:58:20 2020 - [info] The oldest binary log file/position on all slaves is mysql-bin.000039:487 134 Mon Oct 19 22:58:20 2020 - [info] Oldest slaves: 135 Mon Oct 19 22:58:20 2020 - [info] 192.168.150.102(192.168.150.102:3306) Version=5.7.29-log (oldest major version between slaves) log-bin:enabled 136 Mon Oct 19 22:58:20 2020 - [info] GTID ON 137 Mon Oct 19 22:58:20 2020 - [info] Replicating from 192.168.150.101(192.168.150.101:3306) 138 Mon Oct 19 22:58:20 2020 - [info] 192.168.150.103(192.168.150.103:3306) Version=5.7.29-log (oldest major version between slaves) log-bin:enabled 139 Mon Oct 19 22:58:20 2020 - [info] GTID ON 140 Mon Oct 19 22:58:20 2020 - [info] Replicating from 192.168.150.101(192.168.150.101:3306) 141 Mon Oct 19 22:58:20 2020 - [info] 142 Mon Oct 19 22:58:20 2020 - [info] * Phase 3.3: Determining New Master Phase.. 143 Mon Oct 19 22:58:20 2020 - [info] 144 Mon Oct 19 22:58:20 2020 - [info] Searching new master from slaves.. 145 Mon Oct 19 22:58:20 2020 - [info] Candidate masters from the configuration file: 146 Mon Oct 19 22:58:20 2020 - [info] Non-candidate masters: 147 Mon Oct 19 22:58:20 2020 - [info] New master is 192.168.150.102(192.168.150.102:3306) 148 Mon Oct 19 22:58:20 2020 - [info] Starting master failover.. 149 Mon Oct 19 22:58:20 2020 - [info] 150 From: 151 192.168.150.101(192.168.150.101:3306) (current master) 152 +--192.168.150.102(192.168.150.102:3306) 153 +--192.168.150.103(192.168.150.103:3306) 154 155 To: 156 192.168.150.102(192.168.150.102:3306) (new master) 157 +--192.168.150.103(192.168.150.103:3306) 158 Mon Oct 19 22:58:20 2020 - [info] 159 Mon Oct 19 22:58:20 2020 - [info] * Phase 3.3: New Master Recovery Phase.. 160 Mon Oct 19 22:58:20 2020 - [info] 161 Mon Oct 19 22:58:20 2020 - [info] Waiting all logs to be applied.. 162 Mon Oct 19 22:58:20 2020 - [info] done. 163 Mon Oct 19 22:58:20 2020 - [info] Getting new master's binlog name and position.. 164 Mon Oct 19 22:58:20 2020 - [info] mysql-bin.000037:234 165 Mon Oct 19 22:58:20 2020 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.150.102', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='root', MASTER_PASSWORD='xxx'; 166 Mon Oct 19 22:58:20 2020 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000037, 234, 9fef2262-97b1-11ea-92b5-000c29cd3ff3:1-82, 167 adc4403d-97b2-11ea-b803-000c298076e0:1-66 168 Mon Oct 19 22:58:20 2020 - [info] Executing master IP activate script: 169 Mon Oct 19 22:58:20 2020 - [info] /usr/local/bin/master_ip_failover --command=start --ssh_user=root --orig_master_host=192.168.150.101 --orig_master_ip=192.168.150.101 --orig_master_port=3306 --new_master_host=192.168.150.102 --new_master_ip=192.168.150.102 -- new_master_port=3306 --new_master_user='root' --new_master_password=xxx 170 Unknown option: new_master_user 171 Unknown option: new_master_password 172 173 174 IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.150.88/24=== 175 176 Enabling the VIP - 192.168.150.88/24 on the new master - 192.168.150.102 177 Mon Oct 19 22:58:21 2020 - [info] OK. 178 Mon Oct 19 22:58:21 2020 - [info] ** Finished master recovery successfully. 179 Mon Oct 19 22:58:21 2020 - [info] * Phase 3: Master Recovery Phase completed. 180 Mon Oct 19 22:58:21 2020 - [info] 181 Mon Oct 19 22:58:21 2020 - [info] * Phase 4: Slaves Recovery Phase.. 182 Mon Oct 19 22:58:21 2020 - [info] 183 Mon Oct 19 22:58:21 2020 - [info] 184 Mon Oct 19 22:58:21 2020 - [info] * Phase 4.1: Starting Slaves in parallel.. 185 Mon Oct 19 22:58:21 2020 - [info] 186 Mon Oct 19 22:58:21 2020 - [info] -- Slave recovery on host 192.168.150.103(192.168.150.103:3306) started, pid: 100386. Check tmp log /var/log/masterha/app1.log/192.168.150.103_3306_20201019225818.log if it takes time.. 187 Mon Oct 19 22:58:23 2020 - [info] 188 Mon Oct 19 22:58:23 2020 - [info] Log messages from 192.168.150.103 ... 189 Mon Oct 19 22:58:23 2020 - [info] 190 Mon Oct 19 22:58:21 2020 - [info] Resetting slave 192.168.150.103(192.168.150.103:3306) and starting replication from the new master 192.168.150.102(192.168.150.102:3306).. 191 Mon Oct 19 22:58:21 2020 - [info] Executed CHANGE MASTER. 192 Mon Oct 19 22:58:22 2020 - [info] Slave started. 193 Mon Oct 19 22:58:22 2020 - [info] gtid_wait(9fef2262-97b1-11ea-92b5-000c29cd3ff3:1-82, 194 adc4403d-97b2-11ea-b803-000c298076e0:1-66) completed on 192.168.150.103(192.168.150.103:3306). Executed 0 events. 195 Mon Oct 19 22:58:23 2020 - [info] End of log messages from 192.168.150.103. 196 Mon Oct 19 22:58:23 2020 - [info] -- Slave on host 192.168.150.103(192.168.150.103:3306) started. 197 Mon Oct 19 22:58:23 2020 - [info] All new slave servers recovered successfully. 198 Mon Oct 19 22:58:23 2020 - [info] 199 Mon Oct 19 22:58:23 2020 - [info] * Phase 5: New master cleanup phase.. 200 Mon Oct 19 22:58:23 2020 - [info] 201 Mon Oct 19 22:58:23 2020 - [info] Resetting slave info on the new master.. 202 Mon Oct 19 22:58:23 2020 - [info] 192.168.150.102: Resetting slave info succeeded. 203 Mon Oct 19 22:58:23 2020 - [info] Master failover to 192.168.150.102(192.168.150.102:3306) completed successfully. 204 Mon Oct 19 22:58:23 2020 - [info] Deleted server1 entry from /etc/masterha/app1.cnf . 205 Mon Oct 19 22:58:23 2020 - [info] 206 207 ----- Failover Report ----- 208 209 app1: MySQL Master failover 192.168.150.101(192.168.150.101:3306) to 192.168.150.102(192.168.150.102:3306) succeeded 210 211 Master 192.168.150.101(192.168.150.101:3306) is down! 212 213 Check MHA Manager logs at host101:/var/log/masterha/app1/manager.log for details. 214 215 Started automated(non-interactive) failover. 216 Invalidated master IP address on 192.168.150.101(192.168.150.101:3306) 217 Selected 192.168.150.102(192.168.150.102:3306) as a new master. 218 192.168.150.102(192.168.150.102:3306): OK: Applying all logs succeeded. 219 192.168.150.102(192.168.150.102:3306): OK: Activated master IP address. 220 192.168.150.103(192.168.150.103:3306): OK: Slave started, replicating from 192.168.150.102(192.168.150.102:3306) 221 192.168.150.102(192.168.150.102:3306): Resetting slave info succeeded. 222 Master failover to 192.168.150.102(192.168.150.102:3306) completed successfully.
从日志看MHA没有从主实例拷贝binlog,原来在GTID模式下MHA 不会去拷贝binlog,需要搭建BINLOG server
http://www.mamicode.com/info-detail-3071910.html
https://www.cnblogs.com/tim1blog/p/9877164.html#_label2
https://www.jianshu.com/p/bc0d8183b3b0?utm_campaign=hugo
app1.cnf配置文件修改后如下,看最上边多了binlog1 信息,
在102服务器创建/root/binlog目录,启动binlog server,
nohup mysqlbinlog -R --host=192.168.150.101 --user=root --password=123456 --raw --stop-never mysql-bin.000038 &
[binlog1] hostname=192.168.150.102 master_binlog_dir=/root/binlog/ no_master=1 [server default] manager_log=/var/log/masterha/app1/manager.log manager_workdir=/var/log/masterha/app1.log master_binlog_dir=/usr/local/mysql/data master_ip_failover_script=/usr/local/bin/master_ip_failover master_ip_online_change_script=/usr/local/bin/master_ip_online_change password=123456 ping_interval=1 remote_workdir=/tmp repl_password=123456 repl_user=root secondary_check_script=/usr/local/bin/masterha_secondary_check -s 192.168.150.103 -s 192.168.150.102 shutdown_script="" ssh_user=root user=root xxx ....后续省略
做两个软件连接(三个节点都要操作)
ln -s /usr/local/mysql/bin/mysqlbinlog /usr/local/bin/mysqlbinlog
ln -s /usr/local/mysql/bin/mysql /usr/local/bin/mysql
主机宕机后再看最新manager.log日志
Tue Oct 20 10:58:06 2020 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Tue Oct 20 10:58:06 2020 - [info] Reading application default configuration from /etc/masterha/app1.cnf.. Tue Oct 20 10:58:06 2020 - [info] Reading server configuration from /etc/masterha/app1.cnf.. Tue Oct 20 10:59:36 2020 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Tue Oct 20 10:59:36 2020 - [info] Reading application default configuration from /etc/masterha/app1.cnf.. Tue Oct 20 10:59:36 2020 - [info] Reading server configuration from /etc/masterha/app1.cnf.. 58:07 2020 - [info] 192.168.150.102(192.168.150.102:3306) Tue Oct 20 10:58:07 2020 - [info] 192.168.150.103(192.168.150.103:3306) Tue Oct 20 10:58:07 2020 - [info] Alive Slaves: Tue Oct 20 10:58:07 2020 - [info] 192.168.150.102(192.168.150.102:3306) Version=5.7.29-log (oldest major version between slaves) log-bin:enabled Tue Oct 20 10:58:07 2020 - [info] GTID ON Tue Oct 20 10:58:07 2020 - [info] Replicating from 192.168.150.101(192.168.150.101:3306) Tue Oct 20 10:58:07 2020 - [info] 192.168.150.103(192.168.150.103:3306) Version=5.7.29-log (oldest major version between slaves) log-bin:enabled Tue Oct 20 10:58:07 2020 - [info] GTID ON Tue Oct 20 10:58:07 2020 - [info] Replicating from 192.168.150.101(192.168.150.101:3306) Tue Oct 20 10:58:07 2020 - [info] Current Alive Master: 192.168.150.101(192.168.150.101:3306) Tue Oct 20 10:58:07 2020 - [info] Checking slave configurations.. Tue Oct 20 10:58:07 2020 - [info] read_only=1 is not set on slave 192.168.150.102(192.168.150.102:3306). Tue Oct 20 10:58:07 2020 - [info] read_only=1 is not set on slave 192.168.150.103(192.168.150.103:3306). Tue Oct 20 10:58:07 2020 - [info] Checking replication filtering settings.. Tue Oct 20 10:58:07 2020 - [info] binlog_do_db= , binlog_ignore_db= Tue Oct 20 10:58:07 2020 - [info] Replication filtering check ok. Tue Oct 20 10:58:07 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking. Tue Oct 20 10:58:08 2020 - [info] HealthCheck: SSH to 192.168.150.102 is reachable. Tue Oct 20 10:58:08 2020 - [info] Binlog server 192.168.150.102 is reachable. Tue Oct 20 10:58:08 2020 - [info] Checking recovery script configurations on 192.168.150.102(192.168.150.102:3306).. Tue Oct 20 10:58:08 2020 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/root/binlog/ --output_file=/tmp/save_binary_logs_test --manager_version=0.56 --start_file=mysql-bin.000043 Tue Oct 20 10:58:08 2020 - [info] Connecting to root@192.168.150.102(192.168.150.102:22).. Creating /tmp if not exists.. ok. Checking output directory is accessible or not.. ok. Binlog found at /root/binlog/, up to mysql-bin.000043 Tue Oct 20 10:58:09 2020 - [info] Binlog setting check done. Tue Oct 20 10:58:09 2020 - [info] Checking SSH publickey authentication settings on the current master.. Tue Oct 20 10:58:11 2020 - [info] HealthCheck: SSH to 192.168.150.101 is reachable. Tue Oct 20 10:58:11 2020 - [info] 192.168.150.101(192.168.150.101:3306) (current master) +--192.168.150.102(192.168.150.102:3306) +--192.168.150.103(192.168.150.103:3306) Tue Oct 20 10:58:11 2020 - [info] Checking master_ip_failover_script status: Tue Oct 20 10:58:11 2020 - [info] /usr/local/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=192.168.150.101 --orig_master_ip=192.168.150.101 --orig_master_port=3306 IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.150.88/24=== Checking the Status of the script.. OK Tue Oct 20 10:58:11 2020 - [info] OK. Tue Oct 20 10:58:11 2020 - [warning] shutdown_script is not defined. Tue Oct 20 10:58:11 2020 - [info] Set master ping interval 1 seconds. Tue Oct 20 10:58:11 2020 - [info] Set secondary check script: /usr/local/bin/masterha_secondary_check -s 192.168.150.103 -s 192.168.150.102 Tue Oct 20 10:58:11 2020 - [info] Starting ping health check on 192.168.150.101(192.168.150.101:3306).. Tue Oct 20 10:58:11 2020 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond.. Tue Oct 20 10:59:32 2020 - [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away) Tue Oct 20 10:59:32 2020 - [info] Executing SSH check script: exit 0 Tue Oct 20 10:59:32 2020 - [info] Executing secondary network check script: /usr/local/bin/masterha_secondary_check -s 192.168.150.103 -s 192.168.150.102 --user=root --master_host=192.168.150.101 --master_ip=192.168.150.101 --master_port=3306 --master_user=root --master_password=123456 --ping_type=SELECT Monitoring server 192.168.150.103 is reachable, Master is not reachable from 192.168.150.103. OK. Tue Oct 20 10:59:33 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.150.101' (111)) Tue Oct 20 10:59:33 2020 - [warning] Connection failed 2 time(s).. Monitoring server 192.168.150.102 is reachable, Master is not reachable from 192.168.150.102. OK. Tue Oct 20 10:59:34 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start. Tue Oct 20 10:59:34 2020 - [info] HealthCheck: SSH to 192.168.150.101 is reachable. Tue Oct 20 10:59:34 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.150.101' (111)) Tue Oct 20 10:59:34 2020 - [warning] Connection failed 3 time(s).. Tue Oct 20 10:59:35 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.150.101' (111)) Tue Oct 20 10:59:35 2020 - [warning] Connection failed 4 time(s).. Tue Oct 20 10:59:35 2020 - [warning] Master is not reachable from health checker! Tue Oct 20 10:59:35 2020 - [warning] Master 192.168.150.101(192.168.150.101:3306) is not reachable! Tue Oct 20 10:59:35 2020 - [warning] SSH is reachable. Tue Oct 20 10:59:35 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/masterha/app1.cnf again, and trying to connect to all servers to check server status.. Tue Oct 20 10:59:35 2020 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Tue Oct 20 10:59:35 2020 - [info] Reading application default configuration from /etc/masterha/app1.cnf.. Tue Oct 20 10:59:35 2020 - [info] Reading server configuration from /etc/masterha/app1.cnf.. Tue Oct 20 10:59:36 2020 - [info] GTID failover mode = 1 Tue Oct 20 10:59:36 2020 - [info] Dead Servers: Tue Oct 20 10:59:36 2020 - [info] 192.168.150.101(192.168.150.101:3306) Tue Oct 20 10:59:36 2020 - [info] Alive Servers: Tue Oct 20 10:59:36 2020 - [info] 192.168.150.102(192.168.150.102:3306) Tue Oct 20 10:59:36 2020 - [info] 192.168.150.103(192.168.150.103:3306) Tue Oct 20 10:59:36 2020 - [info] Alive Slaves: Tue Oct 20 10:59:36 2020 - [info] 192.168.150.102(192.168.150.102:3306) Version=5.7.29-log (oldest major version between slaves) log-bin:enabled Tue Oct 20 10:59:36 2020 - [info] GTID ON Tue Oct 20 10:59:36 2020 - [info] Replicating from 192.168.150.101(192.168.150.101:3306) Tue Oct 20 10:59:36 2020 - [info] 192.168.150.103(192.168.150.103:3306) Version=5.7.29-log (oldest major version between slaves) log-bin:enabled Tue Oct 20 10:59:36 2020 - [info] GTID ON Tue Oct 20 10:59:36 2020 - [info] Replicating from 192.168.150.101(192.168.150.101:3306) Tue Oct 20 10:59:36 2020 - [info] Checking slave configurations.. Tue Oct 20 10:59:36 2020 - [info] read_only=1 is not set on slave 192.168.150.102(192.168.150.102:3306). Tue Oct 20 10:59:36 2020 - [info] read_only=1 is not set on slave 192.168.150.103(192.168.150.103:3306). Tue Oct 20 10:59:36 2020 - [info] Checking replication filtering settings.. Tue Oct 20 10:59:36 2020 - [info] Replication filtering check ok. Tue Oct 20 10:59:36 2020 - [info] Master is down! Tue Oct 20 10:59:36 2020 - [info] Terminating monitoring script. Tue Oct 20 10:59:36 2020 - [info] Got exit code 20 (Master dead). Tue Oct 20 10:59:36 2020 - [info] MHA::MasterFailover version 0.56. Tue Oct 20 10:59:36 2020 - [info] Starting master failover. Tue Oct 20 10:59:36 2020 - [info] Tue Oct 20 10:59:36 2020 - [info] * Phase 1: Configuration Check Phase.. Tue Oct 20 10:59:36 2020 - [info] Tue Oct 20 10:59:37 2020 - [info] HealthCheck: SSH to 192.168.150.102 is reachable. Tue Oct 20 10:59:38 2020 - [info] Binlog server 192.168.150.102 is reachable. Tue Oct 20 10:59:39 2020 - [info] GTID failover mode = 1 Tue Oct 20 10:59:39 2020 - [info] Dead Servers: Tue Oct 20 10:59:39 2020 - [info] 192.168.150.101(192.168.150.101:3306) Tue Oct 20 10:59:39 2020 - [info] Checking master reachability via MySQL(double check)... Tue Oct 20 10:59:39 2020 - [info] ok. Tue Oct 20 10:59:39 2020 - [info] Alive Servers: Tue Oct 20 10:59:39 2020 - [info] 192.168.150.102(192.168.150.102:3306) Tue Oct 20 10:59:39 2020 - [info] 192.168.150.103(192.168.150.103:3306) Tue Oct 20 10:59:39 2020 - [info] Alive Slaves: Tue Oct 20 10:59:39 2020 - [info] 192.168.150.102(192.168.150.102:3306) Version=5.7.29-log (oldest major version between slaves) log-bin:enabled Tue Oct 20 10:59:39 2020 - [info] GTID ON Tue Oct 20 10:59:39 2020 - [info] Replicating from 192.168.150.101(192.168.150.101:3306) Tue Oct 20 10:59:39 2020 - [info] 192.168.150.103(192.168.150.103:3306) Version=5.7.29-log (oldest major version between slaves) log-bin:enabled Tue Oct 20 10:59:39 2020 - [info] GTID ON Tue Oct 20 10:59:39 2020 - [info] Replicating from 192.168.150.101(192.168.150.101:3306) Tue Oct 20 10:59:39 2020 - [info] Starting GTID based failover. Tue Oct 20 10:59:39 2020 - [info] Tue Oct 20 10:59:39 2020 - [info] ** Phase 1: Configuration Check Phase completed. Tue Oct 20 10:59:39 2020 - [info] Tue Oct 20 10:59:39 2020 - [info] * Phase 2: Dead Master Shutdown Phase.. Tue Oct 20 10:59:39 2020 - [info] Tue Oct 20 10:59:39 2020 - [info] Forcing shutdown so that applications never connect to the current master.. Tue Oct 20 10:59:39 2020 - [info] Executing master IP deactivation script: Tue Oct 20 10:59:39 2020 - [info] /usr/local/bin/master_ip_failover --orig_master_host=192.168.150.101 --orig_master_ip=192.168.150.101 --orig_master_port=3306 --command=stopssh --ssh_user=root IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.150.88/24=== Disabling the VIP on old master: 192.168.150.101 SIOCSIFFLAGS: Cannot assign requested address Tue Oct 20 10:59:40 2020 - [info] done. Tue Oct 20 10:59:40 2020 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master. Tue Oct 20 10:59:40 2020 - [info] * Phase 2: Dead Master Shutdown Phase completed. Tue Oct 20 10:59:40 2020 - [info] Tue Oct 20 10:59:40 2020 - [info] * Phase 3: Master Recovery Phase.. Tue Oct 20 10:59:40 2020 - [info] Tue Oct 20 10:59:40 2020 - [info] * Phase 3.1: Getting Latest Slaves Phase.. Tue Oct 20 10:59:40 2020 - [info] Tue Oct 20 10:59:41 2020 - [info] The latest binary log file/position on all slaves is mysql-bin.000043:1278 Tue Oct 20 10:59:41 2020 - [info] Retrieved Gtid Set: 9fef2262-97b1-11ea-92b5-000c29cd3ff3:102-105 Tue Oct 20 10:59:41 2020 - [info] Latest slaves (Slaves that received relay log files to the latest): Tue Oct 20 10:59:41 2020 - [info] 192.168.150.102(192.168.150.102:3306) Version=5.7.29-log (oldest major version between slaves) log-bin:enabled Tue Oct 20 10:59:41 2020 - [info] GTID ON Tue Oct 20 10:59:41 2020 - [info] Replicating from 192.168.150.101(192.168.150.101:3306) Tue Oct 20 10:59:41 2020 - [info] The oldest binary log file/position on all slaves is mysql-bin.000043:495 Tue Oct 20 10:59:41 2020 - [info] Retrieved Gtid Set: 9fef2262-97b1-11ea-92b5-000c29cd3ff3:102 Tue Oct 20 10:59:41 2020 - [info] Oldest slaves: Tue Oct 20 10:59:41 2020 - [info] 192.168.150.103(192.168.150.103:3306) Version=5.7.29-log (oldest major version between slaves) log-bin:enabled Tue Oct 20 10:59:41 2020 - [info] GTID ON Tue Oct 20 10:59:41 2020 - [info] Replicating from 192.168.150.101(192.168.150.101:3306) Tue Oct 20 10:59:41 2020 - [info] Tue Oct 20 10:59:41 2020 - [info] * Phase 3.3: Determining New Master Phase.. Tue Oct 20 10:59:41 2020 - [info] Tue Oct 20 10:59:41 2020 - [info] Searching new master from slaves.. Tue Oct 20 10:59:41 2020 - [info] Candidate masters from the configuration file: Tue Oct 20 10:59:41 2020 - [info] Non-candidate masters: Tue Oct 20 10:59:41 2020 - [info] New master is 192.168.150.102(192.168.150.102:3306) Tue Oct 20 10:59:41 2020 - [info] Starting master failover.. Tue Oct 20 10:59:41 2020 - [info] From: 192.168.150.101(192.168.150.101:3306) (current master) +--192.168.150.102(192.168.150.102:3306) +--192.168.150.103(192.168.150.103:3306) To: 192.168.150.102(192.168.150.102:3306) (new master) +--192.168.150.103(192.168.150.103:3306) Tue Oct 20 10:59:41 2020 - [info] Tue Oct 20 10:59:41 2020 - [info] * Phase 3.3: New Master Recovery Phase.. Tue Oct 20 10:59:41 2020 - [info] Tue Oct 20 10:59:41 2020 - [info] Waiting all logs to be applied.. Tue Oct 20 10:59:41 2020 - [info] done. Tue Oct 20 10:59:41 2020 - [info] -- Saving binlog from host 192.168.150.102 started, pid: 25954 Tue Oct 20 10:59:44 2020 - [info] Tue Oct 20 10:59:44 2020 - [info] Log messages from 192.168.150.102 ... Tue Oct 20 10:59:44 2020 - [info] Tue Oct 20 10:59:41 2020 - [info] Fetching binary logs from binlog server 192.168.150.102.. Tue Oct 20 10:59:41 2020 - [info] Executing binlog save command: save_binary_logs --command=save --start_file=mysql-bin.000043 --start_pos=1278 --output_file=/tmp/saved_binlog_binlog1_20201020105936.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.29-log --binlog_dir=/root/binlog/ Creating /tmp if not exists.. ok. Concat binary/relay logs from mysql-bin.000043 pos 1278 to mysql-bin.000043 EOF into /tmp/saved_binlog_binlog1_20201020105936.binlog .. Concat succeeded. Tue Oct 20 10:59:43 2020 - [info] scp from root@192.168.150.102:/tmp/saved_binlog_binlog1_20201020105936.binlog to local:/var/log/masterha/app1.log/saved_binlog_192.168.150.102_binlog1_20201020105936.binlog succeeded. Tue Oct 20 10:59:44 2020 - [info] End of log messages from 192.168.150.102. Tue Oct 20 10:59:44 2020 - [info] Saved mysqlbinlog size from 192.168.150.102 is 3910 bytes. Tue Oct 20 10:59:44 2020 - [info] Applying differential binlog /var/log/masterha/app1.log/saved_binlog_192.168.150.102_binlog1_20201020105936.binlog .. Tue Oct 20 10:59:44 2020 - [info] Differential log apply from binlog server succeeded. Tue Oct 20 10:59:44 2020 - [info] Getting new master's binlog name and position.. Tue Oct 20 10:59:44 2020 - [info] mysql-bin.000041:30316883 Tue Oct 20 10:59:44 2020 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.150.102', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='root', MASTER_PASSWORD='xxx'; Tue Oct 20 10:59:44 2020 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000041, 30316883, 9fef2262-97b1-11ea-92b5-000c29cd3ff3:1-108, adc4403d-97b2-11ea-b803-000c298076e0:1-97 Tue Oct 20 10:59:44 2020 - [info] Executing master IP activate script: Tue Oct 20 10:59:44 2020 - [info] /usr/local/bin/master_ip_failover --command=start --ssh_user=root --orig_master_host=192.168.150.101 --orig_master_ip=192.168.150.101 --orig_master_port=3306 --new_master_host=192.168.150.102 --new_master_ip=192.168.150.102 --new_master_port=3306 --new_master_user='root' --new_master_password='123456' Unknown option: new_master_user Unknown option: new_master_password IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 192.168.150.88/24=== Enabling the VIP - 192.168.150.88/24 on the new master - 192.168.150.102 Tue Oct 20 10:59:44 2020 - [info] OK. Tue Oct 20 10:59:44 2020 - [info] ** Finished master recovery successfully. Tue Oct 20 10:59:44 2020 - [info] * Phase 3: Master Recovery Phase completed. Tue Oct 20 10:59:44 2020 - [info] Tue Oct 20 10:59:44 2020 - [info] * Phase 4: Slaves Recovery Phase.. Tue Oct 20 10:59:44 2020 - [info] Tue Oct 20 10:59:44 2020 - [info] Tue Oct 20 10:59:44 2020 - [info] * Phase 4.1: Starting Slaves in parallel.. Tue Oct 20 10:59:44 2020 - [info] Tue Oct 20 10:59:44 2020 - [info] -- Slave recovery on host 192.168.150.103(192.168.150.103:3306) started, pid: 25981. Check tmp log /var/log/masterha/app1.log/192.168.150.103_3306_20201020105936.log if it takes time.. Tue Oct 20 10:59:45 2020 - [info] Tue Oct 20 10:59:45 2020 - [info] Log messages from 192.168.150.103 ... Tue Oct 20 10:59:45 2020 - [info] Tue Oct 20 10:59:44 2020 - [info] Resetting slave 192.168.150.103(192.168.150.103:3306) and starting replication from the new master 192.168.150.102(192.168.150.102:3306).. Tue Oct 20 10:59:44 2020 - [info] Executed CHANGE MASTER. Tue Oct 20 10:59:44 2020 - [info] Slave started. Tue Oct 20 10:59:45 2020 - [info] gtid_wait(9fef2262-97b1-11ea-92b5-000c29cd3ff3:1-108, adc4403d-97b2-11ea-b803-000c298076e0:1-97) completed on 192.168.150.103(192.168.150.103:3306). Executed 2 events. Tue Oct 20 10:59:45 2020 - [info] End of log messages from 192.168.150.103. Tue Oct 20 10:59:45 2020 - [info] -- Slave on host 192.168.150.103(192.168.150.103:3306) started. Tue Oct 20 10:59:45 2020 - [info] All new slave servers recovered successfully. Tue Oct 20 10:59:45 2020 - [info] Tue Oct 20 10:59:45 2020 - [info] * Phase 5: New master cleanup phase.. Tue Oct 20 10:59:45 2020 - [info] Tue Oct 20 10:59:45 2020 - [info] Resetting slave info on the new master.. Tue Oct 20 10:59:45 2020 - [info] 192.168.150.102: Resetting slave info succeeded. Tue Oct 20 10:59:45 2020 - [info] Master failover to 192.168.150.102(192.168.150.102:3306) completed successfully. Tue Oct 20 10:59:45 2020 - [info] Deleted server1 entry from /etc/masterha/app1.cnf . Tue Oct 20 10:59:45 2020 - [info] ----- Failover Report ----- app1: MySQL Master failover 192.168.150.101(192.168.150.101:3306) to 192.168.150.102(192.168.150.102:3306) succeeded Master 192.168.150.101(192.168.150.101:3306) is down! Check MHA Manager logs at host101:/var/log/masterha/app1/manager.log for details. Started automated(non-interactive) failover. Invalidated master IP address on 192.168.150.101(192.168.150.101:3306) Selected 192.168.150.102(192.168.150.102:3306) as a new master. 192.168.150.102(192.168.150.102:3306): OK: Applying all logs succeeded. 192.168.150.102(192.168.150.102:3306): OK: Activated master IP address. 192.168.150.103(192.168.150.103:3306): OK: Slave started, replicating from 192.168.150.102(192.168.150.102:3306) 192.168.150.102(192.168.150.102:3306): Resetting slave info succeeded. Master failover to 192.168.150.102(192.168.150.102:3306) completed successfully.
Phase 3.3: New Master Recovery Phase..
看这部分日志已经发生了变化,在从远端获取binlog,并进行应用,应用完成后新的Master就会挂上VIP了,
挂上VIP之后
4: Slaves Recovery Phase..
第4步才开始对slave节点进行数据恢复。
整个过程踩坑不少,配置MHA还是比较麻烦的,后续进行总结。
1、每次切换后都需要修改一下配置文件,因为集群主备环境拓扑发生了变化。
2、我的mysql集群环境是gtid+半同步,理论上可以不用配置binlog server,但在极端情况下会有丢数据风险,比如半同步复制退化为异常复制了。
3、之前有文章说需要关闭自动清理中继日志,在GTID复制模式下不用,MHA对GTID复制场景和非GTID复制场景有不同的补数据逻辑,在非GTID场景下会使用到中继日志。
4、如果MHA在切换过程中遇到问题,就是直接退出了,没有别的处理逻辑。
5、MHA不适用于自动化运维。
https://ke.qq.comcourse/430673 把MHA拉下神坛
值得学习。
自己做了一些测试102,103 sql_thread全停掉,101一直在写数据,
如果有binlogserver数据不会丢,
如果没有binlogserver数据会丢失。
102有最新的中断日志,但同步有大量延迟,103没有最新的中继日志,但同步位点比102靠前。
MHA选主时还是选102,因为102和103比,中继日志比较新。没有配置binlogserver,不会丢数据。