BenjaminYang In solitude, where we are least alone

mysql高可用架构之-MHA学习

此博文参考  博主:mysql高级DBA  yayun  完成

简介:

MHA(Master High Availability)目前在MySQL高可用方面是一个相对成熟的解决方案,它由日本DeNA公司youshimaton(现就职于Facebook公司)开发,是一套优秀的作为MySQL高可用性环境下故障切换和主从提升的高可用软件。在MySQL故障切换过程中,MHA能做到在0~30秒之内自动完成数据库的故障切换操作,并且在进行故障切换的过程中,MHA能在最大程度上保证数据的一致性,以达到真正意义上的高可用。

该软件由两部分组成:MHA Manager(管理节点)和MHA Node(数据节点)。MHA Manager可以单独部署在一台独立的机器上管理多个master-slave集群,也可以部署在一台slave节点上。MHA Node运行在每台MySQL服务器上,MHA Manager会定时探测集群中的master节点,当master出现故障时,它可以自动将最新数据的slave提升为新的master,然后将所有其他的slave重新指向新的master。整个故障转移过程对应用程序完全透明。

在MHA自动故障切换过程中,MHA试图从宕机的主服务器上保存二进制日志,最大程度的保证数据的不丢失,但这并不总是可行的。例如,如果主服务器硬件故障或无法通过ssh访问,MHA没法保存二进制日志,只进行故障转移而丢失了最新的数据。使用MySQL 5.5的半同步复制,可以大大降低数据丢失的风险。MHA可以与半同步复制结合起来。如果只有一个slave已经收到了最新的二进制日志,MHA可以将最新的二进制日志应用于其他所有的slave服务器上,因此可以保证所有节点的数据一致性。

目前MHA主要支持一主多从的架构,要搭建MHA,要求一个复制集群中必须最少有三台数据库服务器,一主二从,即一台充当master,一台充当备用master,另外一台充当从库,因为至少需要三台服务器,出于机器成本的考虑,淘宝也在该基础上进行了改造,目前淘宝TMHA已经支持一主一从。另外对于想快速搭建的可以参考:MHA快速搭建

我们自己使用其实也可以使用1主1从,但是master主机宕机后无法切换,以及无法补全binlog。master的mysqld进程crash后,还是可以切换成功,以及补全binlog的。

官方介绍:https://code.google.com/p/mysql-master-ha/

图01展示了如何通过MHA Manager管理多组主从复制。可以将MHA工作原理总结为如下:

每一组都是一组主从复制组

(1)从宕机崩溃的master保存二进制日志事件(binlog events);

(2)识别含有最新更新的slave;

(3)应用差异的中继日志(relay log)到其他的slave;

(4)应用从master保存的二进制日志事件(binlog events);

(5)提升一个slave为新的master;

(6)使其他的slave连接新的master进行复制;

1.编译安装mysql  参考地址:

①cmake的安装

yum install -y cmake
yum -y install make gcc-c++ cmake bison-devel  ncurses-devel gcc autoconf automake zlib* fiex* libxml* libmcrypt* libtool-ltdl-devel*

②mysql源码安装

采用源码安装方式安装,先将mysql-5.6.16.tar源码拷贝至服务器解压,解压后进入mysql-5.6.16这个目录中
使用cmake源码安装mysql(如果你打算安装到不同的路径,注意修改下面语句中/usr/local/mysql这个路径!)
cmake -DCMAKE_INSTALL_PREFIX=/usr/local/mysql \
-DMYSQL_UNIX_ADDR=/usr/local/mysql/mysql.sock \
-DDEFAULT_CHARSET=utf8 \
-DDEFAULT_COLLATION=utf8_general_ci \
-DWITH_MYISAM_STORAGE_ENGINE=1 \
-DWITH_INNOBASE_STORAGE_ENGINE=1 \
-DWITH_ARCHIVE_STORAGE_ENGINE=1 \
-DWITH_BLACKHOLE_STORAGE_ENGINE=1 \
-DWITH_MEMORY_STORAGE_ENGINE=1 \
-DWITH_READLINE=1 \
-DENABLED_LOCAL_INFILE=1 \
-DMYSQL_DATADIR=/mysql/data \
-DMYSQL_USER=mysql \
-DMYSQL_TCP_PORT=3306 \
-DENABLE_DOWNLOADS=1
上面的这些复制完,回车,然后就开始cmake的过程,一般时间不会很长。
cmake结束后开始编译源码,这一步时间会较长,请耐心等待。
make  (注:此步骤可以使用make -j多个CPU一起编译,提升速度,单核CPU不要加此参数)
安装编译好的程序
make install
注意:如果需要重装mysql,在/usr/local/src/mysql-5.6.16在执行下make install就可以了,不需要再cmake和make
清除安装临时文件
make clean

 

③初始化数据目录

cd /usr/local/mysql/scripts/
创建mysql用户,将/mysql  /user/local/mysql 目录赋权给mysql
useradd mysql
chown -Rf mysql:mysql /mysql
chown -Rf mysql:mysql /usr/local/mysql
./mysql_install_db --datadir=/mysql/data --basedir=/usr/local/mysql --user=mysql
出现两个OK字样,表示数据目录初始化完毕在/mysql/data下

④mysql注册服务并启动

cd /usr/local/mysql/support-files
cp mysql.server /etc/rc.d/init.d/mysql
cp my-default.cnf /etc/my.cnf
chkconfig --add mysql
chkconfig mysql on
service mysql start
mysql -u mysql -S /usr/local/mysql/mysql.sock
连接需要sock信息,可以-S指定路径或者把socket路径写在my.cnf中,配置相应路径,否则连接报错,建议写在配置文件里
[mysqld] 
socket=/usr/local/mysql/mysql.sock
[client] 
socket=/usr/local/mysql/mysql.sock
其他my.cnf参数后续添加

 # mysql主从复制
本实验通过三台服务器搭建主从复制环境
Master                 10.0.0.50
cadicateMaster   10.0.0.60
slave                   10.0.0.70
修改主库和从库的参数文件(/etc/my.cnf)
主库:10.0.0.50

#[mysqld]标签下追加
server_id =1     
log-bin=/mysql/data/mysql-bin
log-bin-index=/mysql/data/mysql-binlog.index  
log_slave_updates=1
sync-binlog = 1

 


从库1:10.0.0.60 (备主)

#[mysqld]标签下追加
server_id =2
log-bin=/mysql/data/mysql-bin
log-bin-index=/mysql/data/mysql-binlog.index
log_slave_updates=1
sync-binlog = 1

 


从库2:10.0.0.70

#[mysqld]标签下追加
read_only=1
server_id = 3

 

MHA软件由两部分组成,Manager工具包和Node工具包,具体的说明如下。

Manager工具包主要包括以下几个工具:

masterha_check_ssh              检查MHA的SSH配置状况
masterha_check_repl             检查MySQL复制状况
masterha_manger                 启动MHA
masterha_check_status           检测当前MHA运行状态
masterha_master_monitor         检测master是否宕机
masterha_master_switch          控制故障转移(自动或者手动)
masterha_conf_host              添加或删除配置的server信息

Node工具包(这些工具通常由MHA Manager的脚本触发,无需人为操作)主要包括以下几个工具:

save_binary_logs                保存和复制master的二进制日志
apply_diff_relay_logs           识别差异的中继日志事件并将其差异的事件应用于其他的slave
filter_mysqlbinlog              去除不必要的ROLLBACK事件(MHA已不再使用这个工具)
purge_relay_logs                清除中继日志(不会阻塞SQL线程)

注意:

为了尽可能的减少主库硬件损坏宕机造成的数据丢失,因此在配置MHA的同时建议配置成MySQL 5.5的半同步复制。关于半同步复制原理各位自己进行查阅。(不是必须)

角色                    ip地址          主机名          server_id                  类型
Monitor host            10.0.0.20    manager            -                      监控复制组
Master                  10.0.0.50    master             1                        写入
Candicate master        10.0.0.60    cadicate-master    2                         读
Slave                   10.0.0.70    slave              3                         读

其中master对外提供写服务,备选master(实际的slave,主机名Candicate master)提供读服务,slave也提供相关的读服务,一旦master宕机,将会把备选master提升为新的master,slave指向新的master

(1)在所有节点安装MHA node所需的perl模块(DBD:mysql),安装脚本如下:

[root@master ~]# cat install.sh 
#!/bin/bash
wget http://xrl.us/cpanm --no-check-certificate
mv cpanm /usr/bin
chmod 755 /usr/bin/cpanm
cat > /root/list << EOF
install DBD::mysql
EOF
for package in `cat /root/list`
do
    cpanm $package
done
[root@master ~]bash install.sh
[root@master ~]scp install.sh root@10.0.0.50(60,70):~/ #传到其他节点 并执行安装脚本

安装报如下错误:

Can't locate ExtUtils/Manifest.pm in @INC (@INC contains: FatPacked::33970024=HASH(0x2065768) /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /usr/bin/cpanm line 132.

解决方法:
安装下面的包
yum install perl-ExtUtils-MakeMaker -y

如果有安装epel源,也可以使用yum安装

yum install perl-DBD-MySQL -y

(2)在所有的节点安装mha node:

wget http://mysql-master-ha.googlecode.com/files/mha4mysql-node-0.56.tar.gz
tar xf mha4mysql-node-0.56.tar.gz
cd mha4mysql-node-0.56
perl Makefile.PL
make && make install

安装完成后会在/usr/local/bin目录下生成以下脚本文件:

[root@master bin]# pwd
/usr/local/bin
[root@master bin]# ll
total 40
-r-xr-xr-x 1 root root 15498 Apr 20 10:05 apply_diff_relay_logs
-r-xr-xr-x 1 root root  4807 Apr 20 10:05 filter_mysqlbinlog
-r-xr-xr-x 1 root root  7401 Apr 20 10:05 purge_relay_logs
-r-xr-xr-x 1 root root  7263 Apr 20 10:05 save_binary_logs
[root@master bin]# 

关于上面脚本的功能,上面已经介绍过了,这里不再重复了。

2.安装MHA Manager

MHA Manager中主要包括了几个管理员的命令行工具,例如master_manger,master_master_switch等。MHA Manger也依赖于perl模块,具体如下:

(1)安装MHA Node软件包之前需要安装依赖。我这里使用yum完成,没有epel源的可以使用上面提到的脚本(epel源安装也简单)。注意:在MHA Manager的主机也是需要安装MHA Node。

rpm -ivh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
yum install perl-DBD-MySQL -y

安装MHA Node软件包,和上面的方法一样,如下:

wget http://mysql-master-ha.googlecode.com/files/mha4mysql-node-0.56.tar.gz
tar xf mha4mysql-node-0.56.tar.gz
cd mha4mysql-node-0.56
perl Makefile.PL

小坑:报错,删除编译缓存文件。安装缺少的Curses-devel 包


CMake Error at cmake/readline.cmake:85 (MESSAGE):
Curses library not found. Please install appropriate package,


remove CMakeCache.txt and rerun cmake.On Debian/Ubuntu, package name is libncurses5-dev, on Redhat and derivates it is ncurses-devel.
Call Stack (most recent call first):
cmake/readline.cmake:128 (FIND_CURSES)
cmake/readline.cmake:202 (MYSQL_USE_BUNDLED_EDITLINE)
CMakeLists.txt:410 (MYSQL_CHECK_EDITLINE)



-- Configuring incomplete, errors occurred!
See also "/root/mysql-5.6.16/CMakeFiles/CMakeOutput.log".
See also "/root/mysql-5.6.16/CMakeFiles/CMakeError.log".


[root@cadicate-master mysql-5.6.16] yum install curses-devel -y
[root@cadicate-master mysql-5.6.16]# rm CMakeCache.txt

#报如下错误(一个老外的回答)
Can't locate CPAN.pm in @INC (@INC contains: inc /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .)
at inc/Module/AutoInstall.pm line 277.

Try to install CPAN using

yum -y install perl-CPAN
and then use cpan command to install the required modules
cpan XML::Writer
#有特么报错(这perl真是xxxxx)
Warning (usually harmless): 'YAML' not installed, will not store persistent state

perl -MCPAN -e 'install "YAML"'
make && make install

(2)安装MHA Manager。首先安装MHA Manger依赖的perl模块(我这里使用yum安装):

yum install perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Time-HiRes -y

安装MHA Manager软件包:

wget http://mysql-master-ha.googlecode.com/files/mha4mysql-manager-0.56.tar.gz
tar xf mha4mysql-manager-0.56.tar.gz 
cd mha4mysql-manager-0.56
perl Makefile.PL
make && make install

安装完成后会在/usr/local/bin目录下面生成以下脚本文件,前面已经说过这些脚本的作用,这里不再重复

[root@manager bin]# pwd
/usr/local/bin
[root@manager bin]# ll
total 76
-r-xr-xr-x 1 root root 15498 Apr 20 10:58 apply_diff_relay_logs
-r-xr-xr-x 1 root root  4807 Apr 20 10:58 filter_mysqlbinlog
-r-xr-xr-x 1 root root  1995 Apr 20 11:33 masterha_check_repl
-r-xr-xr-x 1 root root  1779 Apr 20 11:33 masterha_check_ssh
-r-xr-xr-x 1 root root  1865 Apr 20 11:33 masterha_check_status
-r-xr-xr-x 1 root root  3201 Apr 20 11:33 masterha_conf_host
-r-xr-xr-x 1 root root  2517 Apr 20 11:33 masterha_manager
-r-xr-xr-x 1 root root  2165 Apr 20 11:33 masterha_master_monitor
-r-xr-xr-x 1 root root  2373 Apr 20 11:33 masterha_master_switch
-r-xr-xr-x 1 root root  3749 Apr 20 11:33 masterha_secondary_check
-r-xr-xr-x 1 root root  1739 Apr 20 11:33 masterha_stop
-r-xr-xr-x 1 root root  7401 Apr 20 10:58 purge_relay_logs
-r-xr-xr-x 1 root root  7263 Apr 20 10:58 save_binary_logs
[root@manager bin]# 

复制相关脚本到/usr/local/bin目录(软件包解压缩后就有了,不是必须,因为这些脚本不完整,需要自己修改,这是软件开发者留给我们自己发挥的,如果开启下面的任何一个脚本对应的参数,而对应这里的脚本又没有修改,则会抛错,自己被坑的很惨)

[root@manager scripts]# pwd
/root/mha4mysql-manager-0.56/samples/scripts
[root@manager scripts]# ll
total 32
-rwxr-xr-x 1 root root  3443 Jan  8  2012 master_ip_failover                #自动切换时vip管理的脚本,不是必须,如果我们使用keepalived的,我们可以自己编写脚本完成对vip的管理,比如监控mysql,如果mysql异常,我们停止keepalived就行,这样vip就会自动漂移
-rwxr-xr-x 1 root root  9186 Jan  8  2012 master_ip_online_change           #在线切换时vip的管理,不是必须,同样可以可以自行编写简单的shell完成
-rwxr-xr-x 1 root root 11867 Jan  8  2012 power_manager                     #故障发生后关闭主机的脚本,不是必须
-rwxr-xr-x 1 root root  1360 Jan  8  2012 send_report                       #因故障切换后发送报警的脚本,不是必须,可自行编写简单的shell完成。
[root@manager scripts]# cp * /usr/local/bin/
[root@manager scripts]#

3.配置SSH登录无密码验证(使用key登录,工作中常用)我的测试环境已经是使用key登录,服务器之间无需密码验证的。关于配置使用key登录,我想我不再重复。但是有一点需要注意:不能禁止 password 登陆,否则会出现错误

4.搭建主从复制环境

注意:binlog-do-db 和 replicate-ignore-db 设置必须相同。 MHA 在启动时候会检测过滤规则,如果过滤规则不同,MHA 不启动监控和故障转移。

(1)在master上执行备份(10.0.0.50)

[root@master ~]# mysqldump -uroot -p --master-data=2 --single-transaction -R --triggers -A > all.sql

其中--master-data=2代表备份时刻记录master的Binlog位置和Position,--single-transaction意思是获取一致性快照,-R意思是备份存储过程和函数,--triggres的意思是备份触发器,-A代表备份所有的库。更多信息请自行mysqldump --help查看。

(2)在master上创建复制用户:

mysql> grant replication slave on *.* to 'repl'@'10.0.0.%' identified by '123456';
Query OK, 0 rows affected (0.00 sec)

mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)

mysql> 

 (3)查看主库备份时的binlog名称和位置,MASTER_LOG_FILE和MASTER_LOG_POS:

#这个DBA哥们的获取文件和位置方法,有点问题。
[root@master ~]# head -n 30 all.sql |grep 'CHANGE MASTER TO' -- CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=106;
#我们还是来点传统的获取方式,在master主机查看master状态信息。

mysql> show master status;
+------------------+----------+--------------+------------------+-------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+--------------+------------------+-------------------+
| mysql-bin.000005 | 795 | | | |
+------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)

(4)把备份复制到candicateMaster和slave,也就是10.0.0.60和10.0.0.70

scp all.sql 10.0.0.60:/data/
scp all.sql 10.0.0.70:/data/

(5)导入备份到candicateMaster,执行复制相关命令

mysql -uroot -p < /data/all.sql
mysql> CHANGE MASTER TO MASTER_HOST='10.0.0.50',MASTER_USER='repl', MASTER_PASSWORD='123',MASTER_LOG_FILE='mysql-bin.000001',MASTER_LOG_POS=106;
Query OK, 0 rows affected (0.02 sec)

mysql> start slave;
Query OK, 0 rows affected (0.01 sec)

 查看复制状态(可以看见复制成功):

[root@cadicate-master ~]# mysql -uroot -p123 -e "show slave status\G" | grep -E "Slave_SQL|Slave_IO"
               Slave_IO_State: Waiting for master to send event
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it

(6)在slave(10.0.0.70)上搭建复制环境,操作和上面一样。

mysql -uroot -p < /data/all.sql
mysql> change master to   
    -> master_host='10.0.0.50',
    -> master_user='repl',
    -> master_password='123',
    -> master_log_file='mysql-bin.000005',
    -> master_log_pos=795;
Query OK, 0 rows affected, 2 warnings (0.01 sec)

mysql> start slave;
Query OK, 0 rows affected (0.01 sec)

# 查看线程状态

[root@slave ~]# mysql -uroot -p123 -e "show slave status\G" |grep -E "Slave_SQL|Slave_IO"
               Slave_IO_State: Waiting for master to send event
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it

(7)两台slave服务器设置read_only(从库对外提供读服务,只所以没有写进配置文件,是因为随时slave会提升为master

[root@cadicate-master ~]# mysql -uroot -p123 -e 'set global read_only=1'
[root@slave ~]# mysql -uroot -p123 -e 'set global read_only=1'

(8)创建监控用户(在master上执行,也就是10.0.0.50):

mysql> grant all privileges on *.* to 'root'@'10.0.0.%' identified  by '123';
Query OK, 0 rows affected (0.00 sec)

mysql> flush  privileges;
Query OK, 0 rows affected (0.01 sec)

到这里整个集群环境已经搭建完毕,剩下的就是配置MHA软件了。

5.配置MHA

(1)创建MHA的工作目录,并且创建相关配置文件(在软件包解压后的目录里面有样例配置文件)

[root@manager ~]# mkdir -p /etc/masterha
[root@manager ~]# cp mha4mysql-manager-0.56/samples/conf/app1.cnf /etc/masterha/

修改app1.cnf配置文件,修改后的文件内容如下(注意,配置文件中的注释需要去掉,我这里是为了解释清楚):

[root@10.0.0.20 ~]# cat /etc/masterha/app1.cnf 
[server default]
manager_workdir=/var/log/masterha/app1.log              //设置manager的工作目录
manager_log=/var/log/masterha/app1/manager.log          //设置manager的日志
master_binlog_dir=/data/mysql                         //设置master 保存binlog的位置,以便MHA可以找到master的日志,我这里的也就是mysql的数据目录
master_ip_failover_script= /usr/local/bin/master_ip_failover    //设置自动failover时候的切换脚本
master_ip_online_change_script= /usr/local/bin/master_ip_online_change  //设置手动切换时候的切换脚本
password=123456         //设置mysql中root用户的密码,这个密码是前文中创建监控用户的那个密码
user=root               设置监控用户root
ping_interval=1         //设置监控主库,发送ping包的时间间隔,默认是3秒,尝试三次没有回应的时候自动进行railover
remote_workdir=/tmp     //设置远端mysql在发生切换时binlog的保存位置
repl_password=123456    //设置复制用户的密码
repl_user=repl          //设置复制环境中的复制用户名
report_script=/usr/local/send_report    //设置发生切换后发送的报警的脚本
secondary_check_script= /usr/local/bin/masterha_secondary_check -s server03 -s server02            
shutdown_script=""      //设置故障发生后关闭故障主机脚本(该脚本的主要作用是关闭主机放在发生脑裂,这里没有使用)
ssh_user=root           //设置ssh的登录用户名

[server1]
hostname=10.0.0.50
port=3306

[server2]
hostname=10.0.0.60
port=3306
candidate_master=1   //设置为候选master,如果设置该参数以后,发生主从切换以后将会将此从库提升为主库,即使这个主库不是集群中事件最新的slave
check_repl_delay=0   //默认情况下如果一个slave落后master 100M的relay logs的话,MHA将不会选择该slave作为一个新的master,因为对于这个slave的恢复需要花费很长时间,通过设置check_repl_delay=0,MHA触发切换在选择一个新的master的时候将会忽略复制延时,这个参数对于设置了candidate_master=1的主机非常有用,因为这个候选主在切换的过程中一定是新的master

[server3]
hostname=10.0.0.70
port=3306

(2)设置relay log的清除方式(在每个slave节点上):

[root@cadicate-master ~]# mysql -e 'set global relay_log_purge=0'
[root@slave ~]#       mysql -e 'set global relay_log_purge=0'

注意:

MHA在发生切换的过程中,从库的恢复过程中依赖于relay log的相关信息,所以这里要将relay log的自动清除设置为OFF,采用手动清除relay log的方式。在默认情况下,从服务器上的中继日志会在SQL线程执行完毕后被自动删除。但是在MHA环境中,这些中继日志在恢复其他从服务器时可能会被用到,因此需要禁用中继日志的自动删除功能。定期清除中继日志需要考虑到复制延时的问题。在ext3的文件系统下,删除大的文件需要一定的时间,会导致严重的复制延时。为了避免复制延时,需要暂时为中继日志创建硬链接,因为在linux系统中通过硬链接删除大文件速度会很快。(在mysql数据库中,删除大表时,通常也采用建立硬链接的方式)

MHA节点中包含了pure_relay_logs命令工具,它可以为中继日志创建硬链接,执行SET GLOBAL relay_log_purge=1,等待几秒钟以便SQL线程切换到新的中继日志,再执行SET GLOBAL relay_log_purge=0。

pure_relay_logs脚本参数如下所示:

--user mysql                      用户名
--password mysql                  密码
--port                            端口号
--workdir                         指定创建relay log的硬链接的位置,默认是/var/tmp,由于系统不同分区创建硬链接文件会失败,故需要执行硬链接具体位置,成功执行脚本后,硬链接的中继日志文件被删除
--disable_relay_log_purge         默认情况下,如果relay_log_purge=1,脚本会什么都不清理,自动退出,通过设定这个参数,当relay_log_purge=1的情况下会将relay_log_purge设置为0。清理relay log之后,最后将参数设置为OFF。

(3)设置定期清理relay脚本(两台slave服务器)

[root@cadicate-master ~]# cat purge_relay_log.sh 
#!/bin/bash
user=root
passwd=123
port=3306
log_dir='/data/masterha/log'
work_dir='/data'
purge='/usr/local/bin/purge_relay_logs'

if [ ! -d $log_dir ]
then
   mkdir $log_dir -p
fi

$purge --user=$user --password=$passwd --disable_relay_log_purge --port=$port --workdir=$work_dir >> $log_dir/purge_relay_logs.log 2>&1

添加到crontab定期执行

[root@cadicate-master ~]# crontab -l
0 4 * * * /bin/bash /root/purge_relay_log.sh

purge_relay_logs脚本删除中继日志不会阻塞SQL线程。下面我们手动执行看看什么情况。

[root@cadicate-master ~]# purge_relay_logs --user=root --password=123 --port=3306 -disable_relay_log_purge --workdir=/data/

2018-03-09 15:45:59: purge_relay_logs script started.
Found relay_log.info: /mysql/data/relay-log.info
Opening /mysql/data/cadicate-master-relay-bin.000001 ..
Opening /mysql/data/cadicate-master-relay-bin.000002 ..
Opening /mysql/data/cadicate-master-relay-bin.000003 ..
Executing SET GLOBAL relay_log_purge=1; FLUSH LOGS; sleeping a few seconds so that SQL thread can delete older relay log files (if it keeps up); SET GLOBAL relay_log_purge=0; .. ok.
2018-03-09 15:46:02: All relay log purging operations succeeded.

6.检查SSH配置

检查MHA Manger到所有MHA Node的SSH连接状态:

[root@manager .ssh]# masterha_check_ssh --conf=/etc/masterha/app1.cnf
Fri Mar 9 16:29:14 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Fri Mar 9 16:29:14 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Fri Mar 9 16:29:14 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Fri Mar 9 16:29:14 2018 - [info] Starting SSH connection tests..
Fri Mar 9 16:29:14 2018 - [debug]
Fri Mar 9 16:29:14 2018 - [debug] Connecting via SSH from root@10.0.0.50(10.0.0.50:22) to root@10.0.0.60(10.0.0.60:22)..
Fri Mar 9 16:29:14 2018 - [debug] ok.
Fri Mar 9 16:29:14 2018 - [debug] Connecting via SSH from root@10.0.0.50(10.0.0.50:22) to root@10.0.0.70(10.0.0.70:22)..
Fri Mar 9 16:29:14 2018 - [debug] ok.
Fri Mar 9 16:29:15 2018 - [debug]
Fri Mar 9 16:29:14 2018 - [debug] Connecting via SSH from root@10.0.0.60(10.0.0.60:22) to root@10.0.0.50(10.0.0.50:22)..
Fri Mar 9 16:29:15 2018 - [debug] ok.
Fri Mar 9 16:29:15 2018 - [debug] Connecting via SSH from root@10.0.0.60(10.0.0.60:22) to root@10.0.0.70(10.0.0.70:22)..
Fri Mar 9 16:29:15 2018 - [debug] ok.
Fri Mar 9 16:29:15 2018 - [debug]
Fri Mar 9 16:29:15 2018 - [debug] Connecting via SSH from root@10.0.0.70(10.0.0.70:22) to root@10.0.0.50(10.0.0.50:22)..
Fri Mar 9 16:29:15 2018 - [debug] ok.
Fri Mar 9 16:29:15 2018 - [debug] Connecting via SSH from root@10.0.0.70(10.0.0.70:22) to root@10.0.0.60(10.0.0.60:22)..
Fri Mar 9 16:29:15 2018 - [debug] ok.
Fri Mar 9 16:29:15 2018 - [info] All SSH connection tests passed successfully.

# 全部为ok 有error证明密钥配对存在问题。

 

可以看见各个节点ssh验证都是ok的。

7.检查整个复制环境状况。

通过masterha_check_repl脚本查看整个集群的状态

#这个折腾好久,这位dba大哥的binlog日志路径是/data/mysql  我的/mysql/data 好蛋疼,一顿瞎J8百度,完全无卵用,正确姿势就是耐心分析日志。

[root@manager ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
Fri Mar 9 17:21:22 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Fri Mar 9 17:21:22 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Fri Mar 9 17:21:22 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Fri Mar 9 17:21:22 2018 - [info] MHA::MasterMonitor version 0.56.
Fri Mar 9 17:21:22 2018 - [info] GTID failover mode = 0
Fri Mar 9 17:21:22 2018 - [info] Dead Servers:
Fri Mar 9 17:21:22 2018 - [info] Alive Servers:
Fri Mar 9 17:21:22 2018 - [info] 10.0.0.50(10.0.0.50:3306)
Fri Mar 9 17:21:22 2018 - [info] 10.0.0.60(10.0.0.60:3306)
Fri Mar 9 17:21:22 2018 - [info] 10.0.0.70(10.0.0.70:3306)
Fri Mar 9 17:21:22 2018 - [info] Alive Slaves:
Fri Mar 9 17:21:22 2018 - [info] 10.0.0.60(10.0.0.60:3306) Version=5.6.16-log (oldest major version between slaves) log-bin:enabled
Fri Mar 9 17:21:22 2018 - [info] Replicating from 10.0.0.50(10.0.0.50:3306)
Fri Mar 9 17:21:22 2018 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Mar 9 17:21:22 2018 - [info] 10.0.0.70(10.0.0.70:3306) Version=5.6.16 (oldest major version between slaves) log-bin:disabled
Fri Mar 9 17:21:22 2018 - [info] Replicating from 10.0.0.50(10.0.0.50:3306)
Fri Mar 9 17:21:22 2018 - [info] Current Alive Master: 10.0.0.50(10.0.0.50:3306)
Fri Mar 9 17:21:22 2018 - [info] Checking slave configurations..
Fri Mar 9 17:21:22 2018 - [warning] log-bin is not set on slave 10.0.0.70(10.0.0.70:3306). This host cannot be a master.
Fri Mar 9 17:21:22 2018 - [info] Checking replication filtering settings..
Fri Mar 9 17:21:22 2018 - [info] binlog_do_db= , binlog_ignore_db=
Fri Mar 9 17:21:22 2018 - [info] Replication filtering check ok.
Fri Mar 9 17:21:22 2018 - [info] GTID (with auto-pos) is not supported
Fri Mar 9 17:21:22 2018 - [info] Starting SSH connection tests..
Fri Mar 9 17:21:23 2018 - [info] All SSH connection tests passed successfully.
Fri Mar 9 17:21:23 2018 - [info] Checking MHA Node version..
Fri Mar 9 17:21:24 2018 - [info] Version check ok.
Fri Mar 9 17:21:24 2018 - [info] Checking SSH publickey authentication settings on the current master..
Fri Mar 9 17:21:24 2018 - [info] HealthCheck: SSH to 10.0.0.50 is reachable.
Fri Mar 9 17:21:24 2018 - [info] Master MHA Node version is 0.56.
Fri Mar 9 17:21:24 2018 - [info] Checking recovery script configurations on 10.0.0.50(10.0.0.50:3306)..
Fri Mar 9 17:21:24 2018 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/mysql/data --output_file=/tmp/save_binary_logs_test --manager_version=0.56 --start_file=mysql-bin.000005
Fri Mar 9 17:21:24 2018 - [info] Connecting to root@10.0.0.50(10.0.0.50:22)..
Creating /tmp if not exists.. ok.
Checking output directory is accessible or not..
ok.
Binlog found at /mysql/data, up to mysql-bin.000005
Fri Mar 9 17:21:24 2018 - [info] Binlog setting check done.
Fri Mar 9 17:21:24 2018 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..
Fri Mar 9 17:21:24 2018 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=10.0.0.60 --slave_ip=10.0.0.60 --slave_port=3306 --workdir=/tmp --target_version=5.6.16-log --manager_version=0.56 --relay_log_info=/mysql/data/relay-log.info --relay_dir=/mysql/data/ --slave_pass=xxx
Fri Mar 9 17:21:24 2018 - [info] Connecting to root@10.0.0.60(10.0.0.60:22)..
Checking slave recovery environment settings..
Opening /mysql/data/relay-log.info ... ok.
Relay log found at /mysql/data, up to cadicate-master-relay-bin.000004
Temporary relay log file is /mysql/data/cadicate-master-relay-bin.000004
Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Fri Mar 9 17:21:25 2018 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=10.0.0.70 --slave_ip=10.0.0.70 --slave_port=3306 --workdir=/tmp --target_version=5.6.16 --manager_version=0.56 --relay_log_info=/mysql/data/relay-log.info --relay_dir=/mysql/data/ --slave_pass=xxx
Fri Mar 9 17:21:25 2018 - [info] Connecting to root@10.0.0.70(10.0.0.70:22)..
Checking slave recovery environment settings..
Opening /mysql/data/relay-log.info ... ok.
Relay log found at /mysql/data, up to slave-relay-bin.000002
Temporary relay log file is /mysql/data/slave-relay-bin.000002
Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Fri Mar 9 17:21:25 2018 - [info] Slaves settings check done.
Fri Mar 9 17:21:25 2018 - [info]
10.0.0.50(10.0.0.50:3306) (current master)
+--10.0.0.60(10.0.0.60:3306)
+--10.0.0.70(10.0.0.70:3306)

Fri Mar 9 17:21:25 2018 - [info] Checking replication health on 10.0.0.60..
Fri Mar 9 17:21:25 2018 - [info] ok.
Fri Mar 9 17:21:25 2018 - [info] Checking replication health on 10.0.0.70..
Fri Mar 9 17:21:25 2018 - [info] ok.
Fri Mar 9 17:21:25 2018 - [warning] master_ip_failover_script is not defined.
Fri Mar 9 17:21:25 2018 - [warning] shutdown_script is not defined.
Fri Mar 9 17:21:25 2018 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.   (这边显示ok才是真是ok不然脚本不能做出正确判断,它会乱来的你懂的)

如果发现如下错误:

Can't exec "mysqlbinlog": No such file or directory at /usr/local/share/perl5/MHA/BinlogManager.pm line 99.
mysqlbinlog version not found!
Testing mysql connection and privileges..sh: mysql: command not found

 

解决方法如下,添加软连接(所有节点)

ln -s /usr/local/mysql/bin/mysqlbinlog /usr/local/bin/mysqlbinlog
ln -s /usr/local/mysql/bin/mysql /usr/local/bin/mysql

 

所以先暂时注释master_ip_failover_script= /usr/local/bin/master_ip_failover这个选项。后面引入keepalived后和修改该脚本以后再开启该选项。

[root@manager ~]# grep master_ip_failover /etc/masterha/app1.cnf
#master_ip_failover_script= /usr/local/bin/master_ip_failover

8.检查MHA Manager的状态:

通过master_check_status脚本查看Manager的状态:

[root@manager ~]# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 is stopped(2:NOT_RUNNING).

注意:如果正常,会显示"PING_OK",否则会显示"NOT_RUNNING",这代表MHA监控没有开启。
9.开启MHA Manager监控

[root@manager ~]# nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1.log/manager.log 2>&1 & 
[1] 2201

启动参数介绍:

--remove_dead_master_conf      该参数代表当发生主从切换后,老的主库的ip将会从配置文件中移除。

--manger_log                            日志存放位置

--ignore_last_failover                 在缺省情况下,如果MHA检测到连续发生宕机,且两次宕机间隔不足8小时的话,则不会进行Failover,之所以这样限制是为了避免ping-pong效应。该参数代表忽略上次MHA触发切换产生的文件,默认情况下,MHA发生切换后会在日志目录,也就是上面我设置的/data产生app1.failover.complete文件,下次再次切换的时候如果发现该目录下存在该文件将不允许触发切换,除非在第一次切换后收到删除该文件,为了方便,这里设置为--ignore_last_failover。

# 查看MHA Manager监控是否正常:

[root@manager ~]# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 (pid:2211) is running(0:PING_OK), master:10.0.0.50
[root@manager ~]# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 (pid:2211) is running(0:PING_OK), master:10.0.0.50
[root@manager ~]# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 (pid:2211) is running(0:PING_OK), master:10.0.0.50

可以看见已经在监控了,而且master的主机为10.0.0.50

10.查看启动日志

[root@manager ~]# tail -n20 /var/log/masterha/app1/manager.log
  Checking slave recovery environment settings..
    Opening /mysql/data/relay-log.info ... ok.
    Relay log found at /mysql/data, up to slave-relay-bin.000002
    Temporary relay log file is /mysql/data/slave-relay-bin.000002
    Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
 done.
    Testing mysqlbinlog output.. done.
    Cleaning up test file(s).. done.
Fri Mar  9 17:34:40 2018 - [info] Slaves settings check done.
Fri Mar  9 17:34:40 2018 - [info] 
10.0.0.50(10.0.0.50:3306) (current master)
 +--10.0.0.60(10.0.0.60:3306)
 +--10.0.0.70(10.0.0.70:3306)

Fri Mar  9 17:34:40 2018 - [warning] master_ip_failover_script is not defined.
Fri Mar  9 17:34:40 2018 - [warning] shutdown_script is not defined.
Fri Mar  9 17:34:40 2018 - [info] Set master ping interval 1 seconds.
Fri Mar  9 17:34:40 2018 - [info] Set secondary check script: /usr/local/bin/masterha_secondary_check -s server03 -s server02
Fri Mar  9 17:34:40 2018 - [info] Starting ping health check on 10.0.0.50(10.0.0.50:3306)..
Fri Mar  9 17:34:40 2018 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..

其中"Ping(SELECT) succeeded, waiting until MySQL doesn't respond.."说明整个系统已经开始监控了。

11.关闭MHA Manage监控

关闭很简单,使用masterha_stop命令完成。

[root@manager ~]# masterha_stop --conf=/etc/masterha/app1.cnf
Stopped app1 successfully.
[1]+  Exit 1                  nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1
[root@manager ~]# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 is stopped(2:NOT_RUNNING).

12.配置VIP
vip配置可以采用两种方式,一种通过keepalived的方式管理虚拟ip的浮动;另外一种通过脚本方式启动虚拟ip的方式(即不需要keepalived或者heartbeat类似的软件)。

1.keepalived方式管理虚拟ip,keepalived配置方法如下:

(1)下载软件进行并进行安装(两台master,准确的说一台是master,另外一台是备选master,在没有切换以前是slave):

[root@master ~]# wget http://www.keepalived.org/software/keepalived-1.2.12.tar.gz

tar xf keepalived-1.2.12.tar.gz           
cd keepalived-1.2.12
./configure --prefix=/usr/local/keepalived
make &&  make install
cp /usr/local/keepalived/etc/rc.d/init.d/keepalived /etc/init.d/
cp /usr/local/keepalived/etc/sysconfig/keepalived /etc/sysconfig/
mkdir /etc/keepalived
cp /usr/local/keepalived/etc/keepalived/keepalived.conf /etc/keepalived/
cp /usr/local/keepalived/sbin/keepalived /usr/sbin/
 

(2)配置keepalived的配置文件,在master上配置(10.0.0.50)

[root@master ~]# cat /etc/keepalived/keepalived.conf
! Configuration File for keepalived

global_defs {
     notification_email {
     saltstack@163.com
   }
   notification_email_from dba@dbserver.com
   smtp_server 127.0.0.1
   smtp_connect_timeout 30
   router_id MySQL-HA
}

vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 150
    advert_int 1
    nopreempt

    authentication {
    auth_type PASS
    auth_pass 1111
    }

    virtual_ipaddress {
        10.0.10.88
    }
}

其中router_id MySQL HA表示设定keepalived组的名称,将192.168.0.88这个虚拟ip绑定到该主机的eth1网卡上,并且设置了状态为backup模式,将keepalived的模式设置为非抢占模式(nopreempt),priority 150表示设置的优先级为150。下面的配置略有不同,但是都是一个意思。
在候选master上配置(10.0.0.60)

[root@cadicate-master ~]# cat /etc/keepalived/keepalived.conf 
! Configuration File for keepalived

global_defs {
     notification_email {
     saltstack@163.com
   }
   notification_email_from dba@dbserver.com
   smtp_server 127.0.0.1
   smtp_connect_timeout 30
   router_id MySQL-HA
}

vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 120
    advert_int 1
    nopreempt

    authentication {
    auth_type PASS
    auth_pass 1111
    }

    virtual_ipaddress {
        10.0.0.88
    }
}

(3)启动keepalived服务,在master上启动并查看日志

[root@master ~]# /etc/init.d/keepalived start
Starting keepalived:                                       [  OK  ]
[root@master ~]# tail -f /var/log/messages 
Mar 10 04:27:52 master kernel: IPVS: Connection hash table configured (size=4096, memory=64Kbytes)
Mar 10 04:27:52 master kernel: IPVS: ipvs loaded.
Mar 10 04:27:52 master Keepalived_healthcheckers[12025]: Opening file '/etc/keepalived/keepalived.conf'.
Mar 10 04:27:52 master Keepalived_healthcheckers[12025]: Configuration is using : 7241 Bytes
Mar 10 04:27:52 master Keepalived_healthcheckers[12025]: Using LinkWatch kernel netlink reflector...
Mar 10 04:27:55 master Keepalived_vrrp[12026]: VRRP_Instance(VI_1) Transition to MASTER STATE
Mar 10 04:27:56 master Keepalived_vrrp[12026]: VRRP_Instance(VI_1) Entering MASTER STATE
Mar 10 04:27:56 master Keepalived_vrrp[12026]: VRRP_Instance(VI_1) setting protocol VIPs.
Mar 10 04:27:56 master Keepalived_vrrp[12026]: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth1 for 10.0.0.88
Mar 10 04:27:56 master Keepalived_healthcheckers[12025]: Netlink reflector reports IP 10.0.0.88 added
Mar 10 04:28:01 master Keepalived_vrrp[12026]: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth1 for 10.0.0.88

发现已经将虚拟ip 10.0.0.88绑定了网卡eth0上。

(4)查看绑定情况

[root@master ~]# ip addr |grep eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
inet 10.0.0.50/8 brd 10.255.255.255 scope global eth0
inet 10.0.0.88/32 scope global eth0

在另外一台服务器,候选master上启动keepalived服务,并观察

[root@cadicate-master ~]# /etc/init.d/keepalived start ; tail -f /var/log/messages

Starting keepalived: [ OK ]
Mar 11 10:21:24 cadicate-master ntpd[1159]: 0.0.0.0 c012 02 freq_set kernel -26.844 PPM
Mar 11 10:21:31 cadicate-master ntpd[1159]: 0.0.0.0 c615 05 clock_sync
Mar 11 10:24:56 cadicate-master Keepalived[1457]: Starting Keepalived v1.2.12 (03/09,2018)
Mar 11 10:24:56 cadicate-master Keepalived[1459]: Starting Healthcheck child process, pid=1461
Mar 11 10:24:56 cadicate-master Keepalived[1459]: Starting VRRP child process, pid=1463
Mar 11 10:24:56 cadicate-master Keepalived_vrrp[1463]: Netlink reflector reports IP 10.0.0.60 added
Mar 11 10:24:56 cadicate-master Keepalived_vrrp[1463]: Netlink reflector reports IP fe80::20c:29ff:fe39:2d82 added
Mar 11 10:24:56 cadicate-master Keepalived_vrrp[1463]: Registering Kernel netlink reflector
Mar 11 10:24:56 cadicate-master Keepalived_vrrp[1463]: Registering Kernel netlink command channel
Mar 11 10:24:56 cadicate-master Keepalived_vrrp[1463]: Registering gratuitous ARP shared channel
Mar 11 10:24:56 cadicate-master Keepalived_healthcheckers[1461]: Netlink reflector reports IP 10.0.0.60 added
Mar 11 10:24:56 cadicate-master Keepalived_healthcheckers[1461]: Netlink reflector reports IP fe80::20c:29ff:fe39:2d82 added
Mar 11 10:24:56 cadicate-master Keepalived_healthcheckers[1461]: Registering Kernel netlink reflector
Mar 11 10:24:56 cadicate-master Keepalived_healthcheckers[1461]: Registering Kernel netlink command channel
Mar 11 10:24:56 cadicate-master kernel: IPVS: Registered protocols (TCP, UDP, SCTP, AH, ESP)
Mar 11 10:24:56 cadicate-master kernel: IPVS: Connection hash table configured (size=4096, memory=64Kbytes)
Mar 11 10:24:56 cadicate-master kernel: IPVS: ipvs loaded.
Mar 11 10:25:16 cadicate-master Keepalived_vrrp[1463]: Opening file '/etc/keepalived/keepalived.conf'.
Mar 11 10:25:16 cadicate-master Keepalived_vrrp[1463]: Configuration is using : 62858 Bytes
Mar 11 10:25:16 cadicate-master Keepalived_vrrp[1463]: Using LinkWatch kernel netlink reflector...
Mar 11 10:25:16 cadicate-master Keepalived_vrrp[1463]: VRRP_Instance(VI_1) Entering BACKUP STATE
Mar 11 10:25:16 cadicate-master Keepalived_vrrp[1463]: VRRP sockpool: [ifindex(2), proto(112), unicast(0), fd(10,11)]
Mar 11 10:25:16 cadicate-master Keepalived_healthcheckers[1461]: Opening file '/etc/keepalived/keepalived.conf'.
Mar 11 10:25:16 cadicate-master Keepalived_healthcheckers[1461]: Configuration is using : 7113 Bytes
Mar 11 10:25:16 cadicate-master Keepalived_healthcheckers[1461]: Using LinkWatch kernel netlink reflector...

从上面的信息可以看到keepalived已经配置成功。
注意:

上面两台服务器的keepalived都设置为了BACKUP模式,在keepalived中2种模式,分别是master->backup模式和backup->backup模式。这两种模式有很大区别。在master->backup模式下,一旦主库宕机,虚拟ip会自动漂移到从库,当主库修复后,keepalived启动后,还会把虚拟ip抢占过来,即使设置了非抢占模式(nopreempt)抢占ip的动作也会发生。在backup->backup模式下,当主库宕机后虚拟ip会自动漂移到从库上,当原主库恢复和keepalived服务启动后,并不会抢占新主的虚拟ip,即使是优先级高于从库的优先级别,也不会发生抢占。为了减少ip漂移次数,通常是把修复好的主库当做新的备库。

(5)MHA引入keepalived(MySQL服务进程挂掉时通过MHA 停止keepalived):

要想把keepalived服务引入MHA,我们只需要修改切换是触发的脚本文件master_ip_failover即可,在该脚本中添加在master发生宕机时对keepalived的处理。

编辑脚本/usr/local/bin/master_ip_failover,修改后如下,我对perl不熟悉,所以我这里完整贴出该脚本(10.0.0.20)。#那位DBA哥们说是在master上修改,我怎么感觉不像不是那么回事。

在MHA Manager修改脚本修改后的内容如下(参考资料比较少):

#!/usr/bin/env perl

use strict;
use warnings FATAL => 'all';

use Getopt::Long;

my (
    $command,          $ssh_user,        $orig_master_host, $orig_master_ip,
    $orig_master_port, $new_master_host, $new_master_ip,    $new_master_port
);

my $vip = '10.0.0.88';
my $ssh_start_vip = "/etc/init.d/keepalived start";
my $ssh_stop_vip = "/etc/init.d/keepalived stop";

GetOptions(
    'command=s'          => \$command,
    'ssh_user=s'         => \$ssh_user,
    'orig_master_host=s' => \$orig_master_host,
    'orig_master_ip=s'   => \$orig_master_ip,
    'orig_master_port=i' => \$orig_master_port,
    'new_master_host=s'  => \$new_master_host,
    'new_master_ip=s'    => \$new_master_ip,
    'new_master_port=i'  => \$new_master_port,
);

exit &main();

sub main {

    print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n";

    if ( $command eq "stop" || $command eq "stopssh" ) {

        my $exit_code = 1;
        eval {
            print "Disabling the VIP on old master: $orig_master_host \n";
            &stop_vip();
            $exit_code = 0;
        };
        if ($@) {
            warn "Got Error: $@\n";
            exit $exit_code;
        }
        exit $exit_code;
    }
    elsif ( $command eq "start" ) {

        my $exit_code = 10;
        eval {
            print "Enabling the VIP - $vip on the new master - $new_master_host \n";
            &start_vip();
            $exit_code = 0;
        };
        if ($@) {
            warn $@;
            exit $exit_code;
        }
        exit $exit_code;
    }
    elsif ( $command eq "status" ) {
        print "Checking the Status of the script.. OK \n";
        #`ssh $ssh_user\@cluster1 \" $ssh_start_vip \"`;
        exit 0;
    }
    else {
        &usage();
        exit 1;
    }
}

# A simple system call that enable the VIP on the new master
sub start_vip() {
    `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
# A simple system call that disable the VIP on the old_master
sub stop_vip() {
     return 0  unless  ($ssh_user);
    `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}

sub usage {
    print
    "Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
}

现在已经修改这个脚本了,我们现在打开在上面提到过的参数,再检查集群状态,看是否会报错。

[root@manager ~]# grep 'master_ip_failover_script' /etc/masterha/app1.cnf
master_ip_failover_script= /usr/local/bin/master_ip_failover 
[root@manager ~]# grep 'master_ip_failover_script' /etc/masterha/app1.cnf

at Mar
10 05:22:36 2018 - [info] Slaves settings check done. Sat Mar 10 05:22:36 2018 - [info] 10.0.0.50(10.0.0.50:3306) (current master) +--10.0.0.60(10.0.0.60:3306) +--10.0.0.70(10.0.0.70:3306) Sat Mar 10 05:22:36 2018 - [info] Checking replication health on 10.0.0.60.. Sat Mar 10 05:22:36 2018 - [info] ok. Sat Mar 10 05:22:36 2018 - [info] Checking replication health on 10.0.0.70.. Sat Mar 10 05:22:36 2018 - [info] ok. Sat Mar 10 05:22:36 2018 - [info] Checking master_ip_failover_script status: Sat Mar 10 05:22:36 2018 - [info] /usr/local/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=10.0.0.50 --orig_master_ip=10.0.0.50 --orig_master_port=3306 IN SCRIPT TEST====/etc/init.d/keepalived stop==/etc/init.d/keepalived start=== Checking the Status of the script.. OK Sat Mar 10 05:22:36 2018 - [info] OK. Sat Mar 10 05:22:36 2018 - [warning] shutdown_script is not defined. Sat Mar 10 05:22:36 2018 - [info] Got exit code 0 (Not master dead). MySQL Replication Health is OK. [root@manager ~]#

可以看见已经没有报错了。我也哈哈
 /usr/local/bin/master_ip_failover添加或者修改的内容意思是当主库数据库发生故障时,会触发MHA切换,MHA Manager会停掉主库上的keepalived服务,触发虚拟ip漂移到备选从库,从而完成切换。当然可以在keepalived里面引入脚本,这个脚本监控mysql是否正常运行,如果不正常,则调用该脚本杀掉keepalived进程。

2.通过脚本的方式管理VIP。这里是修改/usr/local/bin/master_ip_failover,也可以使用其他的语言完成,比如php语言。使用php脚本编写的failover这里就不介绍了。修改完成后内容如下,而且如果使用脚本管理vip的话,需要手动在master服务器上绑定一个vip

[root@master ~]# /sbin/ifconfig eth1:1 10.0.0.88/24
#!/usr/bin/env perl

use strict;
use warnings FATAL => 'all';

use Getopt::Long;

my (
    $command,          $ssh_user,        $orig_master_host, $orig_master_ip,
    $orig_master_port, $new_master_host, $new_master_ip,    $new_master_port
);

my $vip = '10.0.0.88/24';
my $key = '1';
my $ssh_start_vip = "/sbin/ifconfig eth1:$key $vip";
my $ssh_stop_vip = "/sbin/ifconfig eth1:$key down";

GetOptions(
    'command=s'          => \$command,
    'ssh_user=s'         => \$ssh_user,
    'orig_master_host=s' => \$orig_master_host,
    'orig_master_ip=s'   => \$orig_master_ip,
    'orig_master_port=i' => \$orig_master_port,
    'new_master_host=s'  => \$new_master_host,
    'new_master_ip=s'    => \$new_master_ip,
    'new_master_port=i'  => \$new_master_port,
);

exit &main();

sub main {

    print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n";

    if ( $command eq "stop" || $command eq "stopssh" ) {

        my $exit_code = 1;
        eval {
            print "Disabling the VIP on old master: $orig_master_host \n";
            &stop_vip();
            $exit_code = 0;
        };
        if ($@) {
            warn "Got Error: $@\n";
            exit $exit_code;
        }
        exit $exit_code;
    }
    elsif ( $command eq "start" ) {

        my $exit_code = 10;
        eval {
            print "Enabling the VIP - $vip on the new master - $new_master_host \n";
            &start_vip();
            $exit_code = 0;
        };
        if ($@) {
            warn $@;
            exit $exit_code;
        }
        exit $exit_code;
    }
    elsif ( $command eq "status" ) {
        print "Checking the Status of the script.. OK \n";
        exit 0;
    }
    else {
        &usage();
        exit 1;
    }
}

sub start_vip() {
    `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
sub stop_vip() {
     return 0  unless  ($ssh_user);
    `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}

sub usage {
    print
    "Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
}

为了防止脑裂发生,推荐生产环境采用脚本的方式来管理虚拟ip,而不是使用keepalived来完成。到此为止,基本MHA集群已经配置完毕。接下来就是实际的测试环节了。通过一些测试来看一下MHA到底是如何进行工作的。下面将从MHA自动failover,我们手动failover,在线切换三种方式来介绍MHA的工作情况。

一.自动Failover(必须先启动MHA Manager,否则无法自动切换,当然手动切换不需要开启MHA Manager监控。)

(1)使用sysbench生成测试数据(使用yum快速安装)

yum install sysbench -y

也可以编译安装,我这里使用了编译安装的方式:http://www.cnblogs.com/benjamin77/p/8537226.html

在主库(10.0.0.50)上进行sysbench数据生成,在sbtest库下生成sbtest表,共100W记录。

 #哈哈 记得进数据库提前创建一个叫sbtest的库。(这名字有个性)

[root@master sysbench]# pwd
/root/sysbench-0.4.12-1.1/sysbench

[root@master sysbench]# ./sysbench --test=tests/db/oltp.lua --oltp-table-size=1000000 --oltp-read-only=off --init-rng=on --num-threads=16 --max-requests=0 --oltp-dist-type=uniform --max-time=1800 --mysql-user=root --mysql-socket=/usr/local/mysql/mysql.sock --mysql-password=123 --db-driver=mysql --mysql-table-engine=innodb --oltp-test-mode=complex prepare
sysbench 0.5: multi-threaded system evaluation benchmark

Creating table 'sbtest1'...
Inserting 1000000 records into 'sbtest1'

(2)停掉slave sql线程,模拟主从延时。(10.0.0.60)

mysql> stop slave io_thread;
Query OK, 0 rows affected (0.08 sec)

mysql> 

另外一台slave我们没有停止io线程,所以还在继续接收日志。

(3)模拟sysbench压力测试。

在主库上(10.0.0.50)进行压力测试,持续时间为3分钟,产生大量的binlog。

[root@master sysbench]# ./sysbench --test=tests/db/oltp.lua --oltp-table-size=1000000 --oltp-read-only=off --init-rng=on --num-threads=16 
--max-requests=0 --oltp-dist-type=uniform --max-time=180 --mysql-user=root --mysql-socket=/usr/local/mysql/mysql.sock --mysql-password=123 --db-driver=mysql --mysql-table-engine=innodb --oltp-test-mode=complex run sysbench 0.5: multi-threaded system evaluation benchmark Running the test with following options: Number of threads: 16 Random number generator seed is 0 and will be ignored Threads started! OLTP test statistics: queries performed: read: 324660 write: 92760 other: 46380 total: 463800 transactions: 23190 (128.75 per sec.) deadlocks: 0 (0.00 per sec.) read/write requests: 417420 (2317.42 per sec.) other operations: 46380 (257.49 per sec.) General statistics: total time: 180.1229s total number of events: 23190 total time taken by event execution: 2879.0035s response time: min: 12.24ms avg: 124.15ms max: 2239.50ms approx. 95 percentile: 372.38ms Threads fairness: events (avg/stddev): 1449.3750/10.03 execution time (avg/stddev): 179.9377/0.01

(4)开启slave(10.0.0.60)上的IO线程,追赶落后于master的binlog。

mysql> start slave io_thread;     
Query OK, 0 rows affected (0.00 sec)

(5)杀掉主库mysql进程,模拟主库发生故障,进行自动failover操作。

[root@master sysbench]# pkill -9 mysqld

 (6)查看MHA切换日志,了解整个切换过程,在10.0.0.20上查看日志:

 

Sun Mar 11 13:55:07 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Mar 11 13:55:07 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Sun Mar 11 13:55:07 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Sun Mar 11 13:55:07 2018 - [info] MHA::MasterMonitor version 0.56.
Sun Mar 11 13:55:07 2018 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln671] Master 10.0.0.50:3306 from which slave 10.0.0.60(10.0.0.60:3306) replicates is not defined in the configuration file!
Sun Mar 11 13:55:07 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln424] Error happened on checking configurations.  at /usr/local/share/perl5/MHA/MasterMonitor.pm line 326
Sun Mar 11 13:55:07 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln523] Error happened on monitoring servers.
Sun Mar 11 13:55:07 2018 - [info] Got exit code 1 (Not master dead).
^C
[root@manager ~]# > /var/log/masterha/app1/manager.log        
[root@manager ~]# tail -f /var/log/masterha/app1/manager.log 
Sun Mar 11 13:55:48 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Mar 11 13:55:48 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Sun Mar 11 13:55:48 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Sun Mar 11 13:55:48 2018 - [info] MHA::MasterMonitor version 0.56.
Sun Mar 11 13:55:49 2018 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln671] Master 10.0.0.50:3306 from which slave 10.0.0.60(10.0.0.60:3306) replicates is not defined in the configuration file!
Sun Mar 11 13:55:49 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln424] Error happened on checking configurations.  at /usr/local/share/perl5/MHA/MasterMonitor.pm line 326
Sun Mar 11 13:55:49 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln523] Error happened on monitoring servers.
Sun Mar 11 13:55:49 2018 - [info] Got exit code 1 (Not master dead).
tail: /var/log/masterha/app1/manager.log: file truncated
Sun Mar 11 14:02:04 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Mar 11 14:02:04 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Sun Mar 11 14:02:04 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Sun Mar 11 14:02:04 2018 - [info] MHA::MasterMonitor version 0.56.
Sun Mar 11 14:02:04 2018 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln671] Master 10.0.0.50:3306 from which slave 10.0.0.60(10.0.0.60:3306) replicates is not defined in the configuration file!
Sun Mar 11 14:02:04 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln424] Error happened on checking configurations.  at /usr/local/share/perl5/MHA/MasterMonitor.pm line 326
Sun Mar 11 14:02:04 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln523] Error happened on monitoring servers.
Sun Mar 11 14:02:04 2018 - [info] Got exit code 1 (Not master dead).
tail: /var/log/masterha/app1/manager.log: file truncated
Sun Mar 11 14:03:50 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Mar 11 14:03:50 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Sun Mar 11 14:03:50 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Sun Mar 11 14:03:50 2018 - [info] MHA::MasterMonitor version 0.56.
Sun Mar 11 14:03:50 2018 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln671] Master 10.0.0.50:3306 from which slave 10.0.0.60(10.0.0.60:3306) replicates is not defined in the configuration file!
Sun Mar 11 14:03:50 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln424] Error happened on checking configurations.  at /usr/local/share/perl5/MHA/MasterMonitor.pm line 326
Sun Mar 11 14:03:50 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln523] Error happened on monitoring servers.
Sun Mar 11 14:03:50 2018 - [info] Got exit code 1 (Not master dead).
tail: /var/log/masterha/app1/manager.log: file truncated
Sun Mar 11 14:15:27 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Mar 11 14:15:27 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Sun Mar 11 14:15:27 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Sun Mar 11 14:15:27 2018 - [info] MHA::MasterMonitor version 0.56.
Sun Mar 11 14:15:27 2018 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln781] Multi-master configuration is detected, but two or more masters are either writable (read-only is not set) or dead! Check configurations for details. Master configurations are as below: 
Master 10.0.0.50(10.0.0.50:3306)
Master 10.0.0.60(10.0.0.60:3306), replicating from 10.0.0.50(10.0.0.50:3306)

Sun Mar 11 14:15:27 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln424] Error happened on checking configurations.  at /usr/local/share/perl5/MHA/MasterMonitor.pm line 326
Sun Mar 11 14:15:27 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln523] Error happened on monitoring servers.
Sun Mar 11 14:15:27 2018 - [info] Got exit code 1 (Not master dead).
tail: /var/log/masterha/app1/manager.log: file truncated
Sun Mar 11 14:18:58 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Mar 11 14:18:58 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Sun Mar 11 14:18:58 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Sun Mar 11 14:18:58 2018 - [info] MHA::MasterMonitor version 0.56.
Sun Mar 11 14:18:58 2018 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln781] Multi-master configuration is detected, but two or more masters are either writable (read-only is not set) or dead! Check configurations for details. Master configurations are as below: 
Master 10.0.0.50(10.0.0.50:3306)
Master 10.0.0.60(10.0.0.60:3306), replicating from 10.0.0.50(10.0.0.50:3306)

Sun Mar 11 14:18:58 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln424] Error happened on checking configurations.  at /usr/local/share/perl5/MHA/MasterMonitor.pm line 326
Sun Mar 11 14:18:58 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln523] Error happened on monitoring servers.
Sun Mar 11 14:18:58 2018 - [info] Got exit code 1 (Not master dead).
tail: /var/log/masterha/app1/manager.log: file truncated
Sun Mar 11 14:35:38 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Mar 11 14:35:38 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Sun Mar 11 14:35:38 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Sun Mar 11 14:35:38 2018 - [info] MHA::MasterMonitor version 0.56.
Sun Mar 11 14:35:38 2018 - [info] Multi-master configuration is detected. Current primary(writable) master is 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 14:35:38 2018 - [info] Master configurations are as below: 
Master 10.0.0.50(10.0.0.50:3306)
Master 10.0.0.60(10.0.0.60:3306), replicating from 10.0.0.50(10.0.0.50:3306), read-only

Sun Mar 11 14:35:38 2018 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln726] Slave 10.0.0.70(10.0.0.70:3306) replicates from 10.0.0.60:3306, but real master is 10.0.0.50(10.0.0.50:3306)!
Sun Mar 11 14:35:38 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln424] Error happened on checking configurations.  at /usr/local/share/perl5/MHA/MasterMonitor.pm line 326
Sun Mar 11 14:35:38 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln523] Error happened on monitoring servers.
Sun Mar 11 14:35:38 2018 - [info] Got exit code 1 (Not master dead).
tail: /var/log/masterha/app1/manager.log: file truncated
Sun Mar 11 14:37:49 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Mar 11 14:37:49 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Sun Mar 11 14:37:49 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Sun Mar 11 14:37:49 2018 - [info] MHA::MasterMonitor version 0.56.
Sun Mar 11 14:37:49 2018 - [info] Multi-master configuration is detected. Current primary(writable) master is 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 14:37:49 2018 - [info] Master configurations are as below: 
Master 10.0.0.50(10.0.0.50:3306)
Master 10.0.0.60(10.0.0.60:3306), replicating from 10.0.0.50(10.0.0.50:3306), read-only

Sun Mar 11 14:37:49 2018 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln726] Slave 10.0.0.70(10.0.0.70:3306) replicates from 10.0.0.60:3306, but real master is 10.0.0.50(10.0.0.50:3306)!
Sun Mar 11 14:37:49 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln424] Error happened on checking configurations.  at /usr/local/share/perl5/MHA/MasterMonitor.pm line 326
Sun Mar 11 14:37:49 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln523] Error happened on monitoring servers.
Sun Mar 11 14:37:49 2018 - [info] Got exit code 1 (Not master dead).
tail: /var/log/masterha/app1/manager.log: file truncated
Sun Mar 11 14:38:57 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Mar 11 14:38:57 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Sun Mar 11 14:38:57 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Sun Mar 11 14:38:57 2018 - [info] MHA::MasterMonitor version 0.56.
Sun Mar 11 14:38:57 2018 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln781] Multi-master configuration is detected, but two or more masters are either writable (read-only is not set) or dead! Check configurations for details. Master configurations are as below: 
Master 10.0.0.50(10.0.0.50:3306)
Master 10.0.0.60(10.0.0.60:3306), replicating from 10.0.0.50(10.0.0.50:3306)

Sun Mar 11 14:38:57 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln424] Error happened on checking configurations.  at /usr/local/share/perl5/MHA/MasterMonitor.pm line 326
Sun Mar 11 14:38:57 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln523] Error happened on monitoring servers.
Sun Mar 11 14:38:57 2018 - [info] Got exit code 1 (Not master dead).
tail: /var/log/masterha/app1/manager.log: file truncated
Sun Mar 11 14:41:47 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Mar 11 14:41:47 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Sun Mar 11 14:41:47 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Sun Mar 11 14:41:47 2018 - [info] MHA::MasterMonitor version 0.56.
Sun Mar 11 14:41:47 2018 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln781] Multi-master configuration is detected, but two or more masters are either writable (read-only is not set) or dead! Check configurations for details. Master configurations are as below: 
Master 10.0.0.50(10.0.0.50:3306)
Master 10.0.0.60(10.0.0.60:3306), replicating from 10.0.0.50(10.0.0.50:3306)

Sun Mar 11 14:41:47 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln424] Error happened on checking configurations.  at /usr/local/share/perl5/MHA/MasterMonitor.pm line 326
Sun Mar 11 14:41:47 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln523] Error happened on monitoring servers.
Sun Mar 11 14:41:47 2018 - [info] Got exit code 1 (Not master dead).
tail: /var/log/masterha/app1/manager.log: file truncated
Sun Mar 11 14:44:29 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Mar 11 14:44:29 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Sun Mar 11 14:44:29 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Sun Mar 11 14:44:29 2018 - [info] MHA::MasterMonitor version 0.56.
Sun Mar 11 14:44:29 2018 - [info] Multi-master configuration is detected. Current primary(writable) master is 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 14:44:29 2018 - [info] Master configurations are as below: 
Master 10.0.0.50(10.0.0.50:3306)
Master 10.0.0.60(10.0.0.60:3306), replicating from 10.0.0.50(10.0.0.50:3306), read-only

Sun Mar 11 14:44:29 2018 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln726] Slave 10.0.0.70(10.0.0.70:3306) replicates from 10.0.0.60:3306, but real master is 10.0.0.50(10.0.0.50:3306)!
Sun Mar 11 14:44:29 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln424] Error happened on checking configurations.  at /usr/local/share/perl5/MHA/MasterMonitor.pm line 326
Sun Mar 11 14:44:29 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln523] Error happened on monitoring servers.
Sun Mar 11 14:44:29 2018 - [info] Got exit code 1 (Not master dead).
tail: /var/log/masterha/app1/manager.log: file truncated
Sun Mar 11 15:02:01 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Mar 11 15:02:01 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Sun Mar 11 15:02:01 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Sun Mar 11 15:02:01 2018 - [info] MHA::MasterMonitor version 0.56.
Sun Mar 11 15:02:01 2018 - [info] GTID failover mode = 0
Sun Mar 11 15:02:01 2018 - [info] Dead Servers:
Sun Mar 11 15:02:01 2018 - [info] Alive Servers:
Sun Mar 11 15:02:01 2018 - [info]   10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:02:01 2018 - [info]   10.0.0.60(10.0.0.60:3306)
Sun Mar 11 15:02:01 2018 - [info]   10.0.0.70(10.0.0.70:3306)
Sun Mar 11 15:02:01 2018 - [info] Alive Slaves:
Sun Mar 11 15:02:01 2018 - [info]   10.0.0.60(10.0.0.60:3306)  Version=5.6.16-log (oldest major version between slaves) log-bin:enabled
Sun Mar 11 15:02:01 2018 - [info]     Replicating from 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:02:01 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Sun Mar 11 15:02:01 2018 - [info]   10.0.0.70(10.0.0.70:3306)  Version=5.6.16 (oldest major version between slaves) log-bin:disabled
Sun Mar 11 15:02:01 2018 - [info]     Replicating from 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:02:01 2018 - [info] Current Alive Master: 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:02:01 2018 - [info] Checking slave configurations..
Sun Mar 11 15:02:01 2018 - [warning]  relay_log_purge=0 is not set on slave 10.0.0.60(10.0.0.60:3306).
Sun Mar 11 15:02:01 2018 - [warning]  relay_log_purge=0 is not set on slave 10.0.0.70(10.0.0.70:3306).
Sun Mar 11 15:02:01 2018 - [warning]  log-bin is not set on slave 10.0.0.70(10.0.0.70:3306). This host cannot be a master.
Sun Mar 11 15:02:01 2018 - [info] Checking replication filtering settings..
Sun Mar 11 15:02:01 2018 - [info]  binlog_do_db= , binlog_ignore_db= 
Sun Mar 11 15:02:01 2018 - [info]  Replication filtering check ok.
Sun Mar 11 15:02:01 2018 - [info] GTID (with auto-pos) is not supported
Sun Mar 11 15:02:01 2018 - [info] Starting SSH connection tests..
Sun Mar 11 15:02:02 2018 - [info] All SSH connection tests passed successfully.
Sun Mar 11 15:02:02 2018 - [info] Checking MHA Node version..
Sun Mar 11 15:02:03 2018 - [info]  Version check ok.
Sun Mar 11 15:02:03 2018 - [info] Checking SSH publickey authentication settings on the current master..
Sun Mar 11 15:02:04 2018 - [info] HealthCheck: SSH to 10.0.0.50 is reachable.
Sun Mar 11 15:02:04 2018 - [info] Master MHA Node version is 0.56.
Sun Mar 11 15:02:04 2018 - [info] Checking recovery script configurations on 10.0.0.50(10.0.0.50:3306)..
Sun Mar 11 15:02:04 2018 - [info]   Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/mysql/data --output_file=/tmp/save_binary_logs_test --manager_version=0.56 --start_file=mysql-bin.000002 
Sun Mar 11 15:02:04 2018 - [info]   Connecting to root@10.0.0.50(10.0.0.50:22).. 
  Creating /tmp if not exists..    ok.
  Checking output directory is accessible or not..
   ok.
  Binlog found at /mysql/data, up to mysql-bin.000002
Sun Mar 11 15:02:04 2018 - [info] Binlog setting check done.
Sun Mar 11 15:02:04 2018 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..
Sun Mar 11 15:02:04 2018 - [info]   Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=10.0.0.60 --slave_ip=10.0.0.60 --slave_port=3306 --workdir=/tmp --target_version=5.6.16-log --manager_version=0.56 --relay_log_info=/mysql/data/relay-log.info  --relay_dir=/mysql/data/  --slave_pass=xxx
Sun Mar 11 15:02:04 2018 - [info]   Connecting to root@10.0.0.60(10.0.0.60:22).. 
  Checking slave recovery environment settings..
    Opening /mysql/data/relay-log.info ... ok.
    Relay log found at /mysql/data, up to cadicate-master-relay-bin.000005
    Temporary relay log file is /mysql/data/cadicate-master-relay-bin.000005
    Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
 done.
    Testing mysqlbinlog output.. done.
    Cleaning up test file(s).. done.
Sun Mar 11 15:02:05 2018 - [info]   Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=10.0.0.70 --slave_ip=10.0.0.70 --slave_port=3306 --workdir=/tmp --target_version=5.6.16 --manager_version=0.56 --relay_log_info=/mysql/data/relay-log.info  --relay_dir=/mysql/data/  --slave_pass=xxx
Sun Mar 11 15:02:05 2018 - [info]   Connecting to root@10.0.0.70(10.0.0.70:22).. 
  Checking slave recovery environment settings..
    Opening /mysql/data/relay-log.info ... ok.
    Relay log found at /mysql/data, up to slave-relay-bin.000002
    Temporary relay log file is /mysql/data/slave-relay-bin.000002
    Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
 done.
    Testing mysqlbinlog output.. done.
    Cleaning up test file(s).. done.
Sun Mar 11 15:02:05 2018 - [info] Slaves settings check done.
Sun Mar 11 15:02:05 2018 - [info] 
10.0.0.50(10.0.0.50:3306) (current master)
 +--10.0.0.60(10.0.0.60:3306)
 +--10.0.0.70(10.0.0.70:3306)

Sun Mar 11 15:02:05 2018 - [info] Checking master_ip_failover_script status:
Sun Mar 11 15:02:05 2018 - [info]   /usr/local/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=10.0.0.50 --orig_master_ip=10.0.0.50 --orig_master_port=3306 


IN SCRIPT TEST====/etc/init.d/keepalived stop==/etc/init.d/keepalived start===

Checking the Status of the script.. OK 
Sun Mar 11 15:02:05 2018 - [info]  OK.
Sun Mar 11 15:02:05 2018 - [warning] shutdown_script is not defined.
Sun Mar 11 15:02:05 2018 - [info] Set master ping interval 1 seconds.
Sun Mar 11 15:02:05 2018 - [info] Set secondary check script: /usr/local/bin/masterha_secondary_check -s server03 -s server02
Sun Mar 11 15:02:05 2018 - [info] Starting ping health check on 10.0.0.50(10.0.0.50:3306)..
Sun Mar 11 15:02:05 2018 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
^C
[root@manager ~]# >  /var/log/masterha/app1/manager.log              
[root@manager ~]# tail -f /var/log/masterha/app1/manager.log 
Sun Mar 11 15:12:20 2018 - [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away)
Sun Mar 11 15:12:20 2018 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/mysql/data --output_file=/tmp/save_binary_logs_test --manager_version=0.56 --binlog_prefix=mysql-bin
Sun Mar 11 15:12:20 2018 - [info] Executing secondary network check script: /usr/local/bin/masterha_secondary_check -s server03 -s server02  --user=root  --master_host=10.0.0.50  --master_ip=10.0.0.50  --master_port=3306 --master_user=root --master_password=123 --ping_type=SELECT
Sun Mar 11 15:12:20 2018 - [info] HealthCheck: SSH to 10.0.0.50 is reachable.
Monitoring server server03 is reachable, Master is not reachable from server03. OK.
Monitoring server server02 is reachable, Master is not reachable from server02. OK.
Sun Mar 11 15:12:20 2018 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Sun Mar 11 15:12:21 2018 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
Sun Mar 11 15:12:21 2018 - [warning] Connection failed 2 time(s)..
Sun Mar 11 15:12:22 2018 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
Sun Mar 11 15:12:22 2018 - [warning] Connection failed 3 time(s)..
Sun Mar 11 15:12:23 2018 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
Sun Mar 11 15:12:23 2018 - [warning] Connection failed 4 time(s)..
Sun Mar 11 15:12:23 2018 - [warning] Master is not reachable from health checker!
Sun Mar 11 15:12:23 2018 - [warning] Master 10.0.0.50(10.0.0.50:3306) is not reachable!
Sun Mar 11 15:12:23 2018 - [warning] SSH is reachable.
Sun Mar 11 15:12:23 2018 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/masterha/app1.cnf again, and trying to connect to all servers to check server status..
Sun Mar 11 15:12:23 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Mar 11 15:12:23 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Sun Mar 11 15:12:23 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Sun Mar 11 15:12:23 2018 - [info] GTID failover mode = 0
Sun Mar 11 15:12:23 2018 - [info] Dead Servers:
Sun Mar 11 15:12:23 2018 - [info]   10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:12:23 2018 - [info] Alive Servers:
Sun Mar 11 15:12:23 2018 - [info]   10.0.0.60(10.0.0.60:3306)
Sun Mar 11 15:12:23 2018 - [info]   10.0.0.70(10.0.0.70:3306)
Sun Mar 11 15:12:23 2018 - [info] Alive Slaves:
Sun Mar 11 15:12:23 2018 - [info]   10.0.0.60(10.0.0.60:3306)  Version=5.6.16-log (oldest major version between slaves) log-bin:enabled
Sun Mar 11 15:12:23 2018 - [info]     Replicating from 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:12:23 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Sun Mar 11 15:12:23 2018 - [info]   10.0.0.70(10.0.0.70:3306)  Version=5.6.16 (oldest major version between slaves) log-bin:disabled
Sun Mar 11 15:12:23 2018 - [info]     Replicating from 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:12:23 2018 - [info] Checking slave configurations..
Sun Mar 11 15:12:23 2018 - [warning]  relay_log_purge=0 is not set on slave 10.0.0.60(10.0.0.60:3306).
Sun Mar 11 15:12:23 2018 - [warning]  relay_log_purge=0 is not set on slave 10.0.0.70(10.0.0.70:3306).
Sun Mar 11 15:12:23 2018 - [warning]  log-bin is not set on slave 10.0.0.70(10.0.0.70:3306). This host cannot be a master.
Sun Mar 11 15:12:23 2018 - [info] Checking replication filtering settings..
Sun Mar 11 15:12:23 2018 - [info]  Replication filtering check ok.
Sun Mar 11 15:12:23 2018 - [info] Master is down!
Sun Mar 11 15:12:23 2018 - [info] Terminating monitoring script.
Sun Mar 11 15:12:23 2018 - [info] Got exit code 20 (Master dead).
Sun Mar 11 15:12:23 2018 - [info] MHA::MasterFailover version 0.56.
Sun Mar 11 15:12:23 2018 - [info] Starting master failover.
Sun Mar 11 15:12:23 2018 - [info] 
Sun Mar 11 15:12:23 2018 - [info] * Phase 1: Configuration Check Phase..
Sun Mar 11 15:12:23 2018 - [info] 
Sun Mar 11 15:12:23 2018 - [info] GTID failover mode = 0
Sun Mar 11 15:12:23 2018 - [info] Dead Servers:
Sun Mar 11 15:12:23 2018 - [info]   10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:12:23 2018 - [info] Checking master reachability via MySQL(double check)...
Sun Mar 11 15:12:23 2018 - [info]  ok.
Sun Mar 11 15:12:23 2018 - [info] Alive Servers:
Sun Mar 11 15:12:23 2018 - [info]   10.0.0.60(10.0.0.60:3306)
Sun Mar 11 15:12:23 2018 - [info]   10.0.0.70(10.0.0.70:3306)
Sun Mar 11 15:12:23 2018 - [info] Alive Slaves:
Sun Mar 11 15:12:23 2018 - [info]   10.0.0.60(10.0.0.60:3306)  Version=5.6.16-log (oldest major version between slaves) log-bin:enabled
Sun Mar 11 15:12:23 2018 - [info]     Replicating from 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:12:23 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Sun Mar 11 15:12:23 2018 - [info]   10.0.0.70(10.0.0.70:3306)  Version=5.6.16 (oldest major version between slaves) log-bin:disabled
Sun Mar 11 15:12:23 2018 - [info]     Replicating from 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:12:23 2018 - [info] Starting Non-GTID based failover.
Sun Mar 11 15:12:23 2018 - [info] 
Sun Mar 11 15:12:23 2018 - [info] ** Phase 1: Configuration Check Phase completed.
Sun Mar 11 15:12:23 2018 - [info] 
Sun Mar 11 15:12:23 2018 - [info] * Phase 2: Dead Master Shutdown Phase..
Sun Mar 11 15:12:23 2018 - [info] 
Sun Mar 11 15:12:23 2018 - [info] Forcing shutdown so that applications never connect to the current master..
Sun Mar 11 15:12:23 2018 - [info] Executing master IP deactivation script:
Sun Mar 11 15:12:23 2018 - [info]   /usr/local/bin/master_ip_failover --orig_master_host=10.0.0.50 --orig_master_ip=10.0.0.50 --orig_master_port=3306 --command=stopssh --ssh_user=root  


IN SCRIPT TEST====/etc/init.d/keepalived stop==/etc/init.d/keepalived start===

Disabling the VIP on old master: 10.0.0.50 
Sun Mar 11 15:12:23 2018 - [info]  done.
Sun Mar 11 15:12:23 2018 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Sun Mar 11 15:12:23 2018 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Sun Mar 11 15:12:23 2018 - [info] 
Sun Mar 11 15:12:23 2018 - [info] * Phase 3: Master Recovery Phase..
Sun Mar 11 15:12:23 2018 - [info] 
Sun Mar 11 15:12:23 2018 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Sun Mar 11 15:12:23 2018 - [info] 
Sun Mar 11 15:12:23 2018 - [info] The latest binary log file/position on all slaves is mysql-bin.000002:120
Sun Mar 11 15:12:23 2018 - [info] Latest slaves (Slaves that received relay log files to the latest):
Sun Mar 11 15:12:23 2018 - [info]   10.0.0.60(10.0.0.60:3306)  Version=5.6.16-log (oldest major version between slaves) log-bin:enabled
Sun Mar 11 15:12:23 2018 - [info]     Replicating from 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:12:23 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Sun Mar 11 15:12:23 2018 - [info]   10.0.0.70(10.0.0.70:3306)  Version=5.6.16 (oldest major version between slaves) log-bin:disabled
Sun Mar 11 15:12:23 2018 - [info]     Replicating from 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:12:23 2018 - [info] The oldest binary log file/position on all slaves is mysql-bin.000002:120
Sun Mar 11 15:12:23 2018 - [info] Oldest slaves:
Sun Mar 11 15:12:23 2018 - [info]   10.0.0.60(10.0.0.60:3306)  Version=5.6.16-log (oldest major version between slaves) log-bin:enabled
Sun Mar 11 15:12:23 2018 - [info]     Replicating from 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:12:23 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Sun Mar 11 15:12:23 2018 - [info]   10.0.0.70(10.0.0.70:3306)  Version=5.6.16 (oldest major version between slaves) log-bin:disabled
Sun Mar 11 15:12:23 2018 - [info]     Replicating from 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:12:23 2018 - [info] 
Sun Mar 11 15:12:23 2018 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Sun Mar 11 15:12:23 2018 - [info] 
Sun Mar 11 15:12:23 2018 - [info] Fetching dead master's binary logs..
Sun Mar 11 15:12:23 2018 - [info] Executing command on the dead master 10.0.0.50(10.0.0.50:3306): save_binary_logs --command=save --start_file=mysql-bin.000002  --start_pos=120 --binlog_dir=/mysql/data --output_file=/tmp/saved_master_binlog_from_10.0.0.50_3306_20180311151223.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56
  Creating /tmp if not exists..    ok.
 Concat binary/relay logs from mysql-bin.000002 pos 120 to mysql-bin.000002 EOF into /tmp/saved_master_binlog_from_10.0.0.50_3306_20180311151223.binlog ..
 Binlog Checksum enabled
  Dumping binlog format description event, from position 0 to 120.. ok.
  No need to dump effective binlog data from /mysql/data/mysql-bin.000002 (pos starts 120, filesize 120). Skipping.
 Binlog Checksum enabled
 /tmp/saved_master_binlog_from_10.0.0.50_3306_20180311151223.binlog has no effective data events.
Event not exists.
Sun Mar 11 15:12:24 2018 - [info] Additional events were not found from the orig master. No need to save.
Sun Mar 11 15:12:24 2018 - [info] 
Sun Mar 11 15:12:24 2018 - [info] * Phase 3.3: Determining New Master Phase..
Sun Mar 11 15:12:24 2018 - [info] 
Sun Mar 11 15:12:24 2018 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Sun Mar 11 15:12:24 2018 - [info] All slaves received relay logs to the same position. No need to resync each other.
Sun Mar 11 15:12:24 2018 - [info] Searching new master from slaves..
Sun Mar 11 15:12:24 2018 - [info]  Candidate masters from the configuration file:
Sun Mar 11 15:12:24 2018 - [info]   10.0.0.60(10.0.0.60:3306)  Version=5.6.16-log (oldest major version between slaves) log-bin:enabled
Sun Mar 11 15:12:24 2018 - [info]     Replicating from 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:12:24 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Sun Mar 11 15:12:24 2018 - [info]  Non-candidate masters:
Sun Mar 11 15:12:24 2018 - [info]   10.0.0.70(10.0.0.70:3306)  Version=5.6.16 (oldest major version between slaves) log-bin:disabled
Sun Mar 11 15:12:24 2018 - [info]     Replicating from 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:12:24 2018 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Sun Mar 11 15:12:24 2018 - [info] New master is 10.0.0.60(10.0.0.60:3306)
Sun Mar 11 15:12:24 2018 - [info] Starting master failover..
Sun Mar 11 15:12:24 2018 - [info] 
From:
10.0.0.50(10.0.0.50:3306) (current master)
 +--10.0.0.60(10.0.0.60:3306)
 +--10.0.0.70(10.0.0.70:3306)

To:
10.0.0.60(10.0.0.60:3306) (new master)
 +--10.0.0.70(10.0.0.70:3306)
Sun Mar 11 15:12:24 2018 - [info] 
Sun Mar 11 15:12:24 2018 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Sun Mar 11 15:12:24 2018 - [info] 
Sun Mar 11 15:12:24 2018 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
Sun Mar 11 15:12:24 2018 - [info] 
Sun Mar 11 15:12:24 2018 - [info] * Phase 3.4: Master Log Apply Phase..
Sun Mar 11 15:12:24 2018 - [info] 
Sun Mar 11 15:12:24 2018 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Sun Mar 11 15:12:24 2018 - [info] Starting recovery on 10.0.0.60(10.0.0.60:3306)..
Sun Mar 11 15:12:24 2018 - [info]  This server has all relay logs. Waiting all logs to be applied.. 
Sun Mar 11 15:12:24 2018 - [info]   done.
Sun Mar 11 15:12:24 2018 - [info]  All relay logs were successfully applied.
Sun Mar 11 15:12:24 2018 - [info] Getting new master's binlog name and position..
Sun Mar 11 15:12:24 2018 - [info]  mysql-bin.000002:120
Sun Mar 11 15:12:24 2018 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='10.0.0.60', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000002', MASTER_LOG_POS=120, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Sun Mar 11 15:12:24 2018 - [info] Executing master IP activate script:
Sun Mar 11 15:12:24 2018 - [info]   /usr/local/bin/master_ip_failover --command=start --ssh_user=root --orig_master_host=10.0.0.50 --orig_master_ip=10.0.0.50 --orig_master_port=3306 --new_master_host=10.0.0.60 --new_master_ip=10.0.0.60 --new_master_port=3306 --new_master_user='root' --new_master_password='123'  
Unknown option: new_master_user
Unknown option: new_master_password


IN SCRIPT TEST====/etc/init.d/keepalived stop==/etc/init.d/keepalived start===

Enabling the VIP - 10.0.0.88 on the new master - 10.0.0.60 
Sun Mar 11 15:12:24 2018 - [info]  OK.
Sun Mar 11 15:12:24 2018 - [info] Setting read_only=0 on 10.0.0.60(10.0.0.60:3306)..
Sun Mar 11 15:12:24 2018 - [info]  ok.
Sun Mar 11 15:12:24 2018 - [info] ** Finished master recovery successfully.
Sun Mar 11 15:12:24 2018 - [info] * Phase 3: Master Recovery Phase completed.
Sun Mar 11 15:12:24 2018 - [info] 
Sun Mar 11 15:12:24 2018 - [info] * Phase 4: Slaves Recovery Phase..
Sun Mar 11 15:12:24 2018 - [info] 
Sun Mar 11 15:12:24 2018 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Sun Mar 11 15:12:24 2018 - [info] 
Sun Mar 11 15:12:24 2018 - [info] -- Slave diff file generation on host 10.0.0.70(10.0.0.70:3306) started, pid: 2291. Check tmp log /var/log/masterha/app1.log/10.0.0.70_3306_20180311151223.log if it takes time..
Sun Mar 11 15:12:24 2018 - [info] 
Sun Mar 11 15:12:24 2018 - [info] Log messages from 10.0.0.70 ...
Sun Mar 11 15:12:24 2018 - [info] 
Sun Mar 11 15:12:24 2018 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
Sun Mar 11 15:12:24 2018 - [info] End of log messages from 10.0.0.70.
Sun Mar 11 15:12:24 2018 - [info] -- 10.0.0.70(10.0.0.70:3306) has the latest relay log events.
Sun Mar 11 15:12:24 2018 - [info] Generating relay diff files from the latest slave succeeded.
Sun Mar 11 15:12:24 2018 - [info] 
Sun Mar 11 15:12:24 2018 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Sun Mar 11 15:12:24 2018 - [info] 
Sun Mar 11 15:12:24 2018 - [info] -- Slave recovery on host 10.0.0.70(10.0.0.70:3306) started, pid: 2293. Check tmp log /var/log/masterha/app1.log/10.0.0.70_3306_20180311151223.log if it takes time..
Sun Mar 11 15:12:24 2018 - [info] 
Sun Mar 11 15:12:24 2018 - [info] Log messages from 10.0.0.70 ...
Sun Mar 11 15:12:24 2018 - [info] 
Sun Mar 11 15:12:24 2018 - [info] Starting recovery on 10.0.0.70(10.0.0.70:3306)..
Sun Mar 11 15:12:24 2018 - [info]  This server has all relay logs. Waiting all logs to be applied.. 
Sun Mar 11 15:12:24 2018 - [info]   done.
Sun Mar 11 15:12:24 2018 - [info]  All relay logs were successfully applied.
Sun Mar 11 15:12:24 2018 - [info]  Resetting slave 10.0.0.70(10.0.0.70:3306) and starting replication from the new master 10.0.0.60(10.0.0.60:3306)..
Sun Mar 11 15:12:24 2018 - [info]  Executed CHANGE MASTER.
Sun Mar 11 15:12:24 2018 - [info]  Slave started.
Sun Mar 11 15:12:24 2018 - [info] End of log messages from 10.0.0.70.
Sun Mar 11 15:12:24 2018 - [info] -- Slave recovery on host 10.0.0.70(10.0.0.70:3306) succeeded.
Sun Mar 11 15:12:24 2018 - [info] All new slave servers recovered successfully.
Sun Mar 11 15:12:24 2018 - [info] 
Sun Mar 11 15:12:24 2018 - [info] * Phase 5: New master cleanup phase..
Sun Mar 11 15:12:24 2018 - [info] 
Sun Mar 11 15:12:24 2018 - [info] Resetting slave info on the new master..
Sun Mar 11 15:12:24 2018 - [info]  10.0.0.60: Resetting slave info succeeded.
Sun Mar 11 15:12:24 2018 - [info] Master failover to 10.0.0.60(10.0.0.60:3306) completed successfully.
Sun Mar 11 15:12:24 2018 - [info] Deleted server1 entry from /etc/masterha/app1.cnf .
Sun Mar 11 15:12:24 2018 - [info] 

----- Failover Report -----

app1: MySQL Master failover 10.0.0.50(10.0.0.50:3306) to 10.0.0.60(10.0.0.60:3306) succeeded

Master 10.0.0.50(10.0.0.50:3306) is down!

Check MHA Manager logs at manager:/var/log/masterha/app1/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 10.0.0.50(10.0.0.50:3306)
The latest slave 10.0.0.60(10.0.0.60:3306) has all relay logs for recovery.
Selected 10.0.0.60(10.0.0.60:3306) as a new master.
10.0.0.60(10.0.0.60:3306): OK: Applying all logs succeeded.
10.0.0.60(10.0.0.60:3306): OK: Activated master IP address.
10.0.0.70(10.0.0.70:3306): This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
10.0.0.70(10.0.0.70:3306): OK: Applying all logs succeeded. Slave started, replicating from 10.0.0.60(10.0.0.60:3306)
10.0.0.60(10.0.0.60:3306): Resetting slave info succeeded.
Master failover to 10.0.0.60(10.0.0.60:3306) completed successfully.

successfully

 而有故障的主机会被剔出集群。

查看 cadicate上的vip是否由master 飘过来

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:39:2d:82 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.60/8 brd 10.255.255.255 scope global eth0
    inet 10.0.0.88/32 scope global eth0

 

从上面的输出可以看出整个MHA的切换过程,共包括以下的步骤:

1.配置文件检查阶段,这个阶段会检查整个集群配置文件配置

2.宕机的master处理,这个阶段包括虚拟ip摘除操作,主机关机操作(这个我这里还没有实现,需要研究)

3.复制dead maste和最新slave相差的relay log,并保存到MHA Manger具体的目录下

4.识别含有最新更新的slave

5.应用从master保存的二进制日志事件(binlog events)

6.提升一个slave为新的master进行复制

7.使其他的slave连接新的master进行复制

最后启动MHA Manger监控,查看集群里面现在谁是master(在切换后监控就停止了。。。还有东西没搞对?)后来在官方网站看到这句话就明白了 。

上述模拟了master宕机的情况下手动把10.0.0.60提升为主库的操作过程。

Running MHA Manager from daemontools

Currently MHA Manager process does not run as a daemon. If failover completed successfully or the master process was killed by accident, the manager stops working. To run as a daemon, daemontool. or any external daemon program can be used. Here is an example to run from daemontools.

二.手动Failover(MHA Manager必须没有运行)

手动failover,这种场景意味着在业务上没有启用MHA自动切换功能,当主服务器故障时,人工手动调用MHA来进行故障切换操作,具体命令如下:

注意:如果,MHA manager检测到没有dead的server,将报错,并结束failover: 

进行手动切换命令如下:

[root@manager ~]# masterha_master_switch --master_state=dead --conf=/etc/masterha/app1.cnf --dead_master_host=10.0.0.50 --dead_master_port=3306 --new_master_host=10.0.0.60 --new_master_port=3306 --ignore_last_failover

 

输出的信息会询问你是否进行切换:

 

上述模拟了master宕机的情况下手动把10.0.0.60提升为主库的操作过程。

三.在线进行切换

 在许多情况下, 需要将现有的主服务器迁移到另外一台服务器上。 比如主服务器硬件故障,RAID 控制卡需要重建,将主服务器移到性能更好的服务器上等等。维护主服务器引起性能下降, 导致停机时间至少无法写入数据。 另外, 阻塞或杀掉当前运行的会话会导致主主之间数据不一致的问题发生。 MHA 提供快速切换和优雅的阻塞写入,这个切换过程只需要 0.5-2s 的时间,这段时间内数据是无法写入的。在很多情况下,0.5-2s 的阻塞写入是可以接受的。因此切换主服务器不需要计划分配维护时间窗口。

MHA在线切换的大概过程:
1.检测复制设置和确定当前主服务器
2.确定新的主服务器
3.阻塞写入到当前主服务器
4.等待所有从服务器赶上复制
5.授予写入到新的主服务器
6.重新设置从服务器 

注意,在线切换的时候应用架构需要考虑以下两个问题:

1.自动识别master和slave的问题(master的机器可能会切换),如果采用了vip的方式,基本可以解决这个问题。

2.负载均衡的问题(可以定义大概的读写比例,每台机器可承担的负载比例,当有机器离开集群时,需要考虑这个问题)

为了保证数据完全一致性,在最快的时间内完成切换,MHA的在线切换必须满足以下条件才会切换成功,否则会切换失败。

1.所有slave的IO线程都在运行

2.所有slave的SQL线程都在运行

3.所有的show slave status的输出中Seconds_Behind_Master参数小于或者等于running_updates_limit秒,如果在切换过程中不指定running_updates_limit,那么默认情况下running_updates_limit为1秒。

4.在master端,通过show processlist输出,没有一个更新花费的时间大于running_updates_limit秒。

在线切换步骤如下:

首先,停掉MHA监控:

[root@manager ~]# masterha_stop --conf=/etc/masterha/app1.cnf

其次,进行在线切换操作(模拟在线切换主库操作,原主库10.0.0.50变为slave,10.0.0.60提升为新的主库)

[root@manager ~]# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=10.0.0.60 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000

有个交互问你是否要切换

[root@manager ~]# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=10.0.0.60 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000
Sun Mar 11 15:50:13 2018 - [info] MHA::MasterRotate version 0.56.
Sun Mar 11 15:50:13 2018 - [info] Starting online master switch..
Sun Mar 11 15:50:13 2018 - [info] 
Sun Mar 11 15:50:13 2018 - [info] * Phase 1: Configuration Check Phase..
Sun Mar 11 15:50:13 2018 - [info] 
Sun Mar 11 15:50:13 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Mar 11 15:50:13 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Sun Mar 11 15:50:13 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Sun Mar 11 15:50:13 2018 - [info] GTID failover mode = 0
Sun Mar 11 15:50:13 2018 - [info] Current Alive Master: 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:50:13 2018 - [info] Alive Slaves:
Sun Mar 11 15:50:13 2018 - [info]   10.0.0.60(10.0.0.60:3306)  Version=5.6.16-log (oldest major version between slaves) log-bin:enabled
Sun Mar 11 15:50:13 2018 - [info]     Replicating from 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:50:13 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Sun Mar 11 15:50:13 2018 - [info]   10.0.0.70(10.0.0.70:3306)  Version=5.6.16 (oldest major version between slaves) log-bin:disabled
Sun Mar 11 15:50:13 2018 - [info]     Replicating from 10.0.0.50(10.0.0.50:3306)

It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 10.0.0.50(10.0.0.50:3306)? (YES/no): yes
Sun Mar 11 15:51:11 2018 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..
Sun Mar 11 15:51:11 2018 - [info] ok.
Sun Mar 11 15:51:11 2018 - [info] Checking MHA is not monitoring or doing failover..
Sun Mar 11 15:51:11 2018 - [info] Checking replication health on 10.0.0.60..
Sun Mar 11 15:51:11 2018 - [info] ok.
Sun Mar 11 15:51:11 2018 - [info] Checking replication health on 10.0.0.70..
Sun Mar 11 15:51:11 2018 - [info] ok.
Sun Mar 11 15:51:11 2018 - [info] 10.0.0.60 can be new master.
Sun Mar 11 15:51:11 2018 - [info]
From:
10.0.0.50(10.0.0.50:3306) (current master)
+--10.0.0.60(10.0.0.60:3306)
+--10.0.0.70(10.0.0.70:3306)


To:
10.0.0.60(10.0.0.60:3306) (new master)
+--10.0.0.70(10.0.0.70:3306)
+--10.0.0.50(10.0.0.50:3306)


Starting master switch from 10.0.0.50(10.0.0.50:3306) to 10.0.0.60(10.0.0.60:3306)? (yes/NO): ^CSun Mar 11 15:51:31 2018 - [info] Killing thread 9 on 10.0.0.50(10.0.0.50:3306)..
Sun Mar 11 15:51:31 2018 - [info] ok.
Sun Mar 11 15:51:31 2018 - [info] Killing thread 10 on 10.0.0.60(10.0.0.60:3306)..
Sun Mar 11 15:51:31 2018 - [info] ok.
Sun Mar 11 15:51:31 2018 - [info] Killing thread 10 on 10.0.0.70(10.0.0.70:3306)..
Sun Mar 11 15:51:31 2018 - [info] ok.
[root@manager ~]# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=10.0.0.60 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000
Sun Mar 11 15:51:45 2018 - [info] MHA::MasterRotate version 0.56.
Sun Mar 11 15:51:45 2018 - [info] Starting online master switch..
Sun Mar 11 15:51:45 2018 - [info]
Sun Mar 11 15:51:45 2018 - [info] * Phase 1: Configuration Check Phase..
Sun Mar 11 15:51:45 2018 - [info]
Sun Mar 11 15:51:45 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Mar 11 15:51:45 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Sun Mar 11 15:51:45 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Sun Mar 11 15:51:45 2018 - [info] GTID failover mode = 0
Sun Mar 11 15:51:45 2018 - [info] Current Alive Master: 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:51:45 2018 - [info] Alive Slaves:
Sun Mar 11 15:51:45 2018 - [info] 10.0.0.60(10.0.0.60:3306) Version=5.6.16-log (oldest major version between slaves) log-bin:enabled
Sun Mar 11 15:51:45 2018 - [info] Replicating from 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:51:45 2018 - [info] Primary candidate for the new Master (candidate_master is set)
Sun Mar 11 15:51:45 2018 - [info] 10.0.0.70(10.0.0.70:3306) Version=5.6.16 (oldest major version between slaves) log-bin:disabled
Sun Mar 11 15:51:45 2018 - [info] Replicating from 10.0.0.50(10.0.0.50:3306)


It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 10.0.0.50(10.0.0.50:3306)? (YES/no): yes
Sun Mar 11 15:51:50 2018 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..
Sun Mar 11 15:51:50 2018 - [info] ok.
Sun Mar 11 15:51:50 2018 - [info] Checking MHA is not monitoring or doing failover..
Sun Mar 11 15:51:50 2018 - [info] Checking replication health on 10.0.0.60..
Sun Mar 11 15:51:50 2018 - [info] ok.
Sun Mar 11 15:51:50 2018 - [info] Checking replication health on 10.0.0.70..
Sun Mar 11 15:51:50 2018 - [info] ok.
Sun Mar 11 15:51:50 2018 - [info] 10.0.0.60 can be new master.
Sun Mar 11 15:51:50 2018 - [info]
From:
10.0.0.50(10.0.0.50:3306) (current master)
+--10.0.0.60(10.0.0.60:3306)
+--10.0.0.70(10.0.0.70:3306)


To:
10.0.0.60(10.0.0.60:3306) (new master)
+--10.0.0.70(10.0.0.70:3306)
+--10.0.0.50(10.0.0.50:3306)


Starting master switch from 10.0.0.50(10.0.0.50:3306) to 10.0.0.60(10.0.0.60:3306)? (yes/NO): yes
Sun Mar 11 15:52:04 2018 - [info] Checking whether 10.0.0.60(10.0.0.60:3306) is ok for the new master..
Sun Mar 11 15:52:04 2018 - [info] ok.
Sun Mar 11 15:52:04 2018 - [info] 10.0.0.50(10.0.0.50:3306): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host.
Sun Mar 11 15:52:04 2018 - [info] 10.0.0.50(10.0.0.50:3306): Resetting slave pointing to the dummy host.
Sun Mar 11 15:52:04 2018 - [info] ** Phase 1: Configuration Check Phase completed.
Sun Mar 11 15:52:04 2018 - [info]
Sun Mar 11 15:52:04 2018 - [info] * Phase 2: Rejecting updates Phase..
Sun Mar 11 15:52:04 2018 - [info]
Sun Mar 11 15:52:04 2018 - [info] Executing master ip online change script to disable write on the current master:
Sun Mar 11 15:52:04 2018 - [info] /usr/local/bin/master_ip_online_change --command=stop --orig_master_host=10.0.0.50 --orig_master_ip=10.0.0.50 --orig_master_port=3306 --orig_master_user='root' --orig_master_password='123' --new_master_host=10.0.0.60 --new_master_ip=10.0.0.60 --new_master_port=3306 --new_master_user='root' --new_master_password='123' --orig_master_ssh_user=root --new_master_ssh_user=root --orig_master_is_new_slave
Unknown option: orig_master_password
Unknown option: new_master_password
Unknown option: orig_master_ssh_user
Unknown option: new_master_ssh_user
Unknown option: orig_master_is_new_slave
Sun Mar 11 15:52:05 2018 035158 Set read_only on the new master.. ok.
Sun Mar 11 15:52:05 2018 037347 Set read_only=1 on the orig master.. ok.
Disabling the VIP on old master: 10.0.0.50
eth1:1: unknown interface: No such device
Sun Mar 11 15:52:05 2018 194450 Killing all application threads..
Sun Mar 11 15:52:05 2018 194507 done.
Sun Mar 11 15:52:05 2018 - [info] ok.
Sun Mar 11 15:52:05 2018 - [info] Locking all tables on the orig master to reject updates from everybody (including root):
Sun Mar 11 15:52:05 2018 - [info] Executing FLUSH TABLES WITH READ LOCK..
Sun Mar 11 15:52:05 2018 - [info] ok.
Sun Mar 11 15:52:05 2018 - [info] Orig master binlog:pos is mysql-bin.000002:120.
Sun Mar 11 15:52:05 2018 - [info] Waiting to execute all relay logs on 10.0.0.60(10.0.0.60:3306)..
Sun Mar 11 15:52:05 2018 - [info] master_pos_wait(mysql-bin.000002:120) completed on 10.0.0.60(10.0.0.60:3306). Executed 0 events.
Sun Mar 11 15:52:05 2018 - [info] done.
Sun Mar 11 15:52:05 2018 - [info] Getting new master's binlog name and position..
Sun Mar 11 15:52:05 2018 - [info] mysql-bin.000003:120
Sun Mar 11 15:52:05 2018 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='10.0.0.60', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000003', MASTER_LOG_POS=120, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Sun Mar 11 15:52:05 2018 - [info] Executing master ip online change script to allow write on the new master:
Sun Mar 11 15:52:05 2018 - [info] /usr/local/bin/master_ip_online_change --command=start --orig_master_host=10.0.0.50 --orig_master_ip=10.0.0.50 --orig_master_port=3306 --orig_master_user='root' --orig_master_password='123' --new_master_host=10.0.0.60 --new_master_ip=10.0.0.60 --new_master_port=3306 --new_master_user='root' --new_master_password='123' --orig_master_ssh_user=root --new_master_ssh_user=root --orig_master_is_new_slave
Unknown option: orig_master_password
Unknown option: new_master_password
Unknown option: orig_master_ssh_user
Unknown option: new_master_ssh_user
Unknown option: orig_master_is_new_slave
Sun Mar 11 15:52:05 2018 332345 Set read_only=0 on the new master.
Enabling the VIP - 10.0.0.88/24 on the new master - 10.0.0.60
SIOCSIFADDR: No such device
SIOCSIFNETMASK: No such device
SIOCGIFADDR: No such device
SIOCSIFBROADCAST: No such device
eth1:1: unknown interface: No such device
Sun Mar 11 15:52:05 2018 - [info] ok.
Sun Mar 11 15:52:05 2018 - [info]
Sun Mar 11 15:52:05 2018 - [info] * Switching slaves in parallel..
Sun Mar 11 15:52:05 2018 - [info]
Sun Mar 11 15:52:05 2018 - [info] -- Slave switch on host 10.0.0.70(10.0.0.70:3306) started, pid: 1679
Sun Mar 11 15:52:05 2018 - [info]
Sun Mar 11 15:52:06 2018 - [info] Log messages from 10.0.0.70 ...
Sun Mar 11 15:52:06 2018 - [info]
Sun Mar 11 15:52:05 2018 - [info] Waiting to execute all relay logs on 10.0.0.70(10.0.0.70:3306)..
Sun Mar 11 15:52:05 2018 - [info] master_pos_wait(mysql-bin.000002:120) completed on 10.0.0.70(10.0.0.70:3306). Executed 0 events.
Sun Mar 11 15:52:05 2018 - [info] done.
Sun Mar 11 15:52:05 2018 - [info] Resetting slave 10.0.0.70(10.0.0.70:3306) and starting replication from the new master 10.0.0.60(10.0.0.60:3306)..
Sun Mar 11 15:52:05 2018 - [info] Executed CHANGE MASTER.
Sun Mar 11 15:52:06 2018 - [info] Slave started.
Sun Mar 11 15:52:06 2018 - [info] End of log messages from 10.0.0.70 ...
Sun Mar 11 15:52:06 2018 - [info]
Sun Mar 11 15:52:06 2018 - [info] -- Slave switch on host 10.0.0.70(10.0.0.70:3306) succeeded.
Sun Mar 11 15:52:06 2018 - [info] Unlocking all tables on the orig master:
Sun Mar 11 15:52:06 2018 - [info] Executing UNLOCK TABLES..
Sun Mar 11 15:52:06 2018 - [info] ok.
Sun Mar 11 15:52:06 2018 - [info] Starting orig master as a new slave..
Sun Mar 11 15:52:06 2018 - [info] Resetting slave 10.0.0.50(10.0.0.50:3306) and starting replication from the new master 10.0.0.60(10.0.0.60:3306)..
Sun Mar 11 15:52:06 2018 - [info] Executed CHANGE MASTER.
Sun Mar 11 15:52:06 2018 - [info] Slave started.
Sun Mar 11 15:52:06 2018 - [info] All new slave servers switched successfully.
Sun Mar 11 15:52:06 2018 - [info]
Sun Mar 11 15:52:06 2018 - [info] * Phase 5: New master cleanup phase..
Sun Mar 11 15:52:06 2018 - [info]
Sun Mar 11 15:52:06 2018 - [info] 10.0.0.60: Resetting slave info succeeded.
Sun Mar 11 15:52:06 2018 - [info] Switching master to 10.0.0.60(10.0.0.60:3306) completed successfully.

 

 

最后查看日志,了解切换过程,输出信息如下:

 

Sun Mar 11 15:45:41 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Mar 11 15:45:41 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Sun Mar 11 15:45:41 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Sun Mar 11 15:45:41 2018 - [info] MHA::MasterMonitor version 0.56.
Sun Mar 11 15:45:41 2018 - [warning] /var/log/masterha/app1.log/app1.master_status.health already exists. You might have killed manager with SIGKILL(-9), may run two or more monitoring process for the same application, or use the same working directory. Check for details, and consider setting --workdir separately.
Sun Mar 11 15:45:42 2018 - [info] GTID failover mode = 0
Sun Mar 11 15:45:42 2018 - [info] Dead Servers:
Sun Mar 11 15:45:42 2018 - [info] Alive Servers:
Sun Mar 11 15:45:42 2018 - [info]   10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:45:42 2018 - [info]   10.0.0.60(10.0.0.60:3306)
Sun Mar 11 15:45:42 2018 - [info]   10.0.0.70(10.0.0.70:3306)
Sun Mar 11 15:45:42 2018 - [info] Alive Slaves:
Sun Mar 11 15:45:42 2018 - [info]   10.0.0.60(10.0.0.60:3306)  Version=5.6.16-log (oldest major version between slaves) log-bin:enabled
Sun Mar 11 15:45:42 2018 - [info]     Replicating from 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:45:42 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Sun Mar 11 15:45:42 2018 - [info]   10.0.0.70(10.0.0.70:3306)  Version=5.6.16 (oldest major version between slaves) log-bin:disabled
Sun Mar 11 15:45:42 2018 - [info]     Replicating from 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:45:42 2018 - [info] Current Alive Master: 10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:45:42 2018 - [info] Checking slave configurations..
Sun Mar 11 15:45:42 2018 - [info]  read_only=1 is not set on slave 10.0.0.60(10.0.0.60:3306).
Sun Mar 11 15:45:42 2018 - [warning]  relay_log_purge=0 is not set on slave 10.0.0.60(10.0.0.60:3306).
Sun Mar 11 15:45:42 2018 - [warning]  relay_log_purge=0 is not set on slave 10.0.0.70(10.0.0.70:3306).
Sun Mar 11 15:45:42 2018 - [warning]  log-bin is not set on slave 10.0.0.70(10.0.0.70:3306). This host cannot be a master.
Sun Mar 11 15:45:42 2018 - [info] Checking replication filtering settings..
Sun Mar 11 15:45:42 2018 - [info]  binlog_do_db= , binlog_ignore_db= 
Sun Mar 11 15:45:42 2018 - [info]  Replication filtering check ok.
Sun Mar 11 15:45:42 2018 - [info] GTID (with auto-pos) is not supported
Sun Mar 11 15:45:42 2018 - [info] Starting SSH connection tests..
Sun Mar 11 15:45:43 2018 - [info] All SSH connection tests passed successfully.
Sun Mar 11 15:45:43 2018 - [info] Checking MHA Node version..
Sun Mar 11 15:45:45 2018 - [info]  Version check ok.
Sun Mar 11 15:45:45 2018 - [info] Checking SSH publickey authentication settings on the current master..
Sun Mar 11 15:45:45 2018 - [info] HealthCheck: SSH to 10.0.0.50 is reachable.
Sun Mar 11 15:45:45 2018 - [info] Master MHA Node version is 0.56.
Sun Mar 11 15:45:45 2018 - [info] Checking recovery script configurations on 10.0.0.50(10.0.0.50:3306)..
Sun Mar 11 15:45:45 2018 - [info]   Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/mysql/data --output_file=/tmp/save_binary_logs_test --manager_version=0.56 --start_file=mysql-bin.000002 
Sun Mar 11 15:45:45 2018 - [info]   Connecting to root@10.0.0.50(10.0.0.50:22).. 
  Creating /tmp if not exists..    ok.
  Checking output directory is accessible or not..
   ok.
  Binlog found at /mysql/data, up to mysql-bin.000002
Sun Mar 11 15:45:46 2018 - [info] Binlog setting check done.
Sun Mar 11 15:45:46 2018 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..
Sun Mar 11 15:45:46 2018 - [info]   Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=10.0.0.60 --slave_ip=10.0.0.60 --slave_port=3306 --workdir=/tmp --target_version=5.6.16-log --manager_version=0.56 --relay_log_info=/mysql/data/relay-log.info  --relay_dir=/mysql/data/  --slave_pass=xxx
Sun Mar 11 15:45:46 2018 - [info]   Connecting to root@10.0.0.60(10.0.0.60:22).. 
  Checking slave recovery environment settings..
    Opening /mysql/data/relay-log.info ... ok.
    Relay log found at /mysql/data, up to cadicate-master-relay-bin.000007
    Temporary relay log file is /mysql/data/cadicate-master-relay-bin.000007
    Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
 done.
    Testing mysqlbinlog output.. done.
    Cleaning up test file(s).. done.
Sun Mar 11 15:45:47 2018 - [info]   Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=10.0.0.70 --slave_ip=10.0.0.70 --slave_port=3306 --workdir=/tmp --target_version=5.6.16 --manager_version=0.56 --relay_log_info=/mysql/data/relay-log.info  --relay_dir=/mysql/data/  --slave_pass=xxx
Sun Mar 11 15:45:47 2018 - [info]   Connecting to root@10.0.0.70(10.0.0.70:22).. 
  Checking slave recovery environment settings..
    Opening /mysql/data/relay-log.info ... ok.
    Relay log found at /mysql/data, up to slave-relay-bin.000005
    Temporary relay log file is /mysql/data/slave-relay-bin.000005
    Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
 done.
    Testing mysqlbinlog output.. done.
    Cleaning up test file(s).. done.
Sun Mar 11 15:45:47 2018 - [info] Slaves settings check done.
Sun Mar 11 15:45:47 2018 - [info] 
10.0.0.50(10.0.0.50:3306) (current master)
 +--10.0.0.60(10.0.0.60:3306)
 +--10.0.0.70(10.0.0.70:3306)

Sun Mar 11 15:45:47 2018 - [info] Checking master_ip_failover_script status:
Sun Mar 11 15:45:47 2018 - [info]   /usr/local/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=10.0.0.50 --orig_master_ip=10.0.0.50 --orig_master_port=3306 


IN SCRIPT TEST====/etc/init.d/keepalived stop==/etc/init.d/keepalived start===

Checking the Status of the script.. OK 
Sun Mar 11 15:45:47 2018 - [info]  OK.
Sun Mar 11 15:45:47 2018 - [warning] shutdown_script is not defined.
Sun Mar 11 15:45:47 2018 - [info] Set master ping interval 1 seconds.
Sun Mar 11 15:45:47 2018 - [info] Set secondary check script: /usr/local/bin/masterha_secondary_check -s server03 -s server02
Sun Mar 11 15:45:47 2018 - [info] Starting ping health check on 10.0.0.50(10.0.0.50:3306)..
Sun Mar 11 15:45:47 2018 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
Sun Mar 11 15:49:49 2018 - [info] Got terminate signal. Exit.
^C
[root@manager ~]# > /var/log/masterha/app1/manager.log       
[root@manager ~]# tail -f /var/log/masterha/app1/manager.log 
Sun Mar 11 15:53:04 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Mar 11 15:53:04 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Sun Mar 11 15:53:04 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Sun Mar 11 15:53:04 2018 - [info] MHA::MasterMonitor version 0.56.
Sun Mar 11 15:53:04 2018 - [info] GTID failover mode = 0
Sun Mar 11 15:53:04 2018 - [info] Dead Servers:
Sun Mar 11 15:53:04 2018 - [info] Alive Servers:
Sun Mar 11 15:53:04 2018 - [info]   10.0.0.50(10.0.0.50:3306)
Sun Mar 11 15:53:04 2018 - [info]   10.0.0.60(10.0.0.60:3306)
Sun Mar 11 15:53:04 2018 - [info]   10.0.0.70(10.0.0.70:3306)
Sun Mar 11 15:53:04 2018 - [info] Alive Slaves:
Sun Mar 11 15:53:04 2018 - [info]   10.0.0.50(10.0.0.50:3306)  Version=5.6.16-log (oldest major version between slaves) log-bin:enabled
Sun Mar 11 15:53:04 2018 - [info]     Replicating from 10.0.0.60(10.0.0.60:3306)
Sun Mar 11 15:53:04 2018 - [info]   10.0.0.70(10.0.0.70:3306)  Version=5.6.16 (oldest major version between slaves) log-bin:disabled
Sun Mar 11 15:53:04 2018 - [info]     Replicating from 10.0.0.60(10.0.0.60:3306)
Sun Mar 11 15:53:04 2018 - [info] Current Alive Master: 10.0.0.60(10.0.0.60:3306)
Sun Mar 11 15:53:04 2018 - [info] Checking slave configurations..
Sun Mar 11 15:53:04 2018 - [warning]  relay_log_purge=0 is not set on slave 10.0.0.70(10.0.0.70:3306).
Sun Mar 11 15:53:04 2018 - [warning]  log-bin is not set on slave 10.0.0.70(10.0.0.70:3306). This host cannot be a master.
Sun Mar 11 15:53:04 2018 - [info] Checking replication filtering settings..
Sun Mar 11 15:53:04 2018 - [info]  binlog_do_db= , binlog_ignore_db= 
Sun Mar 11 15:53:04 2018 - [info]  Replication filtering check ok.
Sun Mar 11 15:53:04 2018 - [info] GTID (with auto-pos) is not supported
Sun Mar 11 15:53:04 2018 - [info] Starting SSH connection tests..
Sun Mar 11 15:53:06 2018 - [info] All SSH connection tests passed successfully.
Sun Mar 11 15:53:06 2018 - [info] Checking MHA Node version..
Sun Mar 11 15:53:06 2018 - [info]  Version check ok.
Sun Mar 11 15:53:06 2018 - [info] Checking SSH publickey authentication settings on the current master..
Sun Mar 11 15:53:06 2018 - [info] HealthCheck: SSH to 10.0.0.60 is reachable.
Sun Mar 11 15:53:06 2018 - [info] Master MHA Node version is 0.56.
Sun Mar 11 15:53:06 2018 - [info] Checking recovery script configurations on 10.0.0.60(10.0.0.60:3306)..
Sun Mar 11 15:53:06 2018 - [info]   Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/mysql/data --output_file=/tmp/save_binary_logs_test --manager_version=0.56 --start_file=mysql-bin.000003 
Sun Mar 11 15:53:06 2018 - [info]   Connecting to root@10.0.0.60(10.0.0.60:22).. 
  Creating /tmp if not exists..    ok.
  Checking output directory is accessible or not..
   ok.
  Binlog found at /mysql/data, up to mysql-bin.000003
Sun Mar 11 15:53:06 2018 - [info] Binlog setting check done.
Sun Mar 11 15:53:06 2018 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..
Sun Mar 11 15:53:06 2018 - [info]   Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=10.0.0.50 --slave_ip=10.0.0.50 --slave_port=3306 --workdir=/tmp --target_version=5.6.16-log --manager_version=0.56 --relay_log_info=/mysql/data/relay-log.info  --relay_dir=/mysql/data/  --slave_pass=xxx
Sun Mar 11 15:53:06 2018 - [info]   Connecting to root@10.0.0.50(10.0.0.50:22).. 
  Checking slave recovery environment settings..
    Opening /mysql/data/relay-log.info ... ok.
    Relay log found at /mysql/data, up to master-relay-bin.000002
    Temporary relay log file is /mysql/data/master-relay-bin.000002
    Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
 done.
    Testing mysqlbinlog output.. done.
    Cleaning up test file(s).. done.
Sun Mar 11 15:53:07 2018 - [info]   Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=10.0.0.70 --slave_ip=10.0.0.70 --slave_port=3306 --workdir=/tmp --target_version=5.6.16 --manager_version=0.56 --relay_log_info=/mysql/data/relay-log.info  --relay_dir=/mysql/data/  --slave_pass=xxx
Sun Mar 11 15:53:07 2018 - [info]   Connecting to root@10.0.0.70(10.0.0.70:22).. 
  Checking slave recovery environment settings..
    Opening /mysql/data/relay-log.info ... ok.
    Relay log found at /mysql/data, up to slave-relay-bin.000002
    Temporary relay log file is /mysql/data/slave-relay-bin.000002
    Testing mysql connection and privileges..Warning: Using a password on the command line interface can be insecure.
 done.
    Testing mysqlbinlog output.. done.
    Cleaning up test file(s).. done.
Sun Mar 11 15:53:07 2018 - [info] Slaves settings check done.
Sun Mar 11 15:53:07 2018 - [info] 
10.0.0.60(10.0.0.60:3306) (current master)
 +--10.0.0.50(10.0.0.50:3306)
 +--10.0.0.70(10.0.0.70:3306)

Sun Mar 11 15:53:07 2018 - [info] Checking master_ip_failover_script status:
Sun Mar 11 15:53:07 2018 - [info]   /usr/local/bin/master_ip_failover --command=status --ssh_user=root --orig_master_host=10.0.0.60 --orig_master_ip=10.0.0.60 --orig_master_port=3306 


IN SCRIPT TEST====/etc/init.d/keepalived stop==/etc/init.d/keepalived start===

Checking the Status of the script.. OK 
Sun Mar 11 15:53:07 2018 - [info]  OK.
Sun Mar 11 15:53:07 2018 - [warning] shutdown_script is not defined.
Sun Mar 11 15:53:07 2018 - [info] Set master ping interval 1 seconds.
Sun Mar 11 15:53:07 2018 - [info] Set secondary check script: /usr/local/bin/masterha_secondary_check -s server03 -s server02
Sun Mar 11 15:53:07 2018 - [info] Starting ping health check on 10.0.0.60(10.0.0.60:3306)..
Sun Mar 11 15:53:07 2018 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..

 

 

其中参数的意思:

--orig_master_is_new_slave 切换时加上此参数是将原 master 变为 slave 节点,如果不加此参数,原来的 master 将不启动

--running_updates_limit=10000,故障切换时,候选master 如果有延迟的话, mha 切换不能成功,加上此参数表示延迟在此时间范围内都可切换(单位为s),但是切换的时间长短是由recover 时relay 日志的大小决定 

注意:由于在线进行切换需要调用到master_ip_online_change这个脚本,但是由于该脚本不完整,需要自己进行相应的修改,我google到后发现还是有问题,脚本中new_master_password这个变量获取不到,导致在线切换失败,所以进行了相关的硬编码,直接把mysql的root用户密码赋值给变量new_master_password,如果有哪位大牛知道原因,请指点指点。这个脚本还可以管理vip。下面贴出脚本:

#!/usr/bin/env perl

#  Copyright (C) 2011 DeNA Co.,Ltd.
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  You should have received a copy of the GNU General Public License
#   along with this program; if not, write to the Free Software
#  Foundation, Inc.,
#  51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA

## Note: This is a sample script and is not complete. Modify the script based on your environment.

use strict;
use warnings FATAL => 'all';

use Getopt::Long;
use MHA::DBHelper;
use MHA::NodeUtil;
use Time::HiRes qw( sleep gettimeofday tv_interval );
use Data::Dumper;

my $_tstart;
my $_running_interval = 0.1;
my (
  $command,          $orig_master_host, $orig_master_ip,
  $orig_master_port, $orig_master_user, 
  $new_master_host,  $new_master_ip,    $new_master_port,
  $new_master_user,  
);


my $vip = '10.0.0.88/24';  # Virtual IP 
my $key = "1"; 
my $ssh_start_vip = "/sbin/ifconfig eth1:$key $vip";
my $ssh_stop_vip = "/sbin/ifconfig eth1:$key down";
my $ssh_user = "root";
my $new_master_password='123';
my $orig_master_password='123';
GetOptions(
  'command=s'              => \$command,
  #'ssh_user=s'             => \$ssh_user,  
  'orig_master_host=s'     => \$orig_master_host,
  'orig_master_ip=s'       => \$orig_master_ip,
  'orig_master_port=i'     => \$orig_master_port,
  'orig_master_user=s'     => \$orig_master_user,
  #'orig_master_password=s' => \$orig_master_password,
  'new_master_host=s'      => \$new_master_host,
  'new_master_ip=s'        => \$new_master_ip,
  'new_master_port=i'      => \$new_master_port,
  'new_master_user=s'      => \$new_master_user,
  #'new_master_password=s'  => \$new_master_password,
);

exit &main();

sub current_time_us {
  my ( $sec, $microsec ) = gettimeofday();
  my $curdate = localtime($sec);
  return $curdate . " " . sprintf( "%06d", $microsec );
}

sub sleep_until {
  my $elapsed = tv_interval($_tstart);
  if ( $_running_interval > $elapsed ) {
    sleep( $_running_interval - $elapsed );
  }
}

sub get_threads_util {
  my $dbh                    = shift;
  my $my_connection_id       = shift;
  my $running_time_threshold = shift;
  my $type                   = shift;
  $running_time_threshold = 0 unless ($running_time_threshold);
  $type                   = 0 unless ($type);
  my @threads;

  my $sth = $dbh->prepare("SHOW PROCESSLIST");
  $sth->execute();

  while ( my $ref = $sth->fetchrow_hashref() ) {
    my $id         = $ref->{Id};
    my $user       = $ref->{User};
    my $host       = $ref->{Host};
    my $command    = $ref->{Command};
    my $state      = $ref->{State};
    my $query_time = $ref->{Time};
    my $info       = $ref->{Info};
    $info =~ s/^\s*(.*?)\s*$/$1/ if defined($info);
    next if ( $my_connection_id == $id );
    next if ( defined($query_time) && $query_time < $running_time_threshold );
    next if ( defined($command)    && $command eq "Binlog Dump" );
    next if ( defined($user)       && $user eq "system user" );
    next
      if ( defined($command)
      && $command eq "Sleep"
      && defined($query_time)
      && $query_time >= 1 );

    if ( $type >= 1 ) {
      next if ( defined($command) && $command eq "Sleep" );
      next if ( defined($command) && $command eq "Connect" );
    }

    if ( $type >= 2 ) {
      next if ( defined($info) && $info =~ m/^select/i );
      next if ( defined($info) && $info =~ m/^show/i );
    }

    push @threads, $ref;
  }
  return @threads;
}

sub main {
  if ( $command eq "stop" ) {
    ## Gracefully killing connections on the current master
    # 1. Set read_only= 1 on the new master
    # 2. DROP USER so that no app user can establish new connections
    # 3. Set read_only= 1 on the current master
    # 4. Kill current queries
    # * Any database access failure will result in script die.
    my $exit_code = 1;
    eval {
      ## Setting read_only=1 on the new master (to avoid accident)
      my $new_master_handler = new MHA::DBHelper();

      # args: hostname, port, user, password, raise_error(die_on_error)_or_not
      $new_master_handler->connect( $new_master_ip, $new_master_port,
        $new_master_user, $new_master_password, 1 );
      print current_time_us() . " Set read_only on the new master.. ";
      $new_master_handler->enable_read_only();
      if ( $new_master_handler->is_read_only() ) {
        print "ok.\n";
      }
      else {
        die "Failed!\n";
      }
      $new_master_handler->disconnect();

      # Connecting to the orig master, die if any database error happens
      my $orig_master_handler = new MHA::DBHelper();
      $orig_master_handler->connect( $orig_master_ip, $orig_master_port,
        $orig_master_user, $orig_master_password, 1 );

      ## Drop application user so that nobody can connect. Disabling per-session binlog beforehand
      #$orig_master_handler->disable_log_bin_local();
      #print current_time_us() . " Drpping app user on the orig master..\n";
      #FIXME_xxx_drop_app_user($orig_master_handler);

      ## Waiting for N * 100 milliseconds so that current connections can exit
      my $time_until_read_only = 15;
      $_tstart = [gettimeofday];
      my @threads = get_threads_util( $orig_master_handler->{dbh},
        $orig_master_handler->{connection_id} );
      while ( $time_until_read_only > 0 && $#threads >= 0 ) {
        if ( $time_until_read_only % 5 == 0 ) {
          printf
"%s Waiting all running %d threads are disconnected.. (max %d milliseconds)\n",
            current_time_us(), $#threads + 1, $time_until_read_only * 100;
          if ( $#threads < 5 ) {
            print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump . "\n"
              foreach (@threads);
          }
        }
        sleep_until();
        $_tstart = [gettimeofday];
        $time_until_read_only--;
        @threads = get_threads_util( $orig_master_handler->{dbh},
          $orig_master_handler->{connection_id} );
      }

      ## Setting read_only=1 on the current master so that nobody(except SUPER) can write
      print current_time_us() . " Set read_only=1 on the orig master.. ";
      $orig_master_handler->enable_read_only();
      if ( $orig_master_handler->is_read_only() ) {
        print "ok.\n";
      }
      else {
        die "Failed!\n";
      }

      ## Waiting for M * 100 milliseconds so that current update queries can complete
      my $time_until_kill_threads = 5;
      @threads = get_threads_util( $orig_master_handler->{dbh},
        $orig_master_handler->{connection_id} );
      while ( $time_until_kill_threads > 0 && $#threads >= 0 ) {
        if ( $time_until_kill_threads % 5 == 0 ) {
          printf
"%s Waiting all running %d queries are disconnected.. (max %d milliseconds)\n",
            current_time_us(), $#threads + 1, $time_until_kill_threads * 100;
          if ( $#threads < 5 ) {
            print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump . "\n"
              foreach (@threads);
          }
        }
        sleep_until();
        $_tstart = [gettimeofday];
        $time_until_kill_threads--;
        @threads = get_threads_util( $orig_master_handler->{dbh},
          $orig_master_handler->{connection_id} );
      }



                print "Disabling the VIP on old master: $orig_master_host \n";
                &stop_vip();     


      ## Terminating all threads
      print current_time_us() . " Killing all application threads..\n";
      $orig_master_handler->kill_threads(@threads) if ( $#threads >= 0 );
      print current_time_us() . " done.\n";
      #$orig_master_handler->enable_log_bin_local();
      $orig_master_handler->disconnect();

      ## After finishing the script, MHA executes FLUSH TABLES WITH READ LOCK
      $exit_code = 0;
    };
    if ($@) {
      warn "Got Error: $@\n";
      exit $exit_code;
    }
    exit $exit_code;
  }
  elsif ( $command eq "start" ) {
    ## Activating master ip on the new master
    # 1. Create app user with write privileges
    # 2. Moving backup script if needed
    # 3. Register new master's ip to the catalog database

# We don't return error even though activating updatable accounts/ip failed so that we don't interrupt slaves' recovery.
# If exit code is 0 or 10, MHA does not abort
    my $exit_code = 10;
    eval {
      my $new_master_handler = new MHA::DBHelper();

      # args: hostname, port, user, password, raise_error_or_not
      $new_master_handler->connect( $new_master_ip, $new_master_port,
        $new_master_user, $new_master_password, 1 );

      ## Set read_only=0 on the new master
      #$new_master_handler->disable_log_bin_local();
      print current_time_us() . " Set read_only=0 on the new master.\n";
      $new_master_handler->disable_read_only();

      ## Creating an app user on the new master
      #print current_time_us() . " Creating app user on the new master..\n";
      #FIXME_xxx_create_app_user($new_master_handler);
      #$new_master_handler->enable_log_bin_local();
      $new_master_handler->disconnect();

      ## Update master ip on the catalog database, etc
                print "Enabling the VIP - $vip on the new master - $new_master_host \n";
                &start_vip();
                $exit_code = 0;
    };
    if ($@) {
      warn "Got Error: $@\n";
      exit $exit_code;
    }
    exit $exit_code;
  }
  elsif ( $command eq "status" ) {

    # do nothing
    exit 0;
  }
  else {
    &usage();
    exit 1;
  }
}

# A simple system call that enable the VIP on the new master 
sub start_vip() {
    `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
# A simple system call that disable the VIP on the old_master
sub stop_vip() {
    `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}

sub usage {
  print
"Usage: master_ip_online_change --command=start|stop|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
  die;
}

四.修复宕机的Master 

通常情况下自动切换以后,原master可能已经废弃掉,待原master主机修复后,如果数据完整的情况下,可能想把原来master重新作为新主库的slave,这时我们可以借助当时自动切换时刻的MHA日志来完成对原master的修复。下面是提取相关日志的命令:

[root@manager app1]# grep -i "All other slaves should start" manager.log

获取上述信息以后,就可以直接在修复后的master上执行change master to相关操作,重新作为从库了。

最后补充一下邮件发送脚本send_report ,这个脚本在询问一位朋友后可以使用,如下:

#!/usr/bin/perl

#  Copyright (C) 2011 DeNA Co.,Ltd.
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  You should have received a copy of the GNU General Public License
#   along with this program; if not, write to the Free Software
#  Foundation, Inc.,
#  51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA

## Note: This is a sample script and is not complete. Modify the script based on your environment.

use strict;
use warnings FATAL => 'all';
use Mail::Sender;
use Getopt::Long;

#new_master_host and new_slave_hosts are set only when recovering master succeeded
my ( $dead_master_host, $new_master_host, $new_slave_hosts, $subject, $body );
my $smtp='smtp.163.com';
my $mail_from='xxxx';
my $mail_user='xxxxx';
my $mail_pass='xxxxx';
my $mail_to=['xxxx','xxxx'];
GetOptions(
  'orig_master_host=s' => \$dead_master_host,
  'new_master_host=s'  => \$new_master_host,
  'new_slave_hosts=s'  => \$new_slave_hosts,
  'subject=s'          => \$subject,
  'body=s'             => \$body,
);

mailToContacts($smtp,$mail_from,$mail_user,$mail_pass,$mail_to,$subject,$body);

sub mailToContacts {
    my ( $smtp, $mail_from, $user, $passwd, $mail_to, $subject, $msg ) = @_;
    open my $DEBUG, "> /tmp/monitormail.log"
        or die "Can't open the debug      file:$!\n";
    my $sender = new Mail::Sender {
        ctype       => 'text/plain; charset=utf-8',
        encoding    => 'utf-8',
        smtp        => $smtp,
        from        => $mail_from,
        auth        => 'LOGIN',
        TLS_allowed => '0',
        authid      => $user,
        authpwd     => $passwd,
        to          => $mail_to,
        subject     => $subject,
        debug       => $DEBUG
    };

    $sender->MailMsg(
        {   msg   => $msg,
            debug => $DEBUG
        }
    ) or print $Mail::Sender::Error;
    return 1;
}



# Do whatever you want here

exit 0;

最后切换以后发送告警的邮件示例,注意,这个是我后续的测试,和上面环境出现的ip不一致不要在意。

总结:

目前高可用方案可以一定程度上实现数据库的高可用,比如前面文章介绍的MMMheartbeat+drbdCluster等。还有percona的Galera Cluster等。这些高可用软件各有优劣。在进行高可用方案选择时,主要是看业务还有对数据一致性方面的要求。最后出于对数据库的高可用和数据一致性的要求,推荐使用MHA架构。

posted @ 2018-03-08 17:27  benjamin杨  阅读(1507)  评论(0编辑  收藏  举报