MHA环境搭建

 准备工作

数据库架构

角色               ip地址          主机名          server_id
Master        192.168.132.131    master            131  
Slave1        192.168.132.132    master-bak        132  
Slave2        192.168.132.133    slave2            133  

 

配置三台服务器ssh免秘钥认证

ssh-keygen -t rsa
ssh-copy-id 192.168.132.132
ssh-copy-id 192.168.132.133
ssh-copy-id 192.168.132.131

mysql主从搭建

 

一、修改配置文件 my.cnf

master

[root@master ~]# cat /etc/my.cnf|grep -v "^#\|^$"
[mysqld]
server_id=131
log_bin = master-bin
log_bin_index = master-bin.index
enforce_gtid_consistency = 1
log_slave_updates = 1
plugin_load = "rpl_semi_sync_master=semisync_master.so;rpl_semi_sync_slave=semisync_slave.so"
loose_rpl_semi_sync_master_enabled = 1
loose_rpl_semi_sync_slave_enabled = 1
loose_rpl_semi_sync_master_timeout = 5000
sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES 

 

master-bak

[root@master-bak ~]# cat /etc/my.cnf|grep -v "^#\|^$"
[mysqld]
server_id=132
log_bin=slave-bin
log_bin_index=slave-bin.index
log-slave-updates
relay_log_index=slave-relay-bin.index
relay_log=slave-relay-bin
enforce_gtid_consistency = 1
log_slave_updates = 1
plugin_load = "rpl_semi_sync_master=semisync_master.so;rpl_semi_sync_slave=semisync_slave.so"
loose_rpl_semi_sync_master_enabled = 1
loose_rpl_semi_sync_slave_enabled = 1
loose_rpl_semi_sync_master_timeout = 5000
sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES 

 

slave2

[root@slave2 ~]# cat /etc/my.cnf|grep -v "^#\|^$"
[mysqld]
server_id=133
log-bin=mysql-bin
relay_log_index=slave-relay-bin.index
relay_log=slave-relay-bin
relay_log_purge=0
read_only=1
enforce_gtid_consistency = 1
log_slave_updates = 1
plugin_load = "rpl_semi_sync_master=semisync_master.so;rpl_semi_sync_slave=semisync_slave.so"
loose_rpl_semi_sync_master_enabled = 1
loose_rpl_semi_sync_slave_enabled = 1
loose_rpl_semi_sync_master_timeout = 5000
sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES

 

二、配置主从

授权(在三台服务器上配置复制用户和监控用户,三台服务器都要添加)

mysql5.7会默认加载validate_password 模块,是来控制密码长度和规则的,可以在配置文件里面关闭该模块  加上validate_password = off ,或者在mysql命令行执行set global validate_password_policy=0;来临时取消密码规则。

 

#添加复制用户
grant replication slave on *.* to 'repl'@'192.168.132.%' identified by '123456';

#添加监控用户
grant all privileges on *.* to 'root'@'192.168.132.%' identified  by 'root';

#做完这两步,可测试三个mysql是否都可以远程登录了

 

master执行(记住file和position)

mysql> show master status;
+-------------------+----------+--------------+------------------+-------------------+
| File              | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+-------------------+----------+--------------+------------------+-------------------+
| master-bin.000003 |     1455 |              |                  |                   |
+-------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)

 

master-bak和slave执行

change master to master_host='192.168.132.131',master_user='repl',master_password='repl',master_log_file='mysql-master.000003',master_log_pos=1455;

#设置从服务器只读,不要在配置文件里写 set global read_only
=1

 

三、查看结果

#两个节点执行如下命令
[root@master-bak ~]# mysql -uroot -proot -e "show slave status \G"|grep "Slave_IO_Running\|Slave_SQL_Running" mysql: [Warning] Using a password on the command line interface can be insecure. Slave_IO_Running: Yes Slave_SQL_Running: Yes Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates [root@master-bak ~]#

 

四、遇到的问题

 

当遇到io_running为no的时候:

mysql> stop slave;
mysql> set GLOBAL SQL_SLAVE_SKIP_COUNTER=1;    #前提是配置文件里#gtid_mode = on注释掉
mysql> start slave;

#如果io_running还为no,可尝试从新执行 change master 。。。

##方法二 (重置slave)
mysql> stop slave;
mysql> reset slave;
mysql> start slave;

#方法三 (重设同步日志文件及位置)
mysql> show slave status \G;      #查看目前同步到mysql-bin 日志文件的位置
mysql> stop slave;
mysql> change master to master_log_file='上面查到的位置position',master_log_pos=1;
mysql> start slave;

 

当遇到 Slave_SQL_Running: No的时候:

  1. 程序可能在slave上进行了写操作
  2. 也可能是slave机器重起后,事务回滚造成的.

一般是事务回滚造成的,解决办法:

mysql> stop slave;
Query OK, 0 rows affected (0.10 sec)

mysql> set GLOBAL SQL_SLAVE_SKIP_COUNTER=1;
Query OK, 0 rows affected (0.00 sec)

mysql> start slave;
Query OK, 0 rows affected (0.00 sec)

#2,从新 stop slave 然后change master

 

安装mha

安装依赖包

yum install perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-ExtUtils-CBuilder perl-ExtUtils-MakeMaker perl-CPAN

 

本例使用slave2 作为mha管理机

slave2 安装manager

tar xf mha4mysql-manager-0.57.tar.gz 
cd mha4mysql-manager-0.57
perl Makefile.PL
make && make install

 

三台mysql安装node

tar xf mha4mysql-node-0.57.tar.gz
cd mha4mysql-node-0.57
perl Makefile.PL
make && make install

安装过程中出现的问题看最下面

 

安装完成后会在/usr/local/bin目录下面生成相应的脚本

[root@slave2 bin]# ls /usr/local/bin/masterha_*
/usr/local/bin/masterha_check_repl    /usr/local/bin/masterha_conf_host       /usr/local/bin/masterha_master_switch
/usr/local/bin/masterha_check_ssh     /usr/local/bin/masterha_manager         /usr/local/bin/masterha_secondary_check
/usr/local/bin/masterha_check_status  /usr/local/bin/masterha_master_monitor  /usr/local/bin/masterha_stop
[root@slave2 bin]# 

 

复制 mha4mysql-manager-0.53/samples/scripts/下的脚本到/usr/local/bin目录(manager 服务器执行)

master_ip_failover     #自动切换时vip管理的脚本,不是必须,如果我们使用keepalived的,我们可以自己编写脚本完成对vip的管理,比如监控mysql,如果mysql异常,我们停止keepalived就行,这样vip就会自动漂移
master_ip_online_change    #在线切换时vip的管理,不是必须,同样可以可以自行编写简单的shell完成
power_manager     #故障发生后关闭主机的脚本,不是必须
send_report       #因故障切换后发送报警的脚本,不是必须,可自行编写简单的shell完成。

 

修改/usr/local/bin/master_ip_failover 脚本

#!/usr/bin/env perl

use strict;
use warnings FATAL => 'all';

use Getopt::Long;

my (
    $command,          $ssh_user,        $orig_master_host, $orig_master_ip,
    $orig_master_port, $new_master_host, $new_master_ip,    $new_master_port
);

my $vip = '192.168.132.135/24';  #此处为你要设置的虚拟ip
my $key = '1';
my $ssh_start_vip = "/sbin/ifconfig ens33:$key $vip"; #此处改为你的网卡名称
my $ssh_stop_vip = "/sbin/ifconfig ens33:$key down";

GetOptions(
    'command=s'          => \$command,
    'ssh_user=s'         => \$ssh_user,
    'orig_master_host=s' => \$orig_master_host,
    'orig_master_ip=s'   => \$orig_master_ip,
    'orig_master_port=i' => \$orig_master_port,
    'new_master_host=s'  => \$new_master_host,
    'new_master_ip=s'    => \$new_master_ip,
    'new_master_port=i'  => \$new_master_port,
);

exit &main();

sub main {

    print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n";

    if ( $command eq "stop" || $command eq "stopssh" ) {

        my $exit_code = 1;
        eval {
            print "Disabling the VIP on old master: $orig_master_host \n";
            &stop_vip();
            $exit_code = 0;
        };
        if ($@) {
            warn "Got Error: $@\n";
            exit $exit_code;
        }
        exit $exit_code;
    }
    elsif ( $command eq "start" ) {

        my $exit_code = 10;
        eval {
            print "Enabling the VIP - $vip on the new master - $new_master_host \n";
            &start_vip();
            $exit_code = 0;
        };
        if ($@) {
            warn $@;
            exit $exit_code;
        }
        exit $exit_code;
    }
    elsif ( $command eq "status" ) {
        print "Checking the Status of the script.. OK \n";
        exit 0;
    }
    else {
        &usage();
        exit 1;
    }
}

sub start_vip() {
    `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
sub stop_vip() {
     return 0  unless  ($ssh_user);
    `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}

sub usage {
    print
    "Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
}

  

添加修改MHA配置文件

mkdir -p /etc/masterha
cp mha4mysql-manager-0.57/samples/conf/app1.cnf /etc/masterha/

cat /etc/masterha/app1.cnf [server default] manager_workdir=/var/log/masterha/app1.log               ##manager工作目录 manager_log=/var/log/masterha/app1/manager.log            #manager日志 master_binlog_dir=/usr/local/mysql/data/                #master保存binlog的位置,这里的路径要与master里配置的binlog的路径一致,以便mha能找到 #master_ip_failover_script= /usr/local/bin/master_ip_failover    #设置自动failover时候的切换脚本,也就是上边的哪个脚本 master_ip_online_change_script= /usr/local/bin/master_ip_online_change  #设置手动切换时候的切换脚本 password=root      #设置mysql中root用户的密码,这个密码是前文中创建监控用户的那个密码 user=root        #设置监控用户root ping_interval=1      #设置监控主库,发送ping包的时间间隔,默认是3秒,尝试三次没有回应的时候自动进行railover remote_workdir=/tmp    #设置远端mysql在发生切换时binlog的保存位置 repl_password=repl        #设置复制用户的密码 repl_user=repl           #设置复制用户的用户 report_script=/usr/local/send_report      //设置发生切换后发送的报警的脚本 secondary_check_script= /usr/local/bin/masterha_secondary_check -s 192.168.132.131 -s 192.168.132.132 shutdown_script=""  #设置故障发生后关闭故障主机脚本(该脚本的主要作用是关闭主机放在发生脑裂,这里没有使用) ssh_user=root      #设置ssh的登录用户名 [server1] hostname=192.168.132.131 port=3306 #candidate_master=1 [server2] hostname=192.168.132.132 port=3306 candidate_master=1    #//设置为候选master,如果设置该参数以后,发生主从切换以后将会将此从库提升为主库,即使这个主库不是集群中事件最新的slave check_repl_delay=0    #默认情况下如果一个slave落后master 100M的relay logs的话,MHA将不会选择该slave作为一个新的master,因为对于这个slave的恢复需要花费很长时间,通过设置check_repl_delay=0,MHA触发切换在选择一个新的master的时候将会忽略复制延时,这个参数对于设置了candidate_master=1的主机非常有用,因为这个候选主在切换的过程中一定是新的master [server3] hostname=192.168.132.133 port=3306

 

#测试ssh连接情况

[root@slave2 bin]# masterha_check_ssh -conf=/etc/masterha/app1.cnf 

 

#测试mysq集群的连接情况

masterha_check_repl -conf=/etc/masterha/app1.cnf 

 

#检查MHA的状态

[root@slave2 bin]# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 is stopped(2:NOT_RUNNING).        #没启动

 

master端添加vip

[root@master ~]# /sbin/ifconfig ens33:1 192.168.132.135/24

 

启动MHA

nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1 & 

查看日志

表示启动成功,退出终端时会有监控进程异常退出的情况,解决办法是将命令写到脚本里,然后后台执行

 

杀掉主库mysql进程,模拟主库发生故障,进行自动failover操作。

[root@master ~]# service mysql stop
Shutting down MySQL............ SUCCESS! 

查看mha 日志(自动切换成功)

Sun Apr  8 16:39:41 2018 - [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away)
Sun Apr  8 16:39:41 2018 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/usr/local/mysql/data/ --output_file=/tmp/save_binary_logs_test --manager_version=0.53 --binlog_prefix=master-bin
Sun Apr  8 16:39:41 2018 - [info] Executing seconary network check script: /usr/local/bin/masterha_secondary_check -s 192.168.132.132 -s 192.168.132.131  --user=root  --master_host=192.168.132.131  --master_ip=192.168.132.131  --master_port=3306
Sun Apr  8 16:39:42 2018 - [info] HealthCheck: SSH to 192.168.132.131 is reachable.
Monitoring server 192.168.132.132 is reachable, Master is not reachable from 192.168.132.132. OK.
Monitoring server 192.168.132.131 is reachable, Master is not reachable from 192.168.132.131. OK.
Sun Apr  8 16:39:42 2018 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Sun Apr  8 16:39:42 2018 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.132.131' (111))
Sun Apr  8 16:39:42 2018 - [warning] Connection failed 1 time(s)..
Sun Apr  8 16:39:43 2018 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.132.131' (111))
Sun Apr  8 16:39:43 2018 - [warning] Connection failed 2 time(s)..
Sun Apr  8 16:39:44 2018 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.132.131' (111))
Sun Apr  8 16:39:44 2018 - [warning] Connection failed 3 time(s)..
Sun Apr  8 16:39:44 2018 - [warning] Master is not reachable from health checker!
Sun Apr  8 16:39:44 2018 - [warning] Master 192.168.132.131(192.168.132.131:3306) is not reachable!
Sun Apr  8 16:39:44 2018 - [warning] SSH is reachable.
Sun Apr  8 16:39:44 2018 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/masterha/app1.cnf again, and trying to connect to all servers to check server status..
Sun Apr  8 16:39:44 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Apr  8 16:39:44 2018 - [info] Reading application default configurations from /etc/masterha/app1.cnf..
Sun Apr  8 16:39:44 2018 - [info] Reading server configurations from /etc/masterha/app1.cnf..
Sun Apr  8 16:39:46 2018 - [info] Dead Servers:
Sun Apr  8 16:39:46 2018 - [info]   192.168.132.131(192.168.132.131:3306)
Sun Apr  8 16:39:46 2018 - [info] Alive Servers:
Sun Apr  8 16:39:46 2018 - [info]   192.168.132.132(192.168.132.132:3306)
Sun Apr  8 16:39:46 2018 - [info]   192.168.132.133(192.168.132.133:3306)
Sun Apr  8 16:39:46 2018 - [info] Alive Slaves:
Sun Apr  8 16:39:46 2018 - [info]   192.168.132.132(192.168.132.132:3306)  Version=5.7.17-log (oldest major version between slaves) log-bin:enabled
Sun Apr  8 16:39:46 2018 - [info]     Replicating from 192.168.132.131(192.168.132.131:3306)
Sun Apr  8 16:39:46 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Sun Apr  8 16:39:46 2018 - [info]   192.168.132.133(192.168.132.133:3306)  Version=5.7.17-log (oldest major version between slaves) log-bin:enabled
Sun Apr  8 16:39:46 2018 - [info]     Replicating from 192.168.132.131(192.168.132.131:3306)
Sun Apr  8 16:39:46 2018 - [info] Checking slave configurations..
Sun Apr  8 16:39:46 2018 - [info] Checking replication filtering settings..
Sun Apr  8 16:39:46 2018 - [info]  Replication filtering check ok.
Sun Apr  8 16:39:46 2018 - [info] Master is down!
Sun Apr  8 16:39:46 2018 - [info] Terminating monitoring script.
Sun Apr  8 16:39:46 2018 - [info] Got exit code 20 (Master dead).
Sun Apr  8 16:39:46 2018 - [info] MHA::MasterFailover version 0.53.
Sun Apr  8 16:39:46 2018 - [info] Starting master failover.
Sun Apr  8 16:39:46 2018 - [info] 
Sun Apr  8 16:39:46 2018 - [info] * Phase 1: Configuration Check Phase..
Sun Apr  8 16:39:46 2018 - [info] 
Sun Apr  8 16:39:47 2018 - [info] Dead Servers:
Sun Apr  8 16:39:47 2018 - [info]   192.168.132.131(192.168.132.131:3306)
Sun Apr  8 16:39:47 2018 - [info] Checking master reachability via mysql(double check)..
Sun Apr  8 16:39:47 2018 - [info]  ok.
Sun Apr  8 16:39:47 2018 - [info] Alive Servers:
Sun Apr  8 16:39:47 2018 - [info]   192.168.132.132(192.168.132.132:3306)
Sun Apr  8 16:39:47 2018 - [info]   192.168.132.133(192.168.132.133:3306)
Sun Apr  8 16:39:47 2018 - [info] Alive Slaves:
Sun Apr  8 16:39:47 2018 - [info]   192.168.132.132(192.168.132.132:3306)  Version=5.7.17-log (oldest major version between slaves) log-bin:enabled
Sun Apr  8 16:39:47 2018 - [info]     Replicating from 192.168.132.131(192.168.132.131:3306)
Sun Apr  8 16:39:47 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Sun Apr  8 16:39:47 2018 - [info]   192.168.132.133(192.168.132.133:3306)  Version=5.7.17-log (oldest major version between slaves) log-bin:enabled
Sun Apr  8 16:39:47 2018 - [info]     Replicating from 192.168.132.131(192.168.132.131:3306)
Sun Apr  8 16:39:47 2018 - [info] ** Phase 1: Configuration Check Phase completed.
Sun Apr  8 16:39:47 2018 - [info] 
Sun Apr  8 16:39:47 2018 - [info] * Phase 2: Dead Master Shutdown Phase..
Sun Apr  8 16:39:47 2018 - [info] 
Sun Apr  8 16:39:47 2018 - [info] Forcing shutdown so that applications never connect to the current master..
Sun Apr  8 16:39:47 2018 - [info] Executing master IP deactivatation script:
Sun Apr  8 16:39:47 2018 - [info]   /usr/local/bin/master_ip_failover --orig_master_host=192.168.132.131 --orig_master_ip=192.168.132.131 --orig_master_port=3306 --command=stopssh --ssh_user=root  


IN SCRIPT TEST====/sbin/ifconfig ens33:1 down==/sbin/ifconfig ens33:1 192.168.132.135/24===

Disabling the VIP on old master: 192.168.132.131 
Sun Apr  8 16:39:47 2018 - [info]  done.
Sun Apr  8 16:39:47 2018 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Sun Apr  8 16:39:47 2018 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Sun Apr  8 16:39:47 2018 - [info] 
Sun Apr  8 16:39:47 2018 - [info] * Phase 3: Master Recovery Phase..
Sun Apr  8 16:39:47 2018 - [info] 
Sun Apr  8 16:39:47 2018 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Sun Apr  8 16:39:47 2018 - [info] 
Sun Apr  8 16:39:47 2018 - [info] The latest binary log file/position on all slaves is master-bin.000007:154
Sun Apr  8 16:39:47 2018 - [info] Latest slaves (Slaves that received relay log files to the latest):
Sun Apr  8 16:39:47 2018 - [info]   192.168.132.132(192.168.132.132:3306)  Version=5.7.17-log (oldest major version between slaves) log-bin:enabled
Sun Apr  8 16:39:47 2018 - [info]     Replicating from 192.168.132.131(192.168.132.131:3306)
Sun Apr  8 16:39:47 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Sun Apr  8 16:39:47 2018 - [info]   192.168.132.133(192.168.132.133:3306)  Version=5.7.17-log (oldest major version between slaves) log-bin:enabled
Sun Apr  8 16:39:47 2018 - [info]     Replicating from 192.168.132.131(192.168.132.131:3306)
Sun Apr  8 16:39:47 2018 - [info] The oldest binary log file/position on all slaves is master-bin.000007:154
Sun Apr  8 16:39:47 2018 - [info] Oldest slaves:
Sun Apr  8 16:39:47 2018 - [info]   192.168.132.132(192.168.132.132:3306)  Version=5.7.17-log (oldest major version between slaves) log-bin:enabled
Sun Apr  8 16:39:47 2018 - [info]     Replicating from 192.168.132.131(192.168.132.131:3306)
Sun Apr  8 16:39:47 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Sun Apr  8 16:39:47 2018 - [info]   192.168.132.133(192.168.132.133:3306)  Version=5.7.17-log (oldest major version between slaves) log-bin:enabled
Sun Apr  8 16:39:47 2018 - [info]     Replicating from 192.168.132.131(192.168.132.131:3306)
Sun Apr  8 16:39:47 2018 - [info] 
Sun Apr  8 16:39:47 2018 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Sun Apr  8 16:39:47 2018 - [info] 
Sun Apr  8 16:39:47 2018 - [info] Fetching dead master's binary logs..
Sun Apr  8 16:39:47 2018 - [info] Executing command on the dead master 192.168.132.131(192.168.132.131:3306): save_binary_logs --command=save --start_file=master-bin.000007  --start_pos=154 --binlog_dir=/usr/local/mysql/data/ --output_file=/tmp/saved_master_binlog_from_192.168.132.131_3306_20180408163946.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.53
  Creating /tmp if not exists..    ok.
 Concat binary/relay logs from master-bin.000007 pos 154 to master-bin.000007 EOF into /tmp/saved_master_binlog_from_192.168.132.131_3306_20180408163946.binlog ..
  Dumping binlog format description event, from position 0 to 123.. ok.
  No need to dump effective binlog data from /usr/local/mysql/data//master-bin.000007 (pos starts 154, filesize 154). Skipping.
 /tmp/saved_master_binlog_from_192.168.132.131_3306_20180408163946.binlog has no effective data events.
Event not exists.
Sun Apr  8 16:39:47 2018 - [info] Additional events were not found from the orig master. No need to save.
Sun Apr  8 16:39:47 2018 - [info] 
Sun Apr  8 16:39:47 2018 - [info] * Phase 3.3: Determining New Master Phase..
Sun Apr  8 16:39:47 2018 - [info] 
Sun Apr  8 16:39:47 2018 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Sun Apr  8 16:39:47 2018 - [info] All slaves received relay logs to the same position. No need to resync each other.
Sun Apr  8 16:39:47 2018 - [info] Searching new master from slaves..
Sun Apr  8 16:39:47 2018 - [info]  Candidate masters from the configuration file:
Sun Apr  8 16:39:47 2018 - [info]   192.168.132.132(192.168.132.132:3306)  Version=5.7.17-log (oldest major version between slaves) log-bin:enabled
Sun Apr  8 16:39:47 2018 - [info]     Replicating from 192.168.132.131(192.168.132.131:3306)
Sun Apr  8 16:39:47 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Sun Apr  8 16:39:47 2018 - [info]  Non-candidate masters:
Sun Apr  8 16:39:47 2018 - [info]  Searching from candidate_master slaves which have received the latest relay log events..
Sun Apr  8 16:39:47 2018 - [info] New master is 192.168.132.132(192.168.132.132:3306)
Sun Apr  8 16:39:47 2018 - [info] Starting master failover..
Sun Apr  8 16:39:47 2018 - [info] 
From:
192.168.132.131 (current master)
 +--192.168.132.132
 +--192.168.132.133

To:
192.168.132.132 (new master)
 +--192.168.132.133
Sun Apr  8 16:39:47 2018 - [info] 
Sun Apr  8 16:39:47 2018 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Sun Apr  8 16:39:47 2018 - [info] 
Sun Apr  8 16:39:47 2018 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
Sun Apr  8 16:39:47 2018 - [info] 
Sun Apr  8 16:39:47 2018 - [info] * Phase 3.4: Master Log Apply Phase..
Sun Apr  8 16:39:47 2018 - [info] 
Sun Apr  8 16:39:47 2018 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Sun Apr  8 16:39:47 2018 - [info] Starting recovery on 192.168.132.132(192.168.132.132:3306)..
Sun Apr  8 16:39:47 2018 - [info]  This server has all relay logs. Waiting all logs to be applied.. 
Sun Apr  8 16:39:47 2018 - [info]   done.
Sun Apr  8 16:39:47 2018 - [info]  All relay logs were successfully applied.
Sun Apr  8 16:39:47 2018 - [info] Getting new master's binlog name and position..
Sun Apr  8 16:39:47 2018 - [info]  slave-bin.000005:2183
Sun Apr  8 16:39:47 2018 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.132.132', MASTER_PORT=3306, MASTER_LOG_FILE='slave-bin.000005', MASTER_LOG_POS=2183, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Sun Apr  8 16:39:47 2018 - [info] Executing master IP activate script:
Sun Apr  8 16:39:47 2018 - [info]   /usr/local/bin/master_ip_failover --command=start --ssh_user=root --orig_master_host=192.168.132.131 --orig_master_ip=192.168.132.131 --orig_master_port=3306 --new_master_host=192.168.132.132 --new_master_ip=192.168.132.132 --new_master_port=3306  


IN SCRIPT TEST====/sbin/ifconfig ens33:1 down==/sbin/ifconfig ens33:1 192.168.132.135/24===

Enabling the VIP - 192.168.132.135/24 on the new master - 192.168.132.132 
Sun Apr  8 16:39:47 2018 - [info]  OK.
Sun Apr  8 16:39:47 2018 - [info] Setting read_only=0 on 192.168.132.132(192.168.132.132:3306)..
Sun Apr  8 16:39:47 2018 - [info]  ok.
Sun Apr  8 16:39:47 2018 - [info] ** Finished master recovery successfully.
Sun Apr  8 16:39:47 2018 - [info] * Phase 3: Master Recovery Phase completed.
Sun Apr  8 16:39:47 2018 - [info] 
Sun Apr  8 16:39:47 2018 - [info] * Phase 4: Slaves Recovery Phase..
Sun Apr  8 16:39:47 2018 - [info] 
Sun Apr  8 16:39:47 2018 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Sun Apr  8 16:39:47 2018 - [info] 
Sun Apr  8 16:39:47 2018 - [info] -- Slave diff file generation on host 192.168.132.133(192.168.132.133:3306) started, pid: 3586. Check tmp log /var/log/masterha/app1.log/192.168.132.133_3306_20180408163946.log if it takes time..
Sun Apr  8 16:39:48 2018 - [info] 
Sun Apr  8 16:39:48 2018 - [info] Log messages from 192.168.132.133 ...
Sun Apr  8 16:39:48 2018 - [info] 
Sun Apr  8 16:39:47 2018 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
Sun Apr  8 16:39:48 2018 - [info] End of log messages from 192.168.132.133.
Sun Apr  8 16:39:48 2018 - [info] -- 192.168.132.133(192.168.132.133:3306) has the latest relay log events.
Sun Apr  8 16:39:48 2018 - [info] Generating relay diff files from the latest slave succeeded.
Sun Apr  8 16:39:48 2018 - [info] 
Sun Apr  8 16:39:48 2018 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Sun Apr  8 16:39:48 2018 - [info] 
Sun Apr  8 16:39:48 2018 - [info] -- Slave recovery on host 192.168.132.133(192.168.132.133:3306) started, pid: 3588. Check tmp log /var/log/masterha/app1.log/192.168.132.133_3306_20180408163946.log if it takes time..
Sun Apr  8 16:39:49 2018 - [info] 
Sun Apr  8 16:39:49 2018 - [info] Log messages from 192.168.132.133 ...
Sun Apr  8 16:39:49 2018 - [info] 
Sun Apr  8 16:39:48 2018 - [info] Starting recovery on 192.168.132.133(192.168.132.133:3306)..
Sun Apr  8 16:39:48 2018 - [info]  This server has all relay logs. Waiting all logs to be applied.. 
Sun Apr  8 16:39:48 2018 - [info]   done.
Sun Apr  8 16:39:48 2018 - [info]  All relay logs were successfully applied.
Sun Apr  8 16:39:48 2018 - [info]  Resetting slave 192.168.132.133(192.168.132.133:3306) and starting replication from the new master 192.168.132.132(192.168.132.132:3306)..
Sun Apr  8 16:39:49 2018 - [info]  Executed CHANGE MASTER.
Sun Apr  8 16:39:49 2018 - [info]  Slave started.
Sun Apr  8 16:39:49 2018 - [info] End of log messages from 192.168.132.133.
Sun Apr  8 16:39:49 2018 - [info] -- Slave recovery on host 192.168.132.133(192.168.132.133:3306) succeeded.
Sun Apr  8 16:39:49 2018 - [info] All new slave servers recovered successfully.
Sun Apr  8 16:39:49 2018 - [info] 
Sun Apr  8 16:39:49 2018 - [info] * Phase 5: New master cleanup phease..
Sun Apr  8 16:39:49 2018 - [info] 
Sun Apr  8 16:39:49 2018 - [info] Resetting slave info on the new master..
Sun Apr  8 16:39:49 2018 - [info]  192.168.132.132: Resetting slave info succeeded.
Sun Apr  8 16:39:49 2018 - [info] Master failover to 192.168.132.132(192.168.132.132:3306) completed successfully.
Sun Apr  8 16:39:49 2018 - [info] Deleted server1 entry from /etc/masterha/app1.cnf .
Sun Apr  8 16:39:49 2018 - [info] 

----- Failover Report -----

app1: MySQL Master failover 192.168.132.131 to 192.168.132.132 succeeded

Master 192.168.132.131 is down!

Check MHA Manager logs at slave2:/var/log/masterha/app1/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 192.168.132.131.
The latest slave 192.168.132.132(192.168.132.132:3306) has all relay logs for recovery.
Selected 192.168.132.132 as a new master.
192.168.132.132: OK: Applying all logs succeeded.
192.168.132.132: OK: Activated master IP address.
192.168.132.133: This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
192.168.132.133: OK: Applying all logs succeeded. Slave started, replicating from 192.168.132.132.
192.168.132.132: Resetting slave info succeeded.
Master failover to 192.168.132.132(192.168.132.132:3306) completed successfully.
View Code

 

查看切换后的master信息

[root@master-bak ~]# ip addr 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:c2:7f:10 brd ff:ff:ff:ff:ff:ff
    inet 192.168.132.132/24 brd 192.168.132.255 scope global ens33
       valid_lft forever preferred_lft forever
    inet 192.168.132.135/24 brd 192.168.132.255 scope global secondary ens33:1
       valid_lft forever preferred_lft forever
    inet6 fe80::c8dc:1ccc:c793:c91e/64 scope link 
       valid_lft forever preferred_lft forever

 

mysql> show slave status \G;
Empty set (0.00 sec)

ERROR: 
No query specified

mysql> show master status;
+------------------+----------+--------------+------------------+-------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+--------------+------------------+-------------------+
| slave-bin.000005 |     2183 |              |                  |                   |
+------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)

mysql> 

 

停止MHA

masterha_stop -conf=/etc/masterha/app1.cnf

 

手动切换集群

手动切换集群,这种场景意味着在业务上没有启用MHA自动切换功能,当主服务器故障时,人工手动调用MHA来进行故障切换操作,具体命令如下:

#这里模拟master-bak 主机下的mysql关闭,并切换到master下的mysql前提是master下的mysql在集群中,并且mha manager处于关闭状态
[root@slave2 ~]# masterha_stop -conf=/etc/masterha/app1.cnf
[root@slave2 ~]# masterha_master_switch --master_state=dead --conf=/etc/masterha/app1.cnf --dead_master_host=192.168.132.132 --dead_master_port=3306 --new_master_host=192.168.132.131 --new_master_port=3306 --ignore_last_failover

 

切换日志

[root@slave2 ~]# masterha_master_switch --master_state=dead --conf=/etc/masterha/app1.cnf --dead_master_host=192.168.132.132 --dead_master_port=3306 --new_master_host=192.168.132.131 --new_master_port=3306 --ignore_last_failover
--dead_master_ip=<dead_master_ip> is not set. Using 192.168.132.132.
Mon Apr  9 03:13:58 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Mon Apr  9 03:13:58 2018 - [info] Reading application default configurations from /etc/masterha/app1.cnf..
Mon Apr  9 03:13:58 2018 - [info] Reading server configurations from /etc/masterha/app1.cnf..
Mon Apr  9 03:13:58 2018 - [info] MHA::MasterFailover version 0.53.
Mon Apr  9 03:13:58 2018 - [info] Starting master failover.
Mon Apr  9 03:13:58 2018 - [info] 
Mon Apr  9 03:13:58 2018 - [info] * Phase 1: Configuration Check Phase..
Mon Apr  9 03:13:58 2018 - [info] 
Mon Apr  9 03:13:59 2018 - [info] Dead Servers:
Mon Apr  9 03:13:59 2018 - [info]   192.168.132.132(192.168.132.132:3306)
Mon Apr  9 03:13:59 2018 - [info] Checking master reachability via mysql(double check)..
Mon Apr  9 03:13:59 2018 - [info]  ok.
Mon Apr  9 03:13:59 2018 - [info] Alive Servers:
Mon Apr  9 03:13:59 2018 - [info]   192.168.132.131(192.168.132.131:3306)
Mon Apr  9 03:13:59 2018 - [info]   192.168.132.133(192.168.132.133:3306)
Mon Apr  9 03:13:59 2018 - [info] Alive Slaves:
Mon Apr  9 03:13:59 2018 - [info]   192.168.132.131(192.168.132.131:3306)  Version=5.7.17-log (oldest major version between slaves) log-bin:enabled
Mon Apr  9 03:13:59 2018 - [info]     Replicating from 192.168.132.132(192.168.132.132:3306)
Mon Apr  9 03:13:59 2018 - [info]   192.168.132.133(192.168.132.133:3306)  Version=5.7.17-log (oldest major version between slaves) log-bin:enabled
Mon Apr  9 03:13:59 2018 - [info]     Replicating from 192.168.132.132(192.168.132.132:3306)
Master 192.168.132.132 is dead. Proceed? (yes/NO): yes
Mon Apr  9 03:14:02 2018 - [info] ** Phase 1: Configuration Check Phase completed.
Mon Apr  9 03:14:02 2018 - [info] 
Mon Apr  9 03:14:02 2018 - [info] * Phase 2: Dead Master Shutdown Phase..
Mon Apr  9 03:14:02 2018 - [info] 
Mon Apr  9 03:14:02 2018 - [info] HealthCheck: SSH to 192.168.132.132 is reachable.
Mon Apr  9 03:14:02 2018 - [info] Forcing shutdown so that applications never connect to the current master..
Mon Apr  9 03:14:02 2018 - [info] Executing master IP deactivatation script:
Mon Apr  9 03:14:02 2018 - [info]   /usr/local/bin/master_ip_failover --orig_master_host=192.168.132.132 --orig_master_ip=192.168.132.132 --orig_master_port=3306 --command=stopssh --ssh_user=root  


IN SCRIPT TEST====/sbin/ifconfig ens33:1 down==/sbin/ifconfig ens33:1 192.168.132.135/24===

Disabling the VIP on old master: 192.168.132.132 
Mon Apr  9 03:14:02 2018 - [info]  done.
Mon Apr  9 03:14:02 2018 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Mon Apr  9 03:14:02 2018 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Mon Apr  9 03:14:02 2018 - [info] 
Mon Apr  9 03:14:02 2018 - [info] * Phase 3: Master Recovery Phase..
Mon Apr  9 03:14:02 2018 - [info] 
Mon Apr  9 03:14:02 2018 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Mon Apr  9 03:14:02 2018 - [info] 
Mon Apr  9 03:14:02 2018 - [info] The latest binary log file/position on all slaves is slave-bin.000004:154
Mon Apr  9 03:14:02 2018 - [info] Latest slaves (Slaves that received relay log files to the latest):
Mon Apr  9 03:14:02 2018 - [info]   192.168.132.131(192.168.132.131:3306)  Version=5.7.17-log (oldest major version between slaves) log-bin:enabled
Mon Apr  9 03:14:02 2018 - [info]     Replicating from 192.168.132.132(192.168.132.132:3306)
Mon Apr  9 03:14:02 2018 - [info]   192.168.132.133(192.168.132.133:3306)  Version=5.7.17-log (oldest major version between slaves) log-bin:enabled
Mon Apr  9 03:14:02 2018 - [info]     Replicating from 192.168.132.132(192.168.132.132:3306)
Mon Apr  9 03:14:02 2018 - [info] The oldest binary log file/position on all slaves is slave-bin.000004:154
Mon Apr  9 03:14:02 2018 - [info] Oldest slaves:
Mon Apr  9 03:14:02 2018 - [info]   192.168.132.131(192.168.132.131:3306)  Version=5.7.17-log (oldest major version between slaves) log-bin:enabled
Mon Apr  9 03:14:02 2018 - [info]     Replicating from 192.168.132.132(192.168.132.132:3306)
Mon Apr  9 03:14:02 2018 - [info]   192.168.132.133(192.168.132.133:3306)  Version=5.7.17-log (oldest major version between slaves) log-bin:enabled
Mon Apr  9 03:14:02 2018 - [info]     Replicating from 192.168.132.132(192.168.132.132:3306)
Mon Apr  9 03:14:02 2018 - [info] 
Mon Apr  9 03:14:02 2018 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Mon Apr  9 03:14:02 2018 - [info] 
Mon Apr  9 03:14:02 2018 - [info] Fetching dead master's binary logs..
Mon Apr  9 03:14:02 2018 - [info] Executing command on the dead master 192.168.132.132(192.168.132.132:3306): save_binary_logs --command=save --start_file=slave-bin.000004  --start_pos=154 --binlog_dir=/usr/local/mysql/data/ --output_file=/tmp/saved_master_binlog_from_192.168.132.132_3306_20180409031358.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.53
  Creating /tmp if not exists..    ok.
 Concat binary/relay logs from slave-bin.000004 pos 154 to slave-bin.000004 EOF into /tmp/saved_master_binlog_from_192.168.132.132_3306_20180409031358.binlog ..
  Dumping binlog format description event, from position 0 to 123.. ok.
  Dumping effective binlog data from /usr/local/mysql/data//slave-bin.000004 position 154 to tail(177).. ok.
 Concat succeeded.
saved_master_binlog_from_192.168.132.132_3306_20180409031358.binlog                                                                  100%  146     0.1KB/s   00:00    
Mon Apr  9 03:14:03 2018 - [info] scp from root@192.168.132.132:/tmp/saved_master_binlog_from_192.168.132.132_3306_20180409031358.binlog to local:/var/log/masterha/app1.log/saved_master_binlog_from_192.168.132.132_3306_20180409031358.binlog succeeded.
Mon Apr  9 03:14:03 2018 - [info] HealthCheck: SSH to 192.168.132.131 is reachable.
Mon Apr  9 03:14:03 2018 - [info] HealthCheck: SSH to 192.168.132.133 is reachable.
Mon Apr  9 03:14:04 2018 - [info] 
Mon Apr  9 03:14:04 2018 - [info] * Phase 3.3: Determining New Master Phase..
Mon Apr  9 03:14:04 2018 - [info] 
Mon Apr  9 03:14:04 2018 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Mon Apr  9 03:14:04 2018 - [info] All slaves received relay logs to the same position. No need to resync each other.
Mon Apr  9 03:14:04 2018 - [info] 192.168.132.131 can be new master.
Mon Apr  9 03:14:04 2018 - [info] New master is 192.168.132.131(192.168.132.131:3306)
Mon Apr  9 03:14:04 2018 - [info] Starting master failover..
Mon Apr  9 03:14:04 2018 - [info] 
From:
192.168.132.132 (current master)
 +--192.168.132.131
 +--192.168.132.133

To:
192.168.132.131 (new master)
 +--192.168.132.133

Starting master switch from 192.168.132.132(192.168.132.132:3306) to 192.168.132.131(192.168.132.131:3306)? (yes/NO): yes
Mon Apr  9 03:14:06 2018 - [info] New master decided manually is 192.168.132.131(192.168.132.131:3306)
Mon Apr  9 03:14:06 2018 - [info] 
Mon Apr  9 03:14:06 2018 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Mon Apr  9 03:14:06 2018 - [info] 
Mon Apr  9 03:14:06 2018 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
Mon Apr  9 03:14:06 2018 - [info] Sending binlog..
saved_master_binlog_from_192.168.132.132_3306_20180409031358.binlog                                                                  100%  146     0.1KB/s   00:00    
Mon Apr  9 03:14:07 2018 - [info] scp from local:/var/log/masterha/app1.log/saved_master_binlog_from_192.168.132.132_3306_20180409031358.binlog to root@192.168.132.131:/tmp/saved_master_binlog_from_192.168.132.132_3306_20180409031358.binlog succeeded.
Mon Apr  9 03:14:07 2018 - [info] 
Mon Apr  9 03:14:07 2018 - [info] * Phase 3.4: Master Log Apply Phase..
Mon Apr  9 03:14:07 2018 - [info] 
Mon Apr  9 03:14:07 2018 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Mon Apr  9 03:14:07 2018 - [info] Starting recovery on 192.168.132.131(192.168.132.131:3306)..
Mon Apr  9 03:14:07 2018 - [info]  Generating diffs succeeded.
Mon Apr  9 03:14:07 2018 - [info] Waiting until all relay logs are applied.
Mon Apr  9 03:14:07 2018 - [info]  done.
Mon Apr  9 03:14:07 2018 - [info] Getting slave status..
Mon Apr  9 03:14:07 2018 - [info] This slave(192.168.132.131)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(slave-bin.000004:154). No need to recover from Exec_Master_Log_Pos.
Mon Apr  9 03:14:07 2018 - [info] Connecting to the target slave host 192.168.132.131, running recover script..
Mon Apr  9 03:14:07 2018 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user=root --slave_host=192.168.132.131 --slave_ip=192.168.132.131  --slave_port=3306 --apply_files=/tmp/saved_master_binlog_from_192.168.132.132_3306_20180409031358.binlog --workdir=/tmp --target_version=5.7.17-log --timestamp=20180409031358 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.53 --slave_pass=xxx
Mon Apr  9 03:14:07 2018 - [info] 
MySQL client version is 5.7.17. Using --binary-mode.
Applying differential binary/relay log files /tmp/saved_master_binlog_from_192.168.132.132_3306_20180409031358.binlog on 192.168.132.131:3306. This may take long time...
Applying log files succeeded.
Mon Apr  9 03:14:07 2018 - [info]  All relay logs were successfully applied.
Mon Apr  9 03:14:07 2018 - [info] Getting new master's binlog name and position..
Mon Apr  9 03:14:07 2018 - [info]  master-bin.000003:209635979
Mon Apr  9 03:14:07 2018 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.132.131', MASTER_PORT=3306, MASTER_LOG_FILE='master-bin.000003', MASTER_LOG_POS=209635979, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Mon Apr  9 03:14:07 2018 - [info] Executing master IP activate script:
Mon Apr  9 03:14:07 2018 - [info]   /usr/local/bin/master_ip_failover --command=start --ssh_user=root --orig_master_host=192.168.132.132 --orig_master_ip=192.168.132.132 --orig_master_port=3306 --new_master_host=192.168.132.131 --new_master_ip=192.168.132.131 --new_master_port=3306  


IN SCRIPT TEST====/sbin/ifconfig ens33:1 down==/sbin/ifconfig ens33:1 192.168.132.135/24===

Enabling the VIP - 192.168.132.135/24 on the new master - 192.168.132.131 
Mon Apr  9 03:14:07 2018 - [info]  OK.
Mon Apr  9 03:14:07 2018 - [info] ** Finished master recovery successfully.
Mon Apr  9 03:14:07 2018 - [info] * Phase 3: Master Recovery Phase completed.
Mon Apr  9 03:14:07 2018 - [info] 
Mon Apr  9 03:14:07 2018 - [info] * Phase 4: Slaves Recovery Phase..
Mon Apr  9 03:14:07 2018 - [info] 
Mon Apr  9 03:14:07 2018 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Mon Apr  9 03:14:07 2018 - [info] 
Mon Apr  9 03:14:07 2018 - [info] -- Slave diff file generation on host 192.168.132.133(192.168.132.133:3306) started, pid: 27958. Check tmp log /var/log/masterha/app1.log/192.168.132.133_3306_20180409031358.log if it takes time..
Mon Apr  9 03:14:08 2018 - [info] 
Mon Apr  9 03:14:08 2018 - [info] Log messages from 192.168.132.133 ...
Mon Apr  9 03:14:08 2018 - [info] 
Mon Apr  9 03:14:07 2018 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
Mon Apr  9 03:14:08 2018 - [info] End of log messages from 192.168.132.133.
Mon Apr  9 03:14:08 2018 - [info] -- 192.168.132.133(192.168.132.133:3306) has the latest relay log events.
Mon Apr  9 03:14:08 2018 - [info] Generating relay diff files from the latest slave succeeded.
Mon Apr  9 03:14:08 2018 - [info] 
Mon Apr  9 03:14:08 2018 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Mon Apr  9 03:14:08 2018 - [info] 
Mon Apr  9 03:14:08 2018 - [info] -- Slave recovery on host 192.168.132.133(192.168.132.133:3306) started, pid: 27960. Check tmp log /var/log/masterha/app1.log/192.168.132.133_3306_20180409031358.log if it takes time..
saved_master_binlog_from_192.168.132.132_3306_20180409031358.binlog                                                                  100%  146     0.1KB/s   00:00    
Mon Apr  9 03:14:09 2018 - [info] 
Mon Apr  9 03:14:09 2018 - [info] Log messages from 192.168.132.133 ...
Mon Apr  9 03:14:09 2018 - [info] 
Mon Apr  9 03:14:08 2018 - [info] Sending binlog..
Mon Apr  9 03:14:08 2018 - [info] scp from local:/var/log/masterha/app1.log/saved_master_binlog_from_192.168.132.132_3306_20180409031358.binlog to root@192.168.132.133:/tmp/saved_master_binlog_from_192.168.132.132_3306_20180409031358.binlog succeeded.
Mon Apr  9 03:14:08 2018 - [info] Starting recovery on 192.168.132.133(192.168.132.133:3306)..
Mon Apr  9 03:14:08 2018 - [info]  Generating diffs succeeded.
Mon Apr  9 03:14:08 2018 - [info] Waiting until all relay logs are applied.
Mon Apr  9 03:14:08 2018 - [info]  done.
Mon Apr  9 03:14:08 2018 - [info] Getting slave status..
Mon Apr  9 03:14:08 2018 - [info] This slave(192.168.132.133)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(slave-bin.000004:154). No need to recover from Exec_Master_Log_Pos.
Mon Apr  9 03:14:08 2018 - [info] Connecting to the target slave host 192.168.132.133, running recover script..
Mon Apr  9 03:14:08 2018 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user=root --slave_host=192.168.132.133 --slave_ip=192.168.132.133  --slave_port=3306 --apply_files=/tmp/saved_master_binlog_from_192.168.132.132_3306_20180409031358.binlog --workdir=/tmp --target_version=5.7.17-log --timestamp=20180409031358 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.53 --slave_pass=xxx
Mon Apr  9 03:14:09 2018 - [info] 
MySQL client version is 5.7.17. Using --binary-mode.
Applying differential binary/relay log files /tmp/saved_master_binlog_from_192.168.132.132_3306_20180409031358.binlog on 192.168.132.133:3306. This may take long time...
Applying log files succeeded.
Mon Apr  9 03:14:09 2018 - [info]  All relay logs were successfully applied.
Mon Apr  9 03:14:09 2018 - [info]  Resetting slave 192.168.132.133(192.168.132.133:3306) and starting replication from the new master 192.168.132.131(192.168.132.131:3306)..
Mon Apr  9 03:14:09 2018 - [info]  Executed CHANGE MASTER.
Mon Apr  9 03:14:09 2018 - [info]  Slave started.
Mon Apr  9 03:14:09 2018 - [info] End of log messages from 192.168.132.133.
Mon Apr  9 03:14:09 2018 - [info] -- Slave recovery on host 192.168.132.133(192.168.132.133:3306) succeeded.
Mon Apr  9 03:14:09 2018 - [info] All new slave servers recovered successfully.
Mon Apr  9 03:14:09 2018 - [info] 
Mon Apr  9 03:14:09 2018 - [info] * Phase 5: New master cleanup phease..
Mon Apr  9 03:14:09 2018 - [info] 
Mon Apr  9 03:14:09 2018 - [info] Resetting slave info on the new master..
Mon Apr  9 03:14:09 2018 - [info]  192.168.132.131: Resetting slave info succeeded.
Mon Apr  9 03:14:09 2018 - [info] Master failover to 192.168.132.131(192.168.132.131:3306) completed successfully.
Mon Apr  9 03:14:09 2018 - [info] 

----- Failover Report -----

app1: MySQL Master failover 192.168.132.132 to 192.168.132.131 succeeded

Master 192.168.132.132 is down!

Check MHA Manager logs at slave2 for details.

Started manual(interactive) failover.
Invalidated master IP address on 192.168.132.132.
The latest slave 192.168.132.131(192.168.132.131:3306) has all relay logs for recovery.
Selected 192.168.132.131 as a new master.
192.168.132.131: OK: Applying all logs succeeded.
192.168.132.131: OK: Activated master IP address.
192.168.132.133: This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
192.168.132.133: OK: Applying all logs succeeded. Slave started, replicating from 192.168.132.131.
192.168.132.131: Resetting slave info succeeded.
Master failover to 192.168.132.131(192.168.132.131:3306) completed successfully.
View Code

 

在线切换集群

MHA在线切换的大概过程:

  • 1.检测复制设置和确定当前主服务器
  • 2.确定新的主服务器
  • 3.阻塞写入到当前主服务器
  • 4.等待所有从服务器赶上复制
  • 5.授予写入到新的主服务器
  • 6.重新设置从服务器 

为了保证数据完全一致性,在最快的时间内完成切换,MHA的在线切换必须满足以下条件才会切换成功,否则会切换失败

1.所有slave的IO线程都在运行

2.所有slave的SQL线程都在运行

3.所有的show slave status的输出中Seconds_Behind_Master参数小于或者等于running_updates_limit秒,如果在切换过程中不指定running_updates_limit,那么默认情况下running_updates_limit为1秒。

4.在master端,通过show processlist输出,没有一个更新花费的时间大于running_updates_limit秒。

 

切换步骤

#首先,停掉MHA监控

[root@slave2 ~]# masterha_stop --conf=/etc/masterha/app1.cnf

#其次,进行在线切换操作(模拟在线切换主库操作,原主库192.168132.131变为slave,192.168.132.132提升为新的主库)

[root@slave2 ~]# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=192.168.132.132 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000

 

切换日志

[root@slave2 ~]# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=192.168.132.132 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000
Mon Apr  9 05:02:17 2018 - [info] MHA::MasterRotate version 0.53.
Mon Apr  9 05:02:17 2018 - [info] Starting online master switch..
Mon Apr  9 05:02:17 2018 - [info] 
Mon Apr  9 05:02:17 2018 - [info] * Phase 1: Configuration Check Phase..
Mon Apr  9 05:02:17 2018 - [info] 
Mon Apr  9 05:02:17 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Mon Apr  9 05:02:17 2018 - [info] Reading application default configurations from /etc/masterha/app1.cnf..
Mon Apr  9 05:02:17 2018 - [info] Reading server configurations from /etc/masterha/app1.cnf..
Mon Apr  9 05:02:18 2018 - [info] Current Alive Master: 192.168.132.131(192.168.132.131:3306)
Mon Apr  9 05:02:18 2018 - [info] Alive Slaves:
Mon Apr  9 05:02:18 2018 - [info]   192.168.132.132(192.168.132.132:3306)  Version=5.7.17-log (oldest major version between slaves) log-bin:enabled
Mon Apr  9 05:02:18 2018 - [info]     Replicating from 192.168.132.131(192.168.132.131:3306)
Mon Apr  9 05:02:18 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Mon Apr  9 05:02:18 2018 - [info]   192.168.132.133(192.168.132.133:3306)  Version=5.7.17-log (oldest major version between slaves) log-bin:enabled
Mon Apr  9 05:02:18 2018 - [info]     Replicating from 192.168.132.131(192.168.132.131:3306)

It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 192.168.132.131(192.168.132.131:3306)? (YES/no): yes
Mon Apr  9 05:02:21 2018 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..
Mon Apr  9 05:02:21 2018 - [info]  ok.
Mon Apr  9 05:02:21 2018 - [info] Checking MHA is not monitoring or doing failover..
Mon Apr  9 05:02:21 2018 - [info] Checking replication health on 192.168.132.132..
Mon Apr  9 05:02:21 2018 - [info]  ok.
Mon Apr  9 05:02:21 2018 - [info] Checking replication health on 192.168.132.133..
Mon Apr  9 05:02:21 2018 - [info]  ok.
Mon Apr  9 05:02:21 2018 - [info] 192.168.132.132 can be new master.
Mon Apr  9 05:02:21 2018 - [info] 
From:
192.168.132.131 (current master)
 +--192.168.132.132
 +--192.168.132.133

To:
192.168.132.132 (new master)
 +--192.168.132.133
 +--192.168.132.131

Starting master switch from 192.168.132.131(192.168.132.131:3306) to 192.168.132.132(192.168.132.132:3306)? (yes/NO): yes
Mon Apr  9 05:02:23 2018 - [info] Checking whether 192.168.132.132(192.168.132.132:3306) is ok for the new master..
Mon Apr  9 05:02:23 2018 - [info]  ok.
Mon Apr  9 05:02:23 2018 - [info] 192.168.132.131(192.168.132.131:3306): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host.
Mon Apr  9 05:02:23 2018 - [info] 192.168.132.131(192.168.132.131:3306): Resetting slave pointing to the dummy host.
Mon Apr  9 05:02:23 2018 - [info] ** Phase 1: Configuration Check Phase completed.
Mon Apr  9 05:02:23 2018 - [info] 
Mon Apr  9 05:02:23 2018 - [info] * Phase 2: Rejecting updates Phase..
Mon Apr  9 05:02:23 2018 - [info] 
Mon Apr  9 05:02:23 2018 - [info] Executing master ip online change script to disable write on the current master:
Mon Apr  9 05:02:23 2018 - [info]   /usr/local/bin/master_ip_online_change --command=stop --orig_master_host=192.168.132.131 --orig_master_ip=192.168.132.131 --orig_master_port=3306 --new_master_host=192.168.132.132 --new_master_ip=192.168.132.132 --new_master_port=3306  
Mon Apr  9 05:02:23 2018 703610 Set read_only on the new master.. ok.
Mon Apr  9 05:02:23 2018 706671 Set read_only=1 on the orig master.. ok.
Disabling the VIP on old master: 192.168.132.131 
Mon Apr  9 05:02:23 2018 885840 Killing all application threads..
Mon Apr  9 05:02:23 2018 885904 done.
Mon Apr  9 05:02:23 2018 - [info]  ok.
Mon Apr  9 05:02:23 2018 - [info] Locking all tables on the orig master to reject updates from everybody (including root):
Mon Apr  9 05:02:23 2018 - [info] Executing FLUSH TABLES WITH READ LOCK..
Mon Apr  9 05:02:23 2018 - [info]  ok.
Mon Apr  9 05:02:23 2018 - [info] Orig master binlog:pos is master-bin.000003:209636468.
Mon Apr  9 05:02:23 2018 - [info]  Waiting to execute all relay logs on 192.168.132.132(192.168.132.132:3306)..
Mon Apr  9 05:02:23 2018 - [info]  master_pos_wait(master-bin.000003:209636468) completed on 192.168.132.132(192.168.132.132:3306). Executed 0 events.
Mon Apr  9 05:02:23 2018 - [info]   done.
Mon Apr  9 05:02:23 2018 - [info] Getting new master's binlog name and position..
Mon Apr  9 05:02:23 2018 - [info]  slave-bin.000005:197744
Mon Apr  9 05:02:23 2018 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.132.132', MASTER_PORT=3306, MASTER_LOG_FILE='slave-bin.000005', MASTER_LOG_POS=197744, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Mon Apr  9 05:02:23 2018 - [info] Executing master ip online change script to allow write on the new master:
Mon Apr  9 05:02:23 2018 - [info]   /usr/local/bin/master_ip_online_change --command=start --orig_master_host=192.168.132.131 --orig_master_ip=192.168.132.131 --orig_master_port=3306 --new_master_host=192.168.132.132 --new_master_ip=192.168.132.132 --new_master_port=3306  
Mon Apr  9 05:02:23 2018 975249 Set read_only=0 on the new master.
Enabling the VIP - 192.168.132.135/24 on the new master - 192.168.132.132 
Mon Apr  9 05:02:24 2018 - [info]  ok.
Mon Apr  9 05:02:24 2018 - [info] 
Mon Apr  9 05:02:24 2018 - [info] * Switching slaves in parallel..
Mon Apr  9 05:02:24 2018 - [info] 
Mon Apr  9 05:02:24 2018 - [info] -- Slave switch on host 192.168.132.133(192.168.132.133:3306) started, pid: 28580
Mon Apr  9 05:02:24 2018 - [info] 
Mon Apr  9 05:02:25 2018 - [info] Log messages from 192.168.132.133 ...
Mon Apr  9 05:02:25 2018 - [info] 
Mon Apr  9 05:02:24 2018 - [info]  Waiting to execute all relay logs on 192.168.132.133(192.168.132.133:3306)..
Mon Apr  9 05:02:24 2018 - [info]  master_pos_wait(master-bin.000003:209636468) completed on 192.168.132.133(192.168.132.133:3306). Executed 0 events.
Mon Apr  9 05:02:24 2018 - [info]   done.
Mon Apr  9 05:02:24 2018 - [info]  Resetting slave 192.168.132.133(192.168.132.133:3306) and starting replication from the new master 192.168.132.132(192.168.132.132:3306)..
Mon Apr  9 05:02:24 2018 - [info]  Executed CHANGE MASTER.
Mon Apr  9 05:02:24 2018 - [info]  Slave started.
Mon Apr  9 05:02:25 2018 - [info] End of log messages from 192.168.132.133 ...
Mon Apr  9 05:02:25 2018 - [info] 
Mon Apr  9 05:02:25 2018 - [info] -- Slave switch on host 192.168.132.133(192.168.132.133:3306) succeeded.
Mon Apr  9 05:02:25 2018 - [info] Unlocking all tables on the orig master:
Mon Apr  9 05:02:25 2018 - [info] Executing UNLOCK TABLES..
Mon Apr  9 05:02:25 2018 - [info]  ok.
Mon Apr  9 05:02:25 2018 - [info] Starting orig master as a new slave..
Mon Apr  9 05:02:25 2018 - [info]  Resetting slave 192.168.132.131(192.168.132.131:3306) and starting replication from the new master 192.168.132.132(192.168.132.132:3306)..
Mon Apr  9 05:02:25 2018 - [info]  Executed CHANGE MASTER.
Mon Apr  9 05:02:25 2018 - [info]  Slave started.
Mon Apr  9 05:02:25 2018 - [info] All new slave servers switched successfully.
Mon Apr  9 05:02:25 2018 - [info] 
Mon Apr  9 05:02:25 2018 - [info] * Phase 5: New master cleanup phease..
Mon Apr  9 05:02:25 2018 - [info] 
Mon Apr  9 05:02:25 2018 - [info]  192.168.132.132: Resetting slave info succeeded.
Mon Apr  9 05:02:25 2018 - [info] Switching master to 192.168.132.132(192.168.132.132:3306) completed successfully.
View Code

--orig_master_is_new_slave 切换时加上此参数是将原 master 变为 slave 节点,如果不加此参数,原来的 master 将不启动

--running_updates_limit=10000,故障切换时,候选master 如果有延迟的话, mha 切换不能成功,加上此参数表示延迟在此时间范围内都可切换(单位为s),但是切换的时间长短是由recover 时relay 日志的大小决定 

 

注意:用mha自带的脚本(master_ip_online_change)在线进行切换时,总是提示

Mon Apr  9 04:47:43 2018 - [info]   /usr/local/bin/master_ip_online_change --command=stop --orig_master_host=192.168.132.131 --orig_master_ip=192.168.132.131 --orig_master_port=3306 --new_master_host=192.168.132.132 --new_master_ip=192.168.132.132 --new_master_port=3306  
Got Error: DBI connect(';host=192.168.132.132;port=3306;mysql_connect_timeout=4','root',...) failed: Access denied for user 'root'@'slave2' (using password: YES) at /usr/local/share/perl5/MHA/DBHelper.pm line 181.
 at /usr/local/bin/master_ip_online_change line 122.

Mon Apr  9 04:47:43 2018 - [error][/usr/local/share/perl5/MHA/ManagerUtil.pm, ln178] Got ERROR:  at /usr/local/bin/masterha_master_switch line 53.

导致切换失败,后来网上找了一个修改后的版本如下

[root@slave2 ~]# cat /usr/local/bin/master_ip_online_change 
#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';

use Getopt::Long;
use MHA::DBHelper;
use MHA::NodeUtil;
use Time::HiRes qw( sleep gettimeofday tv_interval );
use Data::Dumper;

my $_tstart;
my $_running_interval = 0.1;
my (
  $command,          $orig_master_host, $orig_master_ip,
  $orig_master_port, $orig_master_user, 
  $new_master_host,  $new_master_ip,    $new_master_port,
  $new_master_user,  
);


my $vip = '192.168.132.135/24';  # Virtual IP 
my $key = "1"; 
my $ssh_start_vip = "/sbin/ifconfig ens33:$key $vip";
my $ssh_stop_vip = "/sbin/ifconfig ens33:$key down";
my $ssh_user = "root";
my $new_master_password='root';
my $orig_master_password='root';
GetOptions(
  'command=s'              => \$command,
  #'ssh_user=s'             => \$ssh_user,  
  'orig_master_host=s'     => \$orig_master_host,
  'orig_master_ip=s'       => \$orig_master_ip,
  'orig_master_port=i'     => \$orig_master_port,
  'orig_master_user=s'     => \$orig_master_user,
  #'orig_master_password=s' => \$orig_master_password,
  'new_master_host=s'      => \$new_master_host,
  'new_master_ip=s'        => \$new_master_ip,
  'new_master_port=i'      => \$new_master_port,
  'new_master_user=s'      => \$new_master_user,
  #'new_master_password=s'  => \$new_master_password,
);

exit &main();

sub current_time_us {
  my ( $sec, $microsec ) = gettimeofday();
  my $curdate = localtime($sec);
  return $curdate . " " . sprintf( "%06d", $microsec );
}

sub sleep_until {
  my $elapsed = tv_interval($_tstart);
  if ( $_running_interval > $elapsed ) {
    sleep( $_running_interval - $elapsed );
  }
}

sub get_threads_util {
  my $dbh                    = shift;
  my $my_connection_id       = shift;
  my $running_time_threshold = shift;
  my $type                   = shift;
  $running_time_threshold = 0 unless ($running_time_threshold);
  $type                   = 0 unless ($type);
  my @threads;

  my $sth = $dbh->prepare("SHOW PROCESSLIST");
  $sth->execute();

  while ( my $ref = $sth->fetchrow_hashref() ) {
    my $id         = $ref->{Id};
    my $user       = $ref->{User};
    my $host       = $ref->{Host};
    my $command    = $ref->{Command};
    my $state      = $ref->{State};
    my $query_time = $ref->{Time};
    my $info       = $ref->{Info};
    $info =~ s/^\s*(.*?)\s*$/$1/ if defined($info);
    next if ( $my_connection_id == $id );
    next if ( defined($query_time) && $query_time < $running_time_threshold );
    next if ( defined($command)    && $command eq "Binlog Dump" );
    next if ( defined($user)       && $user eq "system user" );
    next
      if ( defined($command)
      && $command eq "Sleep"
      && defined($query_time)
      && $query_time >= 1 );

    if ( $type >= 1 ) {
      next if ( defined($command) && $command eq "Sleep" );
      next if ( defined($command) && $command eq "Connect" );
    }

    if ( $type >= 2 ) {
      next if ( defined($info) && $info =~ m/^select/i );
      next if ( defined($info) && $info =~ m/^show/i );
    }

    push @threads, $ref;
  }
  return @threads;
}

sub main {
  if ( $command eq "stop" ) {
    ## Gracefully killing connections on the current master
    # 1. Set read_only= 1 on the new master
    # 2. DROP USER so that no app user can establish new connections
    # 3. Set read_only= 1 on the current master
    # 4. Kill current queries
    # * Any database access failure will result in script die.
    my $exit_code = 1;
    eval {
      ## Setting read_only=1 on the new master (to avoid accident)
      my $new_master_handler = new MHA::DBHelper();

      # args: hostname, port, user, password, raise_error(die_on_error)_or_not
      $new_master_handler->connect( $new_master_ip, $new_master_port,
        $new_master_user, $new_master_password, 1 );
      print current_time_us() . " Set read_only on the new master.. ";
      $new_master_handler->enable_read_only();
      if ( $new_master_handler->is_read_only() ) {
        print "ok.\n";
      }
      else {
        die "Failed!\n";
      }
      $new_master_handler->disconnect();

      # Connecting to the orig master, die if any database error happens
      my $orig_master_handler = new MHA::DBHelper();
      $orig_master_handler->connect( $orig_master_ip, $orig_master_port,
        $orig_master_user, $orig_master_password, 1 );

      ## Drop application user so that nobody can connect. Disabling per-session binlog beforehand
      #$orig_master_handler->disable_log_bin_local();
      #print current_time_us() . " Drpping app user on the orig master..\n";
      #FIXME_xxx_drop_app_user($orig_master_handler);

      ## Waiting for N * 100 milliseconds so that current connections can exit
      my $time_until_read_only = 15;
      $_tstart = [gettimeofday];
      my @threads = get_threads_util( $orig_master_handler->{dbh},
        $orig_master_handler->{connection_id} );
      while ( $time_until_read_only > 0 && $#threads >= 0 ) {
        if ( $time_until_read_only % 5 == 0 ) {
          printf
"%s Waiting all running %d threads are disconnected.. (max %d milliseconds)\n",
            current_time_us(), $#threads + 1, $time_until_read_only * 100;
          if ( $#threads < 5 ) {
            print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump . "\n"
              foreach (@threads);
          }
        }
        sleep_until();
        $_tstart = [gettimeofday];
        $time_until_read_only--;
        @threads = get_threads_util( $orig_master_handler->{dbh},
          $orig_master_handler->{connection_id} );
      }

      ## Setting read_only=1 on the current master so that nobody(except SUPER) can write
      print current_time_us() . " Set read_only=1 on the orig master.. ";
      $orig_master_handler->enable_read_only();
      if ( $orig_master_handler->is_read_only() ) {
        print "ok.\n";
      }
      else {
        die "Failed!\n";
      }

      ## Waiting for M * 100 milliseconds so that current update queries can complete
      my $time_until_kill_threads = 5;
      @threads = get_threads_util( $orig_master_handler->{dbh},
        $orig_master_handler->{connection_id} );
      while ( $time_until_kill_threads > 0 && $#threads >= 0 ) {
        if ( $time_until_kill_threads % 5 == 0 ) {
          printf
"%s Waiting all running %d queries are disconnected.. (max %d milliseconds)\n",
            current_time_us(), $#threads + 1, $time_until_kill_threads * 100;
          if ( $#threads < 5 ) {
            print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump . "\n"
              foreach (@threads);
          }
        }
        sleep_until();
        $_tstart = [gettimeofday];
        $time_until_kill_threads--;
        @threads = get_threads_util( $orig_master_handler->{dbh},
          $orig_master_handler->{connection_id} );
      }



                print "Disabling the VIP on old master: $orig_master_host \n";
                &stop_vip();     


      ## Terminating all threads
      print current_time_us() . " Killing all application threads..\n";
      $orig_master_handler->kill_threads(@threads) if ( $#threads >= 0 );
      print current_time_us() . " done.\n";
      #$orig_master_handler->enable_log_bin_local();
      $orig_master_handler->disconnect();

      ## After finishing the script, MHA executes FLUSH TABLES WITH READ LOCK
      $exit_code = 0;
    };
    if ($@) {
      warn "Got Error: $@\n";
      exit $exit_code;
    }
    exit $exit_code;
  }
  elsif ( $command eq "start" ) {
    ## Activating master ip on the new master
    # 1. Create app user with write privileges
    # 2. Moving backup script if needed
    # 3. Register new master's ip to the catalog database

# We don't return error even though activating updatable accounts/ip failed so that we don't interrupt slaves' recovery.
# If exit code is 0 or 10, MHA does not abort
    my $exit_code = 10;
    eval {
      my $new_master_handler = new MHA::DBHelper();

      # args: hostname, port, user, password, raise_error_or_not
      $new_master_handler->connect( $new_master_ip, $new_master_port,
        $new_master_user, $new_master_password, 1 );

      ## Set read_only=0 on the new master
      #$new_master_handler->disable_log_bin_local();
      print current_time_us() . " Set read_only=0 on the new master.\n";
      $new_master_handler->disable_read_only();

      ## Creating an app user on the new master
      #print current_time_us() . " Creating app user on the new master..\n";
      #FIXME_xxx_create_app_user($new_master_handler);
      #$new_master_handler->enable_log_bin_local();
      $new_master_handler->disconnect();

      ## Update master ip on the catalog database, etc
                print "Enabling the VIP - $vip on the new master - $new_master_host \n";
                &start_vip();
                $exit_code = 0;
    };
    if ($@) {
      warn "Got Error: $@\n";
      exit $exit_code;
    }
    exit $exit_code;
  }
  elsif ( $command eq "status" ) {

    # do nothing
    exit 0;
  }
  else {
    &usage();
    exit 1;
  }
}

# A simple system call that enable the VIP on the new master 
sub start_vip() {
    `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
# A simple system call that disable the VIP on the old_master
sub stop_vip() {
    `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}

sub usage {
  print
"Usage: master_ip_online_change --command=start|stop|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
  die;
}

 

mysql故障恢复后从新加入集群

如果一台主(master)服务器发生故障后,自动迁移到备份master服务器上,原来的master服务器修复后要从新加入集群,只需要从新指定到新的master服务器即可

#新的master服务器上执行192.168.132.132)

mysql> show master status;
+------------------+-----------+--------------+------------------+-------------------+
| File             | Position  | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+-----------+--------------+------------------+-------------------+
| slave-bin.000002 | 209654159 |              |                  |                   |
+------------------+-----------+--------------+------------------+-------------------+

#出现故障经过修复后的master服务器执行192.168.132.131)
mysql> stop slave

mysql> change master to master_host='192.168.132.132',master_user='repl',master_password='repl',master_log_file='slave-bin.000002',master_log_pos="209654151";
#注意,如果有数据写入的时候,master_log_pos是不断变化的,所以需要从binlog日志里查看出现故障时master_log_file的名称和相应的_master_log_pos值,同步即可,这里写的209554151会自动同步一部分数据到新的slave

mysql> reset slave
mysql> start slave;

 

对以上的进行测试

安装sysbench,模拟数据写入

#安装依赖
yum -y install  make automake libtool pkgconfig libaio-devel

#下载sysbench
git clone https://github.com/akopytov/sysbench

#安装sysbench
cd sysbench/
./autogen.sh

./configure --prefix=/usr/local/sysbench/ --with-mysql --with-mysql-includes=/usr/local/mysql/include --with-mysql-libs=/usr/local/mysql/lib
make && make install

#测试是否安装成功
/usr/local/sysbench/bin/sysbench --version 

如果提示:
/usr/local/sysbench/bin/sysbench: error while loading shared libraries: libmysqlclient.so.20: cannot open shared object file: No such file or directory

解决:
export LD_LIBRARY_PATH=/usr/local/mysql/lib

#测试安装是否成功
/usr/local/sysbench/bin/sysbench --version

 

使用

#执行sysbench命令,具体参数通过sysbench --help 获取,run之后记得执行cleanup清除数据

/usr/local/sysbench/bin/sysbench --test=/root/sysbench/src/lua/oltp_insert.lua --tables=1000 --table-size=1000 \
--num-threads=128 --max-requests=1000 --time=180 --mysql-user=root --mysql-socket=/tmp/mysql.sock --mysql-password=root \
--db-driver=mysql --mysql-db=test --report-interval=10 run

#说明:
--table-size     每张表初始化的数据数量
--tables        初始化表的数量
--num-threads     启动线程数量
--report-interval 运行时间日志,单位为妙

#写好上面的语句后, 在最后面加上以下命令并执行:
parepar     表示 用于准备测试需要的数据
run       表示 执行来测试
cleanup     表示 清除测试数据

 

master端查看

 

指定master

mysql> stop slave;
mysql> change master to master_host='192.168.132.132',master_user='repl',master_password='repl',master_log_file='slave-bin.000002',master_log_pos="209654151";
mysql> reset slave;
mysql> start slave

 

slave端查看(经过同步后)

 

切换后发送邮件报警

[root@slave2 ~] cat /usr/local/bin/master_ip_online_change

#!/usr/bin/perl

#  Copyright (C) 2011 DeNA Co.,Ltd.
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  You should have received a copy of the GNU General Public License
#   along with this program; if not, write to the Free Software
#  Foundation, Inc.,
#  51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA

## Note: This is a sample script and is not complete. Modify the script based on your environment.

use strict;
use warnings FATAL => 'all';
use Mail::Sender;
use Getopt::Long;

#new_master_host and new_slave_hosts are set only when recovering master succeeded
my ( $dead_master_host, $new_master_host, $new_slave_hosts, $subject, $body );
my $smtp='smtp.163.com';
my $mail_from='xxxx';
my $mail_user='xxxxx';
my $mail_pass='xxxxx';
my $mail_to=['xxxx','xxxx'];
GetOptions(
  'orig_master_host=s' => \$dead_master_host,
  'new_master_host=s'  => \$new_master_host,
  'new_slave_hosts=s'  => \$new_slave_hosts,
  'subject=s'          => \$subject,
  'body=s'             => \$body,
);

mailToContacts($smtp,$mail_from,$mail_user,$mail_pass,$mail_to,$subject,$body);

sub mailToContacts {
    my ( $smtp, $mail_from, $user, $passwd, $mail_to, $subject, $msg ) = @_;
    open my $DEBUG, "> /tmp/monitormail.log"
        or die "Can't open the debug      file:$!\n";
    my $sender = new Mail::Sender {
        ctype       => 'text/plain; charset=utf-8',
        encoding    => 'utf-8',
        smtp        => $smtp,
        from        => $mail_from,
        auth        => 'LOGIN',
        TLS_allowed => '0',
        authid      => $user,
        authpwd     => $passwd,
        to          => $mail_to,
        subject     => $subject,
        debug       => $DEBUG
    };

    $sender->MailMsg(
        {   msg   => $msg,
            debug => $DEBUG
        }
    ) or print $Mail::Sender::Error;
    return 1;
}



# Do whatever you want here

exit 0;

 

遇到的问题

①、检查mha状态报错

[root@slave2 ~]# masterha_check_ssh -conf=/etc/masterha/app1.cnf

Fri Mar 30 18:29:34 2018 - [info]   Connecting to root@192.168.132.132(192.168.132.132:22).. 
Can't exec "mysqlbinlog": 没有那个文件或目录 at /usr/local/share/perl5/MHA/BinlogManager.pm line 99.
mysqlbinlog version not found!
 at /usr/local/bin/apply_diff_relay_logs line 463.
Fri Mar 30 18:29:34 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln193] Slaves settings check failed!
Fri Mar 30 18:29:34 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln372] Slave configuration failed.
Fri Mar 30 18:29:34 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln383] Error happend on checking configurations.  at /usr/local/bin/masterha_check_repl line 48.
Fri Mar 30 18:29:34 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln478] Error happened on monitoring servers.
Fri Mar 30 18:29:34 2018 - [info] Got exit code 1 (Not master dead).

MySQL Replication Health is NOT OK!

问题解决:

#在所有节点上执行(最好先注释app1.cnf下的master_ip_failover脚本)
ln -s /usr/local/mysql/bin/mysql /usr/local/bin/mysql
ln -s /usr/local/mysql/bin/mysqlbinlog /usr/local/bin/mysqlbinlog

 

②、报错提示:

[root@slave2 ~]# masterha_check_repl -conf=/etc/masterha/app1.cnf 
Sun Apr  8 20:11:22 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Apr  8 20:11:22 2018 - [info] Reading application default configurations from /etc/masterha/app1.cnf..
Sun Apr  8 20:11:22 2018 - [info] Reading server configurations from /etc/masterha/app1.cnf..
Sun Apr  8 20:11:22 2018 - [info] MHA::MasterMonitor version 0.53.
Sun Apr  8 20:11:23 2018 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln622] Master 192.168.132.131:3306 from which slave 192.168.132.133(192.168.132.133:3306) replicates is not defined in the configuration file!
Sun Apr  8 20:11:23 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln383] Error happend on checking configurations.  at /usr/local/share/perl5/MHA/MasterMonitor.pm line 298.
Sun Apr  8 20:11:23 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln478] Error happened on monitoring servers.
Sun Apr  8 20:11:23 2018 - [info] Got exit code 1 (Not master dead).

MySQL Replication Health is NOT OK!

问题解决:

查看配置文件 /etc/masterha/app1.cnf 发现只有当前master的配置存在(server3 ...),添加其他的配置即可

[server1]
hostname=192.168.132.131
port=3306

[server2]
hostname=192.168.132.132
port=3306
candidate_master=1
check_repl_delay=0

[server3]
hostname=192.168.132.133
port=3306

 

③、对于mha监控进程异常退出的解决办法

写进脚本里,然后执行脚本

#!/bin/bash

/usr/bin/nohup /usr/local/bin/masterha_manager --conf=/etc/mastermha/app1.cnf --ignore_last_failover > /var/log/mastermha/app1/manager.log 2>&1 &

启动脚本

/bin/bash start_monitor.sh 

 

posted @ 2018-04-08 15:40  FRESHMANS  阅读(706)  评论(0编辑  收藏  举报