2017-5-6 zabbix数据库备份+MHA对单控节点的一主两从数据库备份+高可用方案（失败）

一、zabbix数据库单机备份问题

1、备份脚本
[root@zabbix mysql]# pwd
/var/lib/mysql
[root@zabbix mysql]# ls ##好奇下面的日志文件是什么东西？
aria_log.00000001 ibdata1 ib_logfile1 mysql.sock test
aria_log_control ib_logfile0 mysql performance_schema zabbix
[root@zabbix mysql]# du -sh .
367M .

思路是：考虑到zabbix数据库不是非常庞大，选择mysqldump备份命令对整个zabbix数据库进行完全备份（tar圧缩形式），方便进行恢复，结合crond任务，每日0:00分执行备份脚本，以年月日作为文件名，备份到/var/lib/backup_mysql_zabbix/目录下，因为是全量备份形式的恢复过程，完全可以把30天以前的已备份后数据库的文件都给清除，以腾出磁盘空间。

[root@zabbix ~]# cd /usr/local/src
[root@zabbix src]# vim zabbix_bak.sh

#!/bin/bash 
Time=`date -d "yesterday" +%F` ##取昨天的年月日
/usr/bin/mysqldump -uzabbix -pzabbix zabbix > zabbix_${Time}.sql ##数据库的密码为空，脚本反而不能执行
Backupdir=/var/lib/backup_mysql_zabbix/
if [ ! -d $Backupdir ];then
mkdir $Backupdir
fi
tar -zcf /var/lib/backup_mysql_zabbix/$Time.sql.tar.gz zabbix_${Time}.sql ##备份目录
if [ $? -eq 0 ];then
rm -f zabbix_${Time}.sql
fi
find /var/lib/backup_mysql_zabbix/ -mtime +30 -exec rm -f {} \;

=============================================================================
[root@zabbix src]# crontab -e ##分钟时钟的顺序
00 00 * * * /bin/bash /usr/local/src/zabbix_bak.sh
在另一台机器上进行手工测试可恢复性
# [root@agent tmp]# tar zxvf 2017-05-05.sql.tar.gz

2、另一台机器进行恢复测试，临时机器IP为172.16.1.8
# mysql -uroot -p ##另一台机器同样需要安装mysql; yum install mariadb-devel mariadb-server -y
# MariaDB [(none)]> create database zabbix; ##必须事先创建数据库
# MariaDB [(none)]> use zabbix;
# MariaDB [zabbix]> source /tmp/zabbix_2017-05-05.sql ##使用source命令比mysql命令恢复速度极快

MariaDB [(none)]> grant all on zabbix.* to 'zabbix'@'172.16.%.%' identified by 'zabbix';
更改zabbix_server.conf的DBHOST为172.16.1.8，而后停掉zabbix server机器的本地数据库，发现web界面连接不
上，估计原因是在刚开始安装的时候就已经指定了数据库为本机的，以后再想改的话，暂时找不到办法。

二、单个控制节点的数据库备份问题

1、思路整理
对于单个控制节点的数据库备份，考虑采用主从复制架构，但是这样当控制节点的主库宕机之后，需要进行手工
把从库变为主库，比较麻烦，于是考虑采用MHA架构，控制节点是主库，计算节点1作为从库，计算节点2作为监控管理节点。有一点注意的是，主从关系建立完毕之后，以前主库上已有的数据表也会同步到从库上！（这一点之前以为不会同步的）
MHA进程清单：http://blog.csdn.net/leshami/article/details/46380989

2、配置细节
一主多从部署环境：
172.16.1.6 主库
172.16.1.8 从库（候选主库）
172.16.1.5 MHA管理节点（如果磁盘空间充足的话，也可以用作为从库）

（1）首先建立控制节点1和计算节点1的主从复制的关系
控制1：
[root@zabbix ~]# vim /etc/my.cnf
server-id=1
log-bin=binlog
[root@zabbix ~]# systemctl restart mariadb.service
MariaDB [(none)]> show master status;
+---------------+----------+--------------+------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+---------------+----------+--------------+------------------+
| binlog.000001 | 11860 | | |
+---------------+----------+--------------+------------------+
1 row in set (0.02 sec)
MariaDB [(none)] > grant replication slave on *.* to 'root'@'172.16.1.%' identified by '123456';

计算1：
[root@agent ~]# vim /etc/my.cnf
server-id=2
# read_only=1 ##这个选项在MHA架构中，不要加，因为这个slave会升级为master的
[root@agent ~]# systemctl restart mariadb.service
MariaDB [(none)] > stop slave;
MariaDB [(none)] > reset slave;
MariaDB [(none)]> change master to master_host='172.16.1.6',master_user='root',master_password='123456',master_log_file='binlog.000001',master_log_pos=11860;
MariaDB [(none)]> start slave;
MariaDB [(none)]> show slave status\G;
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
MariaDB [(none)]> show slave hosts;

（2）部署MHA，控制1、计算1都是node普通节点，计算2即是node普通节点又是管理节点，三台机器必须都能上外网。参考文档：http://www.cnblogs.com/xuanzhi201111/p/4231412.html
问题1：控制节点的数据库在哪？
问题2：计算节点上网该怎么办？
mha4mysql-manager-0.57.tar.gz的下载地址：http://down.51cto.com/data/2258543
mha4mysql-node-0.57.tar.gz的下载地址： http://down.51cto.com/data/2258544

①node节点：包括控制节点1和计算节点1均需要配置如下操作
[root@zabbix ~]# yum install perl-DBD-MySQL -y　
[root@zabbix ~]# tar xzvf 51CTO下载-mha4mysql-node-0.57.tar.gz
[root@zabbix ~]# cd mha4mysql-node-0.57/
[root@zabbix ~]# yum install -y perl-devel perl-CPAN
[root@zabbix ~]# perl Makefile.PL
[root@zabbix ~]# make && make install
[root@zabbix ~]# cd /usr/local/bin/
[root@zabbix bin]# ls
apply_diff_relay_logs filter_mysqlbinlog purge_relay_logs save_binary_logs

save_binary_logs //保存和复制master的二进制日志
apply_diff_relay_logs //识别差异的中继日志事件并将其差异的事件应用于其他的slave
filter_mysqlbinlog //去除不必要的ROLLBACK事件（MHA已不再使用这个工具）
purge_relay_logs //清除中继日志（不会阻塞SQL线程）

②manager节点： //计算节点2同样需要做上述步骤，安装node节点的软件
[root@vm3 ]# yum install -y perl-devel perl-CPAN
[root@vm3 ]# tar xzvf 51CTO下载-mha4mysql-node-0.57.tar.gz
[root@vm3 ]# cd mha4mysql-node-0.57/
[root@vm3 ]# perl Makefile.PL
[root@vm3 ]# make && make install

[root@vm3 ]# yum install -y perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager rrdtool perl-rrdtool rrdtool-devel perl-Params-Validate
[root@vm3 ]# tar xf 51CTO下载-mha4mysql-manager-0.57.tar.gz
[root@vm3 ]# cd mha4mysql-manager-0.57/
[root@vm3 mha4mysql-manager-0.57]# perl Makefile.PL
[root@vm3 mha4mysql-manager-0.57]# make && make install
[root@vm3 mha4mysql-manager-0.57]# cd /usr/local/bin
[root@vm3 bin]# ls
apply_diff_relay_logs masterha_check_ssh masterha_manager masterha_secondary_check save_binary_logs
filter_mysqlbinlog masterha_check_status masterha_master_monitor masterha_stop
masterha_check_repl masterha_conf_host masterha_master_switch purge_relay_logs

三台机器配置ssh免密钥登录、每台机器均做如下操作：
在controller节点上：
# ssh-keygen -t rsa
# ssh-copy-id -i /root/.ssh/id_rsa.pub root@compute1 //在计算节点2上则是controller1和compute1
# ssh-copy-id -i /root/.ssh/id_rsa.pub root@compute2 //就是拷贝公钥到对方节点的authorized_keys

[root@vm3 ~]# mkdir -p /etc/masterha
[root@vm3 ~]# cp mha4mysql-manager-0.57/samples/conf/app1.cnf /etc/masterha/

未完，待续.... http://www.cnblogs.com/xuanzhi201111/p/4231412.html

[root@vm3 masterha] cat /etc/masterha/app1.cnf ##这个配置文件最好不要有中文注释
===============================================
[server default]
manager_workdir=/var/log/masterha/app1.log
manager_log=/var/log/masterha/app1/manager.log
master_binlog_dir=/var/lib/mysql
# master_ip_failover_script= /usr/local/bin/master_ip_failover
# master_ip_online_change_script= /usr/local/bin/master_ip_online_change
password=123456
user=root
ping_interval=1
remote_workdir=/tmp
repl_password=123456
repl_user=root
#report_script=/usr/local/bin/send_report
#shutdown_script=""
ssh_user=root

[server1]
hostname=172.16.1.6
port=3306

[server2]
hostname=172.16.1.8
port=3306
candidate_master=1
=====================================================================
配置讲解：
manager_workdir=/var/log/masterha/app1.log //设置manager的工作目录
manager_log=/var/log/masterha/app1/manager.log //设置manager的日志
master_binlog_dir=/data/mysql //设置master 保存binlog的位置，以便MHA可以找到master的日志，我这里的也就是mysql的数据目录
master_ip_failover_script= /usr/local/bin/master_ip_failover //设置自动failover时候的切换脚本
master_ip_online_change_script= /usr/local/bin/master_ip_online_change //设置手动切换时候的切换脚本
password=123456 //设置mysql中root用户的密码，这个密码是前文中创建监控用户的那个密码
user=root 设置监控用户root
ping_interval=1 //设置监控主库，发送ping包的时间间隔，默认是3秒，尝试三次没有回应的时候自动进行railover
remote_workdir=/tmp //设置远端mysql在发生切换时binlog的保存位置
repl_password=123456 //设置复制用户的密码
repl_user=repl //设置复制环境中的复制用户名
report_script=/usr/local/bin/send_report //设置发生切换后发送的报警的脚本
secondary_check_script= /usr/local/bin/masterha_secondary_check -s server03 -s server02 --user=root --master_host=server02 --master_ip=192.168.2.128 --master_port=3306 //一旦MHA到server02的监控之间出现问题，MHA Manager将会尝试从server03登录到server02
shutdown_script="" //设置故障发生后关闭故障主机脚本（该脚本的主要作用是关闭主机放在发生脑裂,这里没有使用）
ssh_user=root //设置ssh的登录用户名
check_repl_delay=0 //默认情况下如果一个slave落后master 100M的relay logs的话，MHA将不会选择该slave作为一个新的master，因为对于这个slave的恢复需要花费很长时间，通过设置check_repl_delay=0,MHA触发切换在选择一个新的master的时候将会忽略复制延时，这个参数对于设置了candidate_master=1的主机非常有用，因为这个候选主在切换的过程中一定是新的master

3、MHA的命令使用
[root@vm3 masterha]# masterha_check_ssh --conf=/etc/masterha/app1.cnf ##管理节点检查分别到master和slave机器的ssh连接情况
[root@vm3 masterha]# masterha_check_repl --conf=/etc/masterha/app1.cnf ##检查主从复制情况
报错：
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Mon May 8 09:50:14 2017 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln301] Got MySQL error when connecting 172.16.1.8(172.16.1.8:3306) :1045:Access denied for user 'root'@'172.16.1.5' (using password: YES), but this is not a MySQL crash. Check MySQL server settings.
at /usr/local/share/perl5/MHA/ServerManager.pm line 297.

解决：需要登录到slave机器（172.16.1.8）上个给管理节点（172.16.1.5）授权登录的权限。
[root@slave ~]# mysql -uroot -p
MariaDB [(none)] > grant all on *.* to 'root'@'172.16.1.5' identified by '123456';

继续报错（1）：
Mon May 8 10:03:04 2017 - [error][/usr/local/share/perl5/MHA/Server.pm, ln265] Checking slave status failed on 172.16.1.6(172.16.1.6:3306). err=Got error when executing SHOW SLAVE STATUS. Access denied; you need (at least one of) the SUPER,REPLICATION CLIENT privilege(s) for this operation
解决：需要在Master端给管理节点授权：
[root@master ~]# mysql -uroot -p
MariaDB [(none)]> show grants for 'root'@'172.16.1.5';
MariaDB [(none)]> grant all on *.* to 'root'@'172.16.1.5' identified by '123456';

继续报错（2）：
Mon May 8 10:08:27 2017 - [warning] log-bin is not set on slave 172.16.1.8(172.16.1.8:3306). This host cannot be a master.
解决：在从机上my.cnf中，添加log-bin=binlog即可。

继续报错（3）：
Testing mysql connection and privileges..ERROR 1045 (28000): Access denied for user 'root'@'vm2.novalocal' (using password: YES)
mysql command failed with rc 1:0!
解决：slave机器重启后，主机名发生了变化，重新授权即可
MariaDB [(none)]> grant all on *.* to 'root'@'vm2.novalocal' identified by '123456';

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

[root@vm3 masterha]# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 is stopped(2:NOT_RUNNING).
[root@vm3 masterha]# mkdir -p /var/log/masterha/app1/
[root@vm3 masterha]# nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1 &
--remove_dead_master_conf //该参数代表当发生主从切换后，老的主库的ip将会从配置文件中移除。
--manger_log //日志存放位置
--ignore_last_failover //在缺省情况下，如果MHA检测到连续发生宕机，且两次宕机间隔不足8小时的话，则不会进行Failover，之所以这样限制是为了避免ping-pong效应。该参数代表忽略上次MHA触发切换产生的文件，默认情况下，MHA发生切换后会在日志目录，也就是上面我设置的/data产生app1.failover.complete文件，下次再次切换的时候如果发现该目录下存在该文件将不允许触发切换，除非在第一次切换后收到删除该文件，为了方便，这里设置为--ignore_last_failover。

[root@vm3 masterha]# masterha_check_status --conf=/etc/masterha/app1.cnf ##查看MHA Manager监控是否正常，监控master主机处于OK状态
app1 (pid:3087) is running(0:PING_OK), master:172.16.1.6
[root@vm3 masterha]# tail -n20 /var/log/masterha/app1/manager.log
Mon May 8 14:29:17 2017 - [info] Starting ping health check on 172.16.1.6(172.16.1.6:3306)..
Mon May 8 14:29:17 2017 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond.
//出现 Ping(SELECT) succeeded, waiting until MySQL doesn't respond.."说明整个系统已经开始监控了，这个manager必须一直在运行，所以最好加入到
/etc/rc.local/
[root@vm3 masterha]# vim /etc/rc.local
nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1 &

4、主从机器下载Keepalived软件，配置VIP为172.16.1.254
主机器上（172.16.1.6）：
[root@master ~]# yum install keepalived -y
! Configuration File for keepalived

global_defs {
notification_email {
15706107661@163.com
}
notification_email_from 172.16.1.6@ip.com
smtp_server smtp.163.com
smtp_connect_timeout 30
router_id MySQL-HA
}

vrrp_instance VI_1 {
state BACKUP
interface eth0
virtual_router_id 51
priority 100
advert_int 1
nopreempt
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
172.16.1.254
}
}
==============================================
从机器上（172.16.1.8）：
[root@salve ~]# yum install keepalived -y
[root@salve ~]# vim /etc/keepalived/keepalived.conf
! Configuration File for keepalived

global_defs {
notification_email {
15706107661@163.com
}
notification_email_from 172.16.1.8@ip.com
smtp_server smtp.163.com
smtp_connect_timeout 30
router_id MySQL-HA
}

vrrp_instance VI_1 {
state BACKUP
interface eth0
virtual_router_id 51
priority 99
advert_int 1
nopreempt
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
172.16.1.254
}
}
=================================================
配置讲解：
上面两台服务器的keepalived都设置为了BACKUP模式，在keepalived中2种模式，分别是master->backup模式和backup->backup模式。这两种模式有很大区别。在master->backup模式下，一旦主库宕机，虚拟ip会自动漂移到从库，当主库修复后，keepalived启动后，还会把虚拟ip抢占过来，即使设置了非抢占模式（nopreempt）抢占ip的动作也会发生。在backup->backup模式下，当主库宕机后虚拟ip会自动漂移到从库上，当原主库恢复和keepalived服务启动后，并不会抢占新主的虚拟ip，即使是优先级高于从库的优先级别，也不会发生抢占。为了减少ip漂移次数，通常是把修复好的主库当做新的备库。
发送的邮件部分可以参考mailx的配置，参考文档：http://www.cnblogs.com/fujinzhou/p/5687384.html

[root@master ~]# systemctl start keepalived.service
[root@master ~]# ip a
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast state UP qlen 1000
link/ether fa:16:3e:56:49:e7 brd ff:ff:ff:ff:ff:ff
inet 172.16.1.6/24 brd 172.16.1.255 scope global dynamic eth0
valid_lft 72238sec preferred_lft 72238sec
inet 172.16.1.254/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::f816:3eff:fe56:49e7/64 scope link
valid_lft forever preferred_lft forever
如上，可以看到虚拟IP已经绑定在eth0网卡上了，因为它比从机器的优先级高，所以从机上不会看到VIP
[root@master ~]# tail /var/log/messages
May 8 15:03:04 zabbix Keepalived_vrrp[9294]: VRRP_Instance(VI_1) Entering MASTER STATE
May 8 15:03:04 zabbix Keepalived_vrrp[9294]: VRRP_Instance(VI_1) setting protocol VIPs.
May 8 15:03:04 zabbix Keepalived_vrrp[9294]: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth0 for 172.16.1.254
May 8 15:03:04 zabbix Keepalived_healthcheckers[9293]: Netlink reflector reports IP 172.16.1.254 added
May 8 15:03:09 zabbix Keepalived_vrrp[9294]: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth0 for 172.16.1.254

[root@salve ~]# systemctl start keepalived.service ##从机器也必须启动keepalved服务

5、MHA触发Keepalived脚本，只需更改VIP，其他照抄
在数据库管理节点上：
[root@vm3 ~]# vim /usr/local/bin/master_ip_failover

#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';

use Getopt::Long;

my (
$command, $ssh_user, $orig_master_host, $orig_master_ip,
$orig_master_port, $new_master_host, $new_master_ip, $new_master_port
);

my $vip = '172.16.1.254';
my $ssh_start_vip = "/etc/init.d/keepalived start";
my $ssh_stop_vip = "/etc/init.d/keepalived stop";

GetOptions(
'command=s' => \$command,
'ssh_user=s' => \$ssh_user,
'orig_master_host=s' => \$orig_master_host,
'orig_master_ip=s' => \$orig_master_ip,
'orig_master_port=i' => \$orig_master_port,
'new_master_host=s' => \$new_master_host,
'new_master_ip=s' => \$new_master_ip,
'new_master_port=i' => \$new_master_port,
);

exit &main();

sub main {

print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n";

if ( $command eq "stop" || $command eq "stopssh" ) {

my $exit_code = 1;
eval {
print "Disabling the VIP on old master: $orig_master_host \n";
&stop_vip();
$exit_code = 0;
};
if ($@) {
warn "Got Error: $@\n";
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "start" ) {

my $exit_code = 10;
eval {
print "Enabling the VIP - $vip on the new master - $new_master_host \n";
&start_vip();
$exit_code = 0;
};
if ($@) {
warn $@;
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "status" ) {
print "Checking the Status of the script.. OK \n";
exit 0;
}
else {
&usage();
exit 1;
}
}
sub start_vip() {
`ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
# A simple system call that disable the VIP on the old_master
sub stop_vip() {
`ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}

sub usage {
print
"Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
}

===========================================================================
[root@vm3 ~]# chmod +x /usr/local/bin/master_ip_failover
[root@vm3 ~]# vim /etc/masterha/app1.conf ##取消注释切换脚本
master_ip_failover_script= /usr/local/bin/master_ip_failover

在这里/usr/local/bin/master_ip_failover添加或者修改的内容意思是当主库数据库发生故障时，会触发MHA切换，MHA Manager会停掉主库上的keepalived服务，触发虚拟ip漂移到备选从库，从而完成切换。当然可以在keepalived里面引入脚本，这个脚本监控mysql是否正常运行，如果不正常，则调用该脚本杀掉keepalived进程

[root@vm3 bin]# vim /usr/local/bin/send_report

#!/usr/bin/perl

use strict;
use warnings FATAL => 'all';
use Mail::Sender;
use Getopt::Long;

#new_master_host and new_slave_hosts are set only when recovering master succeeded
my ( $dead_master_host, $new_master_host, $new_slave_hosts, $subject, $body );
my $smtp='smtp.163.com';
my $mail_from='172.16.1.5@ip.com';
my $mail_user='	MHA';
my $mail_pass='199510';
my $mail_to=['2424013264@qq.com','15706107661@163.com'];
GetOptions(
'orig_master_host=s' => \$dead_master_host,
'new_master_host=s' => \$new_master_host,
'new_slave_hosts=s' => \$new_slave_hosts,
'subject=s' => \$subject,
'body=s' => \$body,
);

mailToContacts($smtp,$mail_from,$mail_user,$mail_pass,$mail_to,$subject,$body);

sub mailToContacts {
my ( $smtp, $mail_from, $user, $passwd, $mail_to, $subject, $msg ) = @_;
open my $DEBUG, "> /tmp/monitormail.log"
or die "Can't open the debug file:$!\n";
my $sender = new Mail::Sender {
ctype => 'text/plain; charset=utf-8',
encoding => 'utf-8',
smtp => $smtp,
from => $mail_from,
auth => 'LOGIN',
TLS_allowed => '0',
authid => $user,
authpwd => $passwd,
to => $mail_to,
subject => $subject,
debug => $DEBUG
};

$sender->MailMsg(
{ msg => $msg,
debug => $DEBUG
}
) or print $Mail::Sender::Error;
return 1;
}

# Do whatever you want here
exit 0;

==============================================================
[root@vm3 bin]# vim /etc/masterha/app1.cnf ##mha在配置文件中会自己寻找告警脚本
report_script=/usr/local/bin/send_report
[root@vm3 bin]# chmod +x /usr/local/bin/send_report

[root@vm3 ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf

6、主从挂掉时候，使用MHA_Switch手动将从数据库身为主库，但是注意要把mha_mannager停掉，因为它是用来自动切换的。
[root@vm3 ~]# masterha_stop --conf=/etc/masterha/app1.cnf ##停掉MHA监控，该为手动切换，
[root@vm3 ~]# masterha_check_status --conf=/etc/masterha/app1.cnf
而后，在master机器上进行测试，停掉mysql，$ systemctl stop mariadb.service
[root@vm3 ~]# masterha_master_switch --master_state=dead --conf=/etc/masterha/app1.cnf --dead_master_host=172.16.1.6--dead_master_port=3306 --new_master_host=172.16.1.8 --new_master_port=3306 --ignore_last_failover --orig_master_is_new_slave
配置解释：
--master_state=active可以是--master_state=dead ，表示在Master挂掉的时候才进行切换
--orig_master_is_new_slave ##旧的master成为slave，这个选项还是要加的，因为当主库修复好后，新的数据都在从机上。
--running_updates_limit=10000 ,故障切换时,候选master如果有延迟的话， mha 切换不能成功，加上此参数表示延迟在此时间范围内都可切换（单位为s），但是切换的时间长短是由recover时relay 日志的大小决定

7、MHA在线切换的大概过程：(所谓在线是指master机器并没有挂掉，而不是自动切换的意思！)，在线切换的好处是暂时的阻塞往master写入数据，完成数据迁移,一般情况是原master机器负载比较高。
（1）检测复制设置和确定当前主服务器
（2）确定新的主服务器
（3）阻塞写入到当前主服务器
（4）等待所有从服务器赶上复制
（5）授予写入权限到新的主服务器
（6）重新设置从服务器
注意，在线切换的时候应用架构需要考虑以下两个问题：
1.自动识别master和slave的问题（master的机器可能会切换），如果采用了vip的方式，基本可以解决这个问题。
2.负载均衡的问题（可以定义大概的读写比例，每台机器可承担的负载比例，当有机器离开集群时，需要考虑这个问题）
为了保证数据完全一致性，在最快的时间内完成切换，MHA的在线切换必须满足以下条件才会切换成功，否则会切换失败。
(1)所有slave的IO线程都在运行
(2)所有slave的SQL线程都在运行
(3)所有的show slave status的输出中Seconds_Behind_Master参数小于或者等于running_updates_limit秒，如果在切换过程中不指定running_updates_limit,那么默认情况下running_updates_limit为1秒。
(4)在master端，通过show processlist输出，没有一个更新花费的时间大于running_updates_limit秒

[root@vm3 ~]# vim /usr/local/bin/master_ip_online_change

#!/usr/bin/env perl
## Note: This is a sample script and is not complete. Modify the script based on your environment.

use strict;
use warnings FATAL => 'all';

use Getopt::Long;
use MHA::DBHelper;
use MHA::NodeUtil;
use Time::HiRes qw( sleep gettimeofday tv_interval );
use Data::Dumper;

my $_tstart;
my $_running_interval = 0.1;
my (
$command, $orig_master_is_new_slave, $orig_master_host,
$orig_master_ip, $orig_master_port, $orig_master_user,
$orig_master_password, $orig_master_ssh_user, $new_master_host,
$new_master_ip, $new_master_port, $new_master_user,
$new_master_password, $new_master_ssh_user
);
my $vip = '172.16.1.254/24';
my $key = '1';
my $ssh_start_vip = "/sbin/ifconfig eth0:$key $vip";
my $ssh_stop_vip = "/sbin/ifconfig eth0:$key down";
my $orig_master_ssh_port = 22;
my $new_master_ssh_port = 22;
GetOptions(
'command=s' => \$command,
'orig_master_is_new_slave' => \$orig_master_is_new_slave,
'orig_master_host=s' => \$orig_master_host,
'orig_master_ip=s' => \$orig_master_ip,
'orig_master_port=i' => \$orig_master_port,
'orig_master_user=s' => \$orig_master_user,
'orig_master_password=s' => \$orig_master_password,
'orig_master_ssh_user=s' => \$orig_master_ssh_user,
'new_master_host=s' => \$new_master_host,
'new_master_ip=s' => \$new_master_ip,
'new_master_port=i' => \$new_master_port,
'new_master_user=s' => \$new_master_user,
'new_master_password=s' => \$new_master_password,
'new_master_ssh_user=s' => \$new_master_ssh_user,
'orig_master_ssh_port=i' => \$orig_master_ssh_port,
'new_master_ssh_port=i' => \$new_master_ssh_port,
);

exit &main();

sub current_time_us {
my ( $sec, $microsec ) = gettimeofday();
my $curdate = localtime($sec);
return $curdate . " " . sprintf( "%06d", $microsec );
}

sub sleep_until {
my $elapsed = tv_interval($_tstart);
if ( $_running_interval > $elapsed ) {
sleep( $_running_interval - $elapsed );
}
}

sub get_threads_util {
my $dbh = shift;
my $my_connection_id = shift;
my $running_time_threshold = shift;
my $type = shift;
$running_time_threshold = 0 unless ($running_time_threshold);
$type = 0 unless ($type);
my @threads;

my $sth = $dbh->prepare("SHOW PROCESSLIST");
$sth->execute();

while ( my $ref = $sth->fetchrow_hashref() ) {
my $id = $ref->{Id};
my $user = $ref->{User};
my $host = $ref->{Host};
my $command = $ref->{Command};
my $state = $ref->{State};
my $query_time = $ref->{Time};
my $info = $ref->{Info};
$info =~ s/^\s*(.*?)\s*$/$1/ if defined($info);
next if ( $my_connection_id == $id );
next if ( defined($query_time) && $query_time < $running_time_threshold );
next if ( defined($command) && $command eq "Binlog Dump" );
next if ( defined($user) && $user eq "system user" );
next
if ( defined($command)
&& $command eq "Sleep"
&& defined($query_time)
&& $query_time >= 1 );

if ( $type >= 1 ) {
next if ( defined($command) && $command eq "Sleep" );
next if ( defined($command) && $command eq "Connect" );
}

if ( $type >= 2 ) {
next if ( defined($info) && $info =~ m/^select/i );
next if ( defined($info) && $info =~ m/^show/i );
}

push @threads, $ref;
}
return @threads;
}

sub main {
if ( $command eq "stop" ) {
## Gracefully killing connections on the current master
# 1. Set read_only= 1 on the new master
# 2. DROP USER so that no app user can establish new connections
# 3. Set read_only= 1 on the current master
# 4. Kill current queries
# * Any database access failure will result in script die.
my $exit_code = 1;
eval {
## Setting read_only=1 on the new master (to avoid accident)
my $new_master_handler = new MHA::DBHelper();

# args: hostname, port, user, password, raise_error(die_on_error)_or_not
$new_master_handler->connect( $new_master_ip, $new_master_port,
$new_master_user, $new_master_password, 1 );
print current_time_us() . " Set read_only on the new master.. ";
$new_master_handler->enable_read_only();
if ( $new_master_handler->is_read_only() ) {
print "ok.\n";
}
else {
die "Failed!\n";
}
$new_master_handler->disconnect();

# Connecting to the orig master, die if any database error happens
my $orig_master_handler = new MHA::DBHelper();
$orig_master_handler->connect( $orig_master_ip, $orig_master_port,
$orig_master_user, $orig_master_password, 1 );

## Drop application user so that nobody can connect. Disabling per-session binlog beforehand
$orig_master_handler->disable_log_bin_local();
print current_time_us() . " Drpping app user on the orig master..\n";
#FIXME_xxx_drop_app_user($orig_master_handler);

## Waiting for N * 100 milliseconds so that current connections can exit
my $time_until_read_only = 15;
$_tstart = [gettimeofday];
my @threads = get_threads_util( $orig_master_handler->{dbh},
$orig_master_handler->{connection_id} );
while ( $time_until_read_only > 0 && $#threads >= 0 ) {
if ( $time_until_read_only % 5 == 0 ) {
printf
"%s Waiting all running %d threads are disconnected.. (max %d milliseconds)\n",
current_time_us(), $#threads + 1, $time_until_read_only * 100;
if ( $#threads < 5 ) {
print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump . "\n"
foreach (@threads);
}
}
sleep_until();
$_tstart = [gettimeofday];
$time_until_read_only--;
@threads = get_threads_util( $orig_master_handler->{dbh},
$orig_master_handler->{connection_id} );
}

## Setting read_only=1 on the current master so that nobody(except SUPER) can write
print current_time_us() . " Set read_only=1 on the orig master.. ";
$orig_master_handler->enable_read_only();
if ( $orig_master_handler->is_read_only() ) {
print "ok.\n";
}
else {
die "Failed!\n";
}

## Waiting for M * 100 milliseconds so that current update queries can complete
my $time_until_kill_threads = 5;
@threads = get_threads_util( $orig_master_handler->{dbh},
$orig_master_handler->{connection_id} );
while ( $time_until_kill_threads > 0 && $#threads >= 0 ) {
if ( $time_until_kill_threads % 5 == 0 ) {
printf
"%s Waiting all running %d queries are disconnected.. (max %d milliseconds)\n",
current_time_us(), $#threads + 1, $time_until_kill_threads * 100;
if ( $#threads < 5 ) {
print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump . "\n"
foreach (@threads);
}
}
sleep_until();
$_tstart = [gettimeofday];
$time_until_kill_threads--;
@threads = get_threads_util( $orig_master_handler->{dbh},
$orig_master_handler->{connection_id} );
}

## Terminating all threads
print current_time_us() . " Killing all application threads..\n";
$orig_master_handler->kill_threads(@threads) if ( $#threads >= 0 );
print current_time_us() . " done.\n";
$orig_master_handler->enable_log_bin_local();
$orig_master_handler->disconnect();

## After finishing the script, MHA executes FLUSH TABLES WITH READ LOCK
eval {
`ssh -p$orig_master_ssh_port $orig_master_ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
};
if ($@) {
warn $@;
}
$exit_code = 0;
};
if ($@) {
warn "Got Error: $@\n";
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "start" ) {
## Activating master ip on the new master
# 1. Create app user with write privileges
# 2. Moving backup script if needed
# 3. Register new master's ip to the catalog database

# We don't return error even though activating updatable accounts/ip failed so that we don't interrupt slaves' recovery.
# If exit code is 0 or 10, MHA does not abort
my $exit_code = 10;
eval {
my $new_master_handler = new MHA::DBHelper();

# args: hostname, port, user, password, raise_error_or_not
$new_master_handler->connect( $new_master_ip, $new_master_port,
$new_master_user, $new_master_password, 1 );

## Set read_only=0 on the new master
$new_master_handler->disable_log_bin_local();
print current_time_us() . " Set read_only=0 on the new master.\n";
$new_master_handler->disable_read_only();

## Creating an app user on the new master
print current_time_us() . " Creating app user on the new master..\n";
#FIXME_xxx_create_app_user($new_master_handler);
$new_master_handler->enable_log_bin_local();
$new_master_handler->disconnect();

## Update master ip on the catalog database, etc
`ssh -p$new_master_ssh_port $new_master_ssh_user\@$new_master_host \" $ssh_start_vip \"`;
$exit_code = 0;
};
if ($@) {
warn "Got Error: $@\n";
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "status" ) {

# do nothing
exit 0;
}
else {
&usage();
exit 1;
}
}

sub usage {
print
"Usage: master_ip_online_change --command=start|stop|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
die;
}

==========================================================================

masterha_stop --conf=/etc/masterha/app1.cnf
masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=172.16.1.8 --orig_master_is_new_slave --running_updates_limit=10000 --interactive=0

报错：变成了主主复制架构，只要干掉原先从机器的logbin选项才行
查看MYSQL数据库中所有用户及拥有权限：SELECT DISTINCT CONCAT('User: ''',user,'''@''',host,''';') AS query FROM mysql.user;
mysql> select * from mysql.user where user='root' \G

删除授权：MariaDB [(none)]> revoke all on *.* from 'root'@'vm2.novalocal';

6、master机器宕机以后的修复过程
如果Master主机修复好了，可以在修复好后的Master执行CHANGE MASTER操作，作为新的slave库
同样，新主grant replication slave
stop slave ; reset slave;
CHANGE MASTER TO MASTER_HOST='192.168.2.129', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000005', MASTER_LOG_POS=61791, MASTER_USER='repl', MASTER_PASSWORD='xxx';
start slave ;

MHA的工作原理：http://blog.csdn.net/zengxuewen2045/article/details/51605613

当故障转移完毕之后，往后想恢复旧的主库，必须在旧的slave机器上进行手工change master to ；show slave status\G; 而后，在管理节点上app1.cnf上
添加l
[root@vm3 ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
[root@vm3 ~]# nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1 &
[root@vm3 ~]# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 (pid:17491) is running(0:PING_OK), master:172.16.1.6

二、后续遇到的报错问题
1、masterha_check_status --conf=/etc/masterha/app1.cnf
app1 is stopped(2:NOT_RUNNING).
解决：（1）可以下载个daemontools把nohup masterha-mannager变为守护进程。
参考文档：http://blog.sina.com.cn/s/blog_81b2b2a10102w3mh.html
注意：加入crond计划是行不通的！nohup的目的是在关闭终端后也可以进行。 [root@vm3 masterha]# crontab -e
* * * * * nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1 &

2、故障转移的效果也不太理想。停掉主库服务后，MHA没有动作，VIP还是飘逸在主库上。
masterha_master_switch --master_state=dead --conf=/etc/masterha/app1.cnf --dead_master_host=172.16.1.6--dead_master_port=3306 --new_master_host=172.16.1.8 --new_master_port=3306 --ignore_last_failover --orig_master_is_new_slave
主库上：
Last_IO_Error: error connecting to master 'root@172.16.1.8:3306' - retry-time: 60 retries: 86400 message: Access denied for user 'root'@'172.16.1.6' (using password: YES)

报错：主库没有切换成从机成功，应该需要在从机上授予replication slave权限。
从库：MariaDB [(none)]> grant all on *.* to 'root'@'172.16.1.6' identified by '123456';
MariaDB [(none)]> select user,host from mysql.user;
// MariaDB [(none)]> delete from mysql.user where host='vm2.novalocal';
主库：# start slave;
[root@vm3 ~]# tail -n20 /var/log/masterha/app1/manager.log

3、[root@vm3 ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf
Tue May 9 09:57:35 2017 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln671] Master 172.16.1.6:3306 from which slave 172.16.1.8(172.16.1.8:3306) replicates is not defined in the configuration file!
解决：原来是之前手动在线切换的时候，更改了/etc/masterha/app1.cnf，导致下面配置缺失
[server1]
hostname=172.16.1.6
port=3306

4、当停掉主库的时候,在管理节点上，# tail /var/log/masterha/app1/manager.log，
app1: MySQL Master failover 172.16.1.6(172.16.1.6:3306) to 172.16.1.8(172.16.1.8:3306) succeeded
可以看到从库确实升级了为主库，但是查看VIP，发现虚拟IP依然停留在主库上，zabbix web界面当然也随之崩溃了，因为之前安装时就已经针对zabbix database为Localhost了1
那么keepalived的VIP到底有什么作用？
解决；原来是在主库Keepalived节点没有编写shell脚本实现对MySQL实例的监控，而后杀掉本机的keepalived进程。

# crontab -e
* * * * * /var/lib/mysql/check_keepalived.sh > /dev/null 2>&1 &
##主库上写这个脚本，从库也要写，因为从库会变成主库，以后也会随时宕机的，所以从库也要部署mailx邮件服务。默认安装的有mailx了
[root@vm2 mysql]# vim /etc/mail.rc
set from=15706107661@163.com
set smtp=smtp.163.com
set smtp-auth-user=15706107661
set smtp-auth-password=199510
set smtp-auth=login

[root@vm1 mysql]# vim /var/lib/mysql/check_keepalived.sh
#/bin/bash
MYSQL_CMD=/usr/bin/mysql
CHECK_TIME=3 #check 3 times
MYSQL_OK=1 #MYSQL_OK values to 1 when mysql servicework,else values to 0
IPADDR=`ifconfig eth0|grep " inet "|awk '{print $2}'`
function check_mysql_health (){
$MYSQL_CMD -uroot -p123456 -S /var/lib/mysql/mysql.sock -e "show status;" > /dev/null 2>&1
if [ $? = 0 ]
then
MYSQL_OK=1
systemctl start keepalived.service
else
MYSQL_OK=0
fi
return $MYSQL_OK
}

while [ $CHECK_TIME -ne 0 ]
do
let "CHECK_TIME -=1"
check_mysql_health
if [ $MYSQL_OK = 1 ]
then
CHECK_TIME=0
exit 0
fi
if [ $MYSQL_OK -eq 0 ] && [ $CHECK_TIME -eq 0 ]
then
systemctl stop keepalived.service
echo "mysql is done"|mailx -s "$IPADDR DB server" 15706107661@163.com
exit 1
fi
sleep 1
done

======================================================================================================
注意mysql的密码正确性，否则脚本不会执行的。
MariaDB [(none)]> update mysql.user set password=password('123456') where user='root';
Query OK, 4 rows affected (0.00 sec)
Rows matched: 7 Changed: 4 Warnings: 0

MariaDB [(none)]> flush privileges; ##必须进行这一步操作

5、主库正常工作的时候，虚拟IP始终Ping不通。当主库挂掉之后，MHA检测到，并启动ip_failover脚本让从机接替成为了主库，主库的脚本检测到MySQL宕机后，停掉本机keepalived的软件，虚拟IP自然而然飘移到了从机上（停掉从库的keepalived，VIP又马上回到了主库上），但是外网依然无法ping通VIP？
解决：那么VIP可不可以不跟主库或从库机器上的物理IP同一网段了，比如让它成为公有IP。答案当然是可以的。

疑惑：
对单个控制节点的数据库，我们的要求是什么？
（1）对mysql服务的高可用，当主机器的mysql挂掉后，从机器成为master，但是从机器的IP地址必须也得手工变为Master机器上的。如果不变的话，必须使用Keepalived的VIP，这样一来，其他业务就得使用VIP,而不是主机器的真实物理IP
（2）还是单纯的数据备份，如果是这样的话，不需要MHA，只需要从机器延时备份1小时，实现数据的可靠性，降低损失
单控节点既实现业务高可用又实现数据的可靠性的最好解决方案：让控制节点作为主库，计算节点1作为从库，主从机器加上keepalived软件实现高可用，计算节点2作为MHA监控节点，同时，计算2延时同步1小时。在主库被误删时，立即停掉从机器的IO线程：stop slave io_thread。mysql 5.5版本及其以下都是单线程复制！
所谓多线程复制就是MySQL的并行复制。但是最为关键的是，mysql 5.6才开始支持延时复制。

三、计算节点2（172.16.1.5）作为MHA管理节点的同时，部署mariadb服务，从控制节点延迟同步半小时。即数据库的一主多从结构
1、安装部署
[root@vm3 ~]# yum install -y mariadb-server mariadb-devel
[root@vm3 ~]# vim /etc/my.cnf
server-id=3
read_only=1
[root@vm3 ~]# systemctl enable mariadb.service
[root@vm3 ~]# systemctl start mariadb.service
[root@vm3 ~]# mysql -uroot -p
MariaDB [(none)]>update mysql.user set password=password('123456') where user='root';
MariaDB [(none)]> flush privileges;

主库上：MariaDB [(none)]> select * from mysql.user where user='root'\G; ##查看授权情况
MariaDB [(none)]> grant replication slave on *.* to 'root'@'172.16.1.5' identified by '123456';
MariaDB [zabbix]> use zabbix;
MariaDB [zabbix]> show tables;
MariaDB [zabbix]> show master status;

而后，MHA节点上：
change master to master_host='172.16.1.6',master_user='root',master_password='123456',master_log_file=' binlog.000008',master_log_pos=13618261,master_delay=1800;

最要命的一点是，主从复制不成功，但是我们没发现，以为备份了！监控监控

posted @ 2017-07-08 21:52 兮瓜公籽阅读(1153) 评论(0) 编辑收藏举报

刷新页面返回顶部

兮瓜公籽

信息分享平台官网：http://www.xiguagongzi.cn/

2017-5-6 zabbix数据库备份+MHA对单控节点的一主两从数据库备份+高可用方案（失败）

公告