PG高可用之Keepalived安装示例及注意事项

适用范围

pg12+

方案概述

通过Keepalived 实现 PG HA的自主切换,以达到高可用的目的,由于Keepalived 是在PG流复制之上,所以在安装Keepalived 之前,我们需要先搭建好PG流复制,这里我们使用PG12的版本,如果是用PG12以前的版本,需要修改failover.sh的切换内容。

实施步骤

1. 资源规划:

image.png

2. 安装准备:

安装前我们需要创建数据库Keepalived,并且创建表探测表sr_delay,后续Keepalived探测,刷新sr_delay表的last_alive字段为当前探测时间。
这张表用来判断主备延迟情况,数据库故障切换时会查询用到这张表。

postgres=# create user keepalived password 'keepalived' CONNECTION LIMIT 4 ;
postgres=# create database keepalived owner keepalived;
postgres=# \c keepalived keepalived
keepalived=> create table sr_delay(id int4, last_alive timestamp(0) without time zone);


#表sr_delay只允许写入一条记录,并且不允许删除此表数据,通过触发器实现。创建触发器函数,如下所示:

CREATE FUNCTION cannt_delete () 
RETURNS trigger 
LANGUAGE plpgsql AS $$ 
BEGIN 
RAISE EXCEPTION 'Table sr_delay can not delete !';
 END; 
$$;

#创建触发器:

CREATE TRIGGER trigger_sr_delay_del 
BEFORE DELETE ON sr_delay 
FOR EACH ROW EXECUTE PROCEDURE cannt_delete() ; 

CREATE TRIGGER trigger_sr_delay_tru BEFORE TRUNCATE ON sr_delay FOR STATEMENT EXECUTE PROCEDURE cannt_delete() ;
 
#插入数据:

INSERT INTO sr_delay VALUES(1,now()) ;

  

后续Keepalived会每隔指定时间探测PostgreSQL数据库存活,并且以Keepalived用户登录Keepalived数据库刷新这张表, 配置主备库pg_hba.conf,增加如下内容:

echo "
host    keepalived    keepalived        10.85.10.0/24      md5
host    keepalived    keepalived        10.85.10.0/24      md5" >> pg_hba.conf 

 

3.安装keepalived

#两个节点

$ wget https://www.keepalived.org/software/keepalived-2.2.7.tar.gz
tar xf keepalived-2.2.7.tar.gz
cd keepalived-2.2.7 
./configure --prefix=/usr/local/keepalived/
# ./configure                        
或 
#./configure    --prefix=/postgres/keepalived-2.2.7  

make
make install

 

3.1 配置 keepalived.conf

vi /etc/keepalived/keepalived.conf

! Configuration File for keepalived
global_defs {
   # 邮件通知信息
   notification_email {
     # 定义收件人
    446464087@qq.com 
   }
   # 定义发件人
   notification_email_from test@cqdba.cn
   # SMTP服务器地址
   smtp_server 10.85.10.51
   smtp_connect_timeout 30
   # 路由器标识,一般不用改,也可以写成每个主机自己的主机名
   router_id huyidb03
   vrrp_skip_check_adv_addr
   vrrp_strict
   vrrp_garp_interval 0
   vrrp_gna_interval 0
}

# 定义用于实例执行的脚本内容,比如可以在线降低优先级,用于强制切换
vrrp_script check_pg_alived {
        script "/etc/keepalived/scirpts/check_pg.sh"
        interval 5
        fall 3 # require 3 failures for KO
}


# 一个vrrp_instance就是定义一个虚拟路由器的,实例名称
vrrp_instance VI_1 {
    # 定义初始状态,可以是MASTER或者BACKUP
    state MASTER
    #非抢占模式
    nopreempt
    # 工作接口,通告选举使用哪个接口进行
    interface ens38
    # 虚拟路由ID,如果是一组虚拟路由就定义一个ID,如果是多组就要定义多个,而且这个虚拟
    # ID还是虚拟MAC最后一段地址的信息,取值范围0-255
    virtual_router_id 51
    #权重 如果你上面定义了MASTER,这里的优先级就需要定义的比其他的高
    priority 100
    #通告频率 单位s
    advert_int 1
    #通信认证机制,这里是明文认证还有一种是加密认证
    authentication {
        auth_type PASS
        auth_pass abcdefgh
    }
    
    # 设置虚拟VIP地址
    virtual_ipaddress {
        10.85.10.77
    }
    # 追踪脚本,通常用于去执行上面的vrrp_script定义的脚本内容
    track_script {
        check_pg_alived
    }
    
    # 如果主机状态变成Master|Backup|Fault之后会去执行的通知脚本,脚本要自己写
    smtp_alert
        notify_master "/etc/keepalived/scirpts/failover.sh"
        notify_fault "/etc/keepalived/scirpts/fault.sh"
}
}

 

以上是Keepalived主节点的配置,Keepalived备节点的priority参数改成90 ,state改为BACKUP, 其余参数配置一样。

3.2 配置check_pg.sh

vi  /etc/keepalived/scirpts/check_pg.sh
#!/bin/bash 
# 配置环境变量 
pgport=5432 
pguser=keepalived 
pgdb=keepalived 
pgpwd='keepalived'
LANG=en_US.utf8 
PGHOME=/postgresql/pg12
LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib 
PATH=$PGHOME/bin:$PATH:.
MONITOR_LOG="/etc/keepalived/log/pg_monitor.log" 
SQL1="UPDATE sr_delay SET last_alive= now();" 
SQL2='SELECT 1;' 
keeplognums=30000
 
#此脚本不检查备库存活状态,如果是备库则退出 
standby_flg=`psql -p$pgport -Upostgres -At -c "SELECT pg_is_in_recovery();"` 
if [ ${standby_flg} == 't' ]; 
then 
 echo -e "`date +%F\ %T`:This is a standby database, exit!\n" > $MONITOR_LOG 
 exit 0 
fi
export PGPASSWORD=$pgpwd
#主库更新sr_delay 表 
echo $SQL1 | psql -At -p $pgport -U $pguser -d $pgdb >> $MONITOR_LOG 
#判断主库是否可用
echo $SQL2 | psql -At -p $pgport -U $pguser -d $pgdb 
if [ $? -eq 0 ] ;
then 
 echo -e "`date +%F\ %T`:Primary db is health." >> $MONITOR_LOG 
 exit 0 
else 
 echo -e "`date +%F\ %T`:Attention: Primary db is not health!" >> $MONITOR_LOG 
 exit 1 
fi

#日志保留 keeplognums 行
if [ ! -f ${MONITOR_LOG} ] ;then touch ${MONITOR_LOG};fi 
lognums=`cat ${MONITOR_LOG} |wc -l`
catnum=$((${lognums} -${keeplognums}))
if [[ $lognums -gt ${keeplognums} ]] ; then sed -i "1,${catnum}d" ${MONITOR_LOG}; fi

 

3.3 配置failover.sh

vi  /etc/keepalived/scirpts/failover.sh

#!/bin/bash 
export PGPORT=5432 
export PGUSER=keepalived 
export PG_OS_USER=postgres 
export PGDBNAME=keepalived 
export LANG=en_US.utf8
export PGPATH=/postgresql/pg12
export PGDATA=/postgresql/data

export PATH=$PATH:$PGPATH/bin
 
LOGFILE='/etc/keepalived/log/failover.log'

# 主备数据库同步时延,单位为秒 
sr_allowed_delay_time=100

SQL1='select pg_is_in_recovery from pg_is_in_recovery();' 
SQL2="select count(1) as delay_time from sr_delay where now()<(last_alive + interval '${sr_allowed_delay_time} seconds');"
#SQL2="select count(1) as delay_time from sr_delay where now()<(last_alive + interval '100 seconds');" 

sleep $sr_allowed_delay_time
db_role=`echo $SQL1 | psql -At -p $PGPORT -U $PGUSER -d $PGDBNAME -w` 
db_sr_delaytime=`echo $SQL2 | psql -p $PGPORT -d $PGDBNAME -U $PGUSER -At -w` 
SWITCH_COMMAND='pg_ctl promote -D $PGDATA' 

# 如果为备库,且延迟大于指定时间则切换为主库
# 判断 是否为备库,为主库退出
if [ ${db_role} == 'f' ];
then 
 echo -e `date +"%F %T"` "Attention:The current postgreSQL DB is master database,cannot switched!" >> $LOGFILE 
 exit 0 
fi

# 判断延迟 时间,   小于延迟时间退出   
if [ $db_sr_delaytime -gt 0 ];
then 
 echo -e `date +"%F %T"` "Attention:The current master database is health,the standby DB cannot switched!" >> $LOGFILE 
 exit 0 
fi

#大于延迟 时间切换
if [ !$db_sr_delaytime ];
then 
 echo -e `date +"%F %T"` "Attention:The current database is statndby,ready to switch master database!" >> $LOGFILE 
 su - $PG_OS_USER -c "$SWITCH_COMMAND" 
elif [ $? -eq 0 ];
then
 echo -e `date +"%F %T"` "success:The current standby database successed to switched the primary PG database !" >> $LOGFILE 
 exit 0 
else 
 echo -e `date +"%F %T"` "Error: the standby database failed to switch the primary PG database !,pelease checked it!" >> $LOGFILE 
 exit 1
fi

 

3.4 配置fault.sh

vi /etc/keepalived/scirpts/fault.sh

#!/bin/bash
LOGFILE=/etc/keepalived/log/pg_db_fault.log 
PGDATA=/postgresql/data
PGPORT=5432 
 
echo -e `date +"%F %T"` "Error:Because of the priamry DB happend some unknown problem,So turn off the PostgreSQL Database!" >> $LOGFILE 
 
PGPID="`head -n1 $PGDATA/postmaster.pid`" 
 
systemctl stop keepalived 
 
kill -9 $PGPID 
if [ $? -eq 0 ] ;
then 
    echo -e `date +"%F %T"` "Error:Because of the priamry DB happend some unknown problem,So turn off the PostgreSQL Database!" >> $LOGFILE
    systemctl stop keepalived 
    exit 1 
fi

 

4.启动keepalived

chmod  -R 755  /etc/keepalived/scirpts/*

systemctl daemon-reload
systemctl enable keepalived
systemctl start keepalived


#状态查询 :
[root@huyidb03 ~]# systemctl status keepalived
● keepalived.service - LVS and VRRP High Availability Monitor
   Loaded: loaded (/usr/lib/systemd/system/keepalived.service; enabled; vendor preset: disabled)
   Active: active (running) since Sat 2022-06-11 11:49:14 CST; 5s ago
     Docs: man:keepalived(8)
           man:keepalived.conf(5)
           man:genhash(1)
           https://keepalived.org
  Process: 97742 ExecStart=/usr/local/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 97743 (keepalived)
   CGroup: /system.slice/keepalived.service
           ├─97612 sleep 100
           ├─97743 /usr/local/sbin/keepalived -D
           ├─97744 /usr/local/sbin/keepalived -D
           ├─97757 /bin/bash /etc/keepalived/scirpts/failover.sh
           └─97758 sleep 100

Jun 11 11:49:18 huyidb03 Keepalived_vrrp[97744]: (VI_1) Receive advertisement timeout
Jun 11 11:49:18 huyidb03 Keepalived_vrrp[97744]: (VI_1) Entering MASTER STATE                # 主库上   MASTER  角色
Jun 11 11:49:18 huyidb03 Keepalived_vrrp[97744]: (VI_1) setting VIPs.
Jun 11 11:49:18 huyidb03 Keepalived_vrrp[97744]: (VI_1) Sending/queueing gratuitous ARPs on ens38 for 10.85.10.77   # 启动了 vip
Jun 11 11:49:18 huyidb03 Keepalived_vrrp[97744]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 11:49:18 huyidb03 Keepalived_vrrp[97744]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 11:49:18 huyidb03 Keepalived_vrrp[97744]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 11:49:18 huyidb03 Keepalived_vrrp[97744]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 11:49:18 huyidb03 Keepalived_vrrp[97744]: Sending gratuitous ARP on ens38 for 10.85.10.77

 

5.切换演示

5.1 主库日志

#主库停库
postgres@huyidb03-> pg_ctl stop
waiting for server to shut down....2022-06-11 12:03:43.219 CST [98168] LOG:  received fast shutdown request
2022-06-11 12:03:43.220 CST [98168] LOG:  aborting any active transactions
2022-06-11 12:03:43.221 CST [98168] LOG:  background worker "logical replication launcher" (PID 98175) exited with exit code 1
2022-06-11 12:03:43.221 CST [98170] LOG:  shutting down
2022-06-11 12:03:43.239 CST [98168] LOG:  database system is shut down
 done
server stopped
# 查看 keepalived 
[root@huyidb03 ~]# systemctl status keepalived
● keepalived.service - LVS and VRRP High Availability Monitor
   Loaded: loaded (/usr/lib/systemd/system/keepalived.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Sat 2022-06-11 12:03:58 CST; 6s ago
     Docs: man:keepalived(8)
           man:keepalived.conf(5)
           man:genhash(1)
           https://keepalived.org
  Process: 98183 ExecStart=/usr/local/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 98184 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/keepalived.service

Jun 11 11:55:56 huyidb03 Keepalived_vrrp[98185]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 11:55:56 huyidb03 Keepalived_vrrp[98185]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 12:03:47 huyidb03 Keepalived_vrrp[98185]: Script `check_pg_alived` now returning 1
Jun 11 12:03:57 huyidb03 Keepalived_vrrp[98185]: VRRP_Script(check_pg_alived) failed (exited with status 1)
Jun 11 12:03:57 huyidb03 Keepalived_vrrp[98185]: (VI_1) Entering FAULT STATE
Jun 11 12:03:57 huyidb03 Keepalived_vrrp[98185]: (VI_1) sent 0 priority
Jun 11 12:03:57 huyidb03 Keepalived_vrrp[98185]: (VI_1) removing VIPs.
Jun 11 12:03:57 huyidb03 Keepalived[98184]: Stopping
Jun 11 12:03:57 huyidb03 systemd[1]: Stopping LVS and VRRP High Availability Monitor...
Jun 11 12:03:58 huyidb03 systemd[1]: Stopped LVS and VRRP High Availability Monitor.

#查看日志
tail -100f   /etc/keepalived/log/pg_monitor.log  
2022-06-11 12:03:42: Primary db is health.
2022-06-11 12:03:47: Attention: Primary db is not health!
2022-06-11 12:03:52: Attention: Primary db is not health!
2022-06-11 12:03:57: Attention: Primary db is not health!
    
[root@huyidb03 log]# tail -100f /etc/keepalived/log/pg_db_fault.log
2022-06-11 11:52:59 Error:Because of the priamry DB happend some unknown problem,So turn off the PostgreSQL Database!
2022-06-11 12:03:57 Error:Because of the priamry DB happend some unknown problem,So turn off the PostgreSQL Database!

 

5.2 备库日志

tail -100f   /etc/keepalived/log/pg_monitor.log  

[root@huyidb04 log]# cat failover.log
2022-06-11 12:05:38 Attention:The current database is statndby,ready to switch master database!
    
    
[root@huyidb04 log]# tail -100f pg_monitor.log 
 2022-06-11 12:04:38: This is a standby database, exit!\n
UPDATE 1
2022-06-11 12:05:43: Primary db is health.
UPDATE 1
2022-06-11 12:05:48: Primary db is health.
UPDATE 1
        
        [root@huyidb04 ~]# systemctl status keepalived 
● keepalived.service - LVS and VRRP High Availability Monitor
   Loaded: loaded (/usr/lib/systemd/system/keepalived.service; disabled; vendor preset: disabled)
   Active: active (running) since Sat 2022-06-11 12:01:38 CST; 9min ago
     Docs: man:keepalived(8)
           man:keepalived.conf(5)
           man:genhash(1)
           https://keepalived.org
  Process: 79603 ExecStart=/usr/local/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 79604 (keepalived)
   CGroup: /system.slice/keepalived.service
           ├─79604 /usr/local/sbin/keepalived -D
           └─79605 /usr/local/sbin/keepalived -D

Jun 11 12:03:58 huyidb04 Keepalived_vrrp[79605]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 12:03:58 huyidb04 Keepalived_vrrp[79605]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 12:04:01 huyidb04 Keepalived_vrrp[79605]: smtp fd 14 returned write error
Jun 11 12:04:03 huyidb04 Keepalived_vrrp[79605]: (VI_1) Sending/queueing gratuitous ARPs on ens38 for 10.85.10.77
Jun 11 12:04:03 huyidb04 Keepalived_vrrp[79605]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 12:04:03 huyidb04 Keepalived_vrrp[79605]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 12:04:03 huyidb04 Keepalived_vrrp[79605]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 12:04:03 huyidb04 Keepalived_vrrp[79605]: Sending gratuitous ARP on ens38 for 10.85.10.77
Jun 11 12:04:03 huyidb04 Keepalived_vrrp[79605]: Sending gratuitous ARP on ens38 for 10.85.10.77

 

6. keepalived使用注意事项

  • 启动keepalived 前, 一定要先启动PG, 且同步状态正常
  • 先库动PG主库的keepalived ,在启动PG备库的keepalive
  • keepalive master 机 down掉, VIP会自动切换到 backup 上, 即使主数据库正常 , 这时keepalive 也会将备库切换为新的主库 (如果不是用VIP连接数据库的,就会出现双主,数据不一致的情况)
  • 正常维护的时候 , 先停备库的keepalived ; 再停主库的 keepalived 。 然后在停数据库
  • 避免开机启动keepalived ,当PG还在启动中时,不确认谁是主时,keepaliveed 谁先启动,可能会先成为MSTER 状态。VIP 将在MASTER 上启动.
  • 数据库DOWN掉时,keepalived 发现没有进程, 也会DOWN掉
    当keepalived 自己 down掉,在另一台机器上切换为master 后, 数据库不会立即切换为主, 时间由 failover.sh 脚本中的,sr_allowed_delay_time 参数控制
  • keepalived 只能 ,建议和DB 安装在同一台机器, 原因是在切换的时候需要技持OS 脚本,如:pg_ctl promote -D $PGDATA ,如果在不同机器,需要编写脚本,能远程连过来执行
  • 如何回切呢?-- 原主库启动后,先配置与新从库之间的同步,数据达到一致后,再找个时间进行切换

参考文档

 

 

posted @ 2022-07-29 11:14  www.cqdba.cn  阅读(831)  评论(0编辑  收藏  举报