KingbaseES V8R6集群运维案例之---net_device_ip配置导致集群切换故障
案例说明:
在执行KingbaseES V8R6集群的switchover和failover切换测试时,发现无法加载vip,导致切换失败。后检查发现repmgr.conf中,net_device_ip配置错误导致。
适用版本:
KingbaseES V8R6
集群节点:
一、问题现象
1、执行主备switchover预演
[kingbase@node202 bin]$ ./repmgr standby switchover -h 192.168.1.201 -U esrep -d esrep --dry-run
[WARNING] following problems with command line parameters detected:
database connection parameters not required when executing STANDBY SWITCHOVER
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.202 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=internal_rwcmgr options=-csearch_path="
[NOTICE] checking switchover on node "node2" (ID: 2) in --dry-run mode
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000 fallback_application_name=internal_rwcmgr options=-csearch_path="
[INFO] ES connection to host "192.168.1.201" succeeded
........
[DEBUG] DoRemoteCommand():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 192.168.1.201 /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/repmgr -f /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/../etc/repmgr.conf node check --terse -LERROR --archive-ready --optformat
[INFO] The output from primary check cmd "repmgr node check --terse -LERROR --archive-ready --optformat" is: "--status=OK --files=0
"
[INFO] 0 pending archive files
[DEBUG] lag is 0
[INFO] replication lag on this standby is 0 seconds
[DEBUG] minimum of 1 free slots (0 for siblings) required; 32 available
[INFO] 1 replication slots required, 32 available
[NOTICE] attempting to pause repmgrd on 2 nodes
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000 fallback_application_name=internal_rwcmgr options=-csearch_path="
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.202 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=internal_rwcmgr options=-csearch_path="
[INFO] would pause repmgrd on node "node1" (ID: 1)
[INFO] would pause repmgrd on node "node2" (ID: 2)
[NOTICE] local node "node2" (ID: 2) would be promoted to primary; current primary "node1" (ID: 1) would be demoted to standby
[DEBUG] DoRemoteCommand():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 192.168.1.201 /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/repmgr -f /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/../etc/repmgr.conf node service --terse -LERROR --list-actions --action=stop
[INFO] following shutdown command would be run on node "node1":
"/home/kingbase/cluster/R6C8/HAC8/kingbase/bin/sys_ctl -D '/home/kingbase/cluster/R6C8/HAC8/kingbase/data' -l /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/logfile -W -m fast stop"
[INFO] parameter "shutdown_check_timeout" is set to 60 seconds
[INFO] prerequisites for executing STANDBY SWITCHOVER are met
--- 如上所示,switchover预演成功。
2、执行主备switchover
如下所示,执行switchover,新主库无法加载vip。
[kingbase@node202 bin]$ ./repmgr standby switchover -h 192.168.1.201 -U esrep -d esrep
[WARNING] following problems with command line parameters detected:
database connection parameters not required when executing STANDBY SWITCHOVER
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.202 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=internal_rwcmgr options=-csearch_path="
[NOTICE] executing switchover on node "node2" (ID: 2)
........
[DEBUG] DoRemoteCommand():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 192.168.1.201 /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/repmgr -f /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/../etc/repmgr.conf node check --terse -LERROR --archive-ready --optformat
[INFO] The output from primary check cmd "repmgr node check --terse -LERROR --archive-ready --optformat" is: "--status=OK --files=0
"
[DEBUG] lag is 0
[DEBUG] minimum of 1 free slots (0 for siblings) required; 32 available
[NOTICE] attempting to pause repmgrd on 2 nodes
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000 fallback_application_name=internal_rwcmgr options=-csearch_path="
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.202 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=internal_rwcmgr options=-csearch_path="
[INFO] pausing repmgrd on node "node1" (ID 1)
[INFO] pausing repmgrd on node "node2" (ID 2)
[NOTICE] local node "node2" (ID: 2) will be promoted to primary; current primary "node1" (ID: 1) will be demoted to standby
[NOTICE] stopping current primary node "node1" (ID: 1)
........
ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 192.168.1.201 /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/repmgr -f /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/../etc/repmgr.conf node status --is-shutdown-cleanly
[NOTICE] current primary has been cleanly shut down at location 0/A1000028
[DEBUG] local node last receive LSN is 0/A10000A0, primary shutdown checkpoint LSN is 0/A1000028
[NOTICE] PING 192.168.1.88 (192.168.1.88) 56(84) bytes of data.
--- 192.168.1.88 ping statistics ---
2 packets transmitted, 0 received, +1 errors, 100% packet loss, time 1000ms
[WARNING] ping host"192.168.1.88" failed
[DETAIL] average RTT value is not greater than zero
[2023-12-13 17:17:51] [ERROR] ip address '192.168.1.201' does not exists on localhost dev enp0s3. skip to load vip.
[INFO] loadvip result: 0, arping result: 0
[ERROR] new primary node (ID: 2) acquire the virtual ip 192.168.1.88/24 failed
如下图所示,vip加载失败:
3、执行failover切换
如下图所示,在执行failover切换时,新主库加载vip失败。
二、问题分析
1、查看原主库主机vip
如下所示,原主库主机vip一备卸载。
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP qlen 1000
link/ether 08:00:27:df:15:2c brd ff:ff:ff:ff:ff:ff
inet 192.168.1.201/24 brd 192.168.1.255 scope global enp0s3
valid_lft forever preferred_lft forever
2、备库主机手工加载vip测试
如下所示,在备库主机手工加载vip后,arping测试成功。
# 检查ip和arping属主及权限配置
[kingbase@node202 kingbase]$ which ip
/usr/sbin/ip
[kingbase@node202 kingbase]$ ls -lh /usr/sbin/ip
-rwsr-xr-x. 1 root root 319K Nov 20 2015 /usr/sbin/ip
[kingbase@node202 kingbase]$ ls -lh bin/arping
-rwsr-xr-x 1 root root 14K Sep 2 04:17 bin/arping
# 手工加载vip
[root@node202 ~]# ip add add 192.168.1.88/24 dev enp0s3:0
......
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 08:00:27:4c:18:12 brd ff:ff:ff:ff:ff:ff
inet 192.168.1.202/24 brd 192.168.1.255 scope global enp0s3
valid_lft forever preferred_lft forever
inet 192.168.1.88/24 scope global secondary enp0s3
# 执行arping测试
[kingbase@node202 bin]$ ./arping -U 192.168.1.88 -I enp0s3 -w 5 -c 3
Success to send 3 packets
3、查看备库repmgr.conf配置
virtual_ip='192.168.1.88'
net_device='enp0s3'
net_device_ip='192.168.1.201'
arping_path='/home/kingbase/cluster/R6C8/HAC8/kingbase/bin'
ipaddr_path='/usr/sbin'
如下图所示,在备库的repmgr.conf配置,net_device_ip配置成主库IP:
如下图所示,故障提示,在备库加载 vip时,提示无法在物理ip:‘192.168.1.201’上加载vip:
三、问题解决
如下图所示,将备库net_device_ip改为standby ip后,问题解决。
四、总结
此次故障问题原因是,集群repmgr.conf配置错误,导致集群切换失败,对于重要的生产环境,数据库和集群重要的配置最好有双人负责,一人配置,另外的人做好检查。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· DeepSeek 开源周回顾「GitHub 热点速览」