随笔- 320 文章- 0 评论- 5 阅读- 34799

KingbaseES V8R6集群运维案例之---net_device_ip配置导致集群切换故障

案例说明：
在执行KingbaseES V8R6集群的switchover和failover切换测试时，发现无法加载vip，导致切换失败。后检查发现repmgr.conf中，net_device_ip配置错误导致。

适用版本：
KingbaseES V8R6

集群节点：

一、问题现象

1、执行主备switchover预演

[kingbase@node202 bin]$ ./repmgr standby switchover -h 192.168.1.201 -U esrep -d esrep --dry-run
[WARNING] following problems with command line parameters detected:
  database connection parameters not required when executing STANDBY SWITCHOVER
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.202 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=internal_rwcmgr options=-csearch_path="
[NOTICE] checking switchover on node "node2" (ID: 2) in --dry-run mode
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000 fallback_application_name=internal_rwcmgr options=-csearch_path="
[INFO] ES connection to host "192.168.1.201" succeeded
........
[DEBUG] DoRemoteCommand():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 192.168.1.201 /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/repmgr -f /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/../etc/repmgr.conf node check --terse -LERROR --archive-ready --optformat
[INFO] The output from primary check cmd "repmgr node check --terse -LERROR --archive-ready --optformat" is: "--status=OK --files=0
"
[INFO] 0 pending archive files
[DEBUG] lag is 0
[INFO] replication lag on this standby is 0 seconds
[DEBUG] minimum of 1 free slots (0 for siblings) required; 32 available
[INFO] 1 replication slots required, 32 available
[NOTICE] attempting to pause repmgrd on 2 nodes
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000 fallback_application_name=internal_rwcmgr options=-csearch_path="
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.202 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=internal_rwcmgr options=-csearch_path="
[INFO] would pause repmgrd on node "node1" (ID: 1)
[INFO] would pause repmgrd on node "node2" (ID: 2)
[NOTICE] local node "node2" (ID: 2) would be promoted to primary; current primary "node1" (ID: 1) would be demoted to standby
[DEBUG] DoRemoteCommand():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 192.168.1.201 /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/repmgr -f /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/../etc/repmgr.conf node service --terse -LERROR --list-actions --action=stop
[INFO] following shutdown command would be run on node "node1":
  "/home/kingbase/cluster/R6C8/HAC8/kingbase/bin/sys_ctl  -D '/home/kingbase/cluster/R6C8/HAC8/kingbase/data' -l /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/logfile -W -m fast stop"
[INFO] parameter "shutdown_check_timeout" is set to 60 seconds
[INFO] prerequisites for executing STANDBY SWITCHOVER are met

--- 如上所示，switchover预演成功。

2、执行主备switchover
如下所示，执行switchover，新主库无法加载vip。

[kingbase@node202 bin]$ ./repmgr standby switchover -h 192.168.1.201 -U esrep -d esrep
[WARNING] following problems with command line parameters detected:
  database connection parameters not required when executing STANDBY SWITCHOVER
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.202 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=internal_rwcmgr options=-csearch_path="
[NOTICE] executing switchover on node "node2" (ID: 2)
........
[DEBUG] DoRemoteCommand():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 192.168.1.201 /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/repmgr -f /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/../etc/repmgr.conf node check --terse -LERROR --archive-ready --optformat
[INFO] The output from primary check cmd "repmgr node check --terse -LERROR --archive-ready --optformat" is: "--status=OK --files=0
"
[DEBUG] lag is 0
[DEBUG] minimum of 1 free slots (0 for siblings) required; 32 available
[NOTICE] attempting to pause repmgrd on 2 nodes
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000 fallback_application_name=internal_rwcmgr options=-csearch_path="
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.202 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=internal_rwcmgr options=-csearch_path="
[INFO] pausing repmgrd on node "node1" (ID 1)
[INFO] pausing repmgrd on node "node2" (ID 2)
[NOTICE] local node "node2" (ID: 2) will be promoted to primary; current primary "node1" (ID: 1) will be demoted to standby
[NOTICE] stopping current primary node "node1" (ID: 1)
........
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 192.168.1.201 /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/repmgr -f /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/../etc/repmgr.conf node status --is-shutdown-cleanly
[NOTICE] current primary has been cleanly shut down at location 0/A1000028
[DEBUG] local node last receive LSN is 0/A10000A0, primary shutdown checkpoint LSN is 0/A1000028
[NOTICE] PING 192.168.1.88 (192.168.1.88) 56(84) bytes of data.

--- 192.168.1.88 ping statistics ---
2 packets transmitted, 0 received, +1 errors, 100% packet loss, time 1000ms
[WARNING] ping host"192.168.1.88" failed
[DETAIL] average RTT value is not greater than zero
[2023-12-13 17:17:51] [ERROR] ip address '192.168.1.201' does not exists on localhost dev enp0s3. skip to load vip.
[INFO] loadvip result: 0, arping result: 0
[ERROR] new primary node (ID: 2) acquire the virtual ip 192.168.1.88/24 failed

如下图所示，vip加载失败：

3、执行failover切换
如下图所示，在执行failover切换时，新主库加载vip失败。

二、问题分析
1、查看原主库主机vip
如下所示，原主库主机vip一备卸载。

2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP qlen 1000
    link/ether 08:00:27:df:15:2c brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.201/24 brd 192.168.1.255 scope global enp0s3
       valid_lft forever preferred_lft forever

2、备库主机手工加载vip测试
如下所示，在备库主机手工加载vip后，arping测试成功。

# 检查ip和arping属主及权限配置
[kingbase@node202 kingbase]$ which ip
/usr/sbin/ip
[kingbase@node202 kingbase]$ ls -lh /usr/sbin/ip
-rwsr-xr-x. 1 root root 319K Nov 20  2015 /usr/sbin/ip

[kingbase@node202 kingbase]$ ls -lh bin/arping
-rwsr-xr-x 1 root root 14K Sep  2 04:17 bin/arping

# 手工加载vip
[root@node202 ~]# ip add add 192.168.1.88/24 dev enp0s3:0
......
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:4c:18:12 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.202/24 brd 192.168.1.255 scope global enp0s3
       valid_lft forever preferred_lft forever
    inet 192.168.1.88/24 scope global secondary enp0s3
    
# 执行arping测试
[kingbase@node202 bin]$ ./arping -U 192.168.1.88 -I enp0s3 -w 5 -c 3
Success to send 3 packets

3、查看备库repmgr.conf配置

virtual_ip='192.168.1.88'
net_device='enp0s3'
net_device_ip='192.168.1.201'
arping_path='/home/kingbase/cluster/R6C8/HAC8/kingbase/bin'
ipaddr_path='/usr/sbin'

如下图所示，在备库的repmgr.conf配置，net_device_ip配置成主库IP：