KingbaseES V8R6 集群运维案例之 -- VIP配置错误导致集群切换失败

案例说明:
KingbaseES V8R6集群的vip在repmgr.conf中配置,本案例测试了手工卸载和加载vip的操作,对failover切换时vip的卸载和加载的影响。

适用版本:
KingbaseES V8R6

一、集群节点状态

[kingbase@node101 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                       
----+---------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node101 | primary | * running |          | default  | 100      | 51       | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node102 | standby |   running | node101  | default  | 100      | 50       | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

二、集群vip配置

1、查看主机vip加载配置

[kingbase@node101 bin]$ ip add sh
.......
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:bd:83:57 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.101/24 brd 192.168.1.255 scope global noprefixroute enp0s3
       valid_lft forever preferred_lft forever
    inet 192.168.1.254/24 scope global secondary enp0s3:3
       valid_lft forever preferred_lft forever
       
 ---如上所示,主库主机加载vip:192.168.1.254/24    

2、查看集群vip配置

[kingbase@node101 bin]$ cat ../etc/repmgr.conf|grep -i vir
virtual_ip='192.168.1.254/24'

三、手工卸载vip测试

1、卸载主库vip

# 如下所示,在卸载vip时需要指定ip掩码
[root@node101 cron.d]# ip add delete 192.168.1.254/24 dev enp0s3

[root@node101 cron.d]# ip add sh
.......
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:bd:83:57 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.101/24 brd 192.168.1.255 scope global noprefixroute enp0s3

2、查看集群节点状态

Tips:
如下所示, 主库vip卸载不影响集群状态,集群状态正常。

[kingbase@node101 bin]$ ./repmgr cluster show

 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                       
----+---------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node101 | primary | * running |          | default  | 100      | 51       | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node102 | standby |   running | node101  | default  | 100      | 50       | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

3、vip自动加载

如下所示,当集群探测到主库vip缺失时,会自动加载vip。

1)查看主机vip

2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:bd:83:57 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.101/24 brd 192.168.1.255 scope global noprefixroute enp0s3
       valid_lft forever preferred_lft forever
    inet 192.168.1.254/24 scope global secondary enp0s3:3
       valid_lft forever preferred_lft forever
---如上所示,在vip被手工卸载后,又被集群自动加载。

2)查看集群日志

如下所示,通过ping vip发现vip丢失时,集群会尝试自动加载vip。

[2023-03-09 17:47:05] [NOTICE] found primary node lost virtual_ip, try to acquire virtual_ip
[2023-03-09 17:47:07] [NOTICE] PING 192.168.1.254 (192.168.1.254) 56(84) bytes of data.

--- 192.168.1.254 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1005ms

[2023-03-09 17:47:07] [WARNING] ping host"192.168.1.254" failed
[2023-03-09 17:47:07] [DETAIL] average RTT value is not greater than zero
[2023-03-09 17:47:07] [DEBUG] executing:
  /home/kingbase/cluster/R6HA/kha/kingbase/bin/kbha -A loadvip
[2023-03-09 17:47:07] [DEBUG] result of command was 0 (0)
[2023-03-09 17:47:07] [DEBUG] local_command(): no output returned
[2023-03-09 17:47:07] [DEBUG] executing:
  /home/kingbase/cluster/R6HA/kha/kingbase/bin/kbha -A arping
[2023-03-09 17:47:07] [DEBUG] result of command was 0 (0)
[2023-03-09 17:47:07] [DEBUG] local_command(): no output returned
[2023-03-09 17:47:07] [INFO] loadvip result: 1, arping result: 1

[2023-03-09 17:47:07] [NOTICE] acquire the virtual ip 192.168.1.254/24 success on localhost

四、手工加载vip测试(子网掩码变化)

1、加载不同子网掩码的vip

[root@node101 cron.d]# ip add delete 192.168.1.254/24 dev enp0s3
[root@node101 cron.d]# ip add add 192.168.1.254/32 dev enp0s3:3
[root@node101 cron.d]# ip add sh
......
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:bd:83:57 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.101/24 brd 192.168.1.255 scope global noprefixroute enp0s3
       valid_lft forever preferred_lft forever
    inet 192.168.1.254/32 scope global enp0s3
       valid_lft forever preferred_lft forever
       
---如上所示,vip被手工卸载并加载不同子网掩码的vip(192.168.1.254/32)。  

2、执行failover切换测试

1) 关闭主库数据库服务

[kingbase@node101 bin]$ ./sys_ctl stop -D /data/kingbase/r6ha/data/
waiting for server to shut down.... done
server stopped

2) 查看主库ip配置

如下所示,主库vip未被卸载。

[root@node101 cron.d]# ip add sh

2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:bd:83:57 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.101/24 brd 192.168.1.255 scope global noprefixroute enp0s3
       valid_lft forever preferred_lft forever
    inet 192.168.1.254/32 scope global enp0s3

3) 查看备库hamgr.log

[2023-03-09 17:52:28] [INFO] try to ping the trusted_servers "192.168.1.1" before execute promote_command
[2023-03-09 17:52:30] [NOTICE] PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.

--- 192.168.1.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1008ms
rtt min/avg/max/mdev = 0.231/0.287/0.343/0.056 ms

[2023-03-09 17:52:30] [NOTICE] successfully ping one or more of the trusted_servers "192.168.1.1"
[2023-03-09 17:52:30] [DEBUG] test_ssh_connection(): executing ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 22 192.168.1.101 /bin/true 2>/dev/null
[2023-03-09 17:52:30] [NOTICE] try to stop old primary db (host: "192.168.1.101")
[2023-03-09 17:52:30] [DEBUG] remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 22 192.168.1.101 /home/kingbase/cluster/R6HA/kha/kingbase/bin/kbha -A stopdb
[2023-03-09 17:52:30] [DEBUG] remote_command(): no output returned
[2023-03-09 17:52:32] [NOTICE] PING 192.168.1.254 (192.168.1.254) 56(84) bytes of data.

--- 192.168.1.254 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.472/0.505/0.538/0.033 ms

[2023-03-09 17:52:32] [WARNING] the virtual ip is already on other host, try to release it on old primary node (host: "192.168.1.101")
[2023-03-09 17:52:32] [DEBUG] test_ssh_connection(): executing ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 22 192.168.1.101 /bin/true 2>/dev/null

[2023-03-09 17:52:32] [INFO] ES connection to host "192.168.1.101" succeeded, ready to release vip on it
[2023-03-09 17:52:32] [DEBUG] remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 22 192.168.1.101 /home/kingbase/cluster/R6HA/kha/kingbase/bin/kbha -A check_ip --ip 192.168.1.254
[2023-03-09 17:52:32] [DEBUG] remote_command(): output returned was:
1

[2023-03-09 17:52:32] [DEBUG] remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 22 192.168.1.101 /home/kingbase/cluster/R6HA/kha/kingbase/bin/kbha -A unloadvip
RTNETLINK answers: Cannot assign requested address
[2023-03-09 17:52:32] [DEBUG] remote_command(): no output returned
[2023-03-09 17:52:32] [WARNING] old primary node (host: "192.168.1.101") release the virtual ip 192.168.1.254/24 failed
[2023-03-09 17:52:32] [NOTICE] the time from the first failure to acquire VIP is 2 seconds (max 60 seconds), try agian
[2023-03-09 17:52:32] [NOTICE] will acquire the virtual ip again
[2023-03-09 17:52:34] [NOTICE] PING 192.168.1.254 (192.168.1.254) 56(84) bytes of data.
  如下图所示,failover切换时,备库远程连接主库后,执行vip的卸载,备库从repmgr.conf中读取的vip地址为:192.168.1.254/24,而主库此时加载的vip地址是:192.168.1.254/32,vip地址不匹配,因此无法卸载vip地址,导致切换失败。

五、总结

    1、如果在主库上vip被手工卸载,集群不会发生切换,集群会自动判断并加载vip地址到主库。
    2、如果主库上配置了和repmgr.conf中不一致的vip地址,在集群切换时,将无法执行vip地址的卸载,会导致集群切换失败。
```****
posted @ 2023-05-19 15:19  KINGBASE研究院  阅读(250)  评论(0编辑  收藏  举报