KingbaseES V8R6集群运维案例之---级联备库架构switchover测试
案例说明:
KingbaseES V8R6集群在级联备库架构下,执行switchover切换测试。
测试版本:
[kingbase@node201 bin]$ ./ksql -V
ksql (Kingbase) V008R006C008B0014
一、集群架构
如下所示,node3节点为级联备库,其upstream节点为node2:
[kingbase@node202 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 67 | | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
2 | node2 | standby | running | node1 | default | 100 | 67 | 0 bytes | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
3 | node3 | standby | running | node2 | default | 100 | 67 | 0 bytes | host=192.168.1.203 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
集群流复制信息:
# node1和node2节点主备了流复制
test=# select * from sys_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backen
d_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flus
h_lag | replay_lag | sync_priority | sync_state | reply_time
-------+----------+---------+------------------+---------------+-----------------+-------------+---------------
----------------+--------------+-----------+----------+------------+------------+------------+-----------+-----
------+------------+---------------+------------+-------------------------------
10981 | 16384 | esrep | node2 | 192.168.1.202 | | 55232 | 2024-09-13 11:
26:15.897739+08 | | streaming | | 3/12017580 | 3/12017580 | 3/12017580 | |
| | 1 | quorum | 2024-09-13 11:45:55.433441+08
(1 row)
test=# select * from sys_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmi
n | restart_lsn | confirmed_flush_lsn
---------------+--------+-----------+--------+----------+-----------+--------+------------+-------+------------
--+-------------+---------------------
repmgr_slot_2 | | physical | | | f | t | 10981 | 16557 |
| 3/12017580 |
(1 row)
# node2和node3主备流复制
test=# select * from sys_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend
_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush
_lag | replay_lag | sync_priority | sync_state | reply_time
------+----------+---------+------------------+---------------+-----------------+-------------+----------------
---------------+--------------+-----------+----------+------------+------------+------------+-----------+------
-----+------------+---------------+------------+-------------------------------
7487 | 16384 | esrep | node3 | 192.168.1.203 | | 57416 | 2024-09-13 11:4
4:04.561940+08 | | streaming | | 3/12017580 | 3/12017580 | 3/12017580 | |
| | 0 | async | 2024-09-13 11:46:16.602821+08
(1 row)
test=# select * from sys_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmi
n | restart_lsn | confirmed_flush_lsn
---------------+--------+-----------+--------+----------+-----------+--------+------------+-------+------------
--+-------------+---------------------
repmgr_slot_3 | | physical | | | f | t | 7487 | 16557 |
| 3/12017580 |
(1 row)
二、node2节点执行switchover切换
1、在node2节点执行switchover
[kingbase@node202 bin]$ ./repmgr standby switchover -h 192.168.1.201 -U esrep -d esrep
2、切换后集群状态
如下所示,node2为新主库节点,node1和node3为备库节点:
[kingbase@node202 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | node1 | standby | running | node2 | default | 100 | 67 | 0 bytes | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
2 | node2 | primary | * running | | default | 100 | 68 | | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
3 | node3 | standby | running | node2 | default | 100 | 68 | 0 bytes | host=192.168.1.203 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
三、node1执行switchvoer切换
1、在node1节点执行switchover
[kingbase@node201 bin]$ ./repmgr standby switchover -h 192.168.1.202 -U esrep -d esrep
2、切换后集群状态
如下所示,切换后node1成为新的主库,node3仍作为级联备库,其upstream节点为node2:
[kingbase@node201 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 69 | | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
2 | node2 | standby | running | node1 | default | 100 | 68 | 0 bytes | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
3 | node3 | standby | running | node2 | default | 100 | 68 | 0 bytes | host=192.168.1.203 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
如下图所示:
四、总结
在KingbaseES V8R6集群执行switchover切换时,仍能保留原有的级联备库的架构。
附件:switchover切换失败案例
1、问题现象
如下所示,node2执行switchover切换失败,在node1原主库节点卸载vip失败:
[kingbase@node202 bin]$ ./repmgr standby switchover -h 192.168.1.201 -U esrep -d esrep
.......
[NOTICE] PING 192.168.1.88 (192.168.1.88) 56(84) bytes of data.
--- 192.168.1.88 ping statistics ---
3 packets transmitted, 2 received, 33% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.506/0.545/0.585/0.045 ms
[WARNING] the virtual ip is already on other host, try to release it on old primary node (host: "192.168.1.201")
[DEBUG] DoRemoteCommand():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 192.168.1.201 /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/kbha -A check_ip --ip 192.168.1.88
[DEBUG] DoRemoteCommand():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 192.168.1.201 /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/kbha -A unloadvip
RTNETLINK answers: Cannot assign requested address
[WARNING] old primary node (host: "192.168.1.201") release the virtual ip 192.168.1.88/24 failed
切换失败后集群状态:
[kingbase@node202 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+---------------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | node1 | primary | ? unreachable | ? | default | 100 | | | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
2 | node2 | standby | running | ? node1 | default | 100 | 67 | ? | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
3 | node3 | standby | running | node2 | default | 100 | 67 | 0 bytes | host=192.168.1.203 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
[WARNING] following issues were detected
- unable to connect to node "node1" (ID: 1)
- node "node1" (ID: 1) is registered as an active primary but is unreachable
- unable to connect to node "node2" (ID: 2)'s upstream node "node1" (ID: 1)
- unable to determine if node "node2" (ID: 2) is attached to its upstream node "node1" (ID: 1)
[HINT] execute with --verbose option to see connection error messages
2、问题分析
1)查看主库节点vip
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 08:00:27:df:15:2c brd ff:ff:ff:ff:ff:ff
inet 192.168.1.201/24 brd 192.168.1.255 scope global noprefixroute enp0s3
valid_lft forever preferred_lft forever
inet 192.168.1.88/32 scope global enp0s3
如下图所示,主库节点vip的掩码为32:
2)查看备库vip配置
virtual_ip='192.168.1.88/24' ##vip掩码为24
net_device='enp0s3'
net_device_ip='192.168.1.202'
arping_path='/home/kingbase/cluster/R6C8/HAC8/kingbase/bin'
ipaddr_path='/usr/sbin'
3)执行vip卸载命令
如下,在switchover切换时,远程执行主库节点vip卸载失败:
# 检查主库节点vip加载状态
[kingbase@node202 bin]$ ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 192.168.1.201 /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/kbha -A check_ip --ip 192.168.1.88
1
# 执行远程连接主库卸载vip
[kingbase@node202 bin]$ ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 192.168.1.201 /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/kbha -A unloadvip
RTNETLINK answers: Cannot assign requested address
4)分析总结
从以上分析可以获悉,主库vip为192.168.1.88/32,而备库配置文件vip为192.168.1.88/24,ip的子网掩码不同,导致备库执行远程卸载vip时,按照配置文件无法获取到:192.168.1.88/24的vip,故在原主库节点卸载vip失败。
3、问题解决
1)在主库节点手工卸载vip
[root@node201 ~]# ip add add 192.168.1.88/24 dev enp0s3
2)重启集群
如下所示,重启集群后主角库节点vip加载正常:
[kingbase@node201 bin]$ ip add sh
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 08:00:27:df:15:2c brd ff:ff:ff:ff:ff:ff
inet 192.168.1.201/24 brd 192.168.1.255 scope global noprefixroute enp0s3
valid_lft forever preferred_lft forever
inet 192.168.1.88/24 scope global secondary enp0s3:3
如下图所示,主库节点vip掩码为24:(原vip疑似手工加载)
3)重新执行switchover成功。