KingbaseES V8R6集群运维案例之---级联备库架构switchover测试

案例说明:
KingbaseES V8R6集群在级联备库架构下,执行switchover切换测试。
测试版本:

[kingbase@node201 bin]$ ./ksql -V
ksql (Kingbase) V008R006C008B0014

一、集群架构
如下所示,node3节点为级联备库,其upstream节点为node2:

[kingbase@node202 bin]$ ./repmgr cluster show

 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                   
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 67       |         | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
 2  | node2 | standby |   running | node1    | default  | 100      | 67       | 0 bytes | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
 3  | node3 | standby |   running | node2    | default  | 100      | 67       | 0 bytes | host=192.168.1.203 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

集群流复制信息:

# node1和node2节点主备了流复制
test=# select * from sys_stat_replication;
  pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backen
d_start         | backend_xmin |   state   | sent_lsn | write_lsn  | flush_lsn  | replay_lsn | write_lag | flus
h_lag | replay_lag | sync_priority | sync_state |          reply_time
-------+----------+---------+------------------+---------------+-----------------+-------------+---------------
----------------+--------------+-----------+----------+------------+------------+------------+-----------+-----
------+------------+---------------+------------+-------------------------------
 10981 |    16384 | esrep   | node2            | 192.168.1.202 |                 |       55232 | 2024-09-13 11:
26:15.897739+08 |              | streaming |          | 3/12017580 | 3/12017580 | 3/12017580 |           |
      |            |             1 | quorum     | 2024-09-13 11:45:55.433441+08
(1 row)

test=# select * from sys_replication_slots;
   slot_name   | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin  | catalog_xmi
n | restart_lsn | confirmed_flush_lsn
---------------+--------+-----------+--------+----------+-----------+--------+------------+-------+------------
--+-------------+---------------------
 repmgr_slot_2 |        | physical  |        |          | f         | t      |      10981 | 16557 |
  | 3/12017580  |
(1 row)

# node2和node3主备流复制
test=# select * from sys_stat_replication;
 pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend
_start         | backend_xmin |   state   | sent_lsn | write_lsn  | flush_lsn  | replay_lsn | write_lag | flush
_lag | replay_lag | sync_priority | sync_state |          reply_time
------+----------+---------+------------------+---------------+-----------------+-------------+----------------
---------------+--------------+-----------+----------+------------+------------+------------+-----------+------
-----+------------+---------------+------------+-------------------------------
 7487 |    16384 | esrep   | node3            | 192.168.1.203 |                 |       57416 | 2024-09-13 11:4
4:04.561940+08 |              | streaming |          | 3/12017580 | 3/12017580 | 3/12017580 |           |
     |            |             0 | async      | 2024-09-13 11:46:16.602821+08
(1 row)

test=# select * from sys_replication_slots;
   slot_name   | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin  | catalog_xmi
n | restart_lsn | confirmed_flush_lsn
---------------+--------+-----------+--------+----------+-----------+--------+------------+-------+------------
--+-------------+---------------------
 repmgr_slot_3 |        | physical  |        |          | f         | t      |       7487 | 16557 |
  | 3/12017580  |
(1 row)

二、node2节点执行switchover切换

1、在node2节点执行switchover
[kingbase@node202 bin]$ ./repmgr standby switchover -h 192.168.1.201 -U esrep -d esrep

2、切换后集群状态
如下所示,node2为新主库节点,node1和node3为备库节点:

[kingbase@node202 bin]$ ./repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                   
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | standby |   running | node2    | default  | 100      | 67       | 0 bytes | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
 2  | node2 | primary | * running |          | default  | 100      | 68       |         | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
 3  | node3 | standby |   running | node2    | default  | 100      | 68       | 0 bytes | host=192.168.1.203 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

三、node1执行switchvoer切换

1、在node1节点执行switchover
[kingbase@node201 bin]$ ./repmgr standby switchover -h 192.168.1.202 -U esrep -d esrep

2、切换后集群状态
如下所示,切换后node1成为新的主库,node3仍作为级联备库,其upstream节点为node2:

[kingbase@node201 bin]$ ./repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                   
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 69       |         | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
 2  | node2 | standby |   running | node1    | default  | 100      | 68       | 0 bytes | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
 3  | node3 | standby |   running | node2    | default  | 100      | 68       | 0 bytes | host=192.168.1.203 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

如下图所示:

四、总结
在KingbaseES V8R6集群执行switchover切换时,仍能保留原有的级联备库的架构。

附件:switchover切换失败案例

1、问题现象
如下所示,node2执行switchover切换失败,在node1原主库节点卸载vip失败:

[kingbase@node202 bin]$ ./repmgr standby switchover -h 192.168.1.201 -U esrep -d esrep
.......
[NOTICE] PING 192.168.1.88 (192.168.1.88) 56(84) bytes of data.

--- 192.168.1.88 ping statistics ---
3 packets transmitted, 2 received, 33% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.506/0.545/0.585/0.045 ms

[WARNING] the virtual ip is already on other host, try to release it on old primary node (host: "192.168.1.201")
[DEBUG] DoRemoteCommand():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 192.168.1.201 /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/kbha -A check_ip --ip 192.168.1.88
[DEBUG] DoRemoteCommand():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 192.168.1.201 /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/kbha -A unloadvip
RTNETLINK answers: Cannot assign requested address
[WARNING] old primary node (host: "192.168.1.201") release the virtual ip 192.168.1.88/24 failed

切换失败后集群状态:

[kingbase@node202 bin]$ ./repmgr cluster show

 ID | Name  | Role    | Status        | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                               
----+-------+---------+---------------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | ? unreachable | ?        | default  | 100      |          |         | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
 2  | node2 | standby |   running     | ? node1  | default  | 100      | 67       | ?       | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
 3  | node3 | standby |   running     | node2    | default  | 100      | 67       | 0 bytes | host=192.168.1.203 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

[WARNING] following issues were detected
  - unable to connect to node "node1" (ID: 1)
  - node "node1" (ID: 1) is registered as an active primary but is unreachable
  - unable to connect to node "node2" (ID: 2)'s upstream node "node1" (ID: 1)
  - unable to determine if node "node2" (ID: 2) is attached to its upstream node "node1" (ID: 1)

[HINT] execute with --verbose option to see connection error messages

2、问题分析

1)查看主库节点vip

2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 08:00:27:df:15:2c brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.201/24 brd 192.168.1.255 scope global noprefixroute enp0s3
       valid_lft forever preferred_lft forever
    inet 192.168.1.88/32 scope global enp0s3

如下图所示,主库节点vip的掩码为32:

2)查看备库vip配置

virtual_ip='192.168.1.88/24'        ##vip掩码为24
net_device='enp0s3'
net_device_ip='192.168.1.202'
arping_path='/home/kingbase/cluster/R6C8/HAC8/kingbase/bin'
ipaddr_path='/usr/sbin'

3)执行vip卸载命令
如下,在switchover切换时,远程执行主库节点vip卸载失败:

# 检查主库节点vip加载状态
[kingbase@node202 bin]$ ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 192.168.1.201 /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/kbha -A check_ip --ip 192.168.1.88
1

# 执行远程连接主库卸载vip
[kingbase@node202 bin]$ ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 192.168.1.201 /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/kbha -A unloadvip
RTNETLINK answers: Cannot assign requested address

4)分析总结
从以上分析可以获悉,主库vip为192.168.1.88/32,而备库配置文件vip为192.168.1.88/24,ip的子网掩码不同,导致备库执行远程卸载vip时,按照配置文件无法获取到:192.168.1.88/24的vip,故在原主库节点卸载vip失败。

3、问题解决

1)在主库节点手工卸载vip

[root@node201 ~]# ip add add 192.168.1.88/24 dev enp0s3

2)重启集群
如下所示,重启集群后主角库节点vip加载正常:

[kingbase@node201 bin]$ ip add sh

2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 08:00:27:df:15:2c brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.201/24 brd 192.168.1.255 scope global noprefixroute enp0s3
       valid_lft forever preferred_lft forever
    inet 192.168.1.88/24 scope global secondary enp0s3:3

如下图所示,主库节点vip掩码为24:(原vip疑似手工加载)

3)重新执行switchover成功。

posted @ 2024-09-13 15:04  天涯客1224  阅读(27)  评论(0编辑  收藏  举报