KingbaseES V8R6集群运维案例之---级联备库的支持

案例说明:
在KingbaseES V8R6 一主二备(其中一个级联备库)集群架构中,通过sys_monitor.sh启动集群时出现 'ERROR: There are no 2 standbys in pg_stat_replication, please check all the standby servers replica from primary'故障,经测试还原了此故障的原因,如下图故障现象:

适用版本:
KingbaseES V8R6

集群架构:

一、集群节点状态

[kingbase@node103 bin]$ ./repmgr cluster show

 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                       
----+-------+---------+-----------+----------+----------+----------+----------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 20       |         | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node2 | standby |   running | node1    | default  | 100      | 20       | 0 bytes | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 3  | node3 | standby |   running | node2    | default  | 100      | 20       | 0 bytes | host=192.168.1.103 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

二、集群启动故障

1、启动集群
如下所示,通过sys_monitor.sh start启动集群后,出现“ERROR: There are no 2 standbys in pg_stat_replication, please check all the standby servers replica from primary”故障:

[kingbase@node103 bin]$ sys_monitor.sh restart
........
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                       
----+-------+---------+-----------+----------+----------+----------+----------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 20       |         | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node2 | standby |   running | node1    | default  | 100      | 20       | 0 bytes | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 3  | node3 | standby |   running | node2    | default  | 100      | 20       | 0 bytes | host=192.168.1.103 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2023-08-01 19:44:22 The primary DB is started.
ERROR: There are no 2 standbys in pg_stat_replication, please check all the standby servers replica from primary

2、查看集群节点信息

1)集群节点状态信息
如下所示,集群启动后,集群节点状态正常:

[kingbase@node103 bin]$ ./repmgr cluster show

 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                       
----+-------+---------+-----------+----------+----------+----------+----------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 20       |         | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node2 | standby |   running | node1    | default  | 100      | 20       | 0 bytes | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 3  | node3 | standby |   running | node2    | default  | 100      | 20       | 0 bytes | host=192.168.1.103 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

2)集群节点repmgrd服务状态
如下所示,集群节点repmgrd服务启动失败:

[kingbase@node101 bin]$ ./repmgr service status
 ID | Name  | Role    | Status    | Upstream | repmgrd     | PID | Paused? | Ups                                                                                                     tream last seen
----+-------+---------+-----------+----------+-------------+-----+---------+----                                                                                                     ----------------
 1  | node1 | primary | * running |          | not running | n/a | n/a     | n/a                                                                                                     
 2  | node2 | standby |   running | node1    | not running | n/a | n/a     | n/a                                                                                                     
 3  | node3 | standby |   running | node2    | not running | n/a | n/a     | n/a   

三、分析集群启动故障

1、查看sys_monitor.sh脚本
如下图所示,在脚本中,启动集群后会统计standby节点的数量:

2、查看备库数量统计方式(sh -x sys_monitor.sh start)
1)通过repmgr cluster show获取集群节点信息

2)按照standby节点数查看主库流复制信息

3)集群流复制信息

主库流复制信息(node1):
prod=# select usename,application_name ,client_addr ,state, sync_state from sys_stat_replication;
 usename | application_name |  client_addr  |   state   | sync_state
---------+------------------+---------------+-----------+------------
 system  | node2            | 192.168.1.102 | streaming | sync
(1 row)

级联备库流复制信息(node2):
prod=# select usename,application_name ,client_addr ,state, sync_state from sys_stat_replication;
 usename | application_name |  client_addr  |   state   | sync_state
---------+------------------+---------------+-----------+------------
 system  | node3            | 192.168.1.103 | streaming | async
(1 row)

由于通过‘repmgr cluster show’获取的standby节点数是2,但从主库sys_stat_replication查询到流复制只有1个standby节点(node2),两者统计的standby节点数不匹配,导致出现以下错误:

四、问题总结
1、查看集群repmgr.conf配置
由于较新版本KingbaseES V8R6集群增加了‘ha_running_mode’参数,默认普通集群为’DG',在repmgr管理中不支持级联备库,‘TPTC' 两地三中心模式,支持级联备库在集群中管理。

[kingbase@node101 bin]$ cat ../etc/repmgr.conf |grep ha_running
ha_running_mode='DG'

2、如下图所示,集群启动判断

3、级联备库使用
如果在ha_running_mode='DG'模式下使用级联备库,不能将此备库节点加入到repmgr集群管理。对于’TPTC‘模式下使用级联备库和普通集群的级联备库机制还不同,需要测试后使用。
4、文档说明
以下文档级联备库的创建是在KingbaseES V8R6较早版本,此版本中repmgr.conf还没有’ha_running_mode‘参数。
《KingbaseES V8R6集群运维案例--创建级联复制》
https://www.cnblogs.com/kingbase/p/15054823.html

posted @ 2024-04-01 15:32  KINGBASE研究院  阅读(95)  评论(0编辑  收藏  举报