随笔- 320 文章- 0 评论- 5 阅读- 34799

KingbaseES V8R6集群运维案例之---级联备库的支持

案例说明：
在KingbaseES V8R6 一主二备（其中一个级联备库）集群架构中，通过sys_monitor.sh启动集群时出现 'ERROR: There are no 2 standbys in pg_stat_replication, please check all the standby servers replica from primary'故障，经测试还原了此故障的原因，如下图故障现象：

适用版本：
KingbaseES V8R6

集群架构：

一、集群节点状态

[kingbase@node103 bin]$ ./repmgr cluster show

 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                       
----+-------+---------+-----------+----------+----------+----------+----------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 20       |         | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node2 | standby |   running | node1    | default  | 100      | 20       | 0 bytes | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 3  | node3 | standby |   running | node2    | default  | 100      | 20       | 0 bytes | host=192.168.1.103 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

二、集群启动故障

1、启动集群
如下所示，通过sys_monitor.sh start启动集群后，出现“ERROR: There are no 2 standbys in pg_stat_replication, please check all the standby servers replica from primary”故障：

[kingbase@node103 bin]$ sys_monitor.sh restart
........
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                       
----+-------+---------+-----------+----------+----------+----------+----------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 20       |         | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node2 | standby |   running | node1    | default  | 100      | 20       | 0 bytes | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 3  | node3 | standby |   running | node2    | default  | 100      | 20       | 0 bytes | host=192.168.1.103 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2023-08-01 19:44:22 The primary DB is started.
ERROR: There are no 2 standbys in pg_stat_replication, please check all the standby servers replica from primary

2、查看集群节点信息

1）集群节点状态信息
如下所示，集群启动后，集群节点状态正常：

[kingbase@node103 bin]$ ./repmgr cluster show

 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                       
----+-------+---------+-----------+----------+----------+----------+----------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 20       |         | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node2 | standby |   running | node1    | default  | 100      | 20       | 0 bytes | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 3  | node3 | standby |   running | node2    | default  | 100      | 20       | 0 bytes | host=192.168.1.103 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

2）集群节点repmgrd服务状态
如下所示，集群节点repmgrd服务启动失败：

[kingbase@node101 bin]$ ./repmgr service status
 ID | Name  | Role    | Status    | Upstream | repmgrd     | PID | Paused? | Ups                                                                                                     tream last seen
----+-------+---------+-----------+----------+-------------+-----+---------+----                                                                                                     ----------------
 1  | node1 | primary | * running |          | not running | n/a | n/a     | n/a                                                                                                     
 2  | node2 | standby |   running | node1    | not running | n/a | n/a     | n/a                                                                                                     
 3  | node3 | standby |   running | node2    | not running | n/a | n/a     | n/a

三、分析集群启动故障

1、查看sys_monitor.sh脚本
如下图所示，在脚本中，启动集群后会统计standby节点的数量：

2、查看备库数量统计方式（sh -x sys_monitor.sh start）
1）通过repmgr cluster show获取集群节点信息

2）按照standby节点数查看主库流复制信息

3）集群流复制信息

主库流复制信息(node1):
prod=# select usename,application_name ,client_addr ,state, sync_state from sys_stat_replication;
 usename | application_name |  client_addr  |   state   | sync_state
---------+------------------+---------------+-----------+------------
 system  | node2            | 192.168.1.102 | streaming | sync
(1 row)

级联备库流复制信息(node2)：
prod=# select usename,application_name ,client_addr ,state, sync_state from sys_stat_replication;
 usename | application_name |  client_addr  |   state   | sync_state
---------+------------------+---------------+-----------+------------
 system  | node3            | 192.168.1.103 | streaming | async
(1 row)

由于通过‘repmgr cluster show’获取的standby节点数是2，但从主库sys_stat_replication查询到流复制只有1个standby节点（node2），两者统计的standby节点数不匹配，导致出现以下错误：

四、问题总结
1、查看集群repmgr.conf配置
由于较新版本KingbaseES V8R6集群增加了‘ha_running_mode’参数，默认普通集群为’DG'，在repmgr管理中不支持级联备库，‘TPTC' 两地三中心模式，支持级联备库在集群中管理。

[kingbase@node101 bin]$ cat ../etc/repmgr.conf |grep ha_running
ha_running_mode='DG'

2、如下图所示，集群启动判断

3、级联备库使用
如果在ha_running_mode='DG'模式下使用级联备库，不能将此备库节点加入到repmgr集群管理。对于’TPTC‘模式下使用级联备库和普通集群的级联备库机制还不同，需要测试后使用。
4、文档说明
以下文档级联备库的创建是在KingbaseES V8R6较早版本，此版本中repmgr.conf还没有’ha_running_mode‘参数。
《KingbaseES V8R6集群运维案例--创建级联复制》
https://www.cnblogs.com/kingbase/p/15054823.html