随笔- 320 文章- 0 评论- 5 阅读- 34719

KingbaseES V8R6集群运维案例之---备库register故障

案例说明：
据现场实施人员说，备库执行了clone，启动数据库服务，执行'repmgr standby register'后，无法将备库register到集群。

适用版本：
KingbaseES V8R6

一、问题现象
如下图所示，执行'repmgr standby register' ，register失败：

二、问题分析
1、repmgr standby register分析
如下图所示：
1) 备库读取repmgr.conf获取本节点信息，并连接。
2）备库读取repmgr.nodes元数据，获取主库节点信息，并连接。
3）连接主库节点，执行备库节点的register。

2、查看备库repmgr.conf配置
如下图所示，备库节点配置正常。

3、检查备库的数据库服务
如下图所示，远程连接到备库节点检查数据库服务，竟然发现备库数据库服务启动在primary状态？？？

三、问题解决
1、在备库data下创建standby.signal文件
[kingbase@localhost data]$ touch standby.signal

2、主库节点创建备库复制槽

3、重启备库数据库服务（数据库服务在standby状态）

[kingbase@localhost bin]$ ./sys_ctl restart -D ../data
等待服务器进程关闭 ....... 完成

4、执行repmgr standby register

[kingbase@localhost bin]$ ./repmgr standby register --force -L debug
[INFO] connecting to local node "node2" (ID: 2)
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=10.0.0.101 port=54321 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000 fallback_application_name=repmgr options=-csearch_path="
[INFO] connecting to primary database
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=10.0.0.100 port=54321 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000 fallback_application_name=repmgr options=-csearch_path="
[DEBUG] remote_command():
  ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 10.0.0.100 /home/kingbase/cluster/install/kingbase/bin/kbha -A updateinfo
[INFO] standby registration complete
[NOTICE] standby node "node2" (ID: 2) successfully registered

---如上所示，standby节点register成功。

5、查看集群节点状态

[kingbase@localhost bin]$ repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                   
----+-------+---------+-----------+----------+----------+----------+----------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        |         | host=10.0.0.100 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
 2  | node2 | standby |   running | node1    | default  | 100      | 1        | 0 bytes | host=10.0.0.101 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000