KingbaseES V8R6集群运维案例之---备库register故障
案例说明:
据现场实施人员说,备库执行了clone,启动数据库服务,执行'repmgr standby register'后,无法将备库register到集群。
适用版本:
KingbaseES V8R6
一、问题现象
如下图所示,执行'repmgr standby register' ,register失败:
二、问题分析
1、repmgr standby register分析
如下图所示:
1) 备库读取repmgr.conf获取本节点信息,并连接。
2)备库读取repmgr.nodes元数据,获取主库节点信息,并连接。
3)连接主库节点,执行备库节点的register。
2、查看备库repmgr.conf配置
如下图所示,备库节点配置正常。
3、检查备库的数据库服务
如下图所示,远程连接到备库节点检查数据库服务,竟然发现备库数据库服务启动在primary状态???
三、问题解决
1、在备库data下创建standby.signal文件
[kingbase@localhost data]$ touch standby.signal
2、主库节点创建备库复制槽
3、重启备库数据库服务(数据库服务在standby状态)
[kingbase@localhost bin]$ ./sys_ctl restart -D ../data
等待服务器进程关闭 ....... 完成
4、执行repmgr standby register
[kingbase@localhost bin]$ ./repmgr standby register --force -L debug
[INFO] connecting to local node "node2" (ID: 2)
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=10.0.0.101 port=54321 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000 fallback_application_name=repmgr options=-csearch_path="
[INFO] connecting to primary database
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=10.0.0.100 port=54321 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000 fallback_application_name=repmgr options=-csearch_path="
[DEBUG] remote_command():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 22 -o ServerAliveInterval=2 -o ServerAliveCountMax=3 10.0.0.100 /home/kingbase/cluster/install/kingbase/bin/kbha -A updateinfo
[INFO] standby registration complete
[NOTICE] standby node "node2" (ID: 2) successfully registered
---如上所示,standby节点register成功。
5、查看集群节点状态
[kingbase@localhost bin]$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 1 | | host=10.0.0.100 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
2 | node2 | standby | running | node1 | default | 100 | 1 | 0 bytes | host=10.0.0.101 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
四、总结
此次案例,是因为备库节点的数据库服务状态启动到了primary模式,导致执行'remgr stanby register'失败,在执行备库克隆后,启动数据库服务注册集群前,需要检查下当前备库的数据库服务状态,状态正常后,再执行register。