随笔- 320 文章- 0 评论- 5 阅读- 34799

KingbaseES V8R6集群运维案例之---failover切换后其他备库follow过程

案例说明：
在一主多备的KingbaseES V8R6集群架构下，在主库数据库服务down，触发failover切换，其中一备库promote为主库后，其他备库需要follow到新的primary节点。

适用版本：
KingbaseES V8R6

集群架构：

 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                       
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 3        |         | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
 2  | node2 | standby |   running | node1    | default  | 100      | 3        | 0 bytes | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
 3  | node3 | standby |   running | node1    | default  | 100      | 3        | 0 bytes | host=192.168.1.203 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000

一、主库数据库服务故障
[kingbase@node201 bin]$ ./sys_ctl stop -D ../data

二、查看failover切换过程

1、promote新的主库

[DETAIL] these nodes will remain attached to the current primary:
  node3 (node ID: 3)
[NOTICE] promoting standby to primary
[DETAIL] promoting server "node2" (ID: 2) using sys_promote()
[NOTICE] waiting for promotion to complete, replay lsn: 0/80000A0
[2023-08-25 18:44:11] [NOTICE] try to stop old primary db (host: "192.168.1.201")
[NOTICE] STANDBY PROMOTE successful
[DETAIL] server "node2" (ID: 2) was successfully promoted to primary

2、recovery其他备库（follow upstream）
recovery流程为：
1）新主库准备对其他备库执行recovery，follow到新的upstream节点。
2）新主库连接其他备库节点（ssh或securecmdd）执行 'repmgr standby follow'。
3）停止和启动其他备库数据库服务。
4）完成其他备库的auto-recovery，并设置其他备库的upstream节点为新的主库。

# 新主库准备其他备库的recovery，follow到upstream。
[2023-08-25 18:44:11] [INFO] sleeping 6 seconds until next reconnection attempt
[2023-08-25 18:44:17] [INFO] checking state of node 2, 1 of 10 attempts
[2023-08-25 18:44:17] [NOTICE] node 2 has recovered, reconnecting
[2023-08-25 18:44:17] [INFO] connection to node 2 succeeded
[2023-08-25 18:44:17] [INFO] original connection is still available
[2023-08-25 18:44:17] [INFO] switching to primary monitoring mode
[2023-08-25 18:44:17] [NOTICE] monitoring cluster primary "node2" (ID: 2)
[2023-08-25 18:44:17] [INFO] create a thread 0x7ff03aa29700 to check the cluster status
[2023-08-25 18:44:17] [INFO] child node: 3; attached: no
[2023-08-25 18:44:17] [INFO] [thread pid:1430] do_nodes_recovery thread begin. The pthread_t tid is 0x7ff03a012700
[2023-08-25 18:44:17] [NOTICE] [thread pid:1430] node (ID: 3; host: "192.168.1.203") will follow myself, ready to auto-recovery
[2023-08-25 18:44:17] [NOTICE] [thread pid:1430] Now, the primary host ip: 192.168.1.202

# 连接其他备库执行 repmgr standby follow
[2023-08-25 18:44:18] [INFO] [thread pid:1430] ES connection to host "192.168.1.203" succeeded, ready to do auto-recovery
[2023-08-25 18:44:18] [INFO] node "node3" (ID: 3, HOST: 192.168.1.203) auto-recovery: STANDBY FOLLOW
[2023-08-25 18:44:20] [NOTICE] executing repmgr command "/home/kingbase/cluster/R6/R6HA/kingbase/bin/repmgr  standby follow  -f /home/kingbase/cluster/R6/R6HA/kingbase/etc/repmgr.conf -W --upstream-node-id=2"
[WARNING] following problems with command line parameters detected:
  --no-wait will be ignored when executing STANDBY FOLLOW
[INFO] local node 3 can attach to follow target node 2
[DETAIL] local node's recovery point: 0/80000A0; follow target node's fork point: 0/80000A0
[INFO] creating replication slot as user "esrep"
[NOTICE] setting node 3's upstream to node 2

# 停止其他备库数据库服务
[NOTICE] begin to stopp server at 2023-08-25 18:44:20.339428
[NOTICE] stopping server using "/home/kingbase/cluster/R6/R6HA/kingbase/bin/sys_ctl  -D '/home/kingbase/cluster/R6/R6HA/kingbase/data' -l /home/kingbase/cluster/R6/R6HA/kingbase/bin/logfile -w -t 90 -m fast stop"
[2023-08-25 18:44:20] [INFO] node (ID: 1): no server running
[2023-08-25 18:44:20] [INFO] [thread 0x7ff03aa29700] the cluster has no other running primary node, exit
[2023-08-25 18:44:23] [NOTICE] new standby "node3" (ID: 3) has connected
[NOTICE] stopp server finish at 2023-08-25 18:44:25.449779

# 启动其他备库数据库服务
[NOTICE] begin to start server at 2023-08-25 18:44:25.450081
[NOTICE] starting server using "/home/kingbase/cluster/R6/R6HA/kingbase/bin/sys_ctl  -w -t 90 -D '/home/kingbase/cluster/R6/R6HA/kingbase/data' -l /home/kingbase/cluster/R6/R6HA/kingbase/bin/logfile start"
[NOTICE] start server finish at 2023-08-25 18:44:25.683756
[WARNING] unable to connect to old upstream node 1 to remove replication slot

# 完成其他备库的auto-recovery，并设置其他备库的upstream节点为新的主库
[HINT] if reusing this node, you should manually remove any inactive replication slots
[WARNING] node "node3" attached in state "startup"
[2023-08-25 18:44:25] [INFO] SET synchronous TO "quorum" on primary host
[NOTICE] STANDBY FOLLOW successful
[DETAIL] standby attached to upstream node "node2" (ID: 2)
[2023-08-25 18:44:26] [NOTICE] kbha: node (ID: 3) standby follow success.
[2023-08-25 18:44:26] [NOTICE] [thread pid:1430] node "node3" (ID: 3) auto-recovery success
[2023-08-25 18:44:26] [INFO] [thread pid:1430] Is standby node "node3" (ID: 3) ready for connection?
[2023-08-25 18:44:26] [INFO] [thread pid:1430] the standby node "node3" (ID: 3) connected ... OK
[2023-08-25 18:44:26] [INFO] [thread pid:1430] do_nodes_recovery thread ends. The pthread_t tid is 0x7ff03a012700
[2023-08-25 18:44:27] [INFO] thread tid:0x7ff03a012700 is not running
[2023-08-25 18:44:27] [INFO] the recovery thread was exited, reset tid
[2023-08-25 18:44:29] [NOTICE] new standby "node3" (ID: 3) has connected
[2023-08-25 18:49:18] [INFO] monitoring primary node "node2" (ID: 2) in normal state

三、failover切换后follow失败案例
如下所示，一主四备架构中，主库主机down，一个备库提升为新主库后，对其他备库执行'repmgr standby follow‘失败：

# node11节点（新主库）对其他备库执行auto-recovery，无法连接到其他节点数据库服务，执行‘repmgr standby follow’失败。
[2023-08-25 10:29:02] [WARNING] unable to connect to remote host "83.12.141.10" via SSH
[2023-08-25 10:29:02] [INFO] unable to connect via SSH to host "83.12.141.10", skip stop old primary db
[2023-08-25 10:30:33] [INFO] 3 followers to notify
[2023-08-25 10:30:33] [NOTICE] notifying node "node16" (ID: 3) to follow node 2

[2023-08-25 10:30:33] [ERROR] unable to execute repmgr.notify_follow_primary()
[2023-08-25 10:30:33] [DETAIL] 
FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

[2023-08-25 10:30:33] [DETAIL] query text is:
SELECT repmgr.notify_follow_primary(2)
[2023-08-25 10:30:33] [NOTICE] notifying node "node17" (ID: 4) to follow node 2
[2023-08-25 10:30:33] [ERROR] unable to execute repmgr.notify_follow_primary()
[2023-08-25 10:30:33] [DETAIL] 
FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

[2023-08-25 10:30:33] [DETAIL] query text is:
SELECT repmgr.notify_follow_primary(2)
[2023-08-25 10:30:33] [NOTICE] notifying node "node38" (ID: 5) to follow node 2
[2023-08-25 10:30:33] [ERROR] unable to execute repmgr.notify_follow_primary()
[2023-08-25 10:30:33] [DETAIL] 
FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

[2023-08-25 10:30:33] [DETAIL] query text is:
SELECT repmgr.notify_follow_primary(2)

其中一个备库节点的hamgr.log:
如下所示，此备库节点的数据库服务无法正常stop和启动：

NOTICE: setting node 3's upstream to node 2
NOTICE: begin to stopp server at 2023-08-25 10:29:59.916352
NOTICE: stopping server using "/home/kingbase/cluster/hngs_sc/hngs_sc_cluster/kingbase/bin/sys_ctl  -D '/home/kingbase/cluster/hngs_sc/hngs_sc_cluster/kingbase/data' -l /home/kingbase/cluster/hngs_sc/hngs_sc_cluster/kingbase/bin/logfile -w -t 90 -m fast stop"
sys_ctl: server does not shut down
NOTICE: stopp server finish at 2023-08-25 10:31:30.021334
ERROR: unable to stopp server
NOTICE: STANDBY FOLLOW failed
[2023-08-25 10:31:33] [ERROR] connection to database failed
[2023-08-25 10:31:33] [DETAIL] 
timeout expired

四、总结
KingbaseES V8R6集群，在一主多备的架构下，当主库数据库服务down后，触发failover切换，其中一个备库被promote为新主库后，将对其他备库节点执行recovery操作，将其他备库节点follow到新的upstream节点；如果新主库不能远程连接其他备库执行正常的数据库服务的关闭和启动，将会导致follow操作失败。