KingbaseES V8R3集群运维案例之---主库数据库服务down后failover切换详解

对KingbaseES V8R3集群,主库数据库服务down后,failover切换进行分析,详解其执行切换的过程,本案例可用于对KingbaseES V8R3集群failover故障的分析参考。

KingbaseES V8R3


 node_id |   hostname    | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay
 0       | | 54321 | up     | 0.333333  | standby | 0          | true              | 0
 1       | | 54321 | up     | 0.333333  | primary | 0          | false             | 0
 2       | | 54321 | up     | 0.333333  | standby | 0          | false             | 0

[kingbase@node102 bin]$ ./sys_ctl stop -D ../data



2023-05-07 02:18:21: pid 11666: WARNING:  checking setuid bit of arping command
2023-05-07 02:18:21: pid 11666: DETAIL:  arping[/home/kingbase/cluster/HAR3/db/bin//arping] doesn't have setuid bit
2023-05-07 02:18:21: pid 11666: LOG:  Backend status file /home/kingbase/cluster/HAR3/run/kingbasecluster/kingbasecluster_status does not exist
2023-05-07 02:18:22: pid 11706: LOG:  watchdog node state changed from [INITIALIZING] to [STANDING FOR MASTER]

2、检测到和主库数据库的health checking失败次数达到阈值(HEALTH_CHECK_MAX_RETRIES=6)。

2023-05-07 02:23:24: pid 11666: LOG:  health checking retry count 1
2023-05-07 02:23:24: pid 11666: LOG:  failed to connect to kingbase server on "", getsockopt() detected error "Connection refused"
2023-05-07 02:24:04: pid 11666: LOG:  health checking retry count 5
2023-05-07 02:24:04: pid 11666: LOG:  failed to connect to kingbase server on "", getsockopt() detected error "Connection refused"
2023-05-07 02:24:04: pid 11666: ERROR:  failed to make persistent db connection
2023-05-07 02:24:04: pid 11666: DETAIL:  connection to host:"" failed

3、执行failover切换前,master节点需要只有failover lock;从kingbasecluster的standby节点接收到failover lock request,默认只有master节点可以持有failover lock 。

2023-05-07 02:24:14: pid 11706: LOG:  received the failover command lock request from remote kingbasecluster node " Linux node102"
2023-05-07 02:24:14: pid 11706: LOG:  remote kingbasecluster node " Linux node102" is requesting to become a lock holder for failover ID: 0
2023-05-07 02:24:14: pid 11706: LOG:  request to become a lock holder is denied to remote kingbasecluster node " Linux node102"
2023-05-07 02:24:14: pid 11706: DETAIL:  only master/coordinator can become a lock holder
2023-05-07 02:24:14: pid 11666: LOG:  Kingbasecluster-II parent process has received failover request


2023-05-07 02:24:14: pid 11666: LOG:  execute command: /home/kingbase/cluster/HAR3/kingbasecluster/bin/ 1 1 0 0 /home/kingbase/cluster/HAR3/db/data
2023-05-07 02:24:14: pid 11706: LOG:  received the failover command lock request from remote kingbasecluster node " Linux node102"
2023-05-07 02:24:14: pid 11706: LOG:  remote kingbasecluster node " Linux node102" is checking the status of [FAILOVER] lock for failover ID 0
2023-05-07 02:24:14: pid 11706: LOG:  FAILOVER lock is currently LOCKED


2023-05-07 02:25:28: pid 11706: LOG:  received the failover command lock request from remote kingbasecluster node " Linux node102"
2023-05-07 02:25:28: pid 11706: LOG:  remote kingbasecluster node " Linux node102" is checking the status of [FAILOVER] lock for failover ID 55
2023-05-07 02:25:28: pid 11706: LOG:  FAILOVER lock is currently LOCKED
2023-05-07 02:25:45: pid 11666: LOG:  starting fail back. reconnect host




-----------------2023-05-07 02:24:14 failover beging---------------------------------------
----failover-stats is %H = hostname of the new master node [], %P = old primary node id [1], %d = node id[1], %h = host name [], %O = old primary host[] %m = new master node id [0], %M = old master node id [0], %D = database cluster path [/home/kingbase/cluster/HAR3/db/data].
----ping trust ip
ping trust ip success ping times :[3], success times:[2]
----determine whether the faulty db is master or standby
master down, let become new primary.....
 2023-05-07 02:24:16 del old primary VIP on
es_client connect host: success, will stop old primary db and del the vip
stop the old primary db
sys_ctl: PID file "/home/kingbase/cluster/HAR3/db/data/" does not exist
Is server running?
DEL VIP NOW AT 2023-05-07 02:24:02 ON enp0s3
execute: [/sbin/ip addr del dev enp0s3]
Oprate del ip cmd end.
2023-05-07 02:24:16 add VIP on
ADD VIP NOW AT 2023-05-07 02:24:17 ON enp0s3
execute: [/sbin/ip addr add dev enp0s3 label enp0s3:2]
execute: /home/kingbase/cluster/HAR3/db/bin//arping -U -I enp0s3 -w 1
Success to send 1 packets
2023-05-07 02:24:17 promote begin...let become master
check db if is alive
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
2023-05-07 02:24:17 kingbase is ok , to prepare execute promote
execute promote
server promoting
check db if is alive after promote
ksql "port=54321 user=SUPERMANAGER_V8ADMIN  dbname=TEST connect_timeout=10"   -c "select 33333;"
2023-05-07 02:24:17 after execute promote , kingbase status is ok.
after execute promote, kingbase is ok.
2023-05-07 02:24:17 sync to async
(1 row)

2023-05-07 02:24:17 make checkpoint
check the db to see if it is alive
ksql "port=54321 user=SUPERMANAGER_V8ADMIN  dbname=TEST connect_timeout=10"  -c "select 33333;"
2023-05-07 02:24:17 kingbase is ok , to prepare execute checkpoint
execute checkpoint
check the db to see if it is alive after execute checkpoint
ksql "port=54321 user=SUPERMANAGER_V8ADMIN  dbname=TEST connect_timeout=10"   -c "select 33333;"
2023-05-07 02:24:17 after execute checkpoint, kingbase is ok.
after execute checkpoint, kingbase is ok.
-----------------2023-05-07 02:24:17 failover end---------------------------------------


-----------------2023-05-07 02:25:28 failover beging---------------------------------------
----failover-stats is %H = hostname of the new master node [], %P = old primary node id [0], %d = node id[2], %h = host name [], %O = old primary host[] %m = new master node id [0], %M = old master node id [0], %D = database cluster path [/home/kingbase/cluster/HAR3/db/data].
----ping trust ip
ping trust ip success ping times :[3], success times:[2]
----determine whether the faulty db is master or standby
standby down, master still
The sys_stat_replication view result is : []
2023-05-07 02:25:30 sync to async
(1 row)

-----------------2023-05-07 02:25:30 failover end---------------------------------------

四、standby 节点的cluster.log


---- Sun May 7 02:18:06 CST 2023 monitor up ----
2023-05-07 02:18:06: pid 31862: WARNING:  checking setuid bit of arping command
2023-05-07 02:18:07: pid 31886: LOG:  setting the remote node " Linux node101" as watchdog cluster master
2023-05-07 02:18:08: pid 31886: LOG:  watchdog node state changed from [INITIALIZING] to [STANDBY]
2023-05-07 02:18:08: pid 31886: LOG:  successfully joined the watchdog cluster as standby node

2、检测到和主库数据库的health checking达到阈值(HEALTH_CHECK_MAX_RETRIES=6)

2023-05-07 02:23:09: pid 31862: LOG:  health checking retry count 1
2023-05-07 02:23:09: pid 31862: LOG:  failed to connect to kingbase server on "", getsockopt() detected error "Connection refused"
2023-05-07 02:23:09: pid 31862: ERROR:  failed to make persistent db connection
2023-05-07 02:23:09: pid 31862: DETAIL:  connection to host:"" failed
2023-05-07 02:23:59: pid 31862: LOG:  health checking retry count 6
2023-05-07 02:23:59: pid 31862: LOG:  failed to connect to kingbase server on "", getsockopt() detected error "Connection refused"

3、kingbasecluster的standby节点,向master节点发出持有failover lock的request,等待master节点的响应。

2023-05-07 02:23:59: pid 31886: LOG:  failover request from local kingbasecluster node received on IPC interface is forwarded to master watchdog node " Linux node101"
2023-05-07 02:23:59: pid 31886: DETAIL:  waiting for the reply...
2023-05-07 02:25:16: pid 31886: LOG:  failover command lock request from local kingbasecluster node received on IPC interface is forwarded to master watchdog node " Linux node101"
2023-05-07 02:25:16: pid 31886: DETAIL:  waiting for the reply...


failover done. shutdown host 02:25:16: pid 31862: LOG:  failover done. shutdown host
2023-05-07 02:25:30: pid 31862: LOG:  failback done. reconnect host


KingbaseES V8R3集群failover切换流程:
2、当master和standby节点,检测到主库的数据库服务(healthy check)次数超过阈值后,触发failover切换。
3、failover切换前,master节点需要持有failover lock。如果是主库主机down或重启,kingbasecluster的standby节点将切换为master,并获取failover lock。
4、master节点持有failover lock后,执行failover_stream.sh触发failover切换,如果master节点主机hang住,有可能导致无法执行,导致切换失败。
7、也只有获得锁的master KingbaseCluster可以进行选主,切换等操作。为standby的KingbaseCluster,当且仅当重新选举为新的master后,才会生效。

posted @ 2023-05-10 14:52  天涯客1224  阅读(2)  评论(0编辑  收藏  举报