KingbaseES V8R3集群运维案例之---ssh连接故障failover切换失败分析
案例说明:
KingbaseES V8R3集群,主库data挂在目录只读了,failover切换失败。主库failover日志,如下图所示:
适用版本:
KingbaseES V8R3
集群架构:
node 209 原主库
node 210 原备库
一、分析过程
1、查看主库recovery.log及sys_log,分析主库数据库服务down原因和时间点。
2、查看主备库cluster.log,备库是否检测到主库数据库服务down及触发了failover。
3、查看kingbasecluster的master节点failover.log日志,分析failover切换过程及失败原因。
4、查看原备库的recovery.log,原备库recovery的过程。
5、通过对不同日志的时间点对比,分析failover切换的原因及过程。
二、分析总结
1、主库在2023-10-15 07:57:01 ,检测到data目录只读,由于use_check_disk=on,所以关闭主库数据库服务。
2、通过主库failover.log日志获悉,从2023-10-15 08:01:44开始执行failover切换,但是由于ssh连接新主库加载vip失败,failover切换失败。
3、从备库cluster.log看,在2023-10-15 07:57:57时,备库检测到连接主库数据库服务失败,超过阈值10次后,触发failover请求到主库node209.
4、从备库recovery.log获悉,在备库通过ssh连接localhost检测vip地址是否存在时,一直是报错,无法检测vip是否在本机。所以,failover切换时,无法ssh连接新主库node210,导致failover切换失败,应该和这个vip检测及加载失败有关。
三、日志分析
1、主库node209日志分析
1)主库在2023-10-15 07:57:01 ,检测到data目录只读,由于use_check_disk=on,所以关闭主库数据库服务。
2023-10-15 07:57:01 recover beging...
........
could not create/write the file "/home/kingbase/cluster/clusterdb/db/data/rw_status_file_714111580".
unlink: 无法清除'/home/kingbase/cluster/clusterdb/db/data/rw_status_file_714111580' 的链接: 只读文件系统
unlink file failed "/home/kingbase/cluster/clusterdb/db/data/rw_status_file_714111580", please check it
2023-10-15 08:00:01 failed to check read/write on mount point (1 / 6).
2023-10-15 08:01:01 execute check_mount_point() failed, maybe the disk is error
2023-10-15 08:01:01 USE_CHECK_DISK = on, will exit with stop db.
stop db.....
等待服务器进程关闭 ..............................I'm already recovery now pid[19320], return nothing to do,will exit script will success
................................I'm already recovery now pid[19320], return nothing to do,will exit script will success
............................... 失败
sys_ctl: server进程没有关闭
2023-10-15 08:02:33 now will del vip [192.168.103.211/28]
but no 192.168.103.211/28 on my DEV, nothing to do with del
---------------------------------------------------------------------
2)通过主库failover.log日志获悉,从2023-10-15 08:01:44开始执行failover切换,但是由于ssh连接新主库加载vip失败,failover切换失败。
-----------------2023-10-15 08:01:44 failover beging---------------------------------------
----failover-stats is %H = hostname of the new master node [192.168.103.210], %P = old primary node id [0], %d = node id[0], %h = host name [192.168.103.209], %O = old primary host[192.168.103.209] %m = new master node id [1], %M = old master node id [0], %D = database cluster path [/home/kingbase/cluster/clusterdb/db/data].
----ping trust ip
ping trust ip 192.168.103.222 success
----determine whether the faulty db is master or standby
master down, let 192.168.103.210 become new primary.....
2023-10-15 08:01:46 del old primary VIP on 192.168.103.209
ssh connect host:192.168.103.209 success, will stop old primary db and del the vip
stop the old primary db
sys_ctl: 无法发送停止信号 (PID: 16367): 没有那个进程
DEL VIP NOW AT 2023-10-15 08:01:47 ON bond0
execute: [/sbin/ip addr del 192.168.103.211/28 dev bond0]
Oprate del ip cmd end.
2023-10-15 08:01:47 add VIP on 192.168.103.210
the new primary can not ssh, exit failover
2、备库node210日志分析
1)、从备库cluster.log看,在2023-10-15 07:57:57时,备库检测到连接主库数据库服务失败,超过阈值10次后,触发failover请求到主库node209.
# cluster.log:
2023-10-15 07:57:21: pid 47108: LOG: failed to connect to kingbase server on "192.168.103.209:62163", getsockopt() detected error "Connection refused"
2023-10-15 07:57:21: pid 47108: ERROR: failed to make persistent db connection
2023-10-15 07:57:57: pid 47108: LOG: health checking retry count 10
2023-10-15 07:57:57: pid 47108: LOG: failed to connect to kingbase server on "192.168.103.209:62163", getsockopt() detected error "Connection refused"
2023-10-15 07:57:57: pid 47108: ERROR: failed to make persistent db connection
2023-10-15 07:57:57: pid 47108: DETAIL: connection to host:"192.168.103.209:62163" failed
2023-10-15 07:57:57: pid 47108: LOG: setting backend node 0 status to NODE DOWN
2023-10-15 07:57:57: pid 47108: LOG: received degenerate backend request for node_id: 0 from pid [47108]
2023-10-15 07:57:57: pid 47144: LOG: new IPC connection received
2023-10-15 07:57:57: pid 47144: LOG: failover request from local kingbasecluster node received on IPC interface is forwarded to master watchdog node "192.168.103.209:9999 Linux Kylin"
2023-10-15 07:58:10: pid 47108: LOG: failover: set new primary node: -1
2023-10-15 07:58:10: pid 47108: LOG: failover: set new master node: 1
failover done. shutdown host 192.168.103.209(62163)2023-10-15 07:58:10: pid 47168: LOG: worker process received restart request
2023-10-15 07:58:10: pid 47108: LOG: failover done. shutdown host 192.168.103.209(62163)
2)从备库recovery.log获悉,在备库通过ssh连接localhost检测vip地址是否存在时,一直是报错,无法检测vip是否在本机。所以,failover切换时,无法ssh连接新主库node210,导致failover切换失败,应该和这个vip检测及加载失败有关。
# recovery.log:
2023-10-15 07:56:01 recover beging...
my pid is 52239,officially began to perform recovery
2023-10-15 07:56:01 check read/write on mount point
2023-10-15 07:56:01 check read/write on mount point (1 / 6).
2023-10-15 07:56:01 stat the directory of the mount point
.......
"/home/kingbase/cluster/clusterdb/db/data/rw_status_file_800201960" ...
2023-10-15 07:56:01 delete the file "/home/kingbase/cluster/clusterdb/db/data/rw_status_file_800201960" ... OK
2023-10-15 07:56:01 success to check read/write on mount point (1 / 6).
2023-10-15 07:56:01 check read/write on mount point ... ok
2023-10-15 07:56:01 check if the network is ok
ping trust ip 192.168.103.222 success ping times :[3], success times:[2]
determine if i am master or standby
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+-----------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.103.209 | 62163 | up | 0.500000 | primary | 0 | false | 0
1 | 192.168.103.210 | 62163 | up | 0.500000 | standby | 0 | true | 0
(2 rows)
i am standby in cluster,determine if recovery is needed
2023-10-15 07:56:03 now will del vip [192.168.103.211/28]
ip addr execute failed,will retry ,retry times:[1/3]
"ssh -o StrictHostKeyChecking=no -l root -T localhost "/sbin/ip addr | grep -w "192.168.103.211/28" | wc -l"" execute failed, error num=[ ]
ip addr execute failed,will retry ,retry times:[2/3]
"ssh -o StrictHostKeyChecking=no -l root -T localhost "/sbin/ip addr | grep -w "192.168.103.211/28" | wc -l"" execute failed, error num=[ ]
ip addr execute failed,will retry ,retry times:[3/3]
"ssh -o StrictHostKeyChecking=no -l root -T localhost "/sbin/ip addr | grep -w "192.168.103.211/28" | wc -l"" execute failed, error num=[ ]
del vip failed,after retry 3 times ,cannot del vip, will exit
---------------------------------------------------------------------
三、总结
此次集群failover切换失败是因为,切换过程中,连接新主库加载vip失败(通过ssh连接节点);在原备库的recovery.log日志中,已经很早就报了‘ssh -o StrictHostKeyChecking=no -l root -T localhost "/sbin/ip addr | grep -w "192.168.103.211/28" | wc -l"" execute failed, error num=[ ]’的故障,dba没有及时发现并处理这一问题,导致failover切换后,ssh连接新主库(原备库)节点加载vip失败,导致整个failover切换失败,dba应及时关注集群出现的故障日志并解决日志体现的问题。