随笔- 320 文章- 0 评论- 5 阅读- 34799

KingbaseES V8R3集群运维案例之---ssh连接故障failover切换失败分析

案例说明：
KingbaseES V8R3集群，主库data挂在目录只读了，failover切换失败。主库failover日志，如下图所示：

适用版本：
KingbaseES V8R3

集群架构：

node 209  原主库
 node 210  原备库

一、分析过程

1、查看主库recovery.log及sys_log，分析主库数据库服务down原因和时间点。
2、查看主备库cluster.log,备库是否检测到主库数据库服务down及触发了failover。
3、查看kingbasecluster的master节点failover.log日志，分析failover切换过程及失败原因。
4、查看原备库的recovery.log，原备库recovery的过程。
5、通过对不同日志的时间点对比，分析failover切换的原因及过程。

二、分析总结

1、主库在2023-10-15 07:57:01 ，检测到data目录只读，由于use_check_disk=on，所以关闭主库数据库服务。
2、通过主库failover.log日志获悉，从2023-10-15 08:01:44开始执行failover切换，但是由于ssh连接新主库加载vip失败，failover切换失败。
3、从备库cluster.log看，在2023-10-15 07:57:57时，备库检测到连接主库数据库服务失败，超过阈值10次后，触发failover请求到主库node209.
4、从备库recovery.log获悉，在备库通过ssh连接localhost检测vip地址是否存在时，一直是报错，无法检测vip是否在本机。所以，failover切换时，无法ssh连接新主库node210，导致failover切换失败，应该和这个vip检测及加载失败有关。

三、日志分析

1、主库node209日志分析

1）主库在2023-10-15 07:57:01 ，检测到data目录只读，由于use_check_disk=on，所以关闭主库数据库服务。

2023-10-15 07:57:01 recover beging...
........
could not create/write the file "/home/kingbase/cluster/clusterdb/db/data/rw_status_file_714111580".
unlink: 无法清除'/home/kingbase/cluster/clusterdb/db/data/rw_status_file_714111580' 的链接: 只读文件系统
unlink file failed "/home/kingbase/cluster/clusterdb/db/data/rw_status_file_714111580", please check it
2023-10-15 08:00:01 failed to check read/write on mount point (1 / 6).

2023-10-15 08:01:01 execute check_mount_point() failed, maybe the disk is error
2023-10-15 08:01:01 USE_CHECK_DISK = on, will exit with stop db.
stop db.....
等待服务器进程关闭 ..............................I'm already recovery now pid[19320], return nothing to do,will exit script will success
................................I'm already recovery now pid[19320], return nothing to do,will exit script will success
............................... 失败
sys_ctl: server进程没有关闭
2023-10-15 08:02:33 now will del vip [192.168.103.211/28]
but no 192.168.103.211/28 on my DEV, nothing to do with del
---------------------------------------------------------------------

2）通过主库failover.log日志获悉，从2023-10-15 08:01:44开始执行failover切换，但是由于ssh连接新主库加载vip失败，failover切换失败。

-----------------2023-10-15 08:01:44 failover beging---------------------------------------
----failover-stats is %H = hostname of the new master node [192.168.103.210], %P = old primary node id [0], %d = node id[0], %h = host name [192.168.103.209], %O = old primary host[192.168.103.209] %m = new master node id [1], %M = old master node id [0], %D = database cluster path [/home/kingbase/cluster/clusterdb/db/data].
----ping trust ip 
ping trust ip 192.168.103.222 success 
----determine whether the faulty db is master or standby 
master down, let 192.168.103.210 become new primary.....
 2023-10-15 08:01:46 del old primary VIP on 192.168.103.209
ssh connect host:192.168.103.209 success, will stop old primary db and del the vip
stop the old primary db
sys_ctl: 无法发送停止信号 (PID: 16367): 没有那个进程
DEL VIP NOW AT 2023-10-15 08:01:47 ON bond0
execute: [/sbin/ip addr del 192.168.103.211/28 dev bond0]
Oprate del ip cmd end.
2023-10-15 08:01:47 add VIP on 192.168.103.210
the new primary can not ssh, exit failover

2、备库node210日志分析

1）、从备库cluster.log看，在2023-10-15 07:57:57时，备库检测到连接主库数据库服务失败，超过阈值10次后，触发failover请求到主库node209.

# cluster.log:
2023-10-15 07:57:21: pid 47108: LOG:  failed to connect to kingbase server on "192.168.103.209:62163", getsockopt() detected error "Connection refused"
2023-10-15 07:57:21: pid 47108: ERROR:  failed to make persistent db connection

2023-10-15 07:57:57: pid 47108: LOG:  health checking retry count 10
2023-10-15 07:57:57: pid 47108: LOG:  failed to connect to kingbase server on "192.168.103.209:62163", getsockopt() detected error "Connection refused"
2023-10-15 07:57:57: pid 47108: ERROR:  failed to make persistent db connection
2023-10-15 07:57:57: pid 47108: DETAIL:  connection to host:"192.168.103.209:62163" failed
2023-10-15 07:57:57: pid 47108: LOG:  setting backend node 0 status to NODE DOWN
2023-10-15 07:57:57: pid 47108: LOG:  received degenerate backend request for node_id: 0 from pid [47108]
2023-10-15 07:57:57: pid 47144: LOG:  new IPC connection received
2023-10-15 07:57:57: pid 47144: LOG:  failover request from local kingbasecluster node received on IPC interface is forwarded to master watchdog node "192.168.103.209:9999 Linux Kylin"

2023-10-15 07:58:10: pid 47108: LOG:  failover: set new primary node: -1
2023-10-15 07:58:10: pid 47108: LOG:  failover: set new master node: 1
failover done. shutdown host 192.168.103.209(62163)2023-10-15 07:58:10: pid 47168: LOG:  worker process received restart request

2023-10-15 07:58:10: pid 47108: LOG:  failover done. shutdown host 192.168.103.209(62163)

2）从备库recovery.log获悉，在备库通过ssh连接localhost检测vip地址是否存在时，一直是报错，无法检测vip是否在本机。所以，failover切换时，无法ssh连接新主库node210，导致failover切换失败，应该和这个vip检测及加载失败有关。

# recovery.log:

2023-10-15 07:56:01 recover beging...
my pid is 52239,officially began to perform recovery
2023-10-15 07:56:01 check read/write on mount point
2023-10-15 07:56:01 check read/write on mount point (1 / 6).
2023-10-15 07:56:01 stat the directory of the mount point 
.......
"/home/kingbase/cluster/clusterdb/db/data/rw_status_file_800201960" ...
2023-10-15 07:56:01 delete the file "/home/kingbase/cluster/clusterdb/db/data/rw_status_file_800201960" ... OK
2023-10-15 07:56:01 success to check read/write on mount point (1 / 6).
2023-10-15 07:56:01 check read/write on mount point ... ok
2023-10-15 07:56:01 check if the network is ok
ping trust ip 192.168.103.222 success ping times :[3], success times:[2] 
determine if i am master or standby 
 node_id |    hostname     | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay 
---------+-----------------+-------+--------+-----------+---------+------------+-------------------+-------------------
 0       | 192.168.103.209 | 62163 | up     | 0.500000  | primary | 0          | false             | 0
 1       | 192.168.103.210 | 62163 | up     | 0.500000  | standby | 0          | true              | 0
(2 rows)

i am standby in cluster,determine if recovery is needed
2023-10-15 07:56:03 now will del vip [192.168.103.211/28]
ip addr execute failed,will retry ,retry times:[1/3]
"ssh -o StrictHostKeyChecking=no -l root -T localhost "/sbin/ip addr | grep -w "192.168.103.211/28" | wc -l"" execute failed, error num=[   ]
ip addr execute failed,will retry ,retry times:[2/3]
"ssh -o StrictHostKeyChecking=no -l root -T localhost "/sbin/ip addr | grep -w "192.168.103.211/28" | wc -l"" execute failed, error num=[   ]
ip addr execute failed,will retry ,retry times:[3/3]
"ssh -o StrictHostKeyChecking=no -l root -T localhost "/sbin/ip addr | grep -w "192.168.103.211/28" | wc -l"" execute failed, error num=[   ]
del vip failed,after retry 3 times ,cannot del vip, will exit
---------------------------------------------------------------------

三、总结
此次集群failover切换失败是因为，切换过程中，连接新主库加载vip失败（通过ssh连接节点）；在原备库的recovery.log日志中，已经很早就报了‘ssh -o StrictHostKeyChecking=no -l root -T localhost "/sbin/ip addr | grep -w "192.168.103.211/28" | wc -l"" execute failed, error num=[ ]’的故障，dba没有及时发现并处理这一问题，导致failover切换后，ssh连接新主库（原备库）节点加载vip失败，导致整个failover切换失败，dba应及时关注集群出现的故障日志并解决日志体现的问题。