KingbaseES集群典型案例之一---主库异常failover切机分析
案例说明:
某生产环境,KingbaseES V8R6集群主库系统异常触发主备failover切换,集群切换成功,客户需要了解触发failover切换的原因。本案例通过日志分析,详细描述了触发此次集群failover切换的原因。
适用版本:
KingbaseES V8R6
集群节点信息:
x.x.x.43 原主库
x.x.x.41 原备库
操作系统:
Kylin Sever 10
问题分析思路:
1、分析备库hamgr.log日志,获取failover切换的时间点及过程。
2、分析主库sys_log日志,获取failover切换前数据库状态,是否有异常。
3、分析主库kbha.log日志,获取磁盘检测、网关检测、kbha、repmgrd进程状态信息。
4、分析主库message日志 ,获取内核状态信息。
5、通过以上日志,在相同时间点左右的故障信息,综合判断产生主库节点异常的具体原因。
6、获取具体原因后,提出对应的解决方案。
一、获取和分析failover切换时间点(备库hamgr.log)
如下所示, 分析备库hamgr.log,在'2024-05-08 10:29:15'左右,备库repmgrd进程监控主库数据库服务异常,触发pqping的尝试连接,在尝试连接超过阈值6次失败后,触发failover切换,备库被promote为primary节点。
[2024-05-08 10:29:15] [INFO] checking state of node 3, 1 of 6 attempts
[2024-05-08 10:29:25] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=10.31.1.43 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
[2024-05-08 10:29:25] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
........
[2024-05-08 10:30:55] [INFO] checking state of node 3, 6 of 6 attempts
[2024-05-08 10:31:05] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=10.31.1.43 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
[2024-05-08 10:31:05] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2024-05-08 10:31:05] [WARNING] unable to reconnect to node 3 after 6 attempts
......
NOTICE: promoting standby to primary
DETAIL: promoting server "ybywdb01" (ID: 1) using sys_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
INFO: SET synchronous TO "async" on primary host
NOTICE: STANDBY PROMOTE successful
DETAIL: server "ybywdb01" (ID: 1) was successfully promoted to primary
二、查看主库sys_log日志获取数据库异常原因
可以通过数据库sys_log日志,获取主库数据库状态异常信息,如慢查询、锁等待、OOM、代码异常core等因素,导致数据库负载增大或进程down,从而触发主备failover切换。
如下所示,从主库的sys_log日志分析,在触发主备failover切换前,系统无慢查询、锁等待、OOM、代码异常core等信息。其中“using stale
statistics instead of current ones because stats collector is not responding”日志,和已有的案例比对,此种异常的日志出现和磁盘I/O压力大有关。
2024-05-08 10:28:20 CST [2513598]: [1-1] user=,db=,client=LOG: using stale statistics instead of current ones because stats collector is not responding
2024-05-08 10:28:24 CST [3073927]: [220-1] user=,db=,client=LOG: using stale statistics instead of current ones because stats collector is not responding
2024-05-08 10:28:34 CST [3073927]: [221-1] user=,db=,client=LOG: using stale statistics instead of current ones because stats collector is not responding
2024-05-08 10:30:39 CST [2501733]: [11-1] user=utic_rulesuser,db=utic_rules,client=10.31.1.14LOG: duration: 149136.067 ms parse <unnamed>: SELECT aa.APPLICATION_ID,......
......
2024-05-08 10:31:17 CST [25218]: [251-1] user=,db=,client=LOG: received fast shutdown request
2024-05-08 10:31:19 CST [25218]: [252-1] user=,db=,client=LOG: aborting any active transactions
2024-05-08 10:31:19 CST [2492906]: [1-1] user=utic_rulesuser,db=utic_rules,client=10.31.1.13FATAL: terminating connection due to unexpected kingbase exit
三、查看主库kbha.log获取磁盘检测信息
集群通过kbha进程测试数据库存储目录data的读写状态,在kbha.log里记录了磁盘检测的信息。
如下所示,主库kbha.log下,在"2024-05-08 10:28:30"开始检测磁盘读写功能时,默认将执行6次,而日志显示,只完成了一次检测。
[2024-05-08 10:28:30] [INFO] [thread 3571119872] begin to check the mount_point_list 1 / 6
[2024-05-08 10:28:30] [INFO] [thread 3571119872] check directory "/data/kingbase/data"
[2024-05-08 10:28:30] [INFO] [thread 3571119872] create file "/data/kingbase/data/rw_status_file_20240508102830" in directory "/data/kingbase/data" ..
[2024-05-08 10:28:32] [NOTICE] PING 10.31.1.200 (10.31.1.200) 56(84) bytes of data.
--- 10.31.1.200 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1018ms
rtt min/avg/max/mdev = 0.712/0.731/0.750/0.019 ms
[2024-05-08 10:28:33] [INFO] [thread 3571119872] create file "/data/kingbase/data/rw_status_file_20240508102830" in directory "/data/kingbase/data" .. OK
[2024-05-08 10:28:33] [INFO] [thread 3571119872] write file "/data/kingbase/data/rw_status_file_20240508102830" in directory "/data/kingbase/data" ..
[2024-05-08 10:28:36] [NOTICE] PING 10.31.1.200 (10.31.1.200) 56(84) bytes of data.
[2024-05-08 10:29:07] [WARNING] unable to ping "user=esrep dbname=esrep port=54321 host=10.31.1.43 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
[2024-05-08 10:29:07] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2024-05-08 10:29:10] [NOTICE] PING 10.31.1.200 (10.31.1.200) 56(84) bytes of data.
四、查看主库系统message日志获取系统状态信息
系统message日志记录了操作系统内核的状态及错误信息,通过读取message日志获取在failover切换前系统的状态。
如下所示,在"May 8 10:31:14"开始,在message日志中记录了kingbase多个相关进程被阻塞,并且状态为“D”,时间长达120s以上,这个时间和触发主库异常,主备failover切换的时间匹配。
May 8 10:31:14 ybywdb03 kernel: [6432582.020700] INFO: task kingbase:2514485 blocked for more than 120 seconds.
May 8 10:31:14 ybywdb03 kernel: [6432582.020709] Not tainted 4.19.90-23.8.v2101.ky10.x86_64 #1
May 8 10:31:14 ybywdb03 kernel: [6432582.020712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 8 10:31:14 ybywdb03 kernel: [6432582.020715] kingbase D 0 2514485 25218 0x00000080
May 8 10:31:14 ybywdb03 kernel: [6432582.020719] Call Trace:
May 8 10:31:14 ybywdb03 kernel: [6432582.020737] ? __schedule+0x296/0x910
May 8 10:31:14 ybywdb03 kernel: [6432582.020740] schedule+0x28/0x80
May 8 10:31:14 ybywdb03 kernel: [6432582.020743] schedule_timeout+0x1ee/0x3a0
May 8 10:31:14 ybywdb03 kernel: [6432582.020777] ? xfs_buf_find.isra.28+0x1c4/0x5c0 [xfs]
May 8 10:31:14 ybywdb03 kernel: [6432582.020780] __down+0xa7/0x100
May 8 10:31:14 ybywdb03 kernel: [6432582.020815] ? kmem_realloc+0x70/0xe0 [xfs]
May 8 10:31:14 ybywdb03 kernel: [6432582.020845] ? xfs_buf_find.isra.28+0x1e6/0x5c0 [xfs]
May 8 10:31:14 ybywdb03 kernel: [6432582.020851] ? down+0x3b/0x50
May 8 10:31:14 ybywdb03 kernel: [6432582.020853] down+0x3b/0x50
May 8 10:31:14 ybywdb03 kernel: [6432582.020879] xfs_buf_lock+0x33/0xf0 [xfs]
May 8 10:31:14 ybywdb03 kernel: [6432582.020907] xfs_buf_find.isra.28+0x1e6/0x5c0 [xfs]
May 8 10:31:14 ybywdb03 kernel: [6432582.020935] xfs_buf_get_map+0x40/0x2a0 [xfs]
May 8 10:31:14 ybywdb03 kernel: [6432582.020960] xfs_buf_read_map+0x28/0x170 [xfs]
May 8 10:31:14 ybywdb03 kernel: [6432582.020995] xfs_trans_read_buf_map+0xac/0x2e0 [xfs]
May 8 10:31:14 ybywdb03 kernel: [6432582.021023] xfs_read_agi+0x95/0x140 [xfs]
May 8 10:31:14 ybywdb03 kernel: [6432582.021053] xfs_iunlink+0x55/0x140 [xfs]
May 8 10:31:14 ybywdb03 kernel: [6432582.021083] ? xfs_trans_ichgtime+0x20/0x50 [xfs]
May 8 10:31:14 ybywdb03 kernel: [6432582.021113] xfs_remove+0x201/0x2d0 [xfs]
May 8 10:31:14 ybywdb03 kernel: [6432582.021145] xfs_vn_unlink+0x55/0xa0 [xfs]
May 8 10:31:14 ybywdb03 kernel: [6432582.021152] ? may_delete+0x79/0x180
May 8 10:31:14 ybywdb03 kernel: [6432582.021153] vfs_unlink+0x106/0x1a0
May 8 10:31:14 ybywdb03 kernel: [6432582.021157] do_unlinkat+0x2ba/0x320
May 8 10:31:14 ybywdb03 kernel: [6432582.021164] do_syscall_64+0x5b/0x1d0
May 8 10:31:14 ybywdb03 kernel: [6432582.021166] entry_SYSCALL_64_after_hwframe+0x44/0xa9
May 8 10:31:14 ybywdb03 kernel: [6432582.021169] RIP: 0033:0x7f24410edff7
May 8 10:31:14 ybywdb03 kernel: [6432582.021179] Code: Bad RIP value.
May 8 10:31:14 ybywdb03 kernel: [6432582.021180] RSP: 002b:00007ffe394fc208 EFLAGS: 00000202 ORIG_RAX: 0000000000000057
May 8 10:31:14 ybywdb03 kernel: [6432582.021183] RAX: ffffffffffffffda RBX: 00007ffe394fc230 RCX: 00007f24410edff7
May 8 10:31:14 ybywdb03 kernel: [6432582.021184] RDX: 0000000000000001 RSI: 0000000000000014 RDI: 00007ffe394fc230
May 8 10:31:14 ybywdb03 kernel: [6432582.021185] RBP: 0000000000000014 R08: 0000000000000000 R09: 0000000000000001
May 8 10:31:14 ybywdb03 kernel: [6432582.021186] R10: 000000000000000f R11: 0000000000000202 R12: 0000000000000000
May 8 10:31:14 ybywdb03 kernel: [6432582.021186] R13: 0000000000000000 R14: 0000000000000006 R15: 0000000000000000
May 8 10:31:14 ybywdb03 kernel: [6432582.021190] INFO: task kingbase:2516527 blocked for more than 120 seconds.
.May 8 10:31:14 ybywdb03 kernel: [6432582.021197] kingbase D 0 2516527 25218 0x00000080
.......
May 8 10:31:14 ybywdb03 kernel: [6432582.021343] INFO: task kingbase:2516528 blocked for more than 120 seconds.
May 8 10:31:14 ybywdb03 kernel: [6432582.021348] kingbase D 0 2516528 25218 0x00000080
......
May 8 10:31:14 ybywdb03 kernel: [6432582.021485] INFO: task kingbase:2516529 blocked for more than 120 seconds.
May 8 10:31:14 ybywdb03 kernel: [6432582.021490] kingbase D 0 2516529 25218 0x00000080
.......
May 8 10:31:14 ybywdb03 kernel: [6432582.021621] INFO: task kingbase:2516530 blocked for more than 120 seconds.
May 8 10:31:14 ybywdb03 kernel: [6432582.021626] kingbase D 0 2516530 25218 0x00000080
.......
May 8 10:31:14 ybywdb03 kernel: [6432582.021758] INFO: task kingbase:2516531 blocked for more than 120 seconds.
May 8 10:31:14 ybywdb03 kernel: [6432582.021763] kingbase D 0 2516531 25218 0x00000084
......
May 8 10:31:14 ybywdb03 kernel: [6432582.021889] INFO: task kingbase:2516532 blocked for more than 120 seconds.
May 8 10:31:14 ybywdb03 kernel: [6432582.021893] kingbase D 0 2516532 25218 0x00000084
.......
May 8 10:31:14 ybywdb03 kernel: [6432582.022019] INFO: task kingbase:2516534 blocked for more than 120 seconds.
May 8 10:31:14 ybywdb03 kernel: [6432582.022025] kingbase D 0 2516534 25218 0x00000084
......
May 8 10:31:14 ybywdb03 kernel: [6432582.022147] INFO: task kingbase:2516535 blocked for more than 120 seconds.
May 8 10:31:14 ybywdb03 kernel: [6432582.022152] kingbase D 0 2516535 25218 0x00000084
.......
May 8 10:31:14 ybywdb03 kernel: [6432582.022277] INFO: task kingbase:2516536 blocked for more than 120 seconds.
May 8 10:31:14 ybywdb03 kernel: [6432582.022281] kingbase D 0 2516536 25218 0x00000084
如下所示,被阻塞的kingbase进程的状态为"D"状态,一般D状态进程多为I/O压力过大,等待I/O响应引起。
五、查询Linux系统相关文档定位系统故障
从RedHat官方文档获取到"xfs_buf deadlock between inode deletion and block allocation"相关故障信息的文档,其错误日志和当前系统message出现的日志信息基本一致。
文档链接:https://access.redhat.com/solutions/6984334
如下所示,错误日志和当前主库message的日志基本一致:
RHEL8: xfs_buf deadlock between inode deletion and block allocation
SOLUTION VERIFIED - Updated April 20 2023 at 6:15 AM - English
Issue
The task is blocked with below logs:
Raw
[6553607.856923] INFO: task irisdb:1499256 blocked for more than 120 seconds.
[6553607.856958] Tainted: G OE --------- - - 4.18.0-372.9.1.el8.x86_64 #1
[6553607.856980] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[6553607.857001] task:irisdb state:D stack: 0 pid:1499256 ppid: 1 flags:0x10004080
[6553607.857004] Call Trace:
[6553607.857007] __schedule+0x2d1/0x830
[6553607.857016] schedule+0x35/0xa0
[6553607.857018] schedule_timeout+0x274/0x300
[6553607.857023] ? dma_map_sg_attrs+0x30/0x50
[6553607.857027] ? scsi_dma_map+0x35/0x40
[6553607.857033] __down+0x9a/0xf0
[6553607.857037] ? sd_init_command+0x1e9/0xc80 [sd_mod]
[6553607.857042] ? xfs_buf_find.isra.29+0x3c8/0x5f0 [xfs]
[6553607.857104] down+0x3b/0x50
[6553607.857108] xfs_buf_lock+0x33/0xf0 [xfs]
[6553607.857147] xfs_buf_find.isra.29+0x3c8/0x5f0 [xfs]
[6553607.857189] xfs_buf_get_map+0x4c/0x320 [xfs]
[6553607.857230] xfs_buf_read_map+0x53/0x310 [xfs]
[6553607.857267] ? xfs_read_agf+0x8e/0x120 [xfs]
[6553607.857295] xfs_trans_read_buf_map+0x1cb/0x360 [xfs]
[6553607.857343] ? xfs_read_agf+0x8e/0x120 [xfs]
[6553607.857371] xfs_read_agf+0x8e/0x120 [xfs]
[6553607.857399] xfs_alloc_read_agf+0x3e/0x1e0 [xfs]
[6553607.857427] xfs_alloc_fix_freelist+0x234/0x480 [xfs]
.......
[6553607.857973] RIP: 0033:0x7fab3945e60b
[6553607.857978] Code: Unable to access opcode bytes at RIP 0x7fab3945e5e1.
[6553607.857979] RSP: 002b:00007ffeb1429a58 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
[6553607.857981] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007fab3945e60b
[6553607.857982] RDX: 00007ffeb1429a95 RSI: 00007ffeb1428e8a RDI: 00007ffeb1429a60
[6553607.857983] RBP: 00007fab3b0c2690 R08: 00007ffeb1428e14 R09: 0000000000000000
[6553607.857985] R10: ff00ff00ff00ff00 R11: 0000000000000246 R12: 0000000000000034
[6553607.857986] R13: 0000000000000068 R14: ffffffffffffffff R15: 0000000000000003
具体的错误解释:
xfs_buf deadlock between inode deletion and block allocation
这个错误信息表明在XFS文件系统中发生了一个死锁的情况。具体来说,是因为在删除一个inode(索引节点)的过程中,和块分配过程同时发生了竞争,导致了操作无法继续进行。
XFS文件系统中,inode是用来存储文件的元数据的,而block是用来存储文件数据的。删除一个文件通常包括两个步骤:首先是删除inode,然后是释放相关的blocks。如果这两个过程之间缺乏恰当的同步机制,就可能会出现死锁。
解决方法:
- 更新系统固件和内核至最新版本,以修复可能导致死锁的已知问题。
- 如果是在高I/O负载下发生,尝试降低I/O负载,例如通过限制文件系统的吞吐量或者优先级。
- 如果是在高并发操作下发生,减少并发操作的数量,避免竞争。
- 如果可能,尝试在低负载时段进行删除操作,减少死锁的可能性。
- 如果是在特定的应用程序中发生,检查应用程序的文件操作逻辑,避免在同一时间进行不同类型的文件操作。
- 如果问题持续存在,可以考虑联系XFS社区或者专业的技术支持,寻求帮助。
六、分析总结
- 从备库hamgr.log日志获悉,在在'2024-05-08 10:29:15'左右,备库repmgrd进程监控主库数据库服务异常,触发pqping的尝试连接。
- 从主库sys_log日志获悉,在"2024-05-08 10:28:20" 左右,出现“using stale statistics instead of current ones because stats collector is not responding”日志,此日志一般由磁盘I/O阻塞引起。
- 从主库kbha.log日志获悉,在"2024-05-08 10:28:30"开始检测磁盘读写功能时,默认将执行6次,而日志显示,只完成了一次检测,磁盘读写检测异常。
- 在"May 8 10:31:14"开始,在message日志中记录了kingbase多个相关进程被阻塞,并且状态为“D”,时间长达120s以上,这个时间和触发主库异常,主备failover切换的时间匹配。
- 从RedHat官方获悉,系统的故障疑似和xfs的bug有关,触发了xfs的死锁,导致I/O阻塞。
通过以上分析,可知此次主库异常,是因为系统出现故障,导致kingbase进程处于”D“状体,进程被阻塞后,导致备库对主库的repmgrd的检测异常,pqping异常检测超过阈值后,触发failover切换,其中判断系统故障疑似和xfs的bug有关,需要系统管理员进一步确认。