KingbaseES V8R6数据库案例之---磁盘损坏导致数据库启动失败
案例说明:
生产现场环境,主机系统盘损坏,导致系统和数据库服务启动失败,系统修复后,启动数据库,出现无法找到有效的检查点故障,如下图所示:
通过执行resetwal后,又出现“replication checkpoint”错误,如下图所示。
一、故障分析
通过研发协助读取源代码:
截取StartupReplicationOrigin()函数的一段代码
if (readBytes != sizeof(magic))
{
if (readBytes < 0)
ereport(PANIC,
(errcode_for_file_access(),
errmsg("could not read file \"%s\": %m",
path)));
else
ereport(PANIC,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg("could not read file \"%s\": read %d of %zu",
path, readBytes, sizeof(magic))));
}
此错误应该和sys_logical/replorigin_checkpoint文件有关,这个文件用于逻辑复制;当数据库正常关闭时会重置此文件,但是数据库服务非正常关闭,导致此文件异常后,数据库服务将启动失败。
二、故障复现
1、查看文件信息
[kingbase@node201 data]$ cd sys_logical/
[kingbase@node201 sys_logical]$ ls -lh
total 4.0K
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 mappings
-rw------- 1 kingbase kingbase 8 Feb 21 16:08 replorigin_checkpoint
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 snapshots
2、复现故障
1)关闭数据库服务
[kingbase@node201 sys_logical]$ /opt/Kingbase/ES/R6_C8/Server/bin/sys_ctl stop -D /home/kingbase/db/r6_c8/data/
waiting for server to shut down.... done
server stopped
2)模拟文件被破坏
[kingbase@node201 sys_logical]$ dd if=/dev/zero of=replorigin_checkpoint bs=1 count=8
8+0 records in
8+0 records out
8 bytes (8 B) copied, 0.000241202 s, 33.2 kB/s
[kingbase@node201 sys_logical]$ ls -lh
total 4.0K
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 mappings
-rw------- 1 kingbase kingbase 8 Feb 21 16:09 replorigin_checkpoint
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 snapshots
3)启动数据库
如下所示,数据库启动失败,出现“PANIC: replication checkpoint has wrong stateMagic 0 instead of 307747550”故障:
[kingbase@node201 sys_log]$ tail -100 kingbase-2024-02-21_160959.log
2024-02-21 16:09:59.682 CST [27639] LOG: database system was shut down at 2024-02-21 16:08:50 CST
2024-02-21 16:09:59.683 CST [27639] PANIC: replication checkpoint has wrong stateMagic 0 instead of 307747550
2024-02-21 16:09:59.683 CST [27639] LOG: kingbase ran into a problem it couldn't handle,it needs to be shutdown to prevent damage to your data
2024-02-21 16:09:59.691 CST [27639] WARNING:
ERROR: -----------------------stack error start-----------------------
ERROR: TIME: 2024-02-21 16:09:59.683471+08
ERROR: 1 27639 0x7fe924505555 OutPutBacktrace (backtrace.so)
ERROR: 2 27639 0x7fe9245055fa <symbol not found> (backtrace.so)
ERROR: 3 27639 0x7fe932f19100 <symbol not found> (libpthread.so.0)
ERROR: 4 27639 0x7fe930d425f7 gsignal (libc.so.6)
ERROR: 5 27639 0x7fe930d43ce8 abort (libc.so.6)
ERROR: 6 27639 0x97ad3a errfinish + 0xcfc3719a
ERROR: 7 27639 0x8f2e27 BeginRepOrigin + 0xcfbaf287
ERROR: 8 27639 0x53f216 StartupWAL + 0xcf7fb676
ERROR: 9 27639 0x8cce48 StartupProcessMain + 0xcfb892a8
ERROR: 10 27639 0x560e85 KdbAuxiliaryProcessMain + 0xcf81d2e5
ERROR: 11 27639 0x8c8acf start_child_process + 0xcfb84f2f
ERROR: 12 27639 0x8cc075 KesMasterMain + 0xcfb884d5
ERROR: 13 27639 0x4a726c main + 0xcf7636cc
ERROR: 14 27639 0x7fe930d2eb15 __libc_start_main (libc.so.6)
ERROR: 15 27639 0x4a730a _start + 0xcf7788ea
2024-02-21 16:10:00.578 CST [27637] LOG: startup process (PID 27639) was terminated by signal 6: Aborted
2024-02-21 16:10:00.578 CST [27637] LOG: aborting startup due to startup process failure
2024-02-21 16:10:00.614 CST [27637] LOG: database system is shut down
三、问题解决
方案1:
1、重命名replorigin_checkpoint文件
[kingbase@node201 sys_logical]$ ls -lh
total 4.0K
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 mappings
-rw------- 1 kingbase kingbase 5 Feb 21 17:15 replorigin_checkpoint.bk
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 snapshots
2、启动数据库
[kingbase@node201 sys_logical]$ /opt/Kingbase/ES/R6_C8/Server/bin/sys_ctl start -D /home/kingbase/db/r6_c8/data/
.......
server started
# 数据库启动后,文件被重建
[kingbase@node201 sys_logical]$ ls -lh
total 8.0K
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 mappings
-rw------- 1 kingbase kingbase 8 Feb 21 17:52 replorigin_checkpoint
-rw------- 1 kingbase kingbase 5 Feb 21 17:15 replorigin_checkpoint.bk
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 snapshots
方案2:
根据源代码分析:
{
const char *tmppath = "pg_logical/replorigin_checkpoint.tmp";
const char *path = "pg_logical/replorigin_checkpoint";
int tmpfd;
int i;
uint32 magic = REPLICATION_STATE_MAGIC;
pg_crc32c crc;
if (max_replication_slots == 0)
return;
......
# 配置max_replication_slots=0后,将跳过replorigin_checkpoint的检查.
1、配置max_replication_slots
如下所示,配置max_replication_slots=0后,将跳过replorigin_checkpoint的检查:
[kingbase@node201 sys_logical]$ cat ../kingbase.conf|grep replication_slots
max_replication_slots = 0 # max number of replication slots
2、启动数据库服务
[kingbase@node201 sys_logical]$ /opt/Kingbase/ES/R6_C8/Server/bin/sys_ctl start -D /home/kingbase/db/r6_c8/data/
.......
server started
四、附件
故障1:“PANIC: could not read file "sys_logical/replorigin_checkpoint": read 3 of 4”
1、模拟replorigin_checkpoint故障
如下所示,默认replorigin_checkpoint文件大小为8字节,如果文件小于4字节:
[kingbase@node201 sys_logical]$ echo 'aa' > replorigin_checkpoint
[kingbase@node201 sys_logical]$ ls -lh
total 4.0K
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 mappings
-rw------- 1 kingbase kingbase 3 Feb 21 17:12 replorigin_checkpoint
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 snapshots
2、启动数据库
如下所示,数据库服务启动失败,出现“PANIC: could not read file "sys_logical/replorigin_checkpoint": read 3 of 4”故障:
[kingbase@node201 sys_logical]$ /opt/Kingbase/ES/R6_C8/Server/bin/sys_ctl start -D /home/kingbase/db/r6_c8/data/
.......
sys_ctl: could not start server
# sys_log日志:
[kingbase@node201 sys_log]$ tail -100 kingbase-2024-02-21_171412.log
2024-02-21 17:14:12.806 CST [31366] LOG: database system was shut down at 2024-02-21 16:08:50 CST
2024-02-21 17:14:12.806 CST [31366] PANIC: could not read file "sys_logical/replorigin_checkpoint": read 3 of 4
2024-02-21 17:14:12.807 CST [31366] LOG: kingbase ran into a problem it couldn't handle,it needs to be shutdown to prevent damage to your data
2024-02-21 17:14:12.813 CST [31366] WARNING:
ERROR: -----------------------stack error start-----------------------
ERROR: TIME: 2024-02-21 17:14:12.807151+08
ERROR: 1 31366 0x7f3f30ed9555 OutPutBacktrace (backtrace.so)
ERROR: 2 31366 0x7f3f30ed95fa <symbol not found> (backtrace.so)
ERROR: 3 31366 0x7f3f3f8ed100 <symbol not found> (libpthread.so.0)
ERROR: 4 31366 0x7f3f3d7165f7 gsignal (libc.so.6)
ERROR: 5 31366 0x7f3f3d717ce8 abort (libc.so.6)
ERROR: 6 31366 0x97ad3a errfinish + 0xc326319a
ERROR: 7 31366 0x8f2ba8 BeginRepOrigin + 0xc31db008
ERROR: 8 31366 0x53f216 StartupWAL + 0xc2e27676
ERROR: 9 31366 0x8cce48 StartupProcessMain + 0xc31b52a8
ERROR: 10 31366 0x560e85 KdbAuxiliaryProcessMain + 0xc2e492e5
ERROR: 11 31366 0x8c8acf start_child_process + 0xc31b0f2f
ERROR: 12 31366 0x8cc075 KesMasterMain + 0xc31b44d5
ERROR: 13 31366 0x4a726c main + 0xc2d8f6cc
ERROR: 14 31366 0x7f3f3d702b15 __libc_start_main (libc.so.6)
ERROR: 15 31366 0x4a730a _start + 0xc2da48ea
2024-02-21 17:14:13.542 CST [31364] LOG: startup process (PID 31366) was terminated by signal 6: Aborted
2024-02-21 17:14:13.542 CST [31364] LOG: aborting startup due to startup process failure
2024-02-21 17:14:13.576 CST [31364] LOG: database system is shut down
故障2:“ PANIC: replication checkpoint has wrong stateMagic 0 instead of 307747550”
1、模拟replorigin_checkpoint故障
如下所示,默认replorigin_checkpoint文件大小为8字节,如果文件大于4字节:
[kingbase@node201 sys_logical]$ dd if=/dev/zero of=replorigin_checkpoint bs=1 count=5
5+0 records in
5+0 records out
5 bytes (5 B) copied, 0.00024672 s, 20.3 kB/s
[kingbase@node201 sys_logical]$ ls -lh
total 4.0K
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 mappings
-rw------- 1 kingbase kingbase 5 Feb 21 17:15 replorigin_checkpoint
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 snapshots
2、启动数据库
如下所示,数据库服务启动失败,出现“PANIC: replication checkpoint has wrong stateMagic 0 instead of 307747550“故障:
[kingbase@node201 sys_logical]$ /opt/Kingbase/ES/R6_C8/Server/bin/sys_ctl start -D /home/kingbase/db/r6_c8/data/
.......
sys_ctl: could not start server
# sys_log日志:
[kingbase@node201 sys_log]$ tail -1000 kingbase-2024-02-21_171559.log
2024-02-21 17:15:59.487 CST [31539] LOG: database system was shut down at 2024-02-21 16:08:50 CST
2024-02-21 17:15:59.487 CST [31539] PANIC: replication checkpoint has wrong stateMagic 0 instead of 307747550
2024-02-21 17:15:59.487 CST [31539] LOG: kingbase ran into a problem it couldn't handle,it needs to be shutdown to prevent damage to your data
2024-02-21 17:15:59.495 CST [31539] WARNING:
ERROR: -----------------------stack error start-----------------------
ERROR: TIME: 2024-02-21 17:15:59.487848+08
ERROR: 1 31539 0x7f1c7d671555 OutPutBacktrace (backtrace.so)
ERROR: 2 31539 0x7f1c7d6715fa <symbol not found> (backtrace.so)
ERROR: 3 31539 0x7f1c8c085100 <symbol not found> (libpthread.so.0)
ERROR: 4 31539 0x7f1c89eae5f7 gsignal (libc.so.6)
ERROR: 5 31539 0x7f1c89eafce8 abort (libc.so.6)
ERROR: 6 31539 0x97ad3a errfinish + 0x76acb19a
ERROR: 7 31539 0x8f2e27 BeginRepOrigin + 0x76a43287
ERROR: 8 31539 0x53f216 StartupWAL + 0x7668f676
ERROR: 9 31539 0x8cce48 StartupProcessMain + 0x76a1d2a8
ERROR: 10 31539 0x560e85 KdbAuxiliaryProcessMain + 0x766b12e5
ERROR: 11 31539 0x8c8acf start_child_process + 0x76a18f2f
ERROR: 12 31539 0x8cc075 KesMasterMain + 0x76a1c4d5
ERROR: 13 31539 0x4a726c main + 0x765f76cc
ERROR: 14 31539 0x7f1c89e9ab15 __libc_start_main (libc.so.6)
ERROR: 15 31539 0x4a730a _start + 0x7660c8ea
2024-02-21 17:16:00.359 CST [31537] LOG: startup process (PID 31539) was terminated by signal 6: Aborted
2024-02-21 17:16:00.359 CST [31537] LOG: aborting startup due to startup process failure
2024-02-21 17:16:00.397 CST [31537] LOG: database system is shut down
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· DeepSeek 开源周回顾「GitHub 热点速览」
2023-02-21 KingbaseES V8R6集群运维案例之---麒麟系统bug导致sys_monitor.sh无法启动集群