随笔- 320 文章- 0 评论- 5 阅读- 34799

KingbaseES V8R6数据库案例之---磁盘损坏导致数据库启动失败

案例说明：
生产现场环境，主机系统盘损坏，导致系统和数据库服务启动失败，系统修复后，启动数据库，出现无法找到有效的检查点故障，如下图所示：

通过执行resetwal后，又出现“replication checkpoint”错误，如下图所示。

一、故障分析
通过研发协助读取源代码：
截取StartupReplicationOrigin()函数的一段代码

if (readBytes != sizeof(magic))
     {
         if (readBytes < 0)
             ereport(PANIC,
                     (errcode_for_file_access(),
                      errmsg("could not read file \"%s\": %m",
                             path)));
         else
             ereport(PANIC,
                     (errcode(ERRCODE_DATA_CORRUPTED),
                      errmsg("could not read file \"%s\": read %d of %zu",
                             path, readBytes, sizeof(magic))));
     }

此错误应该和sys_logical/replorigin_checkpoint文件有关，这个文件用于逻辑复制；当数据库正常关闭时会重置此文件，但是数据库服务非正常关闭，导致此文件异常后，数据库服务将启动失败。

二、故障复现

1、查看文件信息

[kingbase@node201 data]$ cd sys_logical/
[kingbase@node201 sys_logical]$ ls -lh
total 4.0K
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 mappings
-rw------- 1 kingbase kingbase 8 Feb 21 16:08 replorigin_checkpoint
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 snapshots

2、复现故障
1）关闭数据库服务

[kingbase@node201 sys_logical]$ /opt/Kingbase/ES/R6_C8/Server/bin/sys_ctl stop -D /home/kingbase/db/r6_c8/data/
waiting for server to shut down.... done
server stopped

2）模拟文件被破坏

[kingbase@node201 sys_logical]$ dd if=/dev/zero of=replorigin_checkpoint bs=1 count=8
8+0 records in
8+0 records out
8 bytes (8 B) copied, 0.000241202 s, 33.2 kB/s

[kingbase@node201 sys_logical]$ ls -lh
total 4.0K
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 mappings
-rw------- 1 kingbase kingbase 8 Feb 21 16:09 replorigin_checkpoint
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 snapshots

3）启动数据库
如下所示，数据库启动失败，出现“PANIC: replication checkpoint has wrong stateMagic 0 instead of 307747550”故障：

[kingbase@node201 sys_log]$ tail -100 kingbase-2024-02-21_160959.log
2024-02-21 16:09:59.682 CST [27639] LOG:  database system was shut down at 2024-02-21 16:08:50 CST
2024-02-21 16:09:59.683 CST [27639] PANIC:  replication checkpoint has wrong stateMagic 0 instead of 307747550
2024-02-21 16:09:59.683 CST [27639] LOG:  kingbase ran into a problem it couldn't handle,it needs to be shutdown to prevent damage to your data
2024-02-21 16:09:59.691 CST [27639] WARNING:
        ERROR:  -----------------------stack error start-----------------------
        ERROR:  TIME: 2024-02-21 16:09:59.683471+08
        ERROR:  1 27639 0x7fe924505555 OutPutBacktrace (backtrace.so)
        ERROR:  2 27639 0x7fe9245055fa <symbol not found> (backtrace.so)
        ERROR:  3 27639 0x7fe932f19100 <symbol not found> (libpthread.so.0)
        ERROR:  4 27639 0x7fe930d425f7 gsignal (libc.so.6)
        ERROR:  5 27639 0x7fe930d43ce8 abort (libc.so.6)
        ERROR:  6 27639 0x97ad3a errfinish + 0xcfc3719a
        ERROR:  7 27639 0x8f2e27 BeginRepOrigin + 0xcfbaf287
        ERROR:  8 27639 0x53f216 StartupWAL + 0xcf7fb676
        ERROR:  9 27639 0x8cce48 StartupProcessMain + 0xcfb892a8
        ERROR:  10 27639 0x560e85 KdbAuxiliaryProcessMain + 0xcf81d2e5
        ERROR:  11 27639 0x8c8acf start_child_process + 0xcfb84f2f
        ERROR:  12 27639 0x8cc075 KesMasterMain + 0xcfb884d5
        ERROR:  13 27639 0x4a726c main + 0xcf7636cc
        ERROR:  14 27639 0x7fe930d2eb15 __libc_start_main (libc.so.6)
        ERROR:  15 27639 0x4a730a _start + 0xcf7788ea

2024-02-21 16:10:00.578 CST [27637] LOG:  startup process (PID 27639) was terminated by signal 6: Aborted
2024-02-21 16:10:00.578 CST [27637] LOG:  aborting startup due to startup process failure
2024-02-21 16:10:00.614 CST [27637] LOG:  database system is shut down

三、问题解决

方案1：
1、重命名replorigin_checkpoint文件

[kingbase@node201 sys_logical]$ ls -lh
total 4.0K
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 mappings
-rw------- 1 kingbase kingbase 5 Feb 21 17:15 replorigin_checkpoint.bk
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 snapshots

2、启动数据库

[kingbase@node201 sys_logical]$ /opt/Kingbase/ES/R6_C8/Server/bin/sys_ctl start -D /home/kingbase/db/r6_c8/data/
.......
server started

# 数据库启动后，文件被重建
[kingbase@node201 sys_logical]$ ls -lh
total 8.0K
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 mappings
-rw------- 1 kingbase kingbase 8 Feb 21 17:52 replorigin_checkpoint
-rw------- 1 kingbase kingbase 5 Feb 21 17:15 replorigin_checkpoint.bk
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 snapshots

方案2：
根据源代码分析：

 {
     const char *tmppath = "pg_logical/replorigin_checkpoint.tmp";
     const char *path = "pg_logical/replorigin_checkpoint";
     int         tmpfd;
     int         i;
     uint32      magic = REPLICATION_STATE_MAGIC;
     pg_crc32c   crc;
 
     if (max_replication_slots == 0)
         return;
  ......
 # 配置max_replication_slots=0后，将跳过replorigin_checkpoint的检查.

1、配置max_replication_slots
如下所示，配置max_replication_slots=0后，将跳过replorigin_checkpoint的检查：

[kingbase@node201 sys_logical]$ cat ../kingbase.conf|grep replication_slots
max_replication_slots = 0       # max number of replication slots

2、启动数据库服务

[kingbase@node201 sys_logical]$ /opt/Kingbase/ES/R6_C8/Server/bin/sys_ctl start -D /home/kingbase/db/r6_c8/data/
.......
server started

四、附件

故障1：“PANIC: could not read file "sys_logical/replorigin_checkpoint": read 3 of 4”

1、模拟replorigin_checkpoint故障
如下所示，默认replorigin_checkpoint文件大小为8字节，如果文件小于4字节：

[kingbase@node201 sys_logical]$ echo 'aa' > replorigin_checkpoint
[kingbase@node201 sys_logical]$ ls -lh
total 4.0K
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 mappings
-rw------- 1 kingbase kingbase 3 Feb 21 17:12 replorigin_checkpoint
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 snapshots

2、启动数据库
如下所示，数据库服务启动失败，出现“PANIC: could not read file "sys_logical/replorigin_checkpoint": read 3 of 4”故障：

[kingbase@node201 sys_logical]$ /opt/Kingbase/ES/R6_C8/Server/bin/sys_ctl start -D /home/kingbase/db/r6_c8/data/
.......
sys_ctl: could not start server

# sys_log日志：
[kingbase@node201 sys_log]$ tail -100 kingbase-2024-02-21_171412.log
2024-02-21 17:14:12.806 CST [31366] LOG:  database system was shut down at 2024-02-21 16:08:50 CST
2024-02-21 17:14:12.806 CST [31366] PANIC:  could not read file "sys_logical/replorigin_checkpoint": read 3 of 4
2024-02-21 17:14:12.807 CST [31366] LOG:  kingbase ran into a problem it couldn't handle,it needs to be shutdown to prevent damage to your data
2024-02-21 17:14:12.813 CST [31366] WARNING:
        ERROR:  -----------------------stack error start-----------------------
        ERROR:  TIME: 2024-02-21 17:14:12.807151+08
        ERROR:  1 31366 0x7f3f30ed9555 OutPutBacktrace (backtrace.so)
        ERROR:  2 31366 0x7f3f30ed95fa <symbol not found> (backtrace.so)
        ERROR:  3 31366 0x7f3f3f8ed100 <symbol not found> (libpthread.so.0)
        ERROR:  4 31366 0x7f3f3d7165f7 gsignal (libc.so.6)
        ERROR:  5 31366 0x7f3f3d717ce8 abort (libc.so.6)
        ERROR:  6 31366 0x97ad3a errfinish + 0xc326319a
        ERROR:  7 31366 0x8f2ba8 BeginRepOrigin + 0xc31db008
        ERROR:  8 31366 0x53f216 StartupWAL + 0xc2e27676
        ERROR:  9 31366 0x8cce48 StartupProcessMain + 0xc31b52a8
        ERROR:  10 31366 0x560e85 KdbAuxiliaryProcessMain + 0xc2e492e5
        ERROR:  11 31366 0x8c8acf start_child_process + 0xc31b0f2f
        ERROR:  12 31366 0x8cc075 KesMasterMain + 0xc31b44d5
        ERROR:  13 31366 0x4a726c main + 0xc2d8f6cc
        ERROR:  14 31366 0x7f3f3d702b15 __libc_start_main (libc.so.6)
        ERROR:  15 31366 0x4a730a _start + 0xc2da48ea

2024-02-21 17:14:13.542 CST [31364] LOG:  startup process (PID 31366) was terminated by signal 6: Aborted
2024-02-21 17:14:13.542 CST [31364] LOG:  aborting startup due to startup process failure
2024-02-21 17:14:13.576 CST [31364] LOG:  database system is shut down

故障2：“ PANIC: replication checkpoint has wrong stateMagic 0 instead of 307747550”

1、模拟replorigin_checkpoint故障
如下所示，默认replorigin_checkpoint文件大小为8字节，如果文件大于4字节：

[kingbase@node201 sys_logical]$ dd if=/dev/zero of=replorigin_checkpoint bs=1 count=5
5+0 records in
5+0 records out
5 bytes (5 B) copied, 0.00024672 s, 20.3 kB/s

[kingbase@node201 sys_logical]$ ls -lh
total 4.0K
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 mappings
-rw------- 1 kingbase kingbase 5 Feb 21 17:15 replorigin_checkpoint
drwx------ 2 kingbase kingbase 6 Oct 11 17:53 snapshots

2、启动数据库
如下所示，数据库服务启动失败，出现“PANIC: replication checkpoint has wrong stateMagic 0 instead of 307747550“故障：

[kingbase@node201 sys_logical]$ /opt/Kingbase/ES/R6_C8/Server/bin/sys_ctl start -D /home/kingbase/db/r6_c8/data/
.......
sys_ctl: could not start server

# sys_log日志：
[kingbase@node201 sys_log]$ tail -1000 kingbase-2024-02-21_171559.log
2024-02-21 17:15:59.487 CST [31539] LOG:  database system was shut down at 2024-02-21 16:08:50 CST
2024-02-21 17:15:59.487 CST [31539] PANIC:  replication checkpoint has wrong stateMagic 0 instead of 307747550
2024-02-21 17:15:59.487 CST [31539] LOG:  kingbase ran into a problem it couldn't handle,it needs to be shutdown to prevent damage to your data
2024-02-21 17:15:59.495 CST [31539] WARNING:
        ERROR:  -----------------------stack error start-----------------------
        ERROR:  TIME: 2024-02-21 17:15:59.487848+08
        ERROR:  1 31539 0x7f1c7d671555 OutPutBacktrace (backtrace.so)
        ERROR:  2 31539 0x7f1c7d6715fa <symbol not found> (backtrace.so)
        ERROR:  3 31539 0x7f1c8c085100 <symbol not found> (libpthread.so.0)
        ERROR:  4 31539 0x7f1c89eae5f7 gsignal (libc.so.6)
        ERROR:  5 31539 0x7f1c89eafce8 abort (libc.so.6)
        ERROR:  6 31539 0x97ad3a errfinish + 0x76acb19a
        ERROR:  7 31539 0x8f2e27 BeginRepOrigin + 0x76a43287
        ERROR:  8 31539 0x53f216 StartupWAL + 0x7668f676
        ERROR:  9 31539 0x8cce48 StartupProcessMain + 0x76a1d2a8
        ERROR:  10 31539 0x560e85 KdbAuxiliaryProcessMain + 0x766b12e5
        ERROR:  11 31539 0x8c8acf start_child_process + 0x76a18f2f
        ERROR:  12 31539 0x8cc075 KesMasterMain + 0x76a1c4d5
        ERROR:  13 31539 0x4a726c main + 0x765f76cc
        ERROR:  14 31539 0x7f1c89e9ab15 __libc_start_main (libc.so.6)
        ERROR:  15 31539 0x4a730a _start + 0x7660c8ea

2024-02-21 17:16:00.359 CST [31537] LOG:  startup process (PID 31539) was terminated by signal 6: Aborted
2024-02-21 17:16:00.359 CST [31537] LOG:  aborting startup due to startup process failure
2024-02-21 17:16:00.397 CST [31537] LOG:  database system is shut down