KingbaseES wal(xlog) 日志清理故障恢复案例

案例说明:
在通过sys_archivecleanup工具手工清理wal日志时,在control文件中查询的检查点对应的wal日志是“000000010000000000000008”,但是在执行清理时,误将“000000010000000000000009”以前的wal日志都被清理,在启动数据库时,无法读取checkpoint所在的wal日志,导致数据库启动失败。

数据库版本:

test=# select version;
                                                       version                                                       
------------------------------------------------------------------------------------------------------------------
 KingbaseES V008R006C005B0054 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-
bit

以下为wal日志清理的操作:

1)查看当前control文件信息

2)查看wal日志信息并清理

清理前:

[kingbase@node1 sys_wal]$ ls -lh
total 80M
-rw------- 1 kingbase kingbase 16M May 11 13:26 000000010000000000000006
-rw------- 1 kingbase kingbase 16M May 11 13:26 000000010000000000000007
-rw------- 1 kingbase kingbase 16M May 11 13:26 000000010000000000000008
-rw------- 1 kingbase kingbase 16M May 11 13:00 000000010000000000000009
-rw------- 1 kingbase kingbase 16M May 11 13:02 00000001000000000000000A
drwx------ 2 kingbase kingbase  78 May 11 13:49 archive_status

日志清理:
[kingbase@node1 bin]$ ./sys_archivecleanup /data/kingbase/v8r6_054/data/sys_wal 000000010000000000000009

清理后:

[kingbase@node1 sys_wal]$ ls -lh
total 32M

-rw------- 1 kingbase kingbase 16M May 11 13:00 000000010000000000000009
-rw------- 1 kingbase kingbase 16M May 11 13:02 00000001000000000000000A
drwx------ 2 kingbase kingbase  78 May 11 13:49 archive_status

一、启动数据库出现故障

1、启动数据库服务

[kingbase@node1 bin]$ ./sys_ctl start -D /data/kingbase/v8r6_054/data/
......
2022-05-12 15:29:34.641 CST [25993] HINT:  Future log output will appear in directory "sys_log".
...... stopped waiting
sys_ctl: could not start server
Examine the log output.

2、查看数据库sys_log日志

2022-05-12 15:29:35.309 CST [26003] LOG:  invalid primary checkpoint record
2022-05-12 15:29:35.309 CST [26003] PANIC:  could not locate a valid checkpoint record
2022-05-12 15:29:35.309 CST [26003] LOG:  kingbase ran into a problem it couldn't handle,it needs to be shutdown to prevent damage to your data
2022-05-12 15:29:35.346 CST [26003] WARNING:  
        ERROR:  -----------------------stack error start-----------------------
        ERROR:  TIME: 2022-05-12 15:29:35.309749+08
        ERROR:  1 26003 0x7fc2aa18ef6b debug_backtrace (backtrace.so)
        ERROR:  2 26003 0x7fc2aa18f53a <symbol not found> (backtrace.so)
        ERROR:  3 26003 0x7fc2b390a670 <symbol not found> (libc.so.6)
        ERROR:  4 26003 0x7fc2b390a5f7 gsignal (libc.so.6)
        ERROR:  5 26003 0x7fc2b390bce8 abort (libc.so.6)
        ERROR:  6 26003 0x9148dc errfinish + 0x4d008d3c
        ERROR:  7 26003 0x54011c StartupXLOG + 0x4cc3457c
        ERROR:  8 26003 0x774f51 StartupProcessMain + 0x4ce693b1
        ERROR:  9 26003 0x550550 AuxiliaryProcessMain + 0x4cc449b0
        ERROR:  10 26003 0x76f5c7 StartChildProcess + 0x4ce63a27
        ERROR:  11 26003 0x77350d PostmasterMain + 0x4ce6796d
        ERROR:  12 26003 0x6cb0af main + 0x4cdbf50f
        ERROR:  13 26003 0x7fc2b38f6b15 __libc_start_main (libc.so.6)
        ERROR:  14 26003 0x4a1659 _start + 0x4cbaac39

2022-05-12 15:29:40.654 CST [25993] LOG:  startup process (PID 26003) was terminated by signal 6: Aborted
2022-05-12 15:29:40.654 CST [25993] LOG:  aborting startup due to startup process failure
2022-05-12 15:29:40.728 CST [25993] LOG:  database system is shut down

=如上所示,数据库启动时,无法通过wal日志,读取到checkpoint信息,导致数据库启动失败。=

二、读取数据库控制文件信息

[kingbase@node1 bin]$ ./sys_controldata -D /data/kingbase/v8r6_054/data
sys_control version number:            1201
Catalog version number:               202202151
Database system identifier:           7096019857358041449
Database cluster state:               in production
sys_control last modified:             Wed 11 May 2022 01:26:44 PM CST
Latest checkpoint location:           0/8000058
Latest checkpoint's REDO location:    0/8000028
Latest checkpoint's REDO WAL file:    000000010000000000000008

三、查看当前的wal日志

=如下所示,检查点对应的wal日志文件“000000010000000000000008”已经缺失。=

[kingbase@node1 sys_wal]$ ls -lh
total 32M
-rw------- 1 kingbase kingbase 16M May 11 13:00 000000010000000000000009
-rw------- 1 kingbase kingbase 16M May 11 13:02 00000001000000000000000A
drwx------ 2 kingbase kingbase  78 May 11 13:49 archive_status

Tips:
=由于数据库checkpoint对应的wal日志缺失,数据库启动时,无法判断数据库的一致性状态,导致启动失败。对于以上情况,可以通过物理备份,将数据库恢复到过去的时间点,启动数据库;如果没有物理备份,也可以通过重建控制文件,启动数据库。但是这两种方法都会导致数据丢失,所以在执行数据库的日志清理时,操作之前一定要确认,选择的wal日志文件是正确的。=

四、重建控制文件

1、通过sys_resetwal重建控制文件

[kingbase@node1 bin]$ ./sys_resetwal -l 00000001000000000000000A -D /data/kingbase/v8r6_054/data
The database server was not shut down cleanly.
Resetting the write-ahead log might cause data to be lost.
If you want to proceed anyway, use -f to force reset.
[kingbase@node1 bin]$ ./sys_resetwal -l 00000001000000000000000A -D /data/kingbase/v8r6_054/data -f
Write-ahead log reset

2、查看控制文件重建后的wal日志

[kingbase@node1 sys_wal]$ ls -lh
total 16M
-rw------- 1 kingbase kingbase 16M May 12 15:46 00000001000000000000000B
drwx------ 2 kingbase kingbase   6 May 12 15:46 archive_status

3、查看控制文件信息

[kingbase@node1 bin]$ ./sys_controldata -D /data/kingbase/v8r6_054/data
sys_control version number:            1201
Catalog version number:               202202151
Database system identifier:           7096019857358041449
Database cluster state:               shut down
sys_control last modified:             Thu 12 May 2022 03:46:38 PM CST
Latest checkpoint location:           0/B000028
Latest checkpoint's REDO location:    0/B000028
Latest checkpoint's REDO WAL file:    00000001000000000000000B

五、启动数据库实例及验证

1、启动数据库

[kingbase@node1 bin]$ ./sys_ctl start -D /data/kingbase/v8r6_054/data/
waiting for server to start....2022-05-12 15:54:53.731 CST [30496] LOG:  sepapower extension initialized
.....
 done
server started

2、查看sys_log日志(数据库正常启动)

[kingbase@node1 sys_log]$ tail -100 kingbase-2022-05-12_155453.log
2022-05-12 15:54:53.919 CST [30498] LOG:  database system was shut down at 2022-05-12 15:46:38 CST
2022-05-12 15:54:54.132 CST [30496] LOG:  database system is ready to accept connections

3、访问数据库

[kingbase@node1 bin]$ ./ksql -U system -W  test -p 54322
Password: 
ksql (V8.0)
Type "help" for help.


test=# \d prod
Did not find any relation named "prod".
test=# \d
               List of relations
 Schema |        Name         | Type  | Owner  
--------+---------------------+-------+--------
 public | sys_stat_statements | view  | system
 public | t1                  | table | system
(2 rows)

六、总结

1、对于wal日志清理,可以使用sys_archivecleanup工具,首先通过控制文件判断需要保留的wal日志。
2、在执行清理时,一定要确认保留的日志是正确的。
3、对于生产环境执行此操作,最好由双人确认操作的正确性。
posted @ 2022-05-13 13:22  KINGBASE研究院  阅读(297)  评论(0编辑  收藏  举报