KingbaseES V9备份恢复案例之---单实例环境PITR恢复案例

案例说明:
用户误操作删除数据，测试基于时间点（PITR)和xid的数据库恢复，比较两种恢复之间的差异。

版本信息：

prod=# select version();
                                                       version
------------------------------------------------------------------------------------------------------------------
 KingbaseES V009R001C002B0014 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28), 64-
bit
(1 row)

一、构建测试环境
1、测试数据信息

prod=# select count(*) from t1;
 count
-------
   100
(1 row)

2、数据库全量备份

[kingbase@node208 bin]$  /opt/Kingbase/ES/V9/Server/bin/sys_rman --config=/home/kingbase/kbbr9_repo/sys_rman.conf --stanza=kingbase info
WARN: set process-max 4 is too large, auto set to CPU core count 1
stanza: kingbase
    status: ok
    cipher: none
    db (current)
        wal archive min/max (V009R001C002B0014): 00000004000000000000000C/00000004000000000000000F

        full backup: 20241204-111133F
            timestamp start/stop: 2024-12-04 11:11:33 / 2024-12-04 11:11:41
            wal start/stop: 00000004000000000000000F / 00000004000000000000000F
            database size: 104.7MB, database backup size: 104.7MB
            repo1: backup set size: 104.7MB, backup size: 104.7MB

3、数据库事务操作

test=# \c prod
prod=# insert into t1  values (generate_series(101,200),'usr'||generate_series(101,200));
INSERT 0 100
prod=# select count(*) from t1;
 count
-------
   200
(1 row)

4、数据库增量备份

[kingbase@node208 bin]$  /opt/Kingbase/ES/V9/Server/bin/sys_rman --config=/home/kingbase/kbbr9_repo/sys_rman.conf --stanza=kingbase info
WARN: set process-max 4 is too large, auto set to CPU core count 1
stanza: kingbase
    status: ok
    cipher: none

    db (current)
        wal archive min/max (V009R001C002B0014): 00000004000000000000000C/000000040000000000000011

        full backup: 20241204-111133F
            timestamp start/stop: 2024-12-04 11:11:33 / 2024-12-04 11:11:41
            wal start/stop: 00000004000000000000000F / 00000004000000000000000F
            database size: 104.7MB, database backup size: 104.7MB
            repo1: backup set size: 104.7MB, backup size: 104.7MB

        incr backup: 20241204-111133F_20241204-111338I
            timestamp start/stop: 2024-12-04 11:13:38 / 2024-12-04 11:13:40
            wal start/stop: 000000040000000000000011 / 000000040000000000000011
            database size: 104.7MB, database backup size: 17.3MB
            repo1: backup set size: 104.7MB, backup size: 17.3MB
            backup reference list: 20241204-111133F

5、模拟数据库误操作
如下所示，由于误操作，t1表被删除：

prod=# checkpoint;
CHECKPOINT

# 数据库表被误删除
prod=# drop table t1;
DROP TABLE
prod=# \d
                List of relations
 Schema |          Name           | Type | Owner
--------+-------------------------+------+--------
 public | sys_stat_statements     | view | system
 public | sys_stat_statements_all | view | system
(2 rows)

二、执行数据库恢复（基于时间点PITR)
1、关闭当前数据库服务并备份数据

[kingbase@node208 bin]$ ./sys_ctl stop -D /home/kingbase/db/v9/data
waiting for server to shut down.... done
server stopped

# 备份当前数据库目录
[kingbase@node208 bin]$ ls -lh /home/kingbase/db/v9/
total 8.0K
drwx------ 23 kingbase kingbase 4.0K Dec  4 15:11 data.bk

2、执行PITR的restore
1）确定恢复到的时间点
如下所示，通过dump wal日志，获取到在drop前insert事务后提交的时间点：

如下所示，通过备份信息，获取到在drop前，增量备份完成的时间点：

2）执行PITR恢复（将时间点选择为备份完成和drop前的时间点）

[kingbase@node208 v9]$ /opt/Kingbase/ES/V9/Server/bin/sys_rman --config=/home/kingbase/kbbr9_repo/sys_rman.conf --stanza=kingbase --target="2024-12-04 11:13:40" --type=time restore
2024-12-04 14:46:36.723 P00   INFO: restore command begin 2.27: --band-width=0 --cmd-ssh=/opt/Kingbase/ES/V9/Server/bin/sys_securecmd --config=/home/kingbase/kbbr9_repo/sys_rman.conf --exec-id=155169-8484df31 --kb1-path=/home/kingbase/db/v9/data --link-all --log-level-console=info --log-level-file=info --log-path=/opt/Kingbase/ES/V9/Server/log --log-subprocess --non-archived-space=1024 --process-max=4 --repo1-path=/home/kingbase/kbbr9_repo --stanza=kingbase --target="2024-12-04 11:13:40" --type=time
WARN: set process-max 4 is too large, auto set to CPU core count 1
2024-12-04 14:46:36.741 P00   INFO: repo1: restore backup set 20241204-111133F, recovery will start at 2024-12-04 11:11:33
2024-12-04 14:46:36.890 P00   INFO: Restore Process: FILE: 1 / 2802 0%       SZIE: 16777216 bytes / 109782437 bytes 16MB / 104.7MB 15%
........

2024-12-04 14:46:39.216 P00   INFO: write updated /home/kingbase/db/v9/data/kingbase.auto.conf
2024-12-04 14:46:39.220 P00   INFO: restore global/sys_control (performed last to ensure aborted restores cannot be started)
2024-12-04 14:46:39.221 P00   INFO: restore size = 104.7MB, file total = 2802
2024-12-04 14:46:39.222 P00   INFO: restore command end: completed successfully (2502ms)

3、启动数据库服务执行recovery
1）查看当前wal日志

[kingbase@node208 bin]$ ls -lh /home/kingbase/db/v9/data/sys_wal
total 17M
-rw------- 1 kingbase kingbase  16M Jan  1  1970 000000040000000000000011
drwx------ 2 kingbase kingbase 4.0K Dec  4 11:27 archive_status

2）查看归档wal日志

[kingbase@node208 bin]$ ls -lh /home/kingbase/kbbr9_repo/archive/kingbase/12-1/0000000400000000/
total 113M
-rw-r----- 1 kingbase kingbase 16M Dec  4 11:11 00000004000000000000000C-361e001ab548a9c942701f5a26d9dc53b0b00ac8
-rw-r----- 1 kingbase kingbase 16M Dec  4 11:11 00000004000000000000000D-78b9501fef1a16e8fa6f134310604e0aeabae8a1
-rw-r----- 1 kingbase kingbase 16M Dec  4 11:11 00000004000000000000000E-403e8cbe47c8a0f9819263880b37b9014050bfe9
-rw-r----- 1 kingbase kingbase 368 Dec  4 11:11 00000004000000000000000F.00000028.backup
-rw-r----- 1 kingbase kingbase 16M Dec  4 11:11 00000004000000000000000F-dc8f5fe3ff3c319826d97ba79470eba5fc0d748b
-rw-r----- 1 kingbase kingbase 16M Dec  4 11:13 000000040000000000000010-7307a387f4d138c4fd0623c0149f4248d3ac3a93
-rw-r----- 1 kingbase kingbase 371 Dec  4 11:13 000000040000000000000011.00000028.backup
-rw-r----- 1 kingbase kingbase 16M Dec  4 11:13 000000040000000000000011-14aec9d95a01713acebf05a29ab42d48d6a85570
-rw-r----- 1 kingbase kingbase 16M Dec  4 11:26 000000040000000000000012-ed80130499f5632c8182d8418a206eeb72a1d447

3）kingbase.auto.conf的配置

[kingbase@node208 v9]$ cat data/kingbase.auto.conf
# Do not edit this file manually!
# Removed by sys_rman restore on 2024-12-04 14:46:39 # recovery_target_timeline = '1'
# Recovery settings generated by sys_rman restore on 2024-12-04 14:46:39
restore_command = '/opt/Kingbase/ES/V9/KESRealPro/V009R001C002B0014/Server/bin/sys_rman --config=/home/kingbase/kbbr9_repo/sys_rman.conf --lock-path=/home/kingbase/kbbr9_repo --stanza=kingbase archive-get %f "%p"'
recovery_target_time = '2024-12-04 11:13:40'

4）启动数据库服务
如下所示，在启动数据库服务后，连接数据库执行t1表数据查询时，执行hang住，在执行了select sys_wal_replay_resume()后，t1表数据可正常查询：

prod=# select count(*) from t1;
^CCancel request sent
ERROR:  canceling statement due to user request

prod=# select sys_wal_replay_resume();
 sys_wal_replay_resume
-----------------------

(1 row)

prod=# select count(*) from t1;
 count
-------
   200
(1 row)

5）查看sys_log日志
如下所示，在recovery过程中，读取归档日志及wal日志执行恢复：

[kingbase@node208 sys_log]$ tail -200 kingbase-2024-12-04_144733.log
2024-12-04 14:47:33.755 CST [155550] LOG:  database system was interrupted; last known up at 2024-12-04 11:11:37 CST
........
[000000040000000000000012, sys_wal/RECOVERYXLOG] --archive-timeout=600 --band-width=0 --cmd-ssh=/opt/Kingbase/ES/V9/Server/bin/sys_securecmd --config=/home/kingbase/kbbr9_repo/sys_rman.conf --exec-id=155611-e7bc32c5 --kb1-path=/home/kingbase/db/v9/data --lock-path=/home/kingbase/kbbr9_repo --log-level-console=info --log-level-file=info --log-path=/opt/Kingbase/ES/V9/Server/log --log-subprocess --process-max=4 --repo1-path=/home/kingbase/kbbr9_repo --stanza=kingbase
WARN: set process-max 4 is too large, auto set to CPU core count 1
2024-12-04 14:47:34.620 P00   INFO: found 000000040000000000000012 in the repo1: 12-1 archive
2024-12-04 14:47:34.620 P00   INFO: archive-get command end: completed successfully (29ms)
2024-12-04 14:47:34.621 CST [155550] LOG:  restored log file "000000040000000000000012" from archive
2024-12-04 14:47:34.712 CST [155550] LOG:  process one external WAL file
2024-12-04 14:47:34.712 CST [155550] LOG:  recovery stopping before commit of transaction 1135, time 2024-12-04 11:26:26.159294+08
2024-12-04 14:47:34.712 CST [155550] LOG:  recovery has paused
2024-12-04 14:47:34.712 CST [155550] HINT:  Execute sys_wal_replay_resume() to continue.

参考官方文档：
https://bbs.kingbase.com.cn/docHtml?recId=d3b047761988fde0084a665fc7151c47&url=aHR0cHM6Ly9iYnMua2luZ2Jhc2UuY29tLmNuL2tpbmdiYXNlLWRvYy92OC9oaWdobHkvYXZhaWxhYmlsaXR5L2luZGV4Lmh0bWw

恢复后发现查询卡住
如下图所示，指定了删除表t1之前的时间点做恢复，恢复后启动数据库查询t1表数据卡住，ksql没有任何查询输出，但执行 sys_wal_replay_resume() 函数之后，就可以正常查询t1表了，数据也是正常的。

8.4.1. 原因
受限于KingbaseES PITR的机制，PITR选定的时间（比如下图中的A点），一般都不是事务结束的时间，所以日志重做（redo）会继续往后走，将下一个事务的结束时间（比如下图中事务2的结束时间B点）和设定的时间（比如下图中的A点）进行对比，如果设定时间小于事务的结束时间就停止日志重做，此时数据库的状态有以下情况：
这时候会有1个或多个事务处于未提交状态，这些未提交事务（比如下图中的事务2、事务3）可能持有锁（例如drop table 持有排他锁，本示例中查询卡住就是因为拿不到t1的共享锁），此时执行 sys_wal_replay_resume() 函数将这些事务回滚就可以了，对恢复无影响
如果刚好卡在下一条事务是drop database这种无法回滚的DDL操作，就会出现DDL语句被直接执行（KingbaseES不支持事务内执行drop database的操作，无法回滚，比如下一小节提到的问题）

8.4.2. 解决方法
尝试更改指定时间点恢复的目标时刻，让恢复后的状态不卡在特殊SQL未提交前的状态。
改为指定事务ID恢复，事务ID是唯一标志，不会往下继续查时间大小，比如本示例改为xid=928。

三、执行基于xid的恢复

1、执行wal日志分析（获取xid）
如下所示，insert事务的xid为1133，drop事务的xid为1135：

# insert事务日志分析
[kingbase@node208 0000000400000000]$/home/kingbase/cluster/v9/kingbase/bin/sys_waldump 000000040000000000000010

rmgr: Heap        len (rec/tot):     66/    66, tx:       1133, lsn: 0/10004B38, prev 0/10004AF0, desc: INSERT off 15 flags 0x00, blkref #0: rel 1663/16384/16396 blk 1
rmgr: Heap        len (rec/tot):     66/    66, tx:       1133, lsn: 0/10004B80, prev 0/10004B38, desc: INSERT off 16 flags 0x00, blkref #0: rel 1663/16384/16396 blk 1
rmgr: Transaction len (rec/tot):     34/    34, tx:       1133, lsn: 0/10004BC8, prev 0/10004B80, desc: COMMIT 2024-12-04 11:12:37.005082 CST
rmgr: Standby     len (rec/tot):     46/    46, tx:          0, lsn: 0/10004BF0, prev 0/10004BC8, desc: RUNNING_XACTS nextXid 1134 latestCompletedXid 5531099 oldestRunningXid 1133
rmgr: Heap        len (rec/tot):     65/  2589, tx:       1134, lsn: 0/10004C20, prev 0/10004BF0, desc: HOT_UPDATE off 1 xmax 1134 flags 0x00 ; new off 3 xmax 0, blkref #0: rel 1663/16384/2619 blk 24 FPW
rmgr: Heap        len (rec/tot):   1367/  1367, tx:       1134, lsn: 0/10005640, prev 0/10004C20, desc: HOT_UPDATE off 2 xmax 1134 flags 0x60 ; new off 4 xmax 0, blkref #0: rel 1663/16384/2619 blk 24
rmgr: Heap        len (rec/tot):     53/  2781, tx:       1134, lsn: 0/10005B98, prev 0/10005640, desc: INPLACE off 11, blkref #0: rel 1663/16384/1259 blk 0 FPW
rmgr: Transaction len (rec/tot):    130/   130, tx:       1134, lsn: 0/10006690, prev 0/10005B98, desc: COMMIT 2024-12-04 11:13:31.876312 CST; inval SIMsgs: catcache 73 catcache 73 catcache 62 catcache 61 relcache 16396
rmgr: Standby     len (rec/tot):     42/    42, tx:          0, lsn: 0/10006718, prev 0/10006690, desc: RUNNING_XACTS nextXid 1135 latestCompletedXid 5531099 oldestRunningXid 1135
rmgr: XLOG        len (rec/tot):     24/    24, tx:          0, lsn: 0/10006748, prev 0/10006718, desc: SWITCH

# drop事务日志分析
[kingbase@node208 0000000400000000]$ /home/kingbase/cluster/v9/kingbase/bin/sys_waldump 000000040000000000000012

rmgr: Standby     len (rec/tot):     42/    42, tx:       1135, lsn: 0/12000208, prev 0/120001D8, desc: LOCK xid 1135 db 16384 rel 16396
rmgr: Heap        len (rec/tot):     59/  7863, tx:       1135, lsn: 0/12000238, prev 0/12000208, desc: DELETE off 13 flags 0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/1247 blk 4 FPW
.......

rmgr: Transaction len (rec/tot):    533/   533, tx:       1135, lsn: 0/120089F8, prev 0/120089C0, desc: COMMIT 2024-12-04 11:26:26.159294 CST; rels: base/16384/16396; inval SIMsgs: catcache 44 catcache 73 catcache 73 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 62 catcache 61 catcache 116 catcache 115 catcache 116 catcache 115 relcache 16396 snapshot 2608 snapshot 2608 snapshot 2608

2、指定xid执行恢复
如下所示，指定备份集及xid恢复数据库：

[kingbase@node208 v9]$ /opt/Kingbase/ES/V9/Server/bin/sys_rman --config=/home/kingbase/kbbr9_repo/sys_rman.conf --stanza=kingbase --target=1135 --type=xid restore
2024-12-04 15:13:08.541 P00   INFO: restore command begin 2.27: --band-width=0 --cmd-ssh=/opt/Kingbase/ES/V9/Server/bin/sys_securecmd --config=/home/kingbase/kbbr9_repo/sys_rman.conf --exec-id=166754-9e331797 --kb1-path=/home/kingbase/db/v9/data --link-all --log-level-console=info --log-level-file=info --log-path=/opt/Kingbase/ES/V9/Server/log --log-subprocess --non-archived-space=1024 --process-max=4 --repo1-path=/home/kingbase/kbbr9_repo --stanza=kingbase --target=1135 --type=xid
WARN: set process-max 4 is too large, auto set to CPU core count 1
ERROR: [032]: Current options maybe restore to newest data from archive WAL files
       Please specify --type=xid --target='xx' --set='2021xxxx-xxxxxxX_2021xxxx-xxxxxxX' or follow the user guide
       Stop current restore.
2024-12-04 15:13:08.542 P00   INFO: restore command end: aborted with exception [032]
[kingbase@node208 v9]$ /opt/Kingbase/ES/V9/Server/bin/sys_rman --config=/home/kingbase/kbbr9_repo/sys_rman.conf --stanza=kingbase --target=1135 --type=xid --set='20241204-111133F_20241204-111338I' restore
2024-12-04 15:15:02.150 P00   INFO: restore command begin 2.27: --band-width=0 --cmd-ssh=/opt/Kingbase/ES/V9/Server/bin/sys_securecmd --config=/home/kingbase/kbbr9_repo/sys_rman.conf --exec-id=167474-f17975be --kb1-path=/home/kingbase/db/v9/data --link-all --log-level-console=info --log-level-file=info --log-path=/opt/Kingbase/ES/V9/Server/log --log-subprocess --non-archived-space=1024 --process-max=4 --repo1-path=/home/kingbase/kbbr9_repo --set=20241204-111133F_20241204-111338I --stanza=kingbase --target=1135 --type=xid
WARN: set process-max 4 is too large, auto set to CPU core count 1
2024-12-04 15:15:02.173 P00   INFO: repo1: restore backup set 20241204-111133F_20241204-111338I, recovery will start at 2024-12-04 11:13:38
2024-12-04 15:15:02.339 P00   INFO: Restore Process: FILE: 1 / 2803 0%       SZIE: 16777216 bytes / 109818187 bytes 16MB / 104.7MB 15%
......
2024-12-04 15:15:04.687 P00   INFO: write updated /home/kingbase/db/v9/data/kingbase.auto.conf
2024-12-04 15:15:04.691 P00   INFO: restore global/sys_control (performed last to ensure aborted restores cannot be started)
2024-12-04 15:15:04.692 P00   INFO: restore size = 104.7MB, file total = 2803
2024-12-04 15:15:04.692 P00   INFO: restore command end: completed successfully (2545ms)

如下所示，指定增量备份的备份集，会自动读取全量备份集执行恢复：

3、启动数据库服务执行recovery

1）查看kingbase.auto.conf配置

[kingbase@node208 v9]$ cat data/kingbase.auto.conf
# Do not edit this file manually!
# Removed by sys_rman restore on 2024-12-04 15:15:04 # recovery_target_timeline = '1'
# Recovery settings generated by sys_rman restore on 2024-12-04 15:15:04
restore_command = '/opt/Kingbase/ES/V9/KESRealPro/V009R001C002B0014/Server/bin/sys_rman --config=/home/kingbase/kbbr9_repo/sys_rman.conf --lock-path=/home/kingbase/kbbr9_repo --stanza=kingbase archive-get %f "%p"'
recovery_target_xid = '1135'

2）启动数据库服务

[kingbase@node208 bin]$ ./sys_ctl start -D /home/kingbase/db/v9/data
......
server started

3）连接数据库服务
如下所示，执行xid恢复后，可以直接读取t1获取到数据：

[kingbase@node208 bin]$ ./ksql -U system test
Type "help" for help.

test=# \c prod
You are now connected to database "prod" as userName "system".
prod=# \d
                 List of relations
 Schema |          Name           | Type  | Owner
--------+-------------------------+-------+--------
 public | sys_stat_statements     | view  | system
 public | sys_stat_statements_all | view  | system
 public | t1                      | table | system
(3 rows)

prod=# select count(*) from t1;
 count
-------
   200
(1 row)

四、总结
在用户误操作，通过物理备份恢复时，可以执行基于时间点或xid的恢复，通过dump分析wal日志，可以获取到准确的事务xid，执行xid恢复更方便快捷恢复数据。

posted @ 2024-12-04 19:35 天涯客1224 阅读(13) 评论(0) 编辑收藏举报

刷新页面返回顶部

天涯客1224

KingbaseES V9备份恢复案例之---单实例环境PITR恢复案例

公告