KingbaseES V9备份恢复案例之---单实例环境PITR恢复案例
案例说明:
用户误操作删除数据,测试基于时间点(PITR)和xid的数据库恢复,比较两种恢复之间的差异。
版本信息:
prod=# select version();
version
------------------------------------------------------------------------------------------------------------------
KingbaseES V009R001C002B0014 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28), 64-
bit
(1 row)
一、构建测试环境
1、测试数据信息
prod=# select count(*) from t1;
count
-------
100
(1 row)
2、数据库全量备份
[kingbase@node208 bin]$ /opt/Kingbase/ES/V9/Server/bin/sys_rman --config=/home/kingbase/kbbr9_repo/sys_rman.conf --stanza=kingbase info
WARN: set process-max 4 is too large, auto set to CPU core count 1
stanza: kingbase
status: ok
cipher: none
db (current)
wal archive min/max (V009R001C002B0014): 00000004000000000000000C/00000004000000000000000F
full backup: 20241204-111133F
timestamp start/stop: 2024-12-04 11:11:33 / 2024-12-04 11:11:41
wal start/stop: 00000004000000000000000F / 00000004000000000000000F
database size: 104.7MB, database backup size: 104.7MB
repo1: backup set size: 104.7MB, backup size: 104.7MB
3、数据库事务操作
test=# \c prod
prod=# insert into t1 values (generate_series(101,200),'usr'||generate_series(101,200));
INSERT 0 100
prod=# select count(*) from t1;
count
-------
200
(1 row)
4、数据库增量备份
[kingbase@node208 bin]$ /opt/Kingbase/ES/V9/Server/bin/sys_rman --config=/home/kingbase/kbbr9_repo/sys_rman.conf --stanza=kingbase info
WARN: set process-max 4 is too large, auto set to CPU core count 1
stanza: kingbase
status: ok
cipher: none
db (current)
wal archive min/max (V009R001C002B0014): 00000004000000000000000C/000000040000000000000011
full backup: 20241204-111133F
timestamp start/stop: 2024-12-04 11:11:33 / 2024-12-04 11:11:41
wal start/stop: 00000004000000000000000F / 00000004000000000000000F
database size: 104.7MB, database backup size: 104.7MB
repo1: backup set size: 104.7MB, backup size: 104.7MB
incr backup: 20241204-111133F_20241204-111338I
timestamp start/stop: 2024-12-04 11:13:38 / 2024-12-04 11:13:40
wal start/stop: 000000040000000000000011 / 000000040000000000000011
database size: 104.7MB, database backup size: 17.3MB
repo1: backup set size: 104.7MB, backup size: 17.3MB
backup reference list: 20241204-111133F
5、模拟数据库误操作
如下所示,由于误操作,t1表被删除:
prod=# checkpoint;
CHECKPOINT
# 数据库表被误删除
prod=# drop table t1;
DROP TABLE
prod=# \d
List of relations
Schema | Name | Type | Owner
--------+-------------------------+------+--------
public | sys_stat_statements | view | system
public | sys_stat_statements_all | view | system
(2 rows)
二、执行数据库恢复(基于时间点PITR)
1、关闭当前数据库服务并备份数据
[kingbase@node208 bin]$ ./sys_ctl stop -D /home/kingbase/db/v9/data
waiting for server to shut down.... done
server stopped
# 备份当前数据库目录
[kingbase@node208 bin]$ ls -lh /home/kingbase/db/v9/
total 8.0K
drwx------ 23 kingbase kingbase 4.0K Dec 4 15:11 data.bk
2、执行PITR的restore
1)确定恢复到的时间点
如下所示,通过dump wal日志,获取到在drop前insert事务后提交的时间点:
如下所示,通过备份信息,获取到在drop前,增量备份完成的时间点:
2)执行PITR恢复(将时间点选择为备份完成和drop前的时间点)
[kingbase@node208 v9]$ /opt/Kingbase/ES/V9/Server/bin/sys_rman --config=/home/kingbase/kbbr9_repo/sys_rman.conf --stanza=kingbase --target="2024-12-04 11:13:40" --type=time restore
2024-12-04 14:46:36.723 P00 INFO: restore command begin 2.27: --band-width=0 --cmd-ssh=/opt/Kingbase/ES/V9/Server/bin/sys_securecmd --config=/home/kingbase/kbbr9_repo/sys_rman.conf --exec-id=155169-8484df31 --kb1-path=/home/kingbase/db/v9/data --link-all --log-level-console=info --log-level-file=info --log-path=/opt/Kingbase/ES/V9/Server/log --log-subprocess --non-archived-space=1024 --process-max=4 --repo1-path=/home/kingbase/kbbr9_repo --stanza=kingbase --target="2024-12-04 11:13:40" --type=time
WARN: set process-max 4 is too large, auto set to CPU core count 1
2024-12-04 14:46:36.741 P00 INFO: repo1: restore backup set 20241204-111133F, recovery will start at 2024-12-04 11:11:33
2024-12-04 14:46:36.890 P00 INFO: Restore Process: FILE: 1 / 2802 0% SZIE: 16777216 bytes / 109782437 bytes 16MB / 104.7MB 15%
........
2024-12-04 14:46:39.216 P00 INFO: write updated /home/kingbase/db/v9/data/kingbase.auto.conf
2024-12-04 14:46:39.220 P00 INFO: restore global/sys_control (performed last to ensure aborted restores cannot be started)
2024-12-04 14:46:39.221 P00 INFO: restore size = 104.7MB, file total = 2802
2024-12-04 14:46:39.222 P00 INFO: restore command end: completed successfully (2502ms)
3、启动数据库服务执行recovery
1)查看当前wal日志
[kingbase@node208 bin]$ ls -lh /home/kingbase/db/v9/data/sys_wal
total 17M
-rw------- 1 kingbase kingbase 16M Jan 1 1970 000000040000000000000011
drwx------ 2 kingbase kingbase 4.0K Dec 4 11:27 archive_status
2)查看归档wal日志
[kingbase@node208 bin]$ ls -lh /home/kingbase/kbbr9_repo/archive/kingbase/12-1/0000000400000000/
total 113M
-rw-r----- 1 kingbase kingbase 16M Dec 4 11:11 00000004000000000000000C-361e001ab548a9c942701f5a26d9dc53b0b00ac8
-rw-r----- 1 kingbase kingbase 16M Dec 4 11:11 00000004000000000000000D-78b9501fef1a16e8fa6f134310604e0aeabae8a1
-rw-r----- 1 kingbase kingbase 16M Dec 4 11:11 00000004000000000000000E-403e8cbe47c8a0f9819263880b37b9014050bfe9
-rw-r----- 1 kingbase kingbase 368 Dec 4 11:11 00000004000000000000000F.00000028.backup
-rw-r----- 1 kingbase kingbase 16M Dec 4 11:11 00000004000000000000000F-dc8f5fe3ff3c319826d97ba79470eba5fc0d748b
-rw-r----- 1 kingbase kingbase 16M Dec 4 11:13 000000040000000000000010-7307a387f4d138c4fd0623c0149f4248d3ac3a93
-rw-r----- 1 kingbase kingbase 371 Dec 4 11:13 000000040000000000000011.00000028.backup
-rw-r----- 1 kingbase kingbase 16M Dec 4 11:13 000000040000000000000011-14aec9d95a01713acebf05a29ab42d48d6a85570
-rw-r----- 1 kingbase kingbase 16M Dec 4 11:26 000000040000000000000012-ed80130499f5632c8182d8418a206eeb72a1d447
3)kingbase.auto.conf的配置
[kingbase@node208 v9]$ cat data/kingbase.auto.conf
# Do not edit this file manually!
# Removed by sys_rman restore on 2024-12-04 14:46:39 # recovery_target_timeline = '1'
# Recovery settings generated by sys_rman restore on 2024-12-04 14:46:39
restore_command = '/opt/Kingbase/ES/V9/KESRealPro/V009R001C002B0014/Server/bin/sys_rman --config=/home/kingbase/kbbr9_repo/sys_rman.conf --lock-path=/home/kingbase/kbbr9_repo --stanza=kingbase archive-get %f "%p"'
recovery_target_time = '2024-12-04 11:13:40'
4)启动数据库服务
如下所示,在启动数据库服务后,连接数据库执行t1表数据查询时,执行hang住,在执行了select sys_wal_replay_resume()后,t1表数据可正常查询:
prod=# select count(*) from t1;
^CCancel request sent
ERROR: canceling statement due to user request
prod=# select sys_wal_replay_resume();
sys_wal_replay_resume
-----------------------
(1 row)
prod=# select count(*) from t1;
count
-------
200
(1 row)
5)查看sys_log日志
如下所示,在recovery过程中,读取归档日志及wal日志执行恢复:
[kingbase@node208 sys_log]$ tail -200 kingbase-2024-12-04_144733.log
2024-12-04 14:47:33.755 CST [155550] LOG: database system was interrupted; last known up at 2024-12-04 11:11:37 CST
........
[000000040000000000000012, sys_wal/RECOVERYXLOG] --archive-timeout=600 --band-width=0 --cmd-ssh=/opt/Kingbase/ES/V9/Server/bin/sys_securecmd --config=/home/kingbase/kbbr9_repo/sys_rman.conf --exec-id=155611-e7bc32c5 --kb1-path=/home/kingbase/db/v9/data --lock-path=/home/kingbase/kbbr9_repo --log-level-console=info --log-level-file=info --log-path=/opt/Kingbase/ES/V9/Server/log --log-subprocess --process-max=4 --repo1-path=/home/kingbase/kbbr9_repo --stanza=kingbase
WARN: set process-max 4 is too large, auto set to CPU core count 1
2024-12-04 14:47:34.620 P00 INFO: found 000000040000000000000012 in the repo1: 12-1 archive
2024-12-04 14:47:34.620 P00 INFO: archive-get command end: completed successfully (29ms)
2024-12-04 14:47:34.621 CST [155550] LOG: restored log file "000000040000000000000012" from archive
2024-12-04 14:47:34.712 CST [155550] LOG: process one external WAL file
2024-12-04 14:47:34.712 CST [155550] LOG: recovery stopping before commit of transaction 1135, time 2024-12-04 11:26:26.159294+08
2024-12-04 14:47:34.712 CST [155550] LOG: recovery has paused
2024-12-04 14:47:34.712 CST [155550] HINT: Execute sys_wal_replay_resume() to continue.
- 恢复后发现查询卡住
如下图所示,指定了删除表t1之前的时间点做恢复,恢复后启动数据库查询t1表数据卡住,ksql没有任何查询输出,但执行 sys_wal_replay_resume() 函数之后,就可以正常查询t1表了,数据也是正常的。
8.4.1. 原因
受限于KingbaseES PITR的机制,PITR选定的时间(比如下图中的A点),一般都不是事务结束的时间,所以日志重做(redo)会继续往后走,将下一个事务的结束时间(比如下图中事务2的结束时间B点)和设定的时间(比如下图中的A点)进行对比,如果设定时间小于事务的结束时间就停止日志重做,此时数据库的状态有以下情况:
这时候会有1个或多个事务处于未提交状态,这些未提交事务(比如下图中的事务2、事务3)可能持有锁(例如drop table 持有排他锁,本示例中查询卡住就是因为拿不到t1的共享锁),此时执行 sys_wal_replay_resume() 函数将这些事务回滚就可以了,对恢复无影响
如果刚好卡在下一条事务是drop database这种无法回滚的DDL操作,就会出现DDL语句被直接执行(KingbaseES不支持事务内执行drop database的操作,无法回滚,比如下一小节提到的问题)
8.4.2. 解决方法
尝试更改指定时间点恢复的目标时刻,让恢复后的状态不卡在特殊SQL未提交前的状态。
改为指定事务ID恢复,事务ID是唯一标志,不会往下继续查时间大小,比如本示例改为xid=928。
三、执行基于xid的恢复
1、执行wal日志分析(获取xid)
如下所示,insert事务的xid为1133,drop事务的xid为1135:
# insert事务日志分析
[kingbase@node208 0000000400000000]$/home/kingbase/cluster/v9/kingbase/bin/sys_waldump 000000040000000000000010
rmgr: Heap len (rec/tot): 66/ 66, tx: 1133, lsn: 0/10004B38, prev 0/10004AF0, desc: INSERT off 15 flags 0x00, blkref #0: rel 1663/16384/16396 blk 1
rmgr: Heap len (rec/tot): 66/ 66, tx: 1133, lsn: 0/10004B80, prev 0/10004B38, desc: INSERT off 16 flags 0x00, blkref #0: rel 1663/16384/16396 blk 1
rmgr: Transaction len (rec/tot): 34/ 34, tx: 1133, lsn: 0/10004BC8, prev 0/10004B80, desc: COMMIT 2024-12-04 11:12:37.005082 CST
rmgr: Standby len (rec/tot): 46/ 46, tx: 0, lsn: 0/10004BF0, prev 0/10004BC8, desc: RUNNING_XACTS nextXid 1134 latestCompletedXid 5531099 oldestRunningXid 1133
rmgr: Heap len (rec/tot): 65/ 2589, tx: 1134, lsn: 0/10004C20, prev 0/10004BF0, desc: HOT_UPDATE off 1 xmax 1134 flags 0x00 ; new off 3 xmax 0, blkref #0: rel 1663/16384/2619 blk 24 FPW
rmgr: Heap len (rec/tot): 1367/ 1367, tx: 1134, lsn: 0/10005640, prev 0/10004C20, desc: HOT_UPDATE off 2 xmax 1134 flags 0x60 ; new off 4 xmax 0, blkref #0: rel 1663/16384/2619 blk 24
rmgr: Heap len (rec/tot): 53/ 2781, tx: 1134, lsn: 0/10005B98, prev 0/10005640, desc: INPLACE off 11, blkref #0: rel 1663/16384/1259 blk 0 FPW
rmgr: Transaction len (rec/tot): 130/ 130, tx: 1134, lsn: 0/10006690, prev 0/10005B98, desc: COMMIT 2024-12-04 11:13:31.876312 CST; inval SIMsgs: catcache 73 catcache 73 catcache 62 catcache 61 relcache 16396
rmgr: Standby len (rec/tot): 42/ 42, tx: 0, lsn: 0/10006718, prev 0/10006690, desc: RUNNING_XACTS nextXid 1135 latestCompletedXid 5531099 oldestRunningXid 1135
rmgr: XLOG len (rec/tot): 24/ 24, tx: 0, lsn: 0/10006748, prev 0/10006718, desc: SWITCH
# drop事务日志分析
[kingbase@node208 0000000400000000]$ /home/kingbase/cluster/v9/kingbase/bin/sys_waldump 000000040000000000000012
rmgr: Standby len (rec/tot): 42/ 42, tx: 1135, lsn: 0/12000208, prev 0/120001D8, desc: LOCK xid 1135 db 16384 rel 16396
rmgr: Heap len (rec/tot): 59/ 7863, tx: 1135, lsn: 0/12000238, prev 0/12000208, desc: DELETE off 13 flags 0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/1247 blk 4 FPW
.......
rmgr: Transaction len (rec/tot): 533/ 533, tx: 1135, lsn: 0/120089F8, prev 0/120089C0, desc: COMMIT 2024-12-04 11:26:26.159294 CST; rels: base/16384/16396; inval SIMsgs: catcache 44 catcache 73 catcache 73 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 62 catcache 61 catcache 116 catcache 115 catcache 116 catcache 115 relcache 16396 snapshot 2608 snapshot 2608 snapshot 2608
2、指定xid执行恢复
如下所示,指定备份集及xid恢复数据库:
[kingbase@node208 v9]$ /opt/Kingbase/ES/V9/Server/bin/sys_rman --config=/home/kingbase/kbbr9_repo/sys_rman.conf --stanza=kingbase --target=1135 --type=xid restore
2024-12-04 15:13:08.541 P00 INFO: restore command begin 2.27: --band-width=0 --cmd-ssh=/opt/Kingbase/ES/V9/Server/bin/sys_securecmd --config=/home/kingbase/kbbr9_repo/sys_rman.conf --exec-id=166754-9e331797 --kb1-path=/home/kingbase/db/v9/data --link-all --log-level-console=info --log-level-file=info --log-path=/opt/Kingbase/ES/V9/Server/log --log-subprocess --non-archived-space=1024 --process-max=4 --repo1-path=/home/kingbase/kbbr9_repo --stanza=kingbase --target=1135 --type=xid
WARN: set process-max 4 is too large, auto set to CPU core count 1
ERROR: [032]: Current options maybe restore to newest data from archive WAL files
Please specify --type=xid --target='xx' --set='2021xxxx-xxxxxxX_2021xxxx-xxxxxxX' or follow the user guide
Stop current restore.
2024-12-04 15:13:08.542 P00 INFO: restore command end: aborted with exception [032]
[kingbase@node208 v9]$ /opt/Kingbase/ES/V9/Server/bin/sys_rman --config=/home/kingbase/kbbr9_repo/sys_rman.conf --stanza=kingbase --target=1135 --type=xid --set='20241204-111133F_20241204-111338I' restore
2024-12-04 15:15:02.150 P00 INFO: restore command begin 2.27: --band-width=0 --cmd-ssh=/opt/Kingbase/ES/V9/Server/bin/sys_securecmd --config=/home/kingbase/kbbr9_repo/sys_rman.conf --exec-id=167474-f17975be --kb1-path=/home/kingbase/db/v9/data --link-all --log-level-console=info --log-level-file=info --log-path=/opt/Kingbase/ES/V9/Server/log --log-subprocess --non-archived-space=1024 --process-max=4 --repo1-path=/home/kingbase/kbbr9_repo --set=20241204-111133F_20241204-111338I --stanza=kingbase --target=1135 --type=xid
WARN: set process-max 4 is too large, auto set to CPU core count 1
2024-12-04 15:15:02.173 P00 INFO: repo1: restore backup set 20241204-111133F_20241204-111338I, recovery will start at 2024-12-04 11:13:38
2024-12-04 15:15:02.339 P00 INFO: Restore Process: FILE: 1 / 2803 0% SZIE: 16777216 bytes / 109818187 bytes 16MB / 104.7MB 15%
......
2024-12-04 15:15:04.687 P00 INFO: write updated /home/kingbase/db/v9/data/kingbase.auto.conf
2024-12-04 15:15:04.691 P00 INFO: restore global/sys_control (performed last to ensure aborted restores cannot be started)
2024-12-04 15:15:04.692 P00 INFO: restore size = 104.7MB, file total = 2803
2024-12-04 15:15:04.692 P00 INFO: restore command end: completed successfully (2545ms)
如下所示,指定增量备份的备份集,会自动读取全量备份集执行恢复:
3、启动数据库服务执行recovery
1)查看kingbase.auto.conf配置
[kingbase@node208 v9]$ cat data/kingbase.auto.conf
# Do not edit this file manually!
# Removed by sys_rman restore on 2024-12-04 15:15:04 # recovery_target_timeline = '1'
# Recovery settings generated by sys_rman restore on 2024-12-04 15:15:04
restore_command = '/opt/Kingbase/ES/V9/KESRealPro/V009R001C002B0014/Server/bin/sys_rman --config=/home/kingbase/kbbr9_repo/sys_rman.conf --lock-path=/home/kingbase/kbbr9_repo --stanza=kingbase archive-get %f "%p"'
recovery_target_xid = '1135'
2)启动数据库服务
[kingbase@node208 bin]$ ./sys_ctl start -D /home/kingbase/db/v9/data
......
server started
3)连接数据库服务
如下所示,执行xid恢复后,可以直接读取t1获取到数据:
[kingbase@node208 bin]$ ./ksql -U system test
Type "help" for help.
test=# \c prod
You are now connected to database "prod" as userName "system".
prod=# \d
List of relations
Schema | Name | Type | Owner
--------+-------------------------+-------+--------
public | sys_stat_statements | view | system
public | sys_stat_statements_all | view | system
public | t1 | table | system
(3 rows)
prod=# select count(*) from t1;
count
-------
200
(1 row)
四、总结
在用户误操作,通过物理备份恢复时,可以执行基于时间点或xid的恢复,通过dump分析wal日志,可以获取到准确的事务xid,执行xid恢复更方便快捷恢复数据。