KingbaseES V8R6集群备份恢复案例之---主库single-pro备份恢复

案例说明:
KingbaseES V8R6集群物理备份支持single-pro方式,本案例在集群执行single-pro方式备份并多次切换集群后,对集群执行了恢复测试,文档记录了恢复的详细过程。

适用版本:
KingbaseES V8R6

集群架构:

 ID | Name  | Role    | Status    | Upstream | repmgrd | PID   | Paused? | Upstream last seen
----+-------+---------+-----------+----------+---------+-------+---------+--------------------
 1  | node1 | primary | * running |          | running | 44482 | no      | n/a
 2  | node2 | standby |   running | node1    | running | 15369 | no      | 1 second(s) ago

一、查看集群备份
在执行single-prod模式的物理备份的初始化(sys_backup.sh init)后,同时会在主备库节点执行备份:

1、主库备份

[kingbase@node201 bin]$ /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/sys_rman --config=/home/kingbase/kbbr_repo/sys_rman.conf --stanza=kingbase info
WARN: set process-max 4 is too large, auto set to CPU core count 1
stanza: kingbase
    status: ok
    cipher: none

    db (current)
        wal archive min/max (V008R006C008B0014): 000000490000000300000040/00000049000000030000004A

        full backup: 20241101-105946F
            timestamp start/stop: 2024-11-01 10:59:46 / 2024-11-01 10:59:53
            wal start/stop: 000000490000000300000048 / 000000490000000300000048
            database size: 378.4MB, database backup size: 378.4MB
            repo1: backup set size: 378.4MB, backup size: 378.4MB

        incr backup: 20241101-105946F_20241101-110356I
            timestamp start/stop: 2024-11-01 11:03:56 / 2024-11-01 11:03:58
            wal start/stop: 00000049000000030000004A / 00000049000000030000004A
            database size: 378.8MB, database backup size: 25.1MB
            repo1: backup set size: 378.8MB, backup size: 25.1MB
            backup reference list: 20241101-105946F

2、备库备份

[kingbase@node202 bin]$ /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/sys_rman --config=/home/kingbase/kbbr_repo/sys_rman.conf --stanza=kingbase info
WARN: set process-max 4 is too large, auto set to CPU core count 1
stanza: kingbase
    status: ok
    cipher: none

    db (current)
        wal archive min/max (V008R006C008B0014): 000000490000000300000040/0000004B000000030000004D

        full backup: 20241101-105544F
            timestamp start/stop: 2024-11-01 10:55:44 / 2024-11-01 10:55:53
            wal start/stop: 000000490000000300000046 / 000000490000000300000046
            database size: 378.4MB, database backup size: 378.4MB
            repo2: backup set size: 378.4MB, backup size: 378.4MB

二、切换集群测试

1、执行failover切换后

[kingbase@node201 bin]$ ./repmgr cluster show

 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                       
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | standby |   running | node2    | default  | 100      | 73       | 0 bytes | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
 2  | node2 | primary | * running |          | default  | 100      | 74       |         | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000

2、执行switchover切换后

[kingbase@node201 bin]$ ./repmgr cluster show

 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                       
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 75       |         | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
 2  | node2 | standby |   running | node1    | default  | 100      | 74       | 0 bytes | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000

3、主库timeline时间线变化
如下所示,主库节点timeline变化:

三、主库数据恢复测试

1、模拟数据丢失

prod=# drop table t3;
DROP TABLE
prod=# drop table t2;
DROP TABLE
prod=# drop table t1;
DROP TABLE
prod=# \d
                       List of relations
 Schema |          Name           |       Type        | Owner
--------+-------------------------+-------------------+--------
 public | sys_roles               | table             | system
 public | sys_stat_statements     | view              | system
 public | sys_stat_statements_all | view              | system
 public | tb1                     | table             | system
 public | teachers                | table             | system
(15 rows)

2、通过物理备份恢复数据
1)关闭集群
[kingbase@node201 bin]$ ./sys_monitor.sh stop

2)备份原data
[kingbase@node201 kingbase]$ mv data data.bk

3)执行全量恢复
如下所示,主库节点数据库恢复成功:

[kingbase@node201 bin]$ /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/sys_rman --config=/home/kingbase/kbbr_repo/sys_rman.conf --stanza=kingbase restore
2024-11-01 11:14:17.027 P00   INFO: restore command begin 2.27: --band-width=0 --config=/home/kingbase/kbbr_repo/sys_rman.conf --exec-id=56849-a18b87e7 --kb2-host=192.168.1.202 --kb1-path=/home/kingbase/cluster/R6C8/HAC8/kingbase/data --kb2-path=/home/kingbase/cluster/R6C8/HAC8/kingbase/data --link-all --log-level-console=info --log-level-file=info --log-path=/home/kingbase/cluster/R6C8/HAC8/kingbase/log --log-subprocess --non-archived-space=1024 --process-max=4 --repo1-path=/home/kingbase/kbbr_repo --stanza=kingbase
WARN: set process-max 4 is too large, auto set to CPU core count 1
2024-11-01 11:14:17.065 P00   INFO: repo1: restore backup set 20241101-105946F_20241101-110356I, recovery will start at 2024-11-01 11:03:56
2024-11-01 11:14:17.981 P00   INFO: Restore Process: FILE: 1 / 4572 0%       SZIE: 182419456 bytes / 397204747 bytes 174.0MB / 378.8MB 45%
........

2024-11-01 11:14:23.434 P00   INFO: Restore Process: FILE: 4572 / 4572 100%       SZIE: 397204747 bytes / 397204747 bytes 378.8MB / 378.8MB 100%
2024-11-01 11:14:23.435 P00   INFO: write updated /home/kingbase/cluster/R6C8/HAC8/kingbase/data/kingbase.auto.conf
2024-11-01 11:14:23.438 P00   INFO: restore global/sys_control (performed last to ensure aborted restores cannot be started)
2024-11-01 11:14:23.439 P00   INFO: restore size = 378.8MB, file total = 4572
2024-11-01 11:14:23.440 P00   INFO: restore command end: completed successfully (6417ms)

4)查看数据恢复状态
如下所示,主库节点数据库恢复完成:

# 启动主库数据库服务
[kingbase@node201 bin]$ ./sys_ctl start -D ../data

# 访问数据库
[kingbase@node201 bin]$ ./ksql -U system test
Type "help" for help.

test=# \c prod
You are now connected to database "prod" as userName "system".
prod=# \d
                       List of relations
 Schema |          Name           |       Type        | Owner
--------+-------------------------+-------------------+--------
 public | sys_roles               | table             | system
 public | sys_stat_statements     | view              | system
 public | sys_stat_statements_all | view              | system
 public | t1                      | table             | system
 public | t2                      | table             | system
 public | t3                      | table             | system
 public | tb1                     | table             | system
 public | teachers                | table             | system
(18 rows)

prod=# select count(*) from t3;
 count(*)
----------
    10000
(1 row)

四、执行备库恢复
如下所示,在执行备库clone前,将主库kingbase.auto.conf中的“restore_command”选项注释:

[kingbase@node201 bin]$ cat ../data/kingbase.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
.......

# Recovery settings generated by sys_rman restore on 2024-11-01 11:14:23
# restore_command = '/home/kingbase/cluster/R6C8/HAC8/kingbase/bin/sys_rman --config=/home/kingbase/kbbr_repo/sys_rman.conf --stanza=kingbase archive-get %f "%p"'

1、备库clone
[kingbase@node202 bin]$ ./repmgr standby clone -h 192.168.1.201 -U esrep -d esrep

2、启动备库数据库服务

[kingbase@node202 bin]$ ./sys_ctl start -D ../data

3、注册备库
[kingbase@node202 bin]$ ./repmgr standby register --force

4、查看集群状态

[kingbase@node201 bin]$ ./repmgr cluster show

 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                       
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 74       |         | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
 2  | node2 | standby |   running | node1    | default  | 100      | 74       | 0 bytes | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
You have mail in /var/spool/mail/kingbase

---如上所示,集群恢复完成 !

五、恢复故障案例

1、备库clone后故障
如下所示,在主库kingbase.auto.conf中没有注释“restore_command”参数后,备库执行clone,然后启动数据库服务,备库从归档日志开始恢复数据,导致主备库数据时间线timeline不一致,主备流复制建立失败:

备库sys_log日志:

[kingbase@node202 sys_log]$ tail -1000 kingbase-2024-11-01_113351.csv
2024-11-01 11:33:51.683 CST,,,32336,,67244c1f.7e50,1,,2024-11-01 11:33:51 CST,,0,LOG,00000,"ending log output to stderr",,"Future log output will go to log destination ""csvlog"".",,,,,,,""
2024-11-01 11:33:51.688 CST,,,32338,,67244c1f.7e52,1,,2024-11-01 11:33:51 CST,,0,LOG,00000,"database system was interrupted; last known up at 2024-11-01 11:33:21 CST",,,,,,,,,""
2024-11-01 11:33:52.247 CST,,,32338,,67244c1f.7e52,2,,2024-11-01 11:33:51 CST,,0,LOG,00000,"restored log file ""0000004B.history"" from archive",,,,,,,,,""
2024-11-01 11:33:52.259 CST,,,32338,,67244c1f.7e52,3,,2024-11-01 11:33:51 CST,,0,LOG,00000,"entering standby mode",,,,,,,,,""
2024-11-01 11:33:52.269 CST,,,32338,,67244c1f.7e52,4,,2024-11-01 11:33:51 CST,,0,LOG,00000,"restored log file ""0000004B.history"" from archive",,,,,,,,,""
2024-11-01 11:33:52.289 CST,,,32338,,67244c1f.7e52,5,,2024-11-01 11:33:51 CST,,0,FATAL,XX000,"requested timeline 75 is not a child of this server's history","Latest checkpoint is at 3/51000058 on timeline 74, but in the history of the requested timeline, the server forked off from that timeline at 3/4D0000A0.",,,,,,,,""
2024-11-01 11:33:52.290 CST,,,32336,,67244c1f.7e50,2,,2024-11-01 11:33:51 CST,,0,LOG,00000,"startup process (PID 32338) exited with exit code 1",,,,,,,,,""
2024-11-01 11:33:52.290 CST,,,32336,,67244c1f.7e50,3,,2024-11-01 11:33:51 CST,,0,LOG,00000,"aborting startup due to startup process failure",,,,,,,,,""
2024-11-01 11:33:52.297 CST,,,32336,,67244c1f.7e50,4,,2024-11-01 11:33:51 CST,,0,LOG,00000,"database system is shut down",,,,,,,,,""

2、注释主库restore_command参数

3、备库执行clone

如下所示,备库执行clone后,集群恢复正常:

ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                       
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1  | node1 | primary | * running |          | default  | 100      | 74       |         | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
2  | node2 | standby |   running | node1    | default  | 100      | 74       | 0 bytes | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000

六、总结
本文详细记录了,在集群环境下通过single-pro模式执行物理备份后的,集群恢复过程,可以用于数据库数据恢复的参考。

posted @ 2024-11-01 17:46  天涯客1224  阅读(17)  评论(0编辑  收藏  举报