lightdb wal文件误删恢复及归档wal清理

2021-09-10 17:22:42.417183T  @  startup  00000 [2021-09-10 17:22:42 CST] 0 [9298] LOCATION:  StartupXLOG, xlog.c:6347
2021-09-10 17:22:42.417206T  @  startup  XX000 [2021-09-10 17:22:42 CST] 0 [9298] FATAL:  XX000: required WAL directory "pg_wal" does not exist
2021-09-10 17:22:42.417206T  @  startup  XX000 [2021-09-10 17:22:42 CST] 0 [9298] LOCATION:  ValidateXLOGDirectoryStructure, xlog.c:4262
2021-09-10 17:22:42.417407T  @  postmaster  00000 [2021-09-10 17:22:42 CST] 0 [9296] LOG:  00000: startup process (PID 9298) exited with exit code 1
2021-09-10 17:22:42.417407T  @  postmaster  00000 [2021-09-10 17:22:42 CST] 0 [9296] LOCATION:  LogChildExit, postmaster.c:3714
2021-09-10 17:22:42.417417T  @  postmaster  00000 [2021-09-10 17:22:42 CST] 0 [9296] LOG:  00000: aborting startup due to startup process failure
2021-09-10 17:22:42.417417T  @  postmaster  00000 [2021-09-10 17:22:42 CST] 0 [9296] LOCATION:  reaper, postmaster.c:2969
2021-09-10 17:22:42.427171T  @  postmaster  00000 [2021-09-10 17:22:42 CST] 0 [9296] LOG:  00000: database system is shut down
2021-09-10 17:22:42.427171T  @  postmaster  00000 [2021-09-10 17:22:42 CST] 0 [9296] LOCATION:  UnlinkLockFiles, miscinit.c:928

  执行lt_resetwal -f PGDATA可以重新初始化wal文件,但是会丢失事务日志以及数据不一致,因为可能有full checkpoint之前的数据丢失,极端情况下某些数据块丢失。此时初始化WAL文件如下:

[zjh@lightdb1 pgsql13.2]$ cd data/lt_wal/
[zjh@lightdb1 pg_wal]$ ll
total 1048576
-rw------- 1 zjh zjh 1073741824 Sep 10 21:44 00000001000000BB00000001
drwx------ 2 zjh zjh          6 Sep 10 21:42 archive_status

  再启动PG,备份、重建。

  具体会丢失多少数据,可以通过pg_controldata输出中的latest checkpoint确认。

   如果因为wal_size设置的比较大,希望删除历史归档wal的话,可以通过pg_archivecleanup清理latest checkpoint之前的wal日志,如下:

  pg_archivecleanup /data1/zjh/coordinator/pg_wal/ 000000010000000900000023

  清理000000010000000900000023之前的wal文件,注意如果是高可用环境没有使用复制槽的话,需要避免从节点需要的WAL被清理。

   确实,比他小的没有了,但是问题在于之前的日志都还没删除(因为没有启用归档,所以老的文件会被复用,只是改名了而已),所以也不是那么方便。

    =======

  但是一定要注意,启用了归档,但是归档命令为空的情况,这会导致wal无限制保留,超过max_wal_size大小。如下:

[root@hs-10-20-30-217 pg_wal]# grep "max_wal_size" ../postgresql.conf
#max_wal_size = 16GB
max_wal_size=100GB
[root@hs-10-20-30-217 pg_wal]# grep "keep" ../postgresql.conf
#tcp_keepalives_idle = 0        # TCP_KEEPIDLE, in seconds;
#tcp_keepalives_interval = 0        # TCP_KEEPINTVL, in seconds;
#tcp_keepalives_count = 0        # TCP_KEEPCNT;
#lightdb_keep_temp_file = off           # Keep temporary files due to insufficient work memory.
#wal_keep_size = 1024        # in megabytes; 0 disables
#max_slot_wal_keep_size = -1    # in megabytes; -1 disables
#lightdb_keep_temp_file = off
max_slot_wal_keep_size=173540MB
[root@hs-10-20-30-217 pg_wal]# du . -h
104K    ./archive_status
730G    .
[root@hs-10-20-30-217 pg_wal]# ls -alt archive_status/ | more
总用量 188
drwx------ 3 lightdb lightdb 61440  7月 10 16:12 ..
drwx------ 2 lightdb lightdb 73728  7月 10 16:12 .
-rw------- 1 lightdb lightdb     0  7月 10 16:12 000000010000099700000005.ready
-rw------- 1 lightdb lightdb     0  7月 10 16:11 000000010000099700000004.ready
-rw------- 1 lightdb lightdb     0  7月 10 16:10 000000010000099700000003.ready
-rw------- 1 lightdb lightdb     0  7月 10 16:09 000000010000099700000002.ready
-rw------- 1 lightdb lightdb     0  7月 10 16:07 000000010000099700000001.ready
-rw------- 1 lightdb lightdb     0  7月 10 16:06 000000010000099700000000.ready
-rw------- 1 lightdb lightdb     0  7月 10 16:05 000000010000099600000007.ready
-rw------- 1 lightdb lightdb     0  7月 10 16:04 000000010000099600000006.ready
-rw------- 1 lightdb lightdb     0  7月 10 16:03 000000010000099600000005.ready
-rw------- 1 lightdb lightdb     0  7月 10 16:02 000000010000099600000004.ready
-rw------- 1 lightdb lightdb     0  7月 10 16:01 000000010000099600000003.ready
-rw------- 1 lightdb lightdb     0  7月 10 15:59 000000010000099600000002.ready

  归档是把双刃剑,一定要先规划好自动备份机制以及增量、全量间隔、同时做好全方位监控和告警。如果一天产生的wal大小超过库大小的1/2以上,增量备份就意义不大了。最多留个3份作为时间点恢复即可。否则可以考虑7天、15天或31天。

  因为所有的备份都需要定期验证有效性,所以保留越多、越久,成本也就越高。

  就可用性而言,此时你会发现做异步热备可能反而比物理备份更加合理。

posted @ 2021-09-10 21:53  zhjh256  阅读(546)  评论(0编辑  收藏  举报