KingbaseES V8R3集群管理和维护案例之---failover切换wal日志变化分析
案例说明:
本案例通过对KingbaseES V8R3集群failover切换过程进行观察,分析了主备库切换后wal日志的变化,对应用者了解KingbaseES V8R3(R6) failover切换过程有一定的帮助。
以下为现场案例:
failover切换后主备库的wal日志信息:
新主库数据库服务启动故障:(sys_log)
=如下所示,在sys_log中,新主库启动startup后,建立流复制,流复制的起始wal日志是:“ 00000004000000050000002A”,导致复制失败。=
适用版本:
KingbaseES V8R3/R6
节点信息:
集群节点状态信息:
[kingbase@node102 bin]$ ./ksql -U SYSTEM -W 123456 TEST -p 9999
ksql (V008R003C002B0290)
Type "help" for help.
TEST=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.500000 | standby | 0 | true | 0
1 | 192.168.1.102 | 54321 | up | 0.500000 | primary | 0 | false | 0
(2 rows)
一、查看failover切换前节点信息
1、原主库wal日志
[kingbase@node102 sys_xlog]$ ls -lh
.......
-rw------- 1 kingbase kingbase 16M Jul 7 15:25 00000009000000000000002D.partial
-rw------- 1 kingbase kingbase 339 Jul 7 15:09 00000009.history
-rw------- 1 kingbase kingbase 16M Jul 29 10:56 0000000A000000000000002D
-rw------- 1 kingbase kingbase 16M Jul 29 16:32 0000000A000000000000002E
-rw------- 1 kingbase kingbase 16M Aug 3 10:22 0000000A000000000000002F
-rw------- 1 kingbase kingbase 382 Jul 7 15:25 0000000A.history
2、原主库控制文件信息
[kingbase@node102 bin]$ ./sys_controldata -D ../data
sys_control version number: 830
Catalog version number: 201608131
Database system identifier: 7080416207291699599
Database cluster state: in production
sys_control last modified: Wed 03 Aug 2022 10:26:57 AM CST
Latest checkpoint location: 0/2F000108
Prior checkpoint location: 0/2F000028
Latest checkpoint's REDO location: 0/2F0000D0
Latest checkpoint's REDO WAL file: 0000000A000000000000002F
Latest checkpoint's TimeLineID: 10
Latest checkpoint's PrevTimeLineID: 10
3、备库wal日志
[kingbase@node101 bin]$ ls -lh ../data/sys_xlog
.......
-rw------- 1 kingbase kingbase 16M Jul 7 15:25 00000009000000000000002D
-rw------- 1 kingbase kingbase 339 Jun 22 16:15 00000009.history
-rw------- 1 kingbase kingbase 16M Jul 29 16:14 0000000A000000000000002D
-rw------- 1 kingbase kingbase 16M Aug 3 10:22 0000000A000000000000002E
-rw------- 1 kingbase kingbase 16M Aug 3 10:27 0000000A000000000000002F
-rw------- 1 kingbase kingbase 382 Jul 29 10:33 0000000A.history
4、备库控制文件信息
[kingbase@node101 bin]$ ./sys_controldata -D ../data
sys_control version number: 830
Catalog version number: 201608131
Database system identifier: 7080416207291699599
Database cluster state: in archive recovery
sys_control last modified: Wed 03 Aug 2022 10:26:55 AM CST
Latest checkpoint location: 0/2F000028
Prior checkpoint location: 0/2E0002C8
Latest checkpoint's REDO location: 0/2F000028
Latest checkpoint's REDO WAL file: 0000000A000000000000002F
Latest checkpoint's TimeLineID: 10
Latest checkpoint's PrevTimeLineID: 10
二、执行failover切换(关闭主库数据库服务)
1、关闭主库数据库服务
[kingbase@node102 bin]$ ./sys_ctl stop -D ../data
waiting for server to shut down....... done
server stopped
三、failover切换完成主备状态信息
1、新主库wal日志
[kingbase@node101 bin]$ ls -lh ../data/sys_xlog
.......
-rw------- 1 kingbase kingbase 339 Jun 22 16:15 00000009.history
-rw------- 1 kingbase kingbase 16M Jul 29 16:14 0000000A000000000000002D
-rw------- 1 kingbase kingbase 16M Aug 3 10:22 0000000A000000000000002E
-rw------- 1 kingbase kingbase 16M Aug 3 10:30 0000000A000000000000002F
-rw------- 1 kingbase kingbase 16M Aug 3 10:30 0000000A0000000000000030.partial
-rw------- 1 kingbase kingbase 382 Jul 29 10:33 0000000A.history
-rw------- 1 kingbase kingbase 16M Aug 3 10:31 0000000B0000000000000030
-rw------- 1 kingbase kingbase 426 Aug 3 10:30 0000000B.history
切换完成后timeline发生切换:
查看timeline history文件信息:
2、新主库控制文件信息
[kingbase@node101 bin]$ ./sys_controldata -D ../data
sys_control version number: 830
Catalog version number: 201608131
Database system identifier: 7080416207291699599
Database cluster state: in production
sys_control last modified: Wed 03 Aug 2022 10:35:48 AM CST
Latest checkpoint location: 0/3005E110
Prior checkpoint location: 0/30004BD8
Latest checkpoint's REDO location: 0/3005B370
Latest checkpoint's REDO WAL file: 0000000B0000000000000030
Latest checkpoint's TimeLineID: 11
Latest checkpoint's PrevTimeLineID: 11
3、新备库wal日志
[kingbase@node102 bin]$ ls -lh ../data/sys_xlog
.......
-rw------- 1 kingbase kingbase 16M Jul 29 10:56 0000000A000000000000002D
-rw------- 1 kingbase kingbase 16M Jul 29 16:32 0000000A000000000000002E
-rw------- 1 kingbase kingbase 16M Aug 3 10:34 0000000A000000000000002F
-rw------- 1 kingbase kingbase 16M Aug 3 10:34 0000000A0000000000000030.partial
-rw------- 1 kingbase kingbase 382 Aug 3 10:34 0000000A.history
-rw------- 1 kingbase kingbase 16M Aug 3 10:34 0000000B0000000000000030
-rw------- 1 kingbase kingbase 426 Aug 3 10:34 0000000B.history
4、新备库控制文件信息
[kingbase@node102 bin]$ ./sys_controldata -D ../data
sys_control version number: 830
Catalog version number: 201608131
Database system identifier: 7080416207291699599
Database cluster state: in archive recovery
sys_control last modified: Wed 03 Aug 2022 10:35:42 AM CST
Latest checkpoint location: 0/30004BD8
Prior checkpoint location: 0/30004BD8
Latest checkpoint's REDO location: 0/30004BA0
Latest checkpoint's REDO WAL file: 0000000B0000000000000030
Latest checkpoint's TimeLineID: 11
Latest checkpoint's PrevTimeLineID: 11
四、将原主库作为备库恢复到集群
1、在原主库data下创建recovery.conf
[kingbase@node102 data]$ cp ../etc/recovery.done ./recovery.conf
2、查看recovery.log信息
primary node/Im node status is changed, primary ip[192.168.1.101], recovery.conf NEED_CHANGE [0] (0 is need ), I,m status is [1] (1 is down), I will be in recovery.
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.500000 | primary | 0 | true | 0
1 | 192.168.1.102 | 54321 | down | 0.500000 | standby | 0 | false | 0
(2 rows)
if recover node up, let it down , for rewind
2022-08-03 10:34:35 sys_rewind...
sys_rewind --target-data=/home/kingbase/cluster/R3HA/db/data --source-server="host=192.168.1.101 port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST"
datadir_source = /home/kingbase/cluster/R3HA/db/data
rewinding from last common checkpoint at 0/2F000108 on timeline 10
find last common checkpoint start time from 2022-08-03 10:34:35.349563 CST to 2022-08-03 10:34:35.405349 CST, in "0.055786" seconds.
reading source file list
reading target file list
reading WAL in target
Rewind datadir file from source
update the control file: minRecoveryPoint is '0/3004D0C8', minRecoveryPointTLI is '11', and database state is 'in archive recovery'
rewind start wal location 0/2F0000D0 (file 0000000A000000000000002F), end wal location 0/3004D0C8 (file 0000000B0000000000000030). time from 2022-08-03 10:34:37.349563 CST to 2022-08-03 10:34:37.872586 CST, in "2.523023" seconds.
Done!
sed conf change #synchronous_standby_names
2022-08-03 10:34:39 file operate
cp recovery.conf...
change recovery.conf ip -> primary.ip
2022-08-03 10:34:39 change recovery.conf
delete pid file if exist
del the replication_slots if exis
drop the slot [slot_node101].
drop the slot [slot_node102].
2022-08-03 10:34:40 start up the kingbase...
waiting for server to start....LOG: redirecting log output to logging collector process
HINT: Future log output will appear in directory "/home/kingbase/cluster/R3HA/db/data/sys_log".
done
server started
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
SYS_CREATE_PHYSICAL_REPLICATION_SLOT
--------------------------------------
(slot_node101,)
(1 row)
2022-08-03 10:34:42 create the slot [slot_node101] success.
SYS_CREATE_PHYSICAL_REPLICATION_SLOT
--------------------------------------
(slot_node102,)
(1 row)
2022-08-03 10:34:42 create the slot [slot_node102] success.
2022-08-03 10:34:42 start up standby successful!
can not get the replication of myself
如下所示:recovery过程:
五、总结
在集群执行failover切换时,可以结合wal日志和recovery.log和控制文件的变化,可以详细了解failover切换中wal日志的变化,及通过sys_rewind工具对原主库的恢复过程。