KingbaseES V8R3集群运维案例之---集群启动“DATA_SIZE_DIFF 16 (MB)”故障
案例说明:
为保证集群数据的一致性安全,在主备库的数据相差“DATA_SIZE_DIFF >=16M"以上时,该备库不能参与主备切换,并且通过kingbase_monitor.sh启动集群时,集群将无法启动;本案例对此种故障做了复现,并测试了解决方法。
适用版本:
KingbaseES V8R3
适用版本:
KingbaseES V8R3
一、案例环境
1) 配置流复制架构
=== 此环境为一主一备的架构,并且配置流复制为异步模式,这样模拟备库数据库服务宕机,不能接收主库wal日志并应用,导致主备库之间数据差异增大===
# 配置synchronous_standby_names为null(无sync模式的备库)
test=# show synchronous_standby_names;
synchronous_standby_names
---------------------------
(1 row)
# 查看流复制同步模式
test=# select * from sys_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | s
tate | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+----
-------+---------------+----------------+----------------+-----------------+---------------+------------
6113 | 10 | SYSTEM | node243 | 192.168.7.243 | | 57572 | 2021-03-01 12:04:21.733067+08 | | str
eaming | 0/2B020560 | 0/2B020560 | 0/2B020560 | 0/2B020560 | 0 | async
(1 row)
2)关闭备库自动recovery和数据库服务
# 关闭crond服务中的network_rewind.sh自启动:
root@node3 ~]# cat /etc/cron.d/KINGBASECRON
#*/1 * * * * kingbase . /etc/profile;/home/kingbase/cluster/R6HA/KHA/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/R6HA/KHA/kingbase/bin/../etc/repmgr.conf >> /home/kingbase/cluster/R6HA/KHA/kingbase/bin/../kbha.log 2>&1
#*/1 * * * * kingbase /home/kingbase/cluster/kha/db/bin/network_rewind.sh
#*/1 * * * * root /home/kingbase/cluster/kha/kingbasecluster/bin/restartcluster.sh
# 停止备库的数据库服务
[kingbase@node3 log]$ /home/kingbase/cluster/kha/db/bin/sys_ctl stop -D /data/kingbase/cluster/r3/data
waiting for server to shut down.... done
server stopped
二、在主库进行事务处理模拟生产
1)查看主库sys_xlog
[kingbase@node1 sys_xlog]$ ls -lh
total 1.1G
......
-rw------- 1 kingbase kingbase 16M Mar 1 12:22 000000020000000000000040
-rw------- 1 kingbase kingbase 16M Mar 1 12:22 000000020000000000000041
-rw------- 1 kingbase kingbase 16M Mar 1 12:22 000000020000000000000042
-rw------- 1 kingbase kingbase 41 Mar 1 2021 00000002.history
drwx------ 2 kingbase kingbase 4.0K Mar 1 12:22 archive_status
2)模拟主库事务处理
prod=# insert into t values (generate_series(1,10000000),'usr'||generate_series(1,10000000));
INSERT 0 10000000
3)查看主库sys_xlog日志的变化(主库wal日志在增长)
[kingbase@node1 sys_xlog]$ ls -lh
total 1.8G
......
-rw------- 1 kingbase kingbase 16M Mar 1 12:31 00000002000000000000006F
-rw------- 1 kingbase kingbase 16M Mar 1 12:31 000000020000000000000070
-rw------- 1 kingbase kingbase 16M Mar 1 12:32 000000020000000000000071
-rw------- 1 kingbase kingbase 16M Mar 1 12:33 000000020000000000000072
-rw------- 1 kingbase kingbase 41 Mar 1 2021 00000002.history
drwx------ 2 kingbase kingbase 8.0K Mar 1 12:32 archive_status
三、重新启动集群模拟故障
1) 通过kingbase_monitor.sh重启集群
[kingbase@node3 bin]$ ./kingbase_monitor.sh restart
-----------------------------------------------------------------------
2021-03-01 12:30:07 KingbaseES automation beging...
.......
Received 0 response(s)
ping vip 192.168.7.245 success ping times :[3], success times:[2]
ping vip 192.168.7.245 success ping times :[3], success times:[3]
The diff between the LSN of some standby server and then LSN of primary server is greater than DATA_SIZE_DIFF 16 (MB)
Some standby's data is less than primary, could not start up the whole Cluster
如下图所示:出现”DATA_SIZE_DIFF 16 (MB)“故障:
kingbase_monitor.sh检测diff lsn语句:
2)查看备库数据库服务和流复制状态
=如下所示,集群的kingbasecluster管理服务没有启动,但是数据库服务已经启动=
备库数据库服务:
[kingbase@node3 bin]$ ps -ef |grep kingbase
kingbase 16974 1 0 12:30 ? 00:00:00 /home/kingbase/cluster/kha/db/bin/kingbase -D /data/kingbase/cluster/r3/data
kingbase 16991 16974 0 12:30 ? 00:00:00 kingbase: logger process
kingbase 16992 16974 8 12:30 ? 00:00:11 kingbase: startup process recovering 00000002000000000000004C
kingbase 16996 16974 0 12:30 ? 00:00:00 kingbase: checkpointer process
kingbase 16997 16974 0 12:30 ? 00:00:00 kingbase: writer process
kingbase 16998 16974 0 12:30 ? 00:00:00 kingbase: stats collector process
kingbase 17202 16974 1 12:30 ? 00:00:02 kingbase: wal receiver process streaming 0/4D000000
kingbase 17727 17725 0 12:33 ? 00:00:00 /bin/bash /home/kingbase/cluster/kha/db/bin/network_rewind.sh
主库数据库服务:
[kingbase@node1 bin]$ ps -ef |grep kingbase
kingbase 27462 1 0 12:31 ? 00:00:00 /home/kingbase/cluster/kha/db/bin/kingbase -D /data/kingbase/cluster/r3/data
kingbase 27472 27462 0 12:31 ? 00:00:00 kingbase: logger process
kingbase 27475 27462 0 12:31 ? 00:00:00 kingbase: checkpointer process
kingbase 27476 27462 0 12:31 ? 00:00:00 kingbase: writer process
kingbase 27477 27462 0 12:31 ? 00:00:00 kingbase: wal writer process
kingbase 27478 27462 0 12:31 ? 00:00:00 kingbase: autovacuum launcher process
kingbase 27479 27462 0 12:31 ? 00:00:00 kingbase: archiver process last was 000000020000000000000073
kingbase 27480 27462 0 12:31 ? 00:00:00 kingbase: stats collector process
kingbase 27481 27462 0 12:31 ? 00:00:00 kingbase: bgworker: syslogical supervisor
kingbase 27487 27462 0 12:31 ? 00:00:00 kingbase: wal sender process SYSTEM 192.168.7.243(59934) streaming 0/53680000
kingbase 28041 27462 0 12:32 ? 00:00:01 kingbase: autovacuum worker process prod
流复制状态:
test=# select * from sys_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | s
tate | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+---
------+---------------+----------------+----------------+-----------------+---------------+------------
27487 | 10 | SYSTEM | node243 | 192.168.7.243 | | 59934 | 2021-03-01 12:31:53.917879+08 | | ca
tchup | 0/55B20000 | 0/54000000 | 0/54000000 | 0/52FFFFD8 | 0 | async
(1 row)
3)查看备库recovery状态
=备库启动数据库服务后,startup进程执行recovery=
[kingbase@node3 bin]$ ps -ef |grep kingbase
kingbase 16974 1 0 12:30 ? 00:00:00 /home/kingbase/cluster/kha/db/bin/kingbase -D /data/kingbase/cluster/r3/data
kingbase 16991 16974 0 12:30 ? 00:00:00 kingbase: logger process
kingbase 16992 16974 4 12:30 ? 00:00:19 kingbase: startup process recovering 000000020000000000000071
kingbase 16996 16974 0 12:30 ? 00:00:00 kingbase: checkpointer process
kingbase 16997 16974 0 12:30 ? 00:00:00 kingbase: writer process
kingbase 16998 16974 0 12:30 ? 00:00:00 kingbase: stats collector process
kingbase 17202 16974 1 12:30 ? 00:00:04 kingbase: wal receiver process streaming 0/72000000
如下所示:备库数据库服务刚启动时,startup进程从”00000002000000000000004C“开始做recovery,已经应用wal日志到了”000000020000000000000071“。
=通过观察sys_stat_replication视图也能看到lsn的变化,说明备库一直在做recovery=
test=# select * from sys_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | s
tate | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+---
------+---------------+----------------+----------------+-----------------+---------------+------------
27487 | 10 | SYSTEM | node243 | 192.168.7.243 | | 59934 | 2021-03-01 12:31:53.917879+08 | | ca
tchup | 0/715E0000 | 0/70000000 | 0/70000000 | 0/6EFFFE98 | 0 | async
流复制状态是“catchup”:
4)备库完成recovery
test=# select * from sys_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin |
state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+---
--------+---------------+----------------+----------------+-----------------+---------------+------------
27487 | 10 | SYSTEM | node243 | 192.168.7.243 | | 59934 | 2021-03-01 12:31:53.917879+08 | | st
reaming | 0/81511BE8 | 0/81511BE8 | 0/81511BE8 | 0/81511BE8 | 0 | async
(1 row)
如下所示,此时备库的replay_location和其他的lsn已经相同,说明备库已经完成了wal日志的应用(当然本案例是在主库没有业务的情况下,replay_location和其他的lsn同步,但在生产情况下,业务繁忙时,很难达到同步,只能是缩小差异)。
流复制状态是“streaming”:
查看备库数据库服务:
[kingbase@node3 bin]$ ps -ef |grep kingbase
kingbase 16974 1 0 12:30 ? 00:00:00 /home/kingbase/cluster/kha/db/bin/kingbase -D /data/kingbase/cluster/r3/data
kingbase 16991 16974 0 12:30 ? 00:00:00 kingbase: logger process
kingbase 16992 16974 3 12:30 ? 00:00:20 kingbase: startup process recovering 000000020000000000000081
kingbase 16996 16974 0 12:30 ? 00:00:01 kingbase: checkpointer process
kingbase 16997 16974 0 12:30 ? 00:00:00 kingbase: writer process
kingbase 16998 16974 0 12:30 ? 00:00:00 kingbase: stats collector process
kingbase 17202 16974 0 12:30 ? 00:00:05 kingbase: wal receiver process streaming 0/81511BE8
=== 如上所示:startup进程应用的wal日志已经是备库的最后的wal日志,说明备库已经完成了recovery===
查看备库sys_xlog日志:
[kingbase@node3 bin]$ ls -lh /data/kingbase/cluster/r3/data/sys_xlog/
total 2.0G
......
-rw------- 1 kingbase kingbase 16M Mar 1 12:40 00000002000000000000007E
-rw------- 1 kingbase kingbase 16M Mar 1 12:40 00000002000000000000007F
-rw------- 1 kingbase kingbase 16M Mar 1 12:40 000000020000000000000080
-rw------- 1 kingbase kingbase 16M Mar 1 12:40 000000020000000000000081
-rw------- 1 kingbase kingbase 41 Mar 1 12:07 00000002.history
drwx------ 2 kingbase kingbase 8.0K Mar 1 12:40 archive_status
四、启动集群
kingbase@node3 bin]$ ./kingbase_monitor.sh restart
-----------------------------------------------------------------------
2021-03-01 12:43:23 KingbaseES automation beging...
......
......................
all started..
...
now we check again
=======================================================================
| ip | program| [status]
[ 192.168.7.243]| [kingbasecluster]| [active]
[ 192.168.7.248]| [kingbasecluster]| [active]
[ 192.168.7.243]| [kingbase]| [active]
[ 192.168.7.248]| [kingbase]| [active]
=======================================================================
=如上所示,集群启动正常,备库已经和主库缩小了数据差异,集群可以正常启动=
五、总结
对于集群出现”DATA_SIZE_DIFF >=16M"故障时,可以首先查看主备库的数据库服务,保证主备库的数据库服务都启动正常,并且流复制状态正常。备库在流复制正常的,并且主库的wal日志(归档和在线日志)都完整情况下,可以通过startup进程来做recovery,缩小和主库的数据差异。对于生产环境需要一个较长的时间做recovery,如果需要快速同步,可以重做备库。
附件:
主备库lsn相差(DATA_SIZE_DIFF)超过16M,cluster无法启动
适用版本:V8R3
问题说明:KingbaseES R3集群启动,主备库lsn相差(DATA_SIZE_DIFF)超过16M, cluster无法启动。
问题现象:
问题分析:
为避免备库和主库之间数据差异较大,启动集群时,发生failover,切换到备库,然后使用sys_rewind导致原主库事务回卷,导致数据丢失,所以当主备库lsn相差(DATA_SIZE_DIFF)超过16M时,集群无法启动。
问题解决:
可以不启动集群管理,手工启动主备库数据库服务,启动流复制,备库数据通过流复制,追赶主库数据,等数据差异小于阈值时,再启动集群。
或在Server/etc/HAmodule.conf文件 和 Cluster/etc/HAmodule.conf 这两个配置文件最后加上DATA_SIZE_DIFF=64 。