KingbaseES V8R3集群运维案例之---集群启动“DATA_SIZE_DIFF 16 (MB)”故障

案例说明:
为保证集群数据的一致性安全,在主备库的数据相差“DATA_SIZE_DIFF >=16M"以上时,该备库不能参与主备切换,并且通过kingbase_monitor.sh启动集群时,集群将无法启动;本案例对此种故障做了复现,并测试了解决方法。

适用版本:
KingbaseES V8R3

适用版本:

KingbaseES V8R3

一、案例环境

1) 配置流复制架构
=== 此环境为一主一备的架构,并且配置流复制为异步模式,这样模拟备库数据库服务宕机,不能接收主库wal日志并应用,导致主备库之间数据差异增大===

# 配置synchronous_standby_names为null(无sync模式的备库)
test=# show synchronous_standby_names;
 synchronous_standby_names 
---------------------------

(1 row)

# 查看流复制同步模式
test=# select * from sys_stat_replication;
 pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |   s
tate   | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state 
------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+----
-------+---------------+----------------+----------------+-----------------+---------------+------------
 6113 |       10 | SYSTEM  | node243          | 192.168.7.243 |                 |       57572 | 2021-03-01 12:04:21.733067+08 |              | str
eaming | 0/2B020560    | 0/2B020560     | 0/2B020560     | 0/2B020560      |             0 | async
(1 row)

2)关闭备库自动recovery和数据库服务

# 关闭crond服务中的network_rewind.sh自启动:
root@node3 ~]# cat /etc/cron.d/KINGBASECRON 
#*/1 * * * * kingbase . /etc/profile;/home/kingbase/cluster/R6HA/KHA/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/R6HA/KHA/kingbase/bin/../etc/repmgr.conf >> /home/kingbase/cluster/R6HA/KHA/kingbase/bin/../kbha.log 2>&1
#*/1 * * * * kingbase  /home/kingbase/cluster/kha/db/bin/network_rewind.sh
#*/1 * * * * root  /home/kingbase/cluster/kha/kingbasecluster/bin/restartcluster.sh

# 停止备库的数据库服务
[kingbase@node3 log]$ /home/kingbase/cluster/kha/db/bin/sys_ctl stop -D /data/kingbase/cluster/r3/data
waiting for server to shut down.... done
server stopped

二、在主库进行事务处理模拟生产

1)查看主库sys_xlog

[kingbase@node1 sys_xlog]$ ls -lh
total 1.1G
......
-rw------- 1 kingbase kingbase  16M Mar  1 12:22 000000020000000000000040
-rw------- 1 kingbase kingbase  16M Mar  1 12:22 000000020000000000000041
-rw------- 1 kingbase kingbase  16M Mar  1 12:22 000000020000000000000042
-rw------- 1 kingbase kingbase   41 Mar  1  2021 00000002.history
drwx------ 2 kingbase kingbase 4.0K Mar  1 12:22 archive_status

2)模拟主库事务处理

prod=# insert into t values (generate_series(1,10000000),'usr'||generate_series(1,10000000));
INSERT 0 10000000

3)查看主库sys_xlog日志的变化(主库wal日志在增长)

[kingbase@node1 sys_xlog]$ ls -lh
total 1.8G
......
-rw------- 1 kingbase kingbase  16M Mar  1 12:31 00000002000000000000006F
-rw------- 1 kingbase kingbase  16M Mar  1 12:31 000000020000000000000070
-rw------- 1 kingbase kingbase  16M Mar  1 12:32 000000020000000000000071
-rw------- 1 kingbase kingbase  16M Mar  1 12:33 000000020000000000000072
-rw------- 1 kingbase kingbase   41 Mar  1  2021 00000002.history
drwx------ 2 kingbase kingbase 8.0K Mar  1 12:32 archive_status

三、重新启动集群模拟故障

1) 通过kingbase_monitor.sh重启集群

[kingbase@node3 bin]$ ./kingbase_monitor.sh restart
-----------------------------------------------------------------------
2021-03-01 12:30:07 KingbaseES automation beging...
.......
Received 0 response(s)
ping vip 192.168.7.245 success ping times :[3], success times:[2]
ping vip 192.168.7.245 success ping times :[3], success times:[3]
The diff between the LSN of some standby server and then LSN of primary server is greater than DATA_SIZE_DIFF 16 (MB)
Some standby's data is less than primary, could not start up the whole Cluster

如下图所示:出现”DATA_SIZE_DIFF 16 (MB)“故障:

kingbase_monitor.sh检测diff lsn语句:

2)查看备库数据库服务和流复制状态

=如下所示,集群的kingbasecluster管理服务没有启动,但是数据库服务已经启动=

备库数据库服务:
[kingbase@node3 bin]$ ps -ef |grep kingbase

kingbase 16974     1  0 12:30 ?        00:00:00 /home/kingbase/cluster/kha/db/bin/kingbase -D /data/kingbase/cluster/r3/data
kingbase 16991 16974  0 12:30 ?        00:00:00 kingbase: logger process   
kingbase 16992 16974  8 12:30 ?        00:00:11 kingbase: startup process   recovering 00000002000000000000004C
kingbase 16996 16974  0 12:30 ?        00:00:00 kingbase: checkpointer process   
kingbase 16997 16974  0 12:30 ?        00:00:00 kingbase: writer process   
kingbase 16998 16974  0 12:30 ?        00:00:00 kingbase: stats collector process   
kingbase 17202 16974  1 12:30 ?        00:00:02 kingbase: wal receiver process   streaming 0/4D000000
kingbase 17727 17725  0 12:33 ?        00:00:00 /bin/bash /home/kingbase/cluster/kha/db/bin/network_rewind.sh


主库数据库服务:
[kingbase@node1 bin]$ ps -ef |grep kingbase

kingbase 27462     1  0 12:31 ?        00:00:00 /home/kingbase/cluster/kha/db/bin/kingbase -D /data/kingbase/cluster/r3/data
kingbase 27472 27462  0 12:31 ?        00:00:00 kingbase: logger process   
kingbase 27475 27462  0 12:31 ?        00:00:00 kingbase: checkpointer process   
kingbase 27476 27462  0 12:31 ?        00:00:00 kingbase: writer process   
kingbase 27477 27462  0 12:31 ?        00:00:00 kingbase: wal writer process   
kingbase 27478 27462  0 12:31 ?        00:00:00 kingbase: autovacuum launcher process   
kingbase 27479 27462  0 12:31 ?        00:00:00 kingbase: archiver process   last was 000000020000000000000073
kingbase 27480 27462  0 12:31 ?        00:00:00 kingbase: stats collector process   
kingbase 27481 27462  0 12:31 ?        00:00:00 kingbase: bgworker: syslogical supervisor   
kingbase 27487 27462  0 12:31 ?        00:00:00 kingbase: wal sender process SYSTEM 192.168.7.243(59934) streaming 0/53680000
kingbase 28041 27462  0 12:32 ?        00:00:01 kingbase: autovacuum worker process   prod


流复制状态:
test=# select * from sys_stat_replication;
  pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |  s
tate  | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state 
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+---
------+---------------+----------------+----------------+-----------------+---------------+------------
 27487 |       10 | SYSTEM  | node243          | 192.168.7.243 |                 |       59934 | 2021-03-01 12:31:53.917879+08 |              | ca
tchup | 0/55B20000    | 0/54000000     | 0/54000000     | 0/52FFFFD8      |             0 | async
(1 row)

3)查看备库recovery状态

=备库启动数据库服务后,startup进程执行recovery=

[kingbase@node3 bin]$ ps -ef |grep kingbase

kingbase 16974     1  0 12:30 ?        00:00:00 /home/kingbase/cluster/kha/db/bin/kingbase -D /data/kingbase/cluster/r3/data
kingbase 16991 16974  0 12:30 ?        00:00:00 kingbase: logger process   
kingbase 16992 16974  4 12:30 ?        00:00:19 kingbase: startup process   recovering 000000020000000000000071
kingbase 16996 16974  0 12:30 ?        00:00:00 kingbase: checkpointer process   
kingbase 16997 16974  0 12:30 ?        00:00:00 kingbase: writer process   
kingbase 16998 16974  0 12:30 ?        00:00:00 kingbase: stats collector process   
kingbase 17202 16974  1 12:30 ?        00:00:04 kingbase: wal receiver process   streaming 0/72000000
  如下所示:备库数据库服务刚启动时,startup进程从”00000002000000000000004C“开始做recovery,已经应用wal日志到了”000000020000000000000071“。

=通过观察sys_stat_replication视图也能看到lsn的变化,说明备库一直在做recovery=

test=# select * from sys_stat_replication;
  pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |  s
tate  | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state 
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+---
------+---------------+----------------+----------------+-----------------+---------------+------------
 27487 |       10 | SYSTEM  | node243          | 192.168.7.243 |                 |       59934 | 2021-03-01 12:31:53.917879+08 |              | ca
tchup | 0/715E0000    | 0/70000000     | 0/70000000     | 0/6EFFFE98      |             0 | async

流复制状态是“catchup”:

4)备库完成recovery

test=# select * from sys_stat_replication;
  pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |   
state   | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state 
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+---
--------+---------------+----------------+----------------+-----------------+---------------+------------
 27487 |       10 | SYSTEM  | node243          | 192.168.7.243 |                 |       59934 | 2021-03-01 12:31:53.917879+08 |              | st
reaming | 0/81511BE8    | 0/81511BE8     | 0/81511BE8     | 0/81511BE8      |             0 | async
(1 row)
如下所示,此时备库的replay_location和其他的lsn已经相同,说明备库已经完成了wal日志的应用(当然本案例是在主库没有业务的情况下,replay_location和其他的lsn同步,但在生产情况下,业务繁忙时,很难达到同步,只能是缩小差异)。

流复制状态是“streaming”:

查看备库数据库服务:

[kingbase@node3 bin]$ ps -ef |grep kingbase

kingbase 16974     1  0 12:30 ?        00:00:00 /home/kingbase/cluster/kha/db/bin/kingbase -D /data/kingbase/cluster/r3/data
kingbase 16991 16974  0 12:30 ?        00:00:00 kingbase: logger process   
kingbase 16992 16974  3 12:30 ?        00:00:20 kingbase: startup process   recovering 000000020000000000000081
kingbase 16996 16974  0 12:30 ?        00:00:01 kingbase: checkpointer process   
kingbase 16997 16974  0 12:30 ?        00:00:00 kingbase: writer process   
kingbase 16998 16974  0 12:30 ?        00:00:00 kingbase: stats collector process   
kingbase 17202 16974  0 12:30 ?        00:00:05 kingbase: wal receiver process   streaming 0/81511BE8

=== 如上所示:startup进程应用的wal日志已经是备库的最后的wal日志,说明备库已经完成了recovery===

查看备库sys_xlog日志:

[kingbase@node3 bin]$ ls -lh /data/kingbase/cluster/r3/data/sys_xlog/
total 2.0G
......
-rw------- 1 kingbase kingbase  16M Mar  1 12:40 00000002000000000000007E
-rw------- 1 kingbase kingbase  16M Mar  1 12:40 00000002000000000000007F
-rw------- 1 kingbase kingbase  16M Mar  1 12:40 000000020000000000000080
-rw------- 1 kingbase kingbase  16M Mar  1 12:40 000000020000000000000081
-rw------- 1 kingbase kingbase   41 Mar  1 12:07 00000002.history
drwx------ 2 kingbase kingbase 8.0K Mar  1 12:40 archive_status

四、启动集群

kingbase@node3 bin]$ ./kingbase_monitor.sh restart
-----------------------------------------------------------------------
2021-03-01 12:43:23 KingbaseES automation beging...
......
......................
all started..
...
now we check again
=======================================================================
|             ip |                       program|              [status] 
[  192.168.7.243]|             [kingbasecluster]|              [active]
[  192.168.7.248]|             [kingbasecluster]|              [active]
[  192.168.7.243]|                    [kingbase]|              [active]
[  192.168.7.248]|                    [kingbase]|              [active]
=======================================================================

=如上所示,集群启动正常,备库已经和主库缩小了数据差异,集群可以正常启动=

五、总结

对于集群出现”DATA_SIZE_DIFF >=16M"故障时,可以首先查看主备库的数据库服务,保证主备库的数据库服务都启动正常,并且流复制状态正常。备库在流复制正常的,并且主库的wal日志(归档和在线日志)都完整情况下,可以通过startup进程来做recovery,缩小和主库的数据差异。对于生产环境需要一个较长的时间做recovery,如果需要快速同步,可以重做备库。

附件:

主备库lsn相差(DATA_SIZE_DIFF)超过16M,cluster无法启动

适用版本:V8R3
问题说明:KingbaseES R3集群启动,主备库lsn相差(DATA_SIZE_DIFF)超过16M, cluster无法启动。
问题现象:

问题分析:
为避免备库和主库之间数据差异较大,启动集群时,发生failover,切换到备库,然后使用sys_rewind导致原主库事务回卷,导致数据丢失,所以当主备库lsn相差(DATA_SIZE_DIFF)超过16M时,集群无法启动。
问题解决:
可以不启动集群管理,手工启动主备库数据库服务,启动流复制,备库数据通过流复制,追赶主库数据,等数据差异小于阈值时,再启动集群。
或在Server/etc/HAmodule.conf文件 和 Cluster/etc/HAmodule.conf 这两个配置文件最后加上DATA_SIZE_DIFF=64 。

posted @ 2024-03-29 18:38  KINGBASE研究院  阅读(21)  评论(0编辑  收藏  举报