KingbaseES R6集群repmgr.conf参数'recovery'测试案例(一)

案例说明:
在KingbaseES R6集群中,主库节点出现宕机(如重启或关机),会产生主备切换,但是当主库节点系统恢复正常后,如何对原主库节点进行处理,保证集群数据的一致性和安全,可以通过对repmgr.conf文件中配置recovery参数来解决。本案例记录了对‘recovery’参数的三种配置的详细测试过程。

注意:对于KingbaseES R6老的版本,recovery参数只支持‘manual’和‘automatic’。

数据库版本:

集群架构:

集群节点信息:

案例一:测试‘recovery = standby’

一、执行主备切换测试

1、配置recovery参数(所有node):

2、查看集群节点状态信息

[kingbase@node1 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+---------------------------
 1  | node243 | primary | * running |          | default  | 100      | 3        | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node248 | standby |   running | node243  | default  | 100      | 3        | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

3、主库节点系统重启
[root@node3 ~]# reboot

4、查看备库hamgr日志

=从hamgr日志获知,原主库宕机后,集群主备切换,原备库提升为主库。=

[kingbase@node1 log]$ tail -f 100 hamgr.log 
tail: cannot open ‘100’ for reading: No such file or directory
==> hamgr.log <==
[2022-03-01 13:12:23] [NOTICE] repmgrd (repmgrd 5.0.0) starting up
[2022-03-01 13:12:23] [INFO] connecting to database "host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
INFO:  set_repmgrd_pid(): provided pidfile is /home/kingbase/cluster/R6C5/R6C5R/kingbase/etc/hamgrd.pid
[2022-03-01 13:12:23] [NOTICE] starting monitoring of node "node248" (ID: 2)
[2022-03-01 13:12:23] [INFO] "connection_check_type" set to "ping"
[2022-03-01 13:12:23] [INFO] monitoring connection to upstream node "node243" (ID: 1)
[2022-03-01 13:12:23] [NOTICE] try to change wal catched_up state to 1
[2022-03-01 13:12:23] [INFO] primary flush lsn is 0/12000900, local flush lsn is 0/12000848
[2022-03-01 13:12:23] [NOTICE] try to change streaming_sync state to TRUE
[2022-03-01 13:17:24] [INFO] node "node248" (ID: 2) monitoring upstream node "node243" (ID: 1) in normal state
[2022-03-01 13:20:00] [WARNING] unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
[2022-03-01 13:20:00] [DETAIL] PQping() returned "PQPING_REJECT"
[2022-03-01 13:20:00] [WARNING] unable to connect to upstream node "node243" (ID: 1)
[2022-03-01 13:20:00] [INFO] sleeping 6 seconds until next reconnection attempt
[2022-03-01 13:20:06] [INFO] checking state of node 1, 1 of 10 attempts
[2022-03-01 13:20:16] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
[2022-03-01 13:20:16] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2022-03-01 13:20:16] [INFO] sleeping 6 seconds until next reconnection attempt
......
[2022-03-01 13:21:23] [INFO] checking state of node 1, 10 of 10 attempts
[2022-03-01 13:21:23] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
[2022-03-01 13:21:23] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2022-03-01 13:21:23] [WARNING] unable to reconnect to node 1 after 10 attempts
[2022-03-01 13:21:23] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds
[2022-03-01 13:21:23] [WARNING] wal receiver not running
[2022-03-01 13:21:23] [NOTICE] WAL receiver disconnected on all sibling nodes
[2022-03-01 13:21:23] [INFO] WAL receiver disconnected on all 0 sibling nodes
[2022-03-01 13:21:23] [INFO] 0 active sibling nodes registered
[2022-03-01 13:21:23] [INFO] primary and this node have the same location ("default")
[2022-03-01 13:21:23] [INFO] no other sibling nodes - we win by default
[2022-03-01 13:21:23] [NOTICE] setting "wal_retrieve_retry_interval" to 5000 ms
[2022-03-01 13:21:23] [NOTICE] this node is the only available candidate and will now promote itself
[2022-03-01 13:21:23] [INFO] try to ping the trusted_servers "192.168.7.1" before execute promote_command
[2022-03-01 13:21:25] [NOTICE] PING 192.168.7.1 (192.168.7.1) 56(84) bytes of data.

--- 192.168.7.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 2.324/6.238/10.152/3.914 ms

[2022-03-01 13:21:25] [NOTICE] successfully ping one or more of the trusted_servers "192.168.7.1"
[2022-03-01 13:21:26] [NOTICE] try to stop old primary db (host: "192.168.7.243")
[2022-03-01 13:21:26] [NOTICE] PING 192.168.7.241 (192.168.7.241) 56(84) bytes of data.

--- 192.168.7.241 ping statistics ---
2 packets transmitted, 0 received, +1 errors, 100% packet loss, time 1000ms


[2022-03-01 13:21:26] [WARNING] ping host"192.168.7.241" failed
[2022-03-01 13:21:26] [DETAIL] average RTT value is not greater than zero
[2022-03-01 13:21:26] [INFO] loadvip result: 1, arping result: 1
[2022-03-01 13:21:26] [NOTICE] new primary node (ID: 2) acquire the virtual ip 192.168.7.241/24 success
[2022-03-01 13:21:26] [INFO] promote_command is:
  "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgr  standby promote -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/etc/repmgr.conf"
NOTICE: promoting standby to primary
DETAIL: promoting server "node248" (ID: 2) using sys_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node248" (ID: 2) was successfully promoted to primary
[2022-03-01 13:21:30] [INFO] switching to primary monitoring mode
[2022-03-01 13:21:30] [NOTICE] monitoring cluster primary "node248" (ID: 2)
[2022-03-01 13:21:30] [INFO] create a thread 0x7fe7dbe15700 to check the cluster status
[2022-03-01 13:21:30] [INFO] node (ID: 1): no server running
[2022-03-01 13:21:31] [INFO] [thread 0x7fe7dbe15700] the cluster has no other running primary node, exit

二、原主库节点系统恢复后加入集群测试

1、在新主库创建replication slot

test=# select sys_create_physical_replication_slot('repmgr_slot_1');
sys_create_physical_replication_slot 
--------------------------------------
(repmgr_slot_1,)
(1 row)

test=# select sys_create_physical_replication_slot('repmgr_slot_2');
sys_create_physical_replication_slot 
--------------------------------------
(repmgr_slot_2,)
(1 row)

test=# select * from sys_replication_slots;                         
  slot_name   | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_l
sn 
---------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+------------------
---
repmgr_slot_1 |        | physical  |        |          | f         | f      |            |      |              |             | 
repmgr_slot_2 |        | physical  |        |          | f         | f      |            |      |              |             | 
(2 rows)

2、原主库系统启动完成:

1)备份新备库节点数据目录
[kingbase@node3 kingbase]$ cp -r data data.bk

2)在data下创建备库标识文件(重要)
[kingbase@node3 data]$ touch standby.signal

3)查看新备库连接字串信息

[kingbase@node3 data]$ cat kingbase.auto.conf 
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
job_queue_processes = '5'
primary_conninfo = 'user=esrep connect_timeout=10 host=192.168.7.248 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 application_name=node243'
recovery_target_timeline = 'latest'
primary_slot_name = 'repmgr_slot_1'
wal_retrieve_retry_interval = '5000'
synchronous_standby_names = '1 (*)'
wal_retrieve_retry_interval = '5000'

4)启动新备库数据库服务

kingbase@node3 bin]$ ./sys_ctl start -D ../data
......
NOTICE: standby node "node243" (ID: 1) successfully registered

5)查看当前集群节点状态

[kingbase@node3 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status               | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+----------------------+----------+----------+----------+----------+----------------
 1  | node243 | primary | ! running as standby |          | default  | 100      | 3        | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node248 | standby | ! running as primary |          | default  | 100      | 4        | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

WARNING: following issues were detected
  - node "node243" (ID: 1) is registered as primary but running as standby
  - node "node248" (ID: 2) is registered as standby but running as primary

6)集群自动恢复新备库

=如下hamgr日志所示,启动新备库数据库服务后,集群自动对备库做recovery,并将原主库以备库的模式加入集群。=

*[2022-03-01 13:26:31] [INFO] monitoring primary node "node248" (ID: 2) in normal state
[2022-03-01 13:27:28] [INFO] child node: 1; attached: no
[2022-03-01 13:27:28] [INFO] check node status again, try 1 / 10 times
[2022-03-01 13:27:30] [INFO] child node: 1; attached: no
.....
[2022-03-01 13:27:46] [INFO] check node status again, try 10 / 10 times
[2022-03-01 13:27:48] [INFO] child node: 1; attached: no
[2022-03-01 13:27:48] [INFO] found node down, recovery will be triggered after recovery delay time 20s
[2022-03-01 13:27:50] [INFO] child node: 1; attached: no
......
[2022-03-01 13:28:08] [INFO] child node: 1; attached: no
[2022-03-01 13:28:08] [INFO] recovery delay time reached. can do recovery now.
[2022-03-01 13:28:09] [NOTICE] mark node "node243" (ID: 1) as inactive
[2022-03-01 13:28:09] [INFO] [thread pid:30763] do_nodes_recovery thread begin. The pthread_t tid is 0x7fe7dbe15700
[2022-03-01 13:28:09] [NOTICE] [thread pid:30763] node (ID: 1; host: "192.168.7.243") is not attached, ready to auto-recovery
[2022-03-01 13:28:09] [NOTICE] [thread pid:30763] Now, the primary host ip: 192.168.7.248
[2022-03-01 13:28:10] [INFO] [thread pid:30763] ES connection to host "192.168.7.243" succeeded, ready to do auto-recovery
[2022-03-01 13:28:10] [NOTICE] kbha: node (ID: 1) is running as standby, stop it and do rejoin.

[2022-03-01 13:28:15] [INFO] unlink file /tmp/.s.KINGBASE.54321.lock
[2022-03-01 13:28:15] [NOTICE] executing repmgr command "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgr --dbname="host=192.168.7.248 dbname=esrep user=esrep port=54321" node rejoin --force-rewind"
NOTICE: sys_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server's timeline 4 forked off current database system timeline 3 before current recovery point 0/130000A0
NOTICE: executing sys_rewind
DETAIL: sys_rewind command is "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_rewind -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' --source-server='host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'"
sys_rewind: servers diverged at WAL location 0/12000A08 on timeline 3
sys_rewind: rewinding from last common checkpoint at 0/11000058 on timeline 3
sys_rewind: find last common checkpoint start time from 2022-03-01 13:28:15.600702 CST to 2022-03-01 13:28:16.200048 CST, in "0.599346" seconds.
sys_rewind: update the control file: minRecoveryPoint is '0/12011F70', minRecoveryPointTLI is '4', and database state is 'in archive recovery'
*sys_rewind: rewind start wal location 0/11000028 (file 000000030000000000000011), end wal location 0/12011F70 (file 000000040000000000000012). time from 2022-03-01 13:28:15.600702 CST to 2022-03-01 13:28:36.045129 CST, in "20.444427" seconds.
sys_rewind: Done!
NOTICE: 0 files copied to /home/kingbase/cluster/R6C5/R6C5R/kingbase/data
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: begin to start server at 2022-03-01 13:28:36.437003
NOTICE: starting server using "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_ctl  -w -t 90 -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' -l /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/logfile start"
NOTICE: start server finish at 2022-03-01 13:28:37.367954
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2
[2022-03-01 13:28:38] [NOTICE] kbha: node (ID: 1) rejoin success.

[2022-03-01 13:28:38] [NOTICE] [thread pid:30763] node "node243" (ID: 1) auto-recovery success
[2022-03-01 13:28:38] [INFO] [thread pid:30763] do_nodes_recovery thread ends. The pthread_t tid is 0x7fe7dbe15700
[2022-03-01 13:28:39] [INFO] SET synchronous TO "sync" on primary host 
[2022-03-01 13:28:39] [INFO] thread tid:0x7fe7dbe15700 is not running
[2022-03-01 13:28:39] [INFO] the recovery thread was exited, reset tid
[2022-03-01 13:28:39] [NOTICE] Some nodes reconnect, all standby nodes are OK now
[2022-03-01 13:28:41] [NOTICE] new standby "node243" (ID: 1) has connected
[2022-03-01 13:31:31] [INFO] monitoring primary node "node248" (ID: 2) in normal state

7)查看备库数据库进程

8)原主库作为新备库rejoin到集群

[kingbase@node3 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+---------------------------
 1  | node243 | standby |   running | node248  | default  | 100      | 5        | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node248 | primary | * running |          | default  | 100      | 6        | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

9)主库查询流复制信息

test=# select * from sys_replication_slots;
   slot_name   | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_l
sn 
---------------+--------+-----------+--------+----------+-----------+--------+------------+------+----------
---
 repmgr_slot_1 |        | physical  |        |          | f         | t      |      30928 | 1437 |              | 0/120130A8  | 
 repmgr_slot_2 |        | physical  |        |          | f         | f      |            |      |              |             | 
(2 rows)


test=# select * from sys_stat_replication;
  pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |  
 state   |  sent_lsn  | write_lsn  | flush_lsn  | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state |          reply_t
ime           
-------+----------+---------+------------------+---------------+-----------------+-------------+-----------

 30928 |    16384 | esrep   | node243          | 192.168.7.243 |                 |       10817 | 2022-03-01 13:28:37.941077+08 |              | s
treaming | 0/120130A8 | 0/120130A8 | 0/120130A8 | 0/120130A8 |           |           |            |             1 | sync       | 2022-03-01 13:32
:08.445325+08
(1 row)

=如上所示,在原主库节点系统重启后,配置原主库为备库并启动数据库服务后,集群自动将新备库加入到集群。=

=未完待续=

posted @ 2022-03-02 14:36  天涯客1224  阅读(152)  评论(0编辑  收藏  举报