KingbaseES V8R6集群运维案例之---同城双中心failover切换

案例说明：
对于同城双中心的集群架构，当生产中心的主库数据库服务down时，默认是本中心的同步备库被选举为新的主库，只有当生产中心内所有节点都down时，灾备中心的和生产中心主库连接的同步备库节点提升为新主库。
适用版本：
KingbaseES V8R6

集群架构：

      ID | Name  | Role    | Status    | Upstream | repmgrd | PID  | Paused? | Upstream last seen
----+-------+---------+-----------+----------+---------+------+---------+--------------------
 1  | node1 | primary | * running |          | running | 2690 | no      | n/a
 2  | node2 | standby |   running | node1    | running | 2874 | no      | 1 second(s) ago
 3  | node3 | standby |   running | node1    | running | 4118 | no      | 0 second(s) ago
 注：
 1）生产中心：node1、node2
 2）灾备中心：node3

案例一：中心内failover切换

一、集群状态信息

[kingbase@node101 bin]$ ./repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location       | Priority | Timeline | LSN_Lag | Connection string                                                                                                               
----+-------+---------+-----------+----------+----------------+----------+----------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | production     | 100      | 5        |         | host=192.168.1.101 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
 2  | node2 | standby |   running | node1    | production     | 100      | 5        | 0 bytes | host=192.168.1.102 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
 3  | node3 | standby |   running | node1    | local_disaster | 100      | 5        | 0 bytes | host=192.168.1.103 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
 
 流复制状态信息：
test=# select application_name ,client_addr , state ,sent_lsn,write_lsn,flush_lsn,replay_lsn,sync_state
test-# from sys_stat_replication;
 application_name |  client_addr  |   state   |  sent_lsn  | write_lsn  | flush_lsn  | replay_lsn | sync_state
------------------+---------------+-----------+------------+------------+------------+------------+------------
 node2            | 192.168.1.102 | streaming | 0/13002570 | 0/13002570 | 0/13002570 | 0/13002570 | sync
 node3            | 192.168.1.103 | streaming | 0/13002570 | 0/13002570 | 0/13002570 | 0/13002570 | potential

二、repmgr.conf相关配置

[kingbase@node101 bin]$ cat ../etc/repmgr.conf |grep failover
failover='automatic'
failover_need_server_alive='none'

[kingbase@node101 bin]$ cat ../etc/repmgr.conf |grep sync
sync_in_same_location='0'
synchronous='sync'

三、模拟主库数据库服务down
[kingbase@node101 bin]$ ./sys_ctl stop -D /data/kingbase/tptc/rh6/data/

四、查看备库hangr.log日志

1、生产中心备库

1）生产中心备库尝试连接主库数据库服务，超过阈值10次失败后，触发failover切换。
[2023-07-04 11:21:27] [INFO] sleeping up to 6 seconds until next reconnection attempt
[2023-07-04 11:21:33] [INFO] checking state of node "node1" (ID: 1), 10 of 10 attempts
[2023-07-04 11:21:33] [WARNING] unable to reconnect to node "node1" (ID: 1) after 10 attempts

2）断开所有备库的流复制
[2023-07-04 11:21:33] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds
[2023-07-04 11:21:33] [WARNING] wal receiver not running
[2023-07-04 11:21:33] [NOTICE] WAL receiver disconnected on all sibling nodes
[2023-07-04 11:21:33] [INFO] WAL receiver disconnected on all 1 sibling nodes

3）生产中心备库在候选备库选举中获胜，将提升为新的主库
[2023-07-04 11:21:33] [INFO] 1 active sibling nodes registered
[2023-07-04 11:21:33] [INFO] 3 total nodes registered
[2023-07-04 11:21:33] [INFO] primary node  "node1" (ID: 1) and this node have the same location ("production")
[2023-07-04 11:21:33] [INFO] local node's last receive lsn: 0/100000A0
[2023-07-04 11:21:33] [INFO] checking state of sibling node "node3" (ID: 3)
[2023-07-04 11:21:33] [INFO] node "node3" (ID: 3) reports its upstream is node 1, last seen 55 second(s) ago
[2023-07-04 11:21:33] [INFO] standby node "node3" (ID: 3) last saw primary node 55 second(s) ago
[2023-07-04 11:21:33] [NOTICE] in quorum/sync/all mode, node "node3" (ID: 3) replication is not in streaming+quorum/sync stat, skipping
[2023-07-04 11:21:33] [INFO] visible nodes: 2; total nodes: 2; no nodes have seen the primary within the last 4 seconds
[2023-07-04 11:21:33] [NOTICE] promotion candidate is "node2" (ID: 2)
[2023-07-04 11:21:33] [NOTICE] setting "wal_retrieve_retry_interval" to 5000 ms
[2023-07-04 11:21:33] [NOTICE] this node is the winner, will now promote itself and inform other nodes

4）测试网络连通性（ping网关），执行promote提升备库为新主库。
[2023-07-04 11:21:33] [INFO] try to ping the trusted_servers "192.168.1.1" before execute promote_command
[2023-07-04 11:21:35] [NOTICE] PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.

--- 192.168.1.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.198/0.201/0.204/0.003 ms

[2023-07-04 11:21:35] [NOTICE] successfully ping one or more of the trusted_servers "192.168.1.1"
[2023-07-04 11:21:35] [INFO] promote_command is:
  "/home/kingbase/cluster/tptc/rh6/kingbase/bin/repmgr standby promote -f /home/kingbase/cluster/tptc/rh6/kingbase/etc/repmgr.conf"
[WARNING] 1 sibling nodes found, but option "--siblings-follow" not specified
[DETAIL] these nodes will remain attached to the current primary:
  node3 (node ID: 3)
[NOTICE] promoting standby to primary
[DETAIL] promoting server "node2" (ID: 2) using sys_promote()
[NOTICE] waiting for promotion to complete, replay lsn: 0/100000A0
[2023-07-04 11:21:35] [NOTICE] try to stop old primary db (host: "192.168.1.101")
[INFO] SET synchronous TO "async" on primary host
[NOTICE] STANDBY PROMOTE successful
[DETAIL] server "node2" (ID: 2) was successfully promoted to primary

2、灾备中心备库

[2023-07-04 11:21:42] [NOTICE] [thread pid:4567] node (ID: 3; host: "192.168.1.103") will follow myself, ready to auto-recovery
[2023-07-04 11:21:42] [NOTICE] [thread pid:4567] Now, the primary host ip: 192.168.1.102
[2023-07-04 11:21:43] [INFO] [thread pid:4567] ES connection to host "192.168.1.103" succeeded, ready to do auto-recovery
[2023-07-04 11:21:43] [INFO] node "node3" (ID: 3, HOST: 192.168.1.103) auto-recovery: STANDBY FOLLOW
[2023-07-04 11:21:45] [NOTICE] executing repmgr command "/home/kingbase/cluster/tptc/rh6/kingbase/bin/repmgr standby follow -f /home/kingbase/cluster/tptc/rh6/kingbase/etc/repmgr.conf -W --upstream-node-id=2"

---如上所示，对同城灾备中心备库执行repmgr standby follow。

3、灾备中心sys_log日志

如下所示，备库在failover过程中，连接新的主库（upstream）节点时有个被recovery，数据库服务短暂重启的过程。

2023-07-04 11:21:50.579 CST,,,10041,,64a3904e.2739,1,,2023-07-04 11:21:50 CST,,0,LOG,00000,"database system was shut down in recovery at 2023-07-04 11:21:50 CST",,,,,,,,,""
2023-07-04 11:21:50.579 CST,,,10041,,64a3904e.2739,2,,2023-07-04 11:21:50 CST,,0,LOG,00000,"entering standby mode",,,,,,,,,""
2023-07-04 11:21:50.586 CST,,,10041,,64a3904e.2739,3,,2023-07-04 11:21:50 CST,1/0,0,LOG,00000,"consistent recovery state reached at 0/100000A0",,,,,,,,,""
2023-07-04 11:21:50.586 CST,,,10041,,64a3904e.2739,4,,2023-07-04 11:21:50 CST,1/0,0,LOG,00000,"invalid record length at 0/100000A0: wanted 24, got 0",,,,,,,,,""
2023-07-04 11:21:50.586 CST,,,10028,,64a3904e.272c,2,,2023-07-04 11:21:50 CST,,0,LOG,00000,"database system is ready to accept read only connections",,,,,,,,,""

五、切换后集群状态

[kingbase@node102 bin]$ ./repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location       | Priority | Timeline | LSN_Lag | Connection string                                                                                                               
----+-------+---------+-----------+----------+----------------+----------+----------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | - failed  | ?        | production     | 100      |          |         | host=192.168.1.101 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
 2  | node2 | primary | * running |          | production     | 100      | 6        |         | host=192.168.1.102 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000
 3  | node3 | standby |   running | node2    | local_disaster | 100      | 5        | 0 bytes | host=192.168.1.103 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=2 keepalives_interval=2 keepalives_count=3 tcp_user_timeout=9000

[WARNING] following issues were detected
  - unable to connect to node "node1" (ID: 1)

[HINT] execute with --verbose option to see connection error messages

案例二：中心之间failover切换

1、生产中心集群状态
如下图所示，生产中心node1节点数据服务down，现在node2为集群主节点。

2、repmgr.conf相关配置

[kingbase@node101 bin]$ cat ../etc/repmgr.conf |grep failover
failover='automatic'
failover_need_server_alive='none'

[kingbase@node101 bin]$ cat ../etc/repmgr.conf |grep sync
sync_in_same_location='0'
synchronous='sync'

3、模拟node2节点数据库服务down
[kingbase@node102 bin]$ ./sys_ctl stop -D /data/kingbase/tptc/rh6/data/

4、查看灾备备库hamgr.log

2023-06-27 14:48:03] [INFO] checking state of node "node1" (ID: 1), 10 of 10 attempts
[2023-06-27 14:48:03] [WARNING] unable to reconnect to node "node1" (ID: 1) after 10 attempts
[2023-06-27 14:48:03] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds
[2023-06-27 14:48:03] [WARNING] wal receiver not running
[2023-06-27 14:48:03] [NOTICE] WAL receiver disconnected on all sibling nodes
[2023-06-27 14:48:03] [INFO] WAL receiver disconnected on all 0 sibling nodes
[2023-06-27 14:48:03] [INFO] 0 active sibling nodes registered
[2023-06-27 14:48:03] [INFO] 3 total nodes registered
[2023-06-27 14:48:03] [INFO] primary node "node1" (ID: 1) has location "production", this node's location is "local_disaster"
[2023-06-27 14:48:03] [INFO] no other sibling nodes and servers alive - we win by default
[2023-06-27 14:48:03] [NOTICE] setting "wal_retrieve_retry_interval" to 5000 ms
[2023-06-27 14:48:03] [NOTICE] this node is the only available candidate and will now promote itself
[2023-06-27 14:48:03] [INFO] try to ping the trusted_servers "192.168.1.1" before execute promote_command
[2023-06-27 14:48:05] [NOTICE] PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.

--- 192.168.1.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.207/0.229/0.249/0.021 ms

[2023-06-27 14:48:05] [NOTICE] successfully ping one or more of the trusted_servers "192.168.1.1"
[2023-06-27 14:48:05] [INFO] promote_command is:
  "/home/kingbase/cluster/tptc/rh6/kingbase/bin/repmgr standby promote -f /home/kingbase/cluster/tptc/rh6/kingbase/etc/repmgr.conf"
[NOTICE] promoting standby to primary
[DETAIL] promoting server "node3" (ID: 3) using sys_promote()
[NOTICE] waiting for promotion to complete, replay lsn: 0/9001118
[INFO] SET synchronous TO "async" on primary host
[NOTICE] STANDBY PROMOTE successful
[DETAIL] server "node3" (ID: 3) was successfully promoted to primary

---如上所示，此灾备中心备库节点是集群唯一节点，生产中心所有节点down后，提升为新的主库。

posted @ 2023-07-04 16:53 天涯客1224 阅读(1) 评论(0) 编辑收藏举报

刷新页面返回顶部

天涯客1224

KingbaseES V8R6集群运维案例之---同城双中心failover切换

公告