KingbaseES V8R3集群运维案例之---数据节点执行pcp_attach_node
案例说明:
KingbaseES V8R3 一主二备集群中,其中一个节点为单纯的数据节点,当集群流复制正常,但通过show pool_nodes查看数据节点状态为‘down’,如何重新更新和注册此数据节点,恢复正常状态。
适用版本:
KingbaseES V8R3
集群架构:
192.168.1.101 管理节点&数据节点(node1)
192.168.1.102 管理节点&数据节点(node1)
192.168.1.103 数据节点(node3)
一、模拟数据节点集群状态down
1、故障前集群节点状态
如下所示,集群节点状态和流复制状态都正常:
TEST=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.333333 | primary | 0 | false | 0
1 | 192.168.1.102 | 54321 | up | 0.333333 | standby | 0 | false | 0
2 | 192.168.1.103 | 54321 | up | 0.333333 | standby | 0 | true | 0
(3 rows)
TEST=# select * from sys_stat_replication;
PID | USESYSID | USENAME | APPLICATION_NAME | CLIENT_ADDR | CLIENT_HOSTNAME | CLIENT_PORT | BACKEND_START | BACKEND_XMIN | STATE | SENT_LOCATION | WRITE_LOC
ATION | FLUSH_LOCATION | REPLAY_LOCATION | SYNC_PRIORITY | SYNC_STATE
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------
------+----------------+-----------------+---------------+------------
21847 | 10 | SYSTEM | node2 | 192.168.1.102 | | 10719 | 2023-08-04 15:10:52.719000+08 | | streaming | 1/72000D10 | 1/72000D1
0 | 1/72000D10 | 1/72000D10 | 0 | async
3378 | 10 | SYSTEM | node3 | 192.168.1.103 | | 10398 | 2023-08-04 16:24:40.434048+08 | | streaming | 1/72000D10 | 1/72000D1
0 | 1/72000D10 | 1/72000D10 | 0 | async
(2 rows)
2、模拟数据节点故障
1)关闭数据节点服务auto-recovery
[root@node103 ~]# cat /etc/cron.d/KINGBASECRON
###*/1 * * * * kingbase /home/kingbase/cluster/HAR3/db/bin/network_rewind.sh
# 如上所示,注释crond计划任务
2)关闭数据节点数据库服务
[kingbase@node103 bin]$ ./sys_ctl stop -D ../data
......
server stopped
3)重启管理节点的kingbasecluster服务(主备库root用户)
[root@node101 ~]# /home/kingbase/cluster/HAR3/kingbasecluster/bin/restartcluster.sh
4)再启动数据节点数据库服务
[kingbase@node103 bin]$ ps -ef |grep kingbase
root 1145 1 0 09:42 ? 00:00:00 /home/kingbase/cluster/HAR3/db/bin/es_server -f /home/kingbase/cluster/HAR3/db/share/es_server.conf
kingbase 7350 1 0 16:24 pts/1 00:00:00 /home/kingbase/cluster/HAR3/db/bin/kingbase -D ../data
kingbase 7351 7350 0 16:24 ? 00:00:00 kingbase: logger process
kingbase 7352 7350 0 16:24 ? 00:00:00 kingbase: startup process recovering 0000000F0000000100000072
kingbase 7356 7350 0 16:24 ? 00:00:00 kingbase: checkpointer process
kingbase 7357 7350 0 16:24 ? 00:00:00 kingbase: writer process
kingbase 7358 7350 0 16:24 ? 00:00:00 kingbase: stats collector process
kingbase 7359 7350 0 16:24 ? 00:00:00 kingbase: wal receiver process streaming 1/72000D10
5)查看集群节点状态和流复制状态
如下所示,数据节点状态为‘down’,但是流复制正常。因为在数据节点数据库服务down状态,重启了kingbasecluster,导致kingbasecluster获取到数据节点的状态为‘down’,虽然后面启动了数据节点的数据库服务,但是kingbasecluster不能更新此节点状态。
TEST=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.333333 | primary | 0 | false | 0
1 | 192.168.1.102 | 54321 | up | 0.333333 | standby | 0 | true | 0
2 | 192.168.1.103 | 54321 | down | 0.333333 | standby | 0 | false | 0
(3 rows)
TEST=# select * from sys_stat_replication;
PID | USESYSID | USENAME | APPLICATION_NAME | CLIENT_ADDR | CLIENT_HOSTNAME | CLIENT_PORT | BACKEND_START | BACKEND_XMIN | STATE | SENT_LOCATION | WRITE_LOC
ATION | FLUSH_LOCATION | REPLAY_LOCATION | SYNC_PRIORITY | SYNC_STATE
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------
------+----------------+-----------------+---------------+------------
21847 | 10 | SYSTEM | node2 | 192.168.1.102 | | 10719 | 2023-08-04 15:10:52.719000+08 | | streaming | 1/72000D10 | 1/72000D1
0 | 1/72000D10 | 1/72000D10 | 0 | async
3378 | 10 | SYSTEM | node3 | 192.168.1.103 | | 10398 | 2023-08-04 16:24:40.434048+08 | | streaming | 1/72000D10 | 1/72000D1
0 | 1/72000D10 | 1/72000D10 | 0 | async
(2 rows)
如下图所示,数据节点的状态为‘down’:
二、执行pcp_attatch_node更新节点状态
1、主库执行pcp_attach_node
[kingbase@node101 bin]$ ./pcp_attach_node -n 2 --verbose
Password:
pcp_attach_node -- Command Successful
# 注意: 2为数据节点的id号
2、查看节点状态信息
如下所示,数据节点的状态信息已经恢复正常:
TEST=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.333333 | primary | 1 | true | 0
1 | 192.168.1.102 | 54321 | up | 0.333333 | standby | 0 | false | 0
2 | 192.168.1.103 | 54321 | up | 0.333333 | standby | 0 | false | 0
(3 rows)
TEST=# select * from sys_stat_replication;
PID | USESYSID | USENAME | APPLICATION_NAME | CLIENT_ADDR | CLIENT_HOSTNAME | CLIENT_PORT | BACKEND_START | BACKEND_XMIN | STATE | SENT_LOCATION | WRITE_LOC
ATION | FLUSH_LOCATION | REPLAY_LOCATION | SYNC_PRIORITY | SYNC_STATE
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------
------+----------------+-----------------+---------------+------------
21847 | 10 | SYSTEM | node2 | 192.168.1.102 | | 10719 | 2023-08-04 15:10:52.719000+08 | | streaming | 1/72000DF0 | 1/72000DF
0 | 1/72000DF0 | 1/72000DF0 | 0 | async
3378 | 10 | SYSTEM | node3 | 192.168.1.103 | | 10398 | 2023-08-04 16:24:40.434048+08 | | streaming | 1/72000DF0 | 1/72000DF
0 | 1/72000DF0 | 1/72000DF0 | 0 | async
(2 rows)
3、查看主库cluster.log日志
如下所示,在执行pcp_attach_node后,主库发起了failover,对数据节点进行了恢复:
2023-08-04 16:26:13: pid 22474: LOG: watchdog received the failover command from local kingbasecluster on IPC interface
2023-08-04 16:26:13: pid 22474: LOG: watchdog is processing the failover command [FAILBACK_REQUEST] received from local kingbasecluster on IPC interface
2023-08-04 16:26:13: pid 22474: LOG: forwarding the failover request [FAILBACK_REQUEST] to all alive nodes
2023-08-04 16:26:13: pid 22474: DETAIL: watchdog cluster currently has 1 connected remote nodes
2023-08-04 16:26:13: pid 22434: LOG: Kingbasecluster-II parent process has received failover request
2023-08-04 16:26:13: pid 22474: LOG: new IPC connection received
2023-08-04 16:26:13: pid 22474: LOG: received the failover command lock request from local kingbasecluster on IPC interface
2023-08-04 16:26:13: pid 22474: LOG: local kingbasecluster node "192.168.1.101:9999 Linux node101" is requesting to become a lock holder for failover ID: 465
2023-08-04 16:26:13: pid 22474: LOG: local kingbasecluster node "192.168.1.101:9999 Linux node101" is the lock holder
2023-08-04 16:26:13: pid 22434: LOG: starting fail back. reconnect host 192.168.1.103(54321)
2023-08-04 16:26:13: pid 22434: LOG: Node 0 is not down (status: 2)
2023-08-04 16:26:13: pid 22474: LOG: received the failover command lock request from remote kingbasecluster node "192.168.1.102:9999 Linux node102"
2023-08-04 16:26:13: pid 22474: LOG: remote kingbasecluster node "192.168.1.102:9999 Linux node102" is requesting to become a lock holder for failover ID: 465
2023-08-04 16:26:13: pid 22474: LOG: lock holder request denied to remote kingbasecluster node "192.168.1.102:9999 Linux node102"
2023-08-04 16:26:13: pid 22474: DETAIL: local kingbasecluster node "192.168.1.101:9999 Linux node101" is already holding the locks
2023-08-04 16:26:13: pid 2797: LOG: PCP process with pid: 4532 exit with SUCCESS.
2023-08-04 16:26:13: pid 2797: LOG: PCP process with pid: 4532 exits with status 0
2023-08-04 16:26:13: pid 22474: LOG: new IPC connection received
2023-08-04 16:26:13: pid 22474: LOG: received the failover command lock request from local kingbasecluster on IPC interface
2023-08-04 16:26:13: pid 22474: LOG: local kingbasecluster node "192.168.1.101:9999 Linux node101" is requesting to release [FAILBACK] lock for failover ID 465
2023-08-04 16:26:13: pid 22474: LOG: local kingbasecluster node "192.168.1.101:9999 Linux node101" has released the [FAILBACK] lock for failover ID 465
2023-08-04 16:26:13: pid 22434: LOG: Do not restart children because we are failing back node id 2 host: 192.168.1.103 port: 54321 and we are in streaming replication mode and not all backends were down
2023-08-04 16:26:13: pid 22474: LOG: new IPC connection received
2023-08-04 16:26:13: pid 22474: LOG: received the failover command lock request from local kingbasecluster on IPC interface
2023-08-04 16:26:13: pid 22474: LOG: local kingbasecluster node "192.168.1.101:9999 Linux node101" is requesting to release [FAILOVER] lock for failover ID 465
2023-08-04 16:26:13: pid 22474: LOG: local kingbasecluster node "192.168.1.101:9999 Linux node101" has released the [FAILOVER] lock for failover ID 465
2023-08-04 16:26:13: pid 22434: LOG: find_primary_node_repeatedly: waiting for finding a primary node
2023-08-04 16:26:13: pid 22434: LOG: find_primary_node: checking backend no 0
2023-08-04 16:26:13: pid 22434: LOG: find_primary_node: primary node id is 0
2023-08-04 16:26:13: pid 22474: LOG: new IPC connection received
2023-08-04 16:26:13: pid 22474: LOG: received the failover command lock request from local kingbasecluster on IPC interface
2023-08-04 16:26:13: pid 22474: LOG: local kingbasecluster node "192.168.1.101:9999 Linux node101" is requesting to release [FOLLOW MASTER] lock for failover ID 465
2023-08-04 16:26:13: pid 22474: LOG: local kingbasecluster node "192.168.1.101:9999 Linux node101" has released the [FOLLOW MASTER] lock for failover ID 465
2023-08-04 16:26:13: pid 22434: LOG: failover: set new primary node: 0
2023-08-04 16:26:13: pid 22434: LOG: failover: set new master node: 0
2023-08-04 16:26:13: pid 2799: LOG: worker process received restart request
2023-08-04 16:26:13: pid 22474: LOG: new IPC connection received
2023-08-04 16:26:13: pid 22474: LOG: received the failover command lock request from remote kingbasecluster node "192.168.1.102:9999 Linux node102"
2023-08-04 16:26:13: pid 22474: LOG: remote kingbasecluster node "192.168.1.102:9999 Linux node102" is checking the status of [FAILBACK] lock for failover ID 465
2023-08-04 16:26:13: pid 22474: LOG: FAILBACK lock is currently FREE
2023-08-04 16:26:13: pid 22474: DETAIL: request was from remote kingbasecluster node "192.168.1.102:9999 Linux node102" and lock holder is local kingbasecluster node "192.168.1.101:9999 Linux node101"
2023-08-04 16:26:13: pid 22474: LOG: received the failover command lock request from local kingbasecluster on IPC interface
2023-08-04 16:26:13: pid 22474: LOG: local kingbasecluster node "192.168.1.101:9999 Linux node101" is requesting to resign from a lock holder for failover ID 465
2023-08-04 16:26:13: pid 22474: LOG: local kingbasecluster node "192.168.1.101:9999 Linux node101" has resigned from the lock holder
2023-08-04 16:26:13: pid 22434: LOG: failback done. reconnect host 192.168.1.103(54321)
三、总结
在KingbaseES V8R3集群,当备库节点流复制状态正常,但是show pool_nodes查看节点状态为‘down’时,可以执行pcp_attach_node对节点进行更新和重新注册到集群,包括管理节点和其他数据节点。