KingbaseES V8R3集群运维案例之---数据节点执行pcp_attach_node

案例说明:
KingbaseES V8R3 一主二备集群中,其中一个节点为单纯的数据节点,当集群流复制正常,但通过show pool_nodes查看数据节点状态为‘down’,如何重新更新和注册此数据节点,恢复正常状态。

适用版本
KingbaseES V8R3

集群架构:

     192.168.1.101   管理节点&数据节点(node1)
     192.168.1.102   管理节点&数据节点(node1)
     192.168.1.103   数据节点(node3)

一、模拟数据节点集群状态down

1、故障前集群节点状态

如下所示,集群节点状态和流复制状态都正常:
TEST=# show pool_nodes;
 node_id |   hostname    | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
 0       | 192.168.1.101 | 54321 | up     | 0.333333  | primary | 0          | false             | 0
 1       | 192.168.1.102 | 54321 | up     | 0.333333  | standby | 0          | false             | 0
 2       | 192.168.1.103 | 54321 | up     | 0.333333  | standby | 0          | true              | 0
(3 rows)

TEST=# select * from sys_stat_replication;
  PID  | USESYSID | USENAME | APPLICATION_NAME |  CLIENT_ADDR  | CLIENT_HOSTNAME | CLIENT_PORT |         BACKEND_START         | BACKEND_XMIN |   STATE   | SENT_LOCATION | WRITE_LOC
ATION | FLUSH_LOCATION | REPLAY_LOCATION | SYNC_PRIORITY | SYNC_STATE
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------
------+----------------+-----------------+---------------+------------
 21847 |       10 | SYSTEM  | node2            | 192.168.1.102 |                 |       10719 | 2023-08-04 15:10:52.719000+08 |              | streaming | 1/72000D10    | 1/72000D1
0     | 1/72000D10     | 1/72000D10      |             0 | async
  3378 |       10 | SYSTEM  | node3            | 192.168.1.103 |                 |       10398 | 2023-08-04 16:24:40.434048+08 |              | streaming | 1/72000D10    | 1/72000D1
0     | 1/72000D10     | 1/72000D10      |             0 | async
(2 rows)

2、模拟数据节点故障
1)关闭数据节点服务auto-recovery

[root@node103 ~]# cat /etc/cron.d/KINGBASECRON
###*/1 * * * * kingbase  /home/kingbase/cluster/HAR3/db/bin/network_rewind.sh
# 如上所示,注释crond计划任务

2)关闭数据节点数据库服务

[kingbase@node103 bin]$ ./sys_ctl stop -D ../data
......
server stopped

3)重启管理节点的kingbasecluster服务(主备库root用户)
[root@node101 ~]# /home/kingbase/cluster/HAR3/kingbasecluster/bin/restartcluster.sh

4)再启动数据节点数据库服务

[kingbase@node103 bin]$  ps -ef |grep kingbase
root      1145     1  0 09:42 ?        00:00:00 /home/kingbase/cluster/HAR3/db/bin/es_server -f /home/kingbase/cluster/HAR3/db/share/es_server.conf
kingbase  7350     1  0 16:24 pts/1    00:00:00 /home/kingbase/cluster/HAR3/db/bin/kingbase -D ../data
kingbase  7351  7350  0 16:24 ?        00:00:00 kingbase: logger process
kingbase  7352  7350  0 16:24 ?        00:00:00 kingbase: startup process   recovering 0000000F0000000100000072
kingbase  7356  7350  0 16:24 ?        00:00:00 kingbase: checkpointer process
kingbase  7357  7350  0 16:24 ?        00:00:00 kingbase: writer process
kingbase  7358  7350  0 16:24 ?        00:00:00 kingbase: stats collector process
kingbase  7359  7350  0 16:24 ?        00:00:00 kingbase: wal receiver process   streaming 1/72000D10

5)查看集群节点状态和流复制状态
如下所示,数据节点状态为‘down’,但是流复制正常。因为在数据节点数据库服务down状态,重启了kingbasecluster,导致kingbasecluster获取到数据节点的状态为‘down’,虽然后面启动了数据节点的数据库服务,但是kingbasecluster不能更新此节点状态。

TEST=# show pool_nodes;
 node_id |   hostname    | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
 0       | 192.168.1.101 | 54321 | up     | 0.333333  | primary | 0          | false             | 0
 1       | 192.168.1.102 | 54321 | up     | 0.333333  | standby | 0          | true              | 0
 2       | 192.168.1.103 | 54321 | down   | 0.333333  | standby | 0          | false             | 0
(3 rows)

TEST=# select * from sys_stat_replication;
  PID  | USESYSID | USENAME | APPLICATION_NAME |  CLIENT_ADDR  | CLIENT_HOSTNAME | CLIENT_PORT |         BACKEND_START         | BACKEND_XMIN |   STATE   | SENT_LOCATION | WRITE_LOC
ATION | FLUSH_LOCATION | REPLAY_LOCATION | SYNC_PRIORITY | SYNC_STATE
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------
------+----------------+-----------------+---------------+------------
 21847 |       10 | SYSTEM  | node2            | 192.168.1.102 |                 |       10719 | 2023-08-04 15:10:52.719000+08 |              | streaming | 1/72000D10    | 1/72000D1
0     | 1/72000D10     | 1/72000D10      |             0 | async
  3378 |       10 | SYSTEM  | node3            | 192.168.1.103 |                 |       10398 | 2023-08-04 16:24:40.434048+08 |              | streaming | 1/72000D10    | 1/72000D1
0     | 1/72000D10     | 1/72000D10      |             0 | async
(2 rows)

如下图所示,数据节点的状态为‘down’:

二、执行pcp_attatch_node更新节点状态

1、主库执行pcp_attach_node

[kingbase@node101 bin]$ ./pcp_attach_node -n 2 --verbose
Password:
pcp_attach_node -- Command Successful
# 注意: 2为数据节点的id号

2、查看节点状态信息
如下所示,数据节点的状态信息已经恢复正常:

TEST=# show pool_nodes;
 node_id |   hostname    | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
 0       | 192.168.1.101 | 54321 | up     | 0.333333  | primary | 1          | true              | 0
 1       | 192.168.1.102 | 54321 | up     | 0.333333  | standby | 0          | false             | 0
 2       | 192.168.1.103 | 54321 | up     | 0.333333  | standby | 0          | false             | 0
(3 rows)

TEST=# select * from sys_stat_replication;
  PID  | USESYSID | USENAME | APPLICATION_NAME |  CLIENT_ADDR  | CLIENT_HOSTNAME | CLIENT_PORT |         BACKEND_START         | BACKEND_XMIN |   STATE   | SENT_LOCATION | WRITE_LOC
ATION | FLUSH_LOCATION | REPLAY_LOCATION | SYNC_PRIORITY | SYNC_STATE
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------
------+----------------+-----------------+---------------+------------
 21847 |       10 | SYSTEM  | node2            | 192.168.1.102 |                 |       10719 | 2023-08-04 15:10:52.719000+08 |              | streaming | 1/72000DF0    | 1/72000DF
0     | 1/72000DF0     | 1/72000DF0      |             0 | async
  3378 |       10 | SYSTEM  | node3            | 192.168.1.103 |                 |       10398 | 2023-08-04 16:24:40.434048+08 |              | streaming | 1/72000DF0    | 1/72000DF
0     | 1/72000DF0     | 1/72000DF0      |             0 | async
(2 rows)

3、查看主库cluster.log日志

如下所示,在执行pcp_attach_node后,主库发起了failover,对数据节点进行了恢复:

2023-08-04 16:26:13: pid 22474: LOG:  watchdog received the failover command from local kingbasecluster on IPC interface
2023-08-04 16:26:13: pid 22474: LOG:  watchdog is processing the failover command [FAILBACK_REQUEST] received from local kingbasecluster on IPC interface
2023-08-04 16:26:13: pid 22474: LOG:  forwarding the failover request [FAILBACK_REQUEST] to all alive nodes
2023-08-04 16:26:13: pid 22474: DETAIL:  watchdog cluster currently has 1 connected remote nodes
2023-08-04 16:26:13: pid 22434: LOG:  Kingbasecluster-II parent process has received failover request
2023-08-04 16:26:13: pid 22474: LOG:  new IPC connection received
2023-08-04 16:26:13: pid 22474: LOG:  received the failover command lock request from local kingbasecluster on IPC interface
2023-08-04 16:26:13: pid 22474: LOG:  local kingbasecluster node "192.168.1.101:9999 Linux node101" is requesting to become a lock holder for failover ID: 465
2023-08-04 16:26:13: pid 22474: LOG:  local kingbasecluster node "192.168.1.101:9999 Linux node101" is the lock holder
2023-08-04 16:26:13: pid 22434: LOG:  starting fail back. reconnect host 192.168.1.103(54321)
2023-08-04 16:26:13: pid 22434: LOG:  Node 0 is not down (status: 2)
2023-08-04 16:26:13: pid 22474: LOG:  received the failover command lock request from remote kingbasecluster node "192.168.1.102:9999 Linux node102"
2023-08-04 16:26:13: pid 22474: LOG:  remote kingbasecluster node "192.168.1.102:9999 Linux node102" is requesting to become a lock holder for failover ID: 465
2023-08-04 16:26:13: pid 22474: LOG:  lock holder request denied to remote kingbasecluster node "192.168.1.102:9999 Linux node102"
2023-08-04 16:26:13: pid 22474: DETAIL:  local kingbasecluster node "192.168.1.101:9999 Linux node101" is already holding the locks
2023-08-04 16:26:13: pid 2797: LOG:  PCP process with pid: 4532 exit with SUCCESS.
2023-08-04 16:26:13: pid 2797: LOG:  PCP process with pid: 4532 exits with status 0
2023-08-04 16:26:13: pid 22474: LOG:  new IPC connection received
2023-08-04 16:26:13: pid 22474: LOG:  received the failover command lock request from local kingbasecluster on IPC interface
2023-08-04 16:26:13: pid 22474: LOG:  local kingbasecluster node "192.168.1.101:9999 Linux node101" is requesting to release [FAILBACK] lock for failover ID 465
2023-08-04 16:26:13: pid 22474: LOG:  local kingbasecluster node "192.168.1.101:9999 Linux node101" has released the [FAILBACK] lock for failover ID 465
2023-08-04 16:26:13: pid 22434: LOG:  Do not restart children because we are failing back node id 2 host: 192.168.1.103 port: 54321 and we are in streaming replication mode and not all backends were down
2023-08-04 16:26:13: pid 22474: LOG:  new IPC connection received
2023-08-04 16:26:13: pid 22474: LOG:  received the failover command lock request from local kingbasecluster on IPC interface
2023-08-04 16:26:13: pid 22474: LOG:  local kingbasecluster node "192.168.1.101:9999 Linux node101" is requesting to release [FAILOVER] lock for failover ID 465
2023-08-04 16:26:13: pid 22474: LOG:  local kingbasecluster node "192.168.1.101:9999 Linux node101" has released the [FAILOVER] lock for failover ID 465
2023-08-04 16:26:13: pid 22434: LOG:  find_primary_node_repeatedly: waiting for finding a primary node
2023-08-04 16:26:13: pid 22434: LOG:  find_primary_node: checking backend no 0
2023-08-04 16:26:13: pid 22434: LOG:  find_primary_node: primary node id is 0
2023-08-04 16:26:13: pid 22474: LOG:  new IPC connection received
2023-08-04 16:26:13: pid 22474: LOG:  received the failover command lock request from local kingbasecluster on IPC interface
2023-08-04 16:26:13: pid 22474: LOG:  local kingbasecluster node "192.168.1.101:9999 Linux node101" is requesting to release [FOLLOW MASTER] lock for failover ID 465
2023-08-04 16:26:13: pid 22474: LOG:  local kingbasecluster node "192.168.1.101:9999 Linux node101" has released the [FOLLOW MASTER] lock for failover ID 465
2023-08-04 16:26:13: pid 22434: LOG:  failover: set new primary node: 0
2023-08-04 16:26:13: pid 22434: LOG:  failover: set new master node: 0
2023-08-04 16:26:13: pid 2799: LOG:  worker process received restart request
2023-08-04 16:26:13: pid 22474: LOG:  new IPC connection received
2023-08-04 16:26:13: pid 22474: LOG:  received the failover command lock request from remote kingbasecluster node "192.168.1.102:9999 Linux node102"
2023-08-04 16:26:13: pid 22474: LOG:  remote kingbasecluster node "192.168.1.102:9999 Linux node102" is checking the status of [FAILBACK] lock for failover ID 465
2023-08-04 16:26:13: pid 22474: LOG:  FAILBACK lock is currently FREE
2023-08-04 16:26:13: pid 22474: DETAIL:  request was from remote kingbasecluster node "192.168.1.102:9999 Linux node102" and lock holder is local kingbasecluster node "192.168.1.101:9999 Linux node101"
2023-08-04 16:26:13: pid 22474: LOG:  received the failover command lock request from local kingbasecluster on IPC interface
2023-08-04 16:26:13: pid 22474: LOG:  local kingbasecluster node "192.168.1.101:9999 Linux node101" is requesting to resign from a lock holder for failover ID 465
2023-08-04 16:26:13: pid 22474: LOG:  local kingbasecluster node "192.168.1.101:9999 Linux node101" has resigned from the lock holder
2023-08-04 16:26:13: pid 22434: LOG:  failback done. reconnect host 192.168.1.103(54321)

三、总结
在KingbaseES V8R3集群,当备库节点流复制状态正常,但是show pool_nodes查看节点状态为‘down’时,可以执行pcp_attach_node对节点进行更新和重新注册到集群,包括管理节点和其他数据节点。

posted @ 2023-08-04 17:13  天涯客1224  阅读(2)  评论(0编辑  收藏  举报