KingbaseES V8R3 集群运维系列 -- pcp_attach_node 更新 pool_nodes 节点状态
案例说明:
pcp_attach_node工具是kingbasecluster自带的集群管理工具,pcp_attach_node工具一般可以用于备库节点的恢复,可以将集群中的备库节点重新注册到集群中,对于主备流复制正常,但通过show_pools显示的状态不正常的node,可以尝试通过此工具进行处理。
适用版本:
KingbaseES V8R3
案例架构:
一、案例现象
备库数据库服务手工启动后,从sys_stat_replication查看主备流复制状态已经正常,但是从show pool_nodes中仍显示node status为“down”,通过pcp_attach_node工具重新将备节点注册到cluster,status状态正常“up”。
1)主备流复制状态
test=# select * from sys_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
-------+----------+---------+------------------+---------------+-----------------+------
13458 | 10 | SYSTEM | node1 | 192.168.7.248 | | 24328 | 2021-03-01 13:17:21.697235+08 | | streaming | 0/1A004380 | 0/1A004380 | 0/1A004380 | 0/1A004380 | 0 | async
(1 row)
2)show pool_nodes查看节点状态
[kingbase@node2 bin]$ ./ksql -U SYSTEM -W 123456Abc test -p 9999
ksql (V008R003C002B0270)
Type "help" for help.
test=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_del
ay
---------+---------------+-------+--------+-----------+---------+------------+------------
0 | 192.168.7.248 | 54322 | down | 0.500000 | standby | 0 | false | 0
1 | 192.168.7.249 | 54322 | up | 0.500000 | primary | 0 | true | 0
(2 rows)
如上所示,集群中主备流复制状态正常,standby节点数据库服务状态应该是“up" ; 但是在通过9999端口访问查询时,备库状态为”down“的状态,这和实际的node状态不符,应该是kingbasecluster的cache中记录node中后台数据库状态没有被更新,可以通过pcp_attach_node将node重新注册到集群,获取node最新状态。
二、通过pcp_attach_node工具注册备库
1)查看pcp工具的连接用户
[kingbase@node1 etc]$ cat pcp.conf
kingbase:e10adc3949ba59abbe56e057f20f883e
2)使用pcp_attach_node注册
[kingbase@node2 bin]$ ./pcp_attach_node -h 192.168.7.245 -U kingbase 0
Password:
pcp_attach_node -- Command Successful
‘0’---备库节点ID(node_id)
3)查看备库recovery.log
primary node/Im node status is changed, primary ip[192.168.7.249], recovery.conf NEED_CHANGE [0] (0 is need ), I,m status is [1] (1 is down), I will be in recovery.
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.7.249 | 54321 | up | 0.500000 | primary | 0 | true | 0
1 | 192.168.7.248 | 54321 | down | 0.500000 | standby | 0 | false | 0
(2 rows)
if recover node up, let it down , for rewind
waiting for server to shut down.... done
server stopped
2022-03-01 16:18:36 set /home/kingbase/cluster/R3HA/db/data down now... already down , check again
wait kb stop 5 sec .......
2022-03-01 16:18:37 sys_rewind...
sys_rewind --target-data=/home/kingbase/cluster/R3HA/db/data --source-server="host=192.168.7.249 port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST"
datadir_source = /home/kingbase/cluster/R3HA/db/data
rewinding from last common checkpoint at 2/125C2A0 on timeline 19
find last common checkpoint start time from 2022-03-01 16:18:37.147787 CST to 2022-03-01 16:18:37.178836 CST, in "0.031049" seconds.
reading source file list
reading target file list
reading WAL in target
Rewind datadir file from source
update the control file: minRecoveryPoint is '2/127FA40', minRecoveryPointTLI is '19', and database state is 'in archive recovery'
rewind start wal location 2/125C268 (file 000000130000000200000001), end wal location 2/127FA40 (file 000000130000000200000001). time from 2022-03-01 16:18:37.147787 CST to 2022-03-01 16:18:38.887227 CST, in "1.739440" seconds.
Done!
server started
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
SYS_CREATE_PHYSICAL_REPLICATION_SLOT
--------------------------------------
(slot_node249,)
(1 row)
2022-03-01 16:18:43 create the slot [slot_node249] success.
SYS_CREATE_PHYSICAL_REPLICATION_SLOT
--------------------------------------
(slot_node248,)
(1 row)
2022-03-01 16:18:43 create the slot [slot_node248] success.
2022-03-01 16:18:43 start up standby successful!
can not get the replication of myself
---以上日志显示,在执行pcp_attach_node后,执行sys_rewind做节点的recovery,sys_rewind执行成功,备库仍然为‘down’状态。
三、查看show pool_nodes的节点状态
test=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+---------+-----------+---------+------------+----------
0 | 192.168.7.248 | 54322 | waiting | 0.500000 | standby | 0 | false | 0
1 | 192.168.7.249 | 54322 | up | 0.500000 | primary | 0 | true | 0
(2 rows)
----- stauts由“down” 更新为“waiting”
四、测试流复制同步
主库事务操作:
test=# \c prod;
You are now connected to database "prod" as user "SYSTEM".
prod=# \d
List of relations
Schema | Name | Type | Owner
--------+-------------------------------+-------+--------
PUBLIC | pathman_cache_stats | view | SYSTEM
PUBLIC | pathman_concurrent_part_tasks | view | SYSTEM
PUBLIC | pathman_config | table | SYSTEM
PUBLIC | pathman_config_params | table | SYSTEM
PUBLIC | pathman_partition_list | view | SYSTEM
(5 rows)
prod=# create table t1 (id int ,name varchar(20));
CREATE TABLE
prod=# insert into t1 values (generate_series(1,1000),'tom'||generate_series(1,1000));
INSERT 0 1000
备库查询:
prod=# select count(*) from t1;
count
-------
1000
(1 row)
查看show pool_nodes节点状态:
prod=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_del
ay
---------+---------------+-------+--------+-----------+---------+------------+-----------
0 | 192.168.7.248 | 54322 | up | 0.500000 | standby | 0 | false | 0
1 | 192.168.7.249 | 54322 | up | 0.500000 | primary | 3 | true | 0
(2 rows)
---- satus为“up”!!!
五 、使用pcp工具查看集群节点信息
1)查看node信息
[kingbase@node2 bin]$ ./pcp_node_info -h 192.168.7.245 -U kingbase -v 0
Password:
Hostname : 192.168.7.248
Port : 54322
Status : 2
Weight : 0.500000
Status Name: up
[kingbase@node2 bin]$ ./pcp_node_info -h 192.168.7.245 -U kingbase -v 1
Password:
Hostname : 192.168.7.249
Port : 54322
Status : 2
Weight : 0.500000
Status Name: up
2)查看watchdog信息
[kingbase@node2 bin]$ ./pcp_watchdog_info -h 192.168.7.245 -U kingbase -v
Password:
Watchdog Cluster Information
Total Nodes : 2
Remote Nodes : 1
Quorum state : QUORUM EXIST
Alive Remote Nodes : 1
VIP up on local node : NO
Master Node Name : 192.168.7.248:9999 Linux node1
Master Host Name : 192.168.7.248
Watchdog Node Information
Node Name : 192.168.7.249:9999 Linux node2
Host Name : 192.168.7.249
Delegate IP : 192.168.7.244
Kingbasecluster port : 9999
Watchdog port : 9000
Node priority : 1
Status : 7
Status Name : STANDBY
Node Name : 192.168.7.248:9999 Linux node1
Host Name : 192.168.7.248
Delegate IP : 192.168.7.244
Kingbasecluster port : 9999
Watchdog port : 9000
Node priority : 3
Status : 4
Status Name : MASTER
六、总结
有效使用pcp工具,可以快速解决集群中遇到的故障问题。
七、现场案例
案例说明:
备库出现故障,通过sys_basebackup修复后,查看主备流复制状态正常,通过show pool_nodes查看节点出现异常。
故障现象:
故障处理:
采用pcp_attach_node工具将节点重新注册,集群状态恢复正常。