KingbaseES V8R3集群运维案例 --在线删除数据节点

案例说明:
kingbaseES V8R3集群一主多备的架构,集群有master和standby两个管理节点,所有的节点都可以为数据节点(包括管理节点);对于非管理节点的数据节点可以在线删除;但是对于管理节点,无法在线删除,如果删除管理节点,需要重新部署集群。本案例描述里在一主二备的架构下,删除数据节点(非管理节点)的过程。

系统主机环境:

[kingbase@node3 bin]$ cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.7.248   node1    # 集群管理节点&数据节点
192.168.7.249   node2    # 数据节点 
192.168.7.243   node3    # 集群管理节点&数据节点

集群架构:

适用版本:

KingbaseES V8R3

一、查看集群状态信息

=注意:在删除数据节点前,保证集群状态是正常的,包括集群节点状态和主备流复制状态=

# 集群节点状态

TEST=# show pool_nodes;
 node_id |   hostname    | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay 
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
 0       | 192.168.7.243 | 54321 | up     | 0.333333  | primary | 0          | false             | 0
 1       | 192.168.7.248 | 54321 | up     | 0.333333  | standby | 0          | true              | 0
 2       | 192.168.7.249 | 54321 | up     | 0.333333  | standby | 0          | false             | 0
(3 rows)

# 主备流复制状态
TEST=# select * from sys_stat_replication;
  PID  | USESYSID | USENAME | APPLICATION_NAME |  CLIENT_ADDR  | CLIENT_HOSTNAME | CLIENT_PORT |         BACKEND_START         | BACKEND_XMIN |  
 STATE   | SENT_LOCATION | WRITE_LOCATION | FLUSH_LOCATION | REPLAY_LOCATION | SYNC_PRIORITY | SYNC_STATE 
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+--

 12316 |       10 | SYSTEM  | node249          | 192.168.7.249 |                 |       39337 | 2021-03-01 12:59:29.003870+08 |              | s
treaming | 0/50001E8     | 0/50001E8      | 0/50001E8      | 0/50001E8       |             3 | potential
 15429 |       10 | SYSTEM  | node248          | 192.168.7.248 |                 |       35885 | 2021-03-01 12:59:38.317605+08 |              | s
treaming | 0/50001E8     | 0/50001E8      | 0/50001E8      | 0/50001E8       |             2 | sync
(2 rows)

二、删除集群数据节点

1、停止数据节点上cron服务(netwrok_rewind.sh计划任务)

[kingbase@node2 bin]$ cat /etc/cron.d/KINGBASECRON 
#*/1 * * * * kingbase . /etc/profile;/home/kingbase/cluster/R6HA/KHA/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/R6HA/KHA/kingbase/bin/../etc/repmgr.conf >> /home/kingbase/cluster/R6HA/KHA/kingbase/bin/../kbha.log 2>&1
#*/1 * * * * kingbase  /home/kingbase/cluster/kha/db/bin/network_rewind.sh

2、停止数据节点数据库服务

[kingbase@node2 bin]$ ./sys_ctl stop -D ../data
waiting for server to shut down.... done
server stopped

3、在主节点删除复制槽

TEST=# select * from sys_replication_slots;
  SLOT_NAME   | PLUGIN | SLOT_TYPE | DATOID | DATABASE | ACTIVE | ACTIVE_PID | XMIN | CATALOG_XMIN | RESTART_LSN | CONFIRMED_FLUSH_LSN 
--------------+--------+-----------+--------+----------+--------+------------+------+--------------+-------------+---------------------
 slot_node243 |        | physical  |        |          | f      |            |      |              |             | 
 slot_node248 |        | physical  |        |          | t      |      29330 | 2076 |              | 0/70000D0   | 
 slot_node249 |        | physical  |        |          | f      |            | 2076 |              | 0/60001B0   | 
(3 rows)


TEST=# select SYS_DROP_REPLICATION_SLOT('slot_node249');
 SYS_DROP_REPLICATION_SLOT 
---------------------------
 
(1 row)

TEST=# select * from sys_replication_slots;
  SLOT_NAME   | PLUGIN | SLOT_TYPE | DATOID | DATABASE | ACTIVE | ACTIVE_PID | XMIN | CATALOG_XMIN | RESTART_LSN | CONFIRMED_FLUSH_LSN 
--------------+--------+-----------+--------+----------+--------+------------+------+--------------+-------------+---------------------
 slot_node243 |        | physical  |        |          | f      |            |      |              |             | 
 slot_node248 |        | physical  |        |          | t      |      29330 | 2076 |              | 0/70000D0   | 
(2 rows)

4、编辑配置文件(所有管理节点)

1) HAmodule.conf配置文件(db/etc和kingbasecluster/etc下)

=如下所示,集群所有节点的主机名和ip配置信息,需将删除节点的配置信息清除=

[kingbase@node3 etc]$ cat HAmodule.conf |grep -i all
#IP of all nodes in the cluster.example:KB_ALL_IP="(192.168.28.128 192.168.28.129 )"
KB_ALL_IP=(192.168.7.243 192.168.7.248 192.168.7.249 )
#recoord the names of all nodes.example:ALL_NODE_NAME=1 (node1 node2 node3)
ALL_NODE_NAME=(node243 node248 node249)

=如下图所示,已经将要删除节点的主机名和ip信息从配置中清除=

2)编辑kingbasecluster配置文件

=如下所示,从配置文件注释删除节点的配置信息=

[kingbase@node1 etc]$ tail kingbasecluster.conf
backend_hostname1='192.168.7.248'
backend_port1=54321
backend_weight1=1
backend_data_directory1='/home/kingbase/cluster/kha/db/data'

# 注释node249配置信息
#backend_hostname2='192.168.7.249'
#backend_port2=54321
#backend_weight2=1
#backend_data_directory2='/home/kingbase/cluster/kha/db/data'

三、重启集群测试

=== 注意:在生产环境下,不需要立刻重启集群,在适当时候重启集群即可===

[kingbase@node3 bin]$ ./kingbase_monitor.sh restart
-----------------------------------------------------------------------
2021-03-01 13:26:44 KingbaseES automation beging...
......................
all started..
...
now we check again
=======================================================================
|             ip |                       program|              [status] 
[  192.168.7.243]|             [kingbasecluster]|              [active]
[  192.168.7.248]|             [kingbasecluster]|              [active]
[  192.168.7.243]|                    [kingbase]|              [active]
[  192.168.7.248]|                    [kingbase]|              [active]
=======================================================================

四、验证集群状态

1、查看流复制状态信息

# 主备流复制状态信息
TEST=# select * from sys_stat_replication;
  PID  | USESYSID | USENAME | APPLICATION_NAME |  CLIENT_ADDR  | CLIENT_HOSTNAME | CLIENT_PORT |         BACKEND_START         | BACKEND_XMIN |   STATE   | SENT_LOCATION | WRITE_LOCATION | FLUSH_LOCATION | REPLAY_LOCATION | SYNC_PRIORITY | SYNC_STATE 
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+-- 29330 |       10 | SYSTEM  | node248          | 192.168.7.248 |                 |       39484 | 2021-03-01 13:27:19.649897+08 |              | streaming | 0/70000D0     | 0/70000D0      | 0/70000D0      | 0/70000D0       |             2 | sync
(1 row)


# 复制槽信息
TEST=# select * from sys_replication_slots;
  SLOT_NAME   | PLUGIN | SLOT_TYPE | DATOID | DATABASE | ACTIVE | ACTIVE_PID | XMIN | CATALOG_XMIN | RESTART_LSN | CONFIRMED_FLUSH_LSN 
--------------+--------+-----------+--------+----------+--------+------------+------+--------------+-------------+---------------------
 slot_node243 |        | physical  |        |          | f      |            |      |              |             | 
 slot_node248 |        | physical  |        |          | t      |      29330 | 2076 |              | 0/70000D0   | 
(2 rows)

2、查看集群节点状态

[kingbase@node3 bin]$ ./ksql -U SYSTEM -W 123456 TEST -p 9999
ksql (V008R003C002B0270)
Type "help" for help.

TEST=# show pool_nodes;
 node_id |   hostname    | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay 
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
 0       | 192.168.7.243 | 54321 | up     | 0.500000  | primary | 0          | false             | 0
 1       | 192.168.7.248 | 54321 | up     | 0.500000  | standby | 0          | true              | 0
(2 rows)

TEST=#  select * from sys_stat_replication;
  PID  | USESYSID | USENAME | APPLICATION_NAME |  CLIENT_ADDR  | CLIENT_HOSTNAME | CLIENT_PORT |         BACKEND_START         | BACKEND_XMIN |  
 STATE   | SENT_LOCATION | WRITE_LOCATION | FLUSH_LOCATION | REPLAY_LOCATION | SYNC_PRIORITY | SYNC_STATE 
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+--
---------+---------------+----------------+----------------+-----------------+---------------+------------
 29330 |       10 | SYSTEM  | node248          | 192.168.7.248 |                 |       39484 | 2021-03-01 13:27:19.649897+08 |              | s
treaming | 0/70001B0     | 0/70001B0      | 0/70001B0      | 0/70001B0       |             2 | sync
(1 row)

五、删除数据节点安装目录

[kingbase@node2 cluster]$ rm -rf kha/

六、总结

  1、在删除集群数据节点前,需保证整个集群的状态(集群节点和流复制)正常。
  2、注释掉数据节点的cron计划任务。
  3、停止数据节点数据库服务。
  4、在主节点删除数据节点的slot。
  5、编辑所有管理节点的配置文件(HAmoudle.conf和kingbasecluster.conf)。
  6、重启集群(非必须)。
  7、测试集群状态。
  8、删除数据节点的安装目录。
posted @ 2022-01-14 19:36  KINGBASE研究院  阅读(252)  评论(0编辑  收藏  举报