KingbaseES V8R6集群运维案例之---备节点恢复为单实例库
案例说明:
在生产环境中,手工将集群节点恢复为单实例节点,操作可以分为两步。第一步,先将节点从repmgr管理中注销,脱离集群的管理;第二步,从流复制中拆分节点,成为单实例节点。
适用版本:
KingbaseES V8R6
集群架构:
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+-------+---------+-----------+----------+---------+------+---------+--------------------
1 | node1 | primary | * running | | running | 4466 | no | n/a
2 | node2 | standby | running | node1 | running | 9380 | no | 0 second(s) ago
3 | node3 | standby | running | node1 | running | 7377 | no | 1 second(s) ago
一、节点从集群注销
如下所示,在被注销节点执行命令,从repmgr集群管理中注销此节点:
[kingbase@node103 bin]$ ./repmgr standby unregister --force
----+-------+---------+-----------+----------+----------+----------+----------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 20 | | host=192.168.1.101 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node2 | standby | running | node1 | default | 100 | 20 | 0 bytes | host=192.168.1.102 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
# 从主库节点查看,node3节点已经被注销
[kingbase@node101 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 20 | | host=192.168.1.101 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node2 | standby | running | node1 | default | 100 | 20 | 0 bytes | host=192.168.1.102 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
二、 节点从流复制拆分
1、查看当前流复制状态(如下所示,node3仍然是流复制备库节点)
test=# select * from sys_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_
start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush
_lag | replay_lag | sync_priority | sync_state | reply_time
------+----------+---------+------------------+---------------+-----------------+-------------+-----------------
--------------+--------------+-----------+------------+------------+------------+------------+-----------+------
-----+------------+---------------+------------+-------------------------------
4024 | 16385 | esrep | node3 | 192.168.1.103 | | 25675 | 2023-07-11 09:34
:59.201690+08 | | streaming | 1/C2000DA0 | 1/C2000DA0 | 1/C2000DA0 | 1/C2000DA0 | |
| | 0 | async | 2023-07-11 09:37:50.359201+08
4026 | 16385 | esrep | node2 | 192.168.1.102 | | 14609 | 2023-07-11 09:35
:00.173768+08 | | streaming | 1/C2000DA0 | 1/C2000DA0 | 1/C2000DA0 | 1/C2000DA0 | |
| | 1 | sync | 2023-07-11 09:37:49.073643+08
(2 rows)
2、复制槽状态信息
如下图所示,复制槽状态正常:
test=# select * from sys_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin
| restart_lsn | confirmed_flush_lsn
---------------+--------+-----------+--------+----------+-----------+--------+------------+-------+-------------
-+-------------+---------------------
repmgr_slot_2 | | physical | | | f | t | 4026 | 60336 |
| 1/C2000DA0 |
repmgr_slot_3 | | physical | | | f | t | 4024 | 60336 |
| 1/C2000DA0 |
(2 rows)
3、关闭当前备库集群及数据库服务
如下所示,通过sys_monitor.sh stoplocal可以关闭此节点原有的集群管理和数据库服务:
[kingbase@node103 bin]$ ./sys_monitor.sh stoplocal
Service process "node_export" was killed at process 8204
Service process "postgres_ex" was killed at process 8205
2023-07-11 09:49:39 begin to stop repmgrd on "[localhost]".
2023-07-11 09:49:39 repmgrd on "[localhost]" stop success.
2023-07-11 09:49:39 begin to stop DB on "[localhost]".
waiting for server to shut down...... done
server stopped
2023-07-11 09:49:43 DB on "[localhost]" stop success.
4、删除复制槽
在主库执行,将拆分的备库节点复制槽删除:
test=# select * from sys_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin
| restart_lsn | confirmed_flush_lsn
---------------+--------+-----------+--------+----------+-----------+--------+------------+-------+-------------
-+-------------+---------------------
repmgr_slot_2 | | physical | | | f | t | 4026 | 60337 |
| 1/C2000E98 |
repmgr_slot_3 | | physical | | | f | f | | 60336 |
| 1/C2000DA0 |
(2 rows)
test=# select sys_drop_replication_slot('repmgr_slot_3');
sys_drop_replication_slot
---------------------------
(1 row)
test=# select * from sys_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin
| restart_lsn | confirmed_flush_lsn
---------------+--------+-----------+--------+----------+-----------+--------+------------+-------+-------------
-+-------------+---------------------
repmgr_slot_2 | | physical | | | f | t | 4026 | 60337 |
| 1/C2002560 |
(1 row)
Tips:
删除备库节点复制槽,必须此复制槽active=‘f’:
三、清理集群配置
1、关闭此节点kbha进程的启动
[root@node103 ~]# cat /etc/cron.d/KINGBASECRON
#*/1 * * * * kingbase . /etc/profile;/home/kingbase/cluster/R6HA/ha7/kingbase/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/R6HA/ha7/kingbase/kingbase/bin/../etc/repmgr.conf
---注释或删除。
2、清除备库标识文件(数据库启动为主库)
[kingbase@node103 data]$ ls -lh standby.signal
-rw------- 1 kingbase kingbase 20 Jul 10 15:19 standby.signal
[kingbase@node103 data]$ mv standby.signal standby.signal.bk
3、修改事务commit同步机制
[kingbase@node103 bin]$ cat /data/kingbase/hac7/data/es_rep.conf|grep -i _commit
synchronous_commit = remote_apply
#修改为以下配置
[kingbase@node103 bin]$ cat /data/kingbase/hac7/data/es_rep.conf|grep -i _commit
synchronous_commit = local
4、启动此节点数据库服务
[kingbase@node103 bin]$ ./sys_ctl start -D /data/kingbase/hac7/data/
waiting for server to start....2023-07-11 09:49:48.082 CST [12054] LOG: sepapower extension initialized
.......
server started
# 如下所示,此节点数据库服务启动为主库状态
test=# select sys_is_in_recovery();
sys_is_in_recovery
--------------------
f
(1 row)
四、卸载repmgr插件及数据
1、取消repmgr加载
如下图所示,repmgr作为插件加载,可以在配置文件中取消加载:
Tips:
esrep用户和数据库,用于存储repmgr集群元数据,可以在恢复为单实例后,删除相应的库和用户:
2、删除esrep数据库
test=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+--------+----------+-------------+-------------+-------------------
esrep | system | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
prod | system | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
security | system | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
template0 | system | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/system +
| | | | | system=CTc/system
template1 | system | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/system +
| | | | | system=CTc/system
test | system | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
(6 rows)
test=# drop database esrep;
DROP DATABASE
3、删除esrep用户
test=# drop user esrep;
DROP ROLE
test=# \du
List of roles
Role name | Attributes | Member of
-----------+------------------------------------------------------------+-----------
sao | No inheritance | {}
sso | No inheritance | {}
system | Superuser, Create role, Create DB, Replication, Bypass RLS+| {}
| Password valid until infinity |
tom | Superuser | {}
五、总结
对于KingbaseES V8R6集群,将集群备库节点恢复为单实例节点操作相对比较简单,对于主库恢复为单实例库,可以在执行switchover切换后,将主库切换为备库后,从集群恢复为单实例库。