KingbaseES集群部署系列之四---单实例在线扩展流复制集群架构

案例说明:
KingbaseES V8R6单实例生产环境需求在线扩展为流复制集群架构,本案例描述了从单实例在线扩展为流复制集群的过程。(本案例在通用机环境下测试,专用机环境可参考)

适用版本:
KingbaseES V8R6

集群节点信息:

[kingbase@node201 bin]$ cat /etc/hosts
.......
192.168.1.201 node201  # Primary
192.168.1.202 node202  # Standby

文档结构:

1. 集群节点部署系统环境准备
2. 集群节点配置sys_securecmdd建立互信
3. 手工创建主备流复制架构
4. 配置repmgr集群管理
5. 执行集群switchover、failover切换及启停测试
6. 总结

一、系统环境准备
集群节点系统环境的准备参考官方文档:
https://help.kingbase.com.cn/v8/install-updata/k-deploy/k-deploy-3.html

二、配置securecmdd建立节点互信
集群节点之间通过集群组件:sys_securecmdd建立系统互信连接,用于远程命令及脚本的执行,部署参考官方文档:
https://help.kingbase.com.cn/v8/highly/availability/cluster-use/cluster-use-2.html?highlight=securecmdd#id24

主备节点配置并启动securecmdd服务:

[kingbase@node201 bin]$ ps -ef |grep securecmd
root      1042     1  0 09:27 ?        00:00:00 sys_securecmdd: /home/kingbase/cluster/securecmdd/bin/sys_securecmdd -f /etc/.kes/securecmdd_config [listener] 0 of 128-256 startups

三、配置主备流复制
原单实例节点作为流复制的Primary节点,新增节点作为流复制的Stangby节点,通过sys_basebackup工具创建主备流复制:

1、查看主库数据库服务状态

test=# select sys_is_in_recovery();
 sys_is_in_recovery
--------------------
 f
(1 row)

2、创建主备流复制
1)主备节点创建集群部署目录
[kingbase@node201 bin]$ mkdir -p /home/kingbase/cluster/HA6/HAC8/kingbase/

2)从数据库安装目录拷贝db.zip

# 数据库软件安装目录
[kingbase@node201 zip]$ pwd
/opt/Kingbase/ES/R6_C8/ClientTools/guitools/DeployTools/zip
# 拷贝db.zip到主备节点
[kingbase@node201 zip]$ cp db.zip /home/kingbase/cluster/HA6/HAC8/kingbase/
[kingbase@node201 zip]$ scp db.zip node202:/home/kingbase/cluster/HA6/HAC8/kingbase/

3)解压db.zip到当前目录

[kingbase@node201 kingbase]$ ls -lh
total 512M
drwxr-xr-x 2 kingbase kingbase 4.0K Sep  2  2023 bin
-rw-rw-r-- 1 kingbase kingbase 308M May 23 15:31 db.zip
drwxrwxr-x 3 kingbase kingbase 4.0K Sep  2  2023 include
drwxrwxr-x 5 kingbase kingbase 8.0K Sep  2  2023 lib
drwxrwxr-x 7 kingbase kingbase 4.0K Sep  2  2023 share

4)主库包含自定义表空间

prod=# \db+
                                               List of tablespaces
    Name     | Owner  |              Location              | Access privileges | Options |   Size   | Descriptio
n
-------------+--------+------------------------------------+-------------------+---------+----------+-----------
--
 sys_default | system |                                    |                   |         | 123 MB   |
 sys_global  | system |                                    |                   |         | 745 kB   |
 sysaudit    | system |                                    |                   |         | 24 kB    |
 users       | system | /home/kingbase/db/r6_c8/tps/user01 |                   |         | 18 bytes |
(4 rows)

#备库创建自定义表空间存储目录:
[kingbase@node202 ~]$ mkdir -p /home/kingbase/db/r6_c8/tps/user01

5)主节点编辑sys_hba.conf:
开放流复制连接:

# 下面的部分是为限制连接的设置(limit to connection)

host    replication     all             0.0.0.0/0              scram-sha-256
host    all             all             0.0.0.0/0              scram-sha-256
host    replication     all             ::0/0                  scram-sha-256

# reload主库数据库服务
[kingbase@node201 bin]$ ./sys_ctl reload -D /home/kingbase/db/r6_c8/data
server signaled

6)创建备库

[kingbase@node202 bin]$ ./sys_basebackup -h 192.168.1.201 -p 54325 -U system  -F p -X stream   -D /home/kingbase/db/r6_c8/data -v -P -c fast
.......
sys_basebackup: base backup completed

7)备库创建standby标识文件
[kingbase@node202 data]$ touch standby.signal

8)主备库拷贝license文件

[kingbase@node201 bin]$ cp license.dat /home/kingbase/cluster/HA6/HAC8/kingbase/bin/
[kingbase@node201 bin]$ scp license.dat node202:/home/kingbase/cluster/HA6/HAC8/kingbase/bin/

9)主备库创建kingbase.auto.conf

#主库
[kingbase@node201 data]$ cat kingbase.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
.......
primary_conninfo = 'user=system connect_timeout=10 host=192.168.1.202 port=54325 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 application_name=node1'
primary_slot_name = 'slot_node1'
synchronous_standby_names = '1 (node2)'

# 主库reload数据库服务
[kingbase@node201 bin]$ ./sys_ctl reload -D /home/kingbase/db/r6_c8/data
server signaled

# 备库
[kingbase@node202 data]$ cat kingbase.auto.conf
......
primary_conninfo = 'user=system connect_timeout=10 host=192.168.1.201 port=54325 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 application_name=node2'
primary_slot_name = 'slot_node2'
synchronous_standby_names = '1 (node2)'

10)主库创建复制槽

test=# select sys_create_physical_replication_slot('slot_node2');
 sys_create_physical_replication_slot
--------------------------------------
 (slot_node2,)
(1 row)

test=# select * from sys_replication_slots;
 slot_name  | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | r
estart_lsn | confirmed_flush_lsn
------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+--
-----------+---------------------
 slot_node2 |        | physical  |        |          | f         | f      |            |      |              |
           |
(1 row)

11)启动备库数据库服务

[kingbase@node202 bin]$ ./sys_ctl start -D /home/kingbase/db/r6_c8/data/
......
server started
[kingbase@node202 bin]$ ./ksql -U system test -p 54325
Type "help" for help.

test=# select sys_is_in_recovery();
 sys_is_in_recovery
--------------------
 t
(1 row)

12)主库查看流复制

test=# select * from sys_stat_replication;
 pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_
start         | backend_xmin |   state   | sent_lsn | write_lsn  | flush_lsn  | replay_lsn |   write_lag    |
flush_lag    |   replay_lag   | sync_priority | sync_state |          reply_time
------+----------+---------+------------------+---------------+-----------------+-------------+-----------------
--------------+--------------+-----------+----------+------------+------------+------------+----------------+---
-------------+----------------+---------------+------------+-------------------------------
 7514 |       10 | system  | node2            | 192.168.1.202 |                 |       44016 | 2024-05-23 16:28
:59.440173+08 |              | streaming |          | 0/6C000130 | 0/6C000130 | 0/6C000130 | 0:00:00.018945 | 0:
00:00.019326 | 0:00:00.019513 |             1 | sync       | 2024-05-23 16:28:59.450793+08
(1 row)
test=# select * from sys_replication_slots;
 slot_name  | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | r
estart_lsn | confirmed_flush_lsn
------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+--
-----------+---------------------
 slot_node2 |        | physical  |        |          | f         | t      |       7514 |      |              | 0
/6C000130  |
(1 row)

---如上所示,主备流复制创建完成。

四、配置repmgr集群管理
主备流复制集群资源监控、主备切换及资源管理,是通过repmgr工具来完成。

1、主备库配置repmgr插件

[kingbase@node201 bin]$ cat /home/kingbase/db/r6_c8/data/kingbase.conf|grep -i repmgr
 shared_preload_libraries = 'repmgr,liboracle_parser, synonym, plsql, force_view, kdb_flashback,plugin_debugger, plsql_plugin_debugger, plsql_plprofiler, kdb_ora_expr, sepapower, dblink, sys_kwr, sys_spacequota, sys_stat_statements, backtrace, kdb_utils_function, auto_bmr, sys_squeeze, src_restrict,kdb_schedule,sysaudit'

# reload主备库数据库服务
[kingbase@node201 bin]$ ./sys_ctl reload -D /home/kingbase/db/r6_c8/data
server signaled

2、主库创建esrep库和esrep用户及repmgr插件(esrep用户用于集群管理)

# 创建esrep数据库和用户
[kingbase@node201 bin]$ ./ksql -U system test -p 54325
Type "help" for help.

test=# create database esrep;
CREATE DATABASE

test=# \c esrep
You are now connected to database "esrep" as userName "system".
esrep=# create user esrep with password '123456';
CREATE ROLE
esrep=# alter user esrep with superuser;
ALTER ROLE
esrep=# \du esrep
           List of roles
 Role name | Attributes | Member of
-----------+------------+-----------
 esrep     | Superuser  | {}

# 创建repmgr插件
esrep=# create extension repmgr;
CREATE EXTENSION

3、主备库配置esrep免密(system用户已经创建免密)

[kingbase@node201 bin]$ ./sys_encpwd -H \* -P \* -D \* -U esrep -W '123456'
[kingbase@node201 bin]$ cat ~/.encpwd
*:*:*:esrep:MTIzNDU2
*:*:*:system:MTIzNDU2
# 测试免密连接
[kingbase@node201 bin]$ ./ksql -U esrep test -p 54325
Type "help" for help.

4、主备库创建集群管理目录

 Tips: 
  etc:集群配置文件目录
  log:日志存储目录

# 当前集群部署目录
[kingbase@node201 kingbase]$ pwd
/home/kingbase/cluster/HA6/HAC8/kingbase
# 创建子目录
[kingbase@node201 kingbase]$  mkdir -p log etc
[kingbase@node201 kingbase]$ ls -lh
total 308M
drwxr-xr-x 2 kingbase kingbase 4.0K May 23 15:49 bin
-rw-rw-r-- 1 kingbase kingbase 308M May 23 15:31 db.zip
drwxrwxr-x 2 kingbase kingbase    6 May 23 17:23 etc
drwxrwxr-x 3 kingbase kingbase 4.0K Sep  2  2023 include
drwxrwxr-x 5 kingbase kingbase 8.0K Sep  2  2023 lib
drwxrwxr-x 2 kingbase kingbase    6 May 23 17:23 log
drwxrwxr-x 7 kingbase kingbase 4.0K Sep  2  2023 share

5、创建集群管理配置文件
对于repmgr.conf和all_nodes_tools.conf文件,可以从已部署的集群环境下copy后修改,如下所示:

[kingbase@node201 etc]$ ls -lh *.conf
-rw-rw-r-- 1 kingbase kingbase   60 Mar 11 11:52 all_nodes_tools.conf
-rw-rw-r-- 1 kingbase kingbase 1.7K May 24 15:31 repmgr.conf

1)主备库拷贝配置文件

[kingbase@node201 etc]$ cp * /home/kingbase/cluster/HA6/HAC8/kingbase/etc/
[kingbase@node201 etc]$ scp * node202:/home/kingbase/cluster/HA6/HAC8/kingbase/etc/

2)主备库配置repmgr.conf

# 主库节点:
[kingbase@node201 etc]$ cat repmgr.conf
node_id=1
node_name='node1'
promote_command='/home/kingbase/cluster/HA6/HAC8/kingbase/bin/repmgr  standby promote -f /home/kingbase/cluster/HA6/HAC8/kingbase/etc/repmgr.conf'
follow_command='/home/kingbase/cluster/HA6/HAC8/kingbase/bin/repmgr  standby follow  -f /home/kingbase/cluster/HA6/HAC8/kingbase/etc/repmgr.conf -W --upstream-node-id=%n'
conninfo='host=192.168.1.201 user=esrep dbname=esrep port=54325 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000'
log_file='/home/kingbase/cluster/HA6/HAC8/kingbase/log/hamgr.log'
kbha_log_file='/home/kingbase/cluster/HA6/HAC8/kingbase/log/kbha.log'
data_directory='/home/kingbase/db/r6_c8/data'   # 原单实例存储路径(构建集群后仍沿用)
sys_bindir='/home/kingbase/cluster/HA6/HAC8/kingbase/bin'
scmd_options='-q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -p 8890 -o ServerAliveInterval=2 -o ServerAliveCountMax=3'
reconnect_attempts=5
reconnect_interval=2
failover='automatic'
recovery='automatic'
monitoring_history='no'
trusted_servers='192.168.1.1'
synchronous='all'
repmgrd_pid_file='/home/kingbase/cluster/HA6/HAC8/kingbase/etc/hamgrd.pid'
kbha_pid_file='/home/kingbase/cluster/HA6/HAC8/kingbase/etc/kbha.pid'
ping_path='/usr/bin'
auto_cluster_recovery_level=1
use_check_disk=off
use_scmd='off'
#use_scmd='on'
running_under_failure_trusted_servers=on
connection_check_type='mix'
log_level=debug

virtual_ip='192.168.1.88/24'
net_device='enp0s3'
net_device_ip='192.168.1.201'
arping_path='/home/kingbase/cluster/HA6/HAC8/kingbase/bin'
ipaddr_path='/usr/sbin'

#备库(注意以下和主库不同之处)
[kingbase@node202 etc]$ cat repmgr.conf
node_id=2
node_name='node2'
.......
conninfo='host=192.168.1.202 user=esrep dbname=esrep port=54325 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000'
......
net_device_ip='192.168.1.202'

3)主备编辑tall_nodes_tools.conf
对于all_nodes_tools.conf文件,只需修改用户密码和端口等信息:

[kingbase@node201 etc]$ cat all_nodes_tools.conf
db_u=system
db_password=MTIzNDU2  # base64加密
db_port=54325
db_name=test

4)主备库配置es_rep.conf

[kingbase@node201 data]$ cat es_rep.conf
listen_addresses='*'
max_wal_senders='32'
max_replication_slots='32'
wal_keep_segments='512'
hot_standby_feedback='on'
max_connections='100'
control_file_copy='/home/kingbase/cluster/HA6/HAC8/kingbase/copy_file'
max_prepared_transactions='100'
port='54321'
shared_buffers='512MB'
fsync='on'
wal_log_hints='on'
archive_mode='on'
wal_sender_timeout=30000ms
wal_receiver_timeout=30000ms
hot_standby='on'
log_destination='csvlog'
timezone='PRC'
wal_level='replica'
lc_messages='C'
lc_monetary='C'
lc_numeric='C'
lc_time='C'
archive_command='export TZ=Asia/Shanghai;/home/kingbase/cluster/HA6/HAC8/kingbase/bin/sys_rman --config /home/kingbase/kbbr6_repo/sys_rman.conf --stanza=kingbase archive-push %p'
logging_collector=on
full_page_writes=on
log_checkpoints=on
log_replication_commands=on
wal_compression=on
synchronous_commit=remote_apply
tcp_keepalives_idle=10
tcp_keepalives_interval=1
tcp_keepalives_count=3
tcp_user_timeout=9000

注意:在kingbase.conf下增加:
include_if_exists='./es_rep.conf' ##通过kingbase.conf加载es_rep.conf

# 应用配置
[kingbase@node201 data]$ cd /home/kingbase/cluster/HA6/HAC8/kingbase/bin
[kingbase@node201 bin]$ ./sys_ctl reload -D /home/kingbase/db/r6_c8/data
server signaled

6、注册主备库
Tips: 主库注册需要加载vip,对于通用机环境注意配置arping工具的属性。

1)注册主库节点

# 配置主备库节点arping属性
[root@node201 ~]# ls -lh /home/kingbase/cluster/HA6/HAC8/kingbase/bin/arping
-rwxr-xr-x 1 kingbase kingbase 14K Sep  2  2023 /home/kingbase/cluster/HA6/HAC8/kingbase/bin/arping
[root@node201 ~]# chown root.root /home/kingbase/cluster/HA6/HAC8/kingbase/bin/arping
[root@node201 ~]# chmod u+s /home/kingbase/cluster/HA6/HAC8/kingbase/bin/arping
[root@node201 ~]# ls -lh /home/kingbase/cluster/HA6/HAC8/kingbase/bin/arping
-rwsr-xr-x 1 root root 14K Sep  2  2023 /home/kingbase/cluster/HA6/HAC8/kingbase/bin/arping
# 执行注册
[kingbase@node201 bin]$ ./repmgr primary register --force
[INFO] connecting to primary database...
......
[NOTICE] node (ID: 1) acquire the virtual ip 192.168.1.88/24 success
[NOTICE] primary node record (ID: 1) registered
# 查看节点状态
[kingbase@node201 bin]$ ./repmgr cluster show
                                                                                                              
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        |         | host=192.168.1.201 user=esrep dbname=esrep port=54325 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000

2)备库注册

[kingbase@node202 bin]$ ./repmgr standby register --force
[INFO] connecting to local node "node2" (ID: 2)
......
[NOTICE] standby node "node2" (ID: 2) successfully registered

[kingbase@node202 bin]$ ./repmgr cluster show

 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                     
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        |         | host=192.168.1.201 user=esrep dbname=esrep port=54325 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
 2  | node2 | standby |   running | node1    | default  | 100      | 1        | 0 bytes | host=192.168.1.202 user=esrep dbname=esrep port=54325 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000

---如上所示,主备节点注册完成,集群节点状态正常。

五、集群switchover和failover及启停测试
Tips:集群切换测试,将影响业务的正常访问,可在业务低峰期间执行。

1、配置crond服务启动kbha和repmgrd进程

#配置crond
[kingbase@node201 kingbase]$ cat /etc/cron.d/KINGBASECRON
*/1 * * * * kingbase . /etc/prof# ile;/home/kingbase/cluster/HA6/HAC8/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/HA6/HAC8/kingbase/bin/../etc/repmgr.conf
# 启动服务
[kingbase@node201 bin]$ ps -ef |grep repmgr
kingbase  6264     1  0 15:41 ?        00:00:00 /home/kingbase/cluster/HA6/HAC8/kingbase/bin/repmgrd -d -v -f /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/../etc/repmgr.conf
kingbase  6528     1  0 15:41 ?        00:00:00 /home/kingbase/cluster/HA6/HAC8/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/R6C8/HAC8/kingbase/bin/../etc/repmgr.conf

2、Switchover切换测试
在新增备库上执行switchover测试:

 # 切换前状态
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                     
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | standby |   running | node2    | default  | 100      | 1        | 0 bytes | host=192.168.1.201 user=esrep dbname=esrep port=54325 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
 2  | node2 | primary | * running |          | default  | 100      | 2        |         | host=192.168.1.202 user=esrep dbname=esrep port=54325 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
# 执行switchover
[kingbase@node201 bin]$ ./repmgr standby switchover -h 192.168.1.202 -U esrep -d esrep
......
# 切换后状态
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                     
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 3        |         | host=192.168.1.201 user=esrep dbname=esrep port=54325 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
 2  | node2 | standby |   running | node1    | default  | 100      | 2        | 0 bytes | host=192.168.1.202 user=esrep dbname=esrep port=54325 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000

3、Failover切换测试

1)停止主库数据库服务
 2)切换后集群状态(配置recovery=automatic,集群自动恢复)
ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                     
----+-------+---------+-----------+----------+----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1  | node1 | standby |   running | node2    | default  | 100      | 3        | 0 bytes | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000
2  | node2 | primary | * running |          | default  | 100      | 4        |         | host=192.168.1.202 user=esrep dbname=esrep port=54325 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000

4、集群启停测试

 如下所示,集群可以正常启停:
[kingbase@node201 bin]$ ./sys_monitor.sh restart
......
 ID | Name  | Role    | Status    | Upstream | repmgrd | PID   | Paused? | Upstream last seen
----+-------+---------+-----------+----------+---------+-------+---------+--------------------
 1  | node1 | primary | * running |          | running | 7605  | no      | n/a
 2  | node2 | standby |   running | node1    | running | 10311 | no      | 1 second(s) ago
[2024-05-24 16:37:42] [NOTICE] redirecting logging output to "/home/kingbase/cluster/HA6/HAC8/kingbase/log/kbha.log"
[2024-05-24 16:37:43] [NOTICE] redirecting logging output to "/home/kingbase/cluster/HA6/HAC8/kingbase/log/kbha.log"

六、总结
对于单实例的生产环境在线扩展为流复制的集群架构,可以采用以上方案,但是步骤繁杂,容易出错,只为参考。建议采用以下方案:
1、业务库数据量少,可以先通过部署工具或脚本,先创建空集群;然后在单实例环境停止业务写入访问数据库后,sys_dump导出业务数据库,然后再导入新建集群的主库节点,备库自动同步。
2、对于数据量大的业务库,通过sys_dump导出导入比较慢,可以考虑物理迁移,参考以下博文:
https://www.cnblogs.com/kingbase/p/16410973.html
《KingbaseES V8R6集群维护案例之--单实例数据迁移到集群案例》

posted @ 2024-05-25 16:52  天涯客1224  阅读(34)  评论(0编辑  收藏  举报