KingbaseES V8R6集群运维案例--创建级联复制

案例说明:

在已有的一主一备的集群架构上,构建级联复制。

适用版本:
KingbaseES V8R6

案例架构:

操作步骤:

   1)增加新的主机节点,并配置相关数据库用户、ssh信任关系、资源管理等。
   2)在新节点创建和源节点相同的集群文件存储目录。
   3)拷贝源节点的文件到新节点(除了data目录下的数据)。
   4)在新节点上执行clone,注意upstream node。
   5)启动新节点数据库服务。
   6)在新节点以standby进行注册。
   7)查看集群节点状态信息。
   8)在主库做DML操作进行数据同步测试。
   9)重启集群测试。

一、查看现有集群及流复制状态

1)查看集群节点状态

[kingbase@node1 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+-----------
 1  | node248 | primary | * running |          | default  | 100      | 5        | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node249 | standby |   running | node248  | default  | 100      | 5        | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

2)查看主备流复制状态

test=# select * from sys_stat_replication;
  pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_s
tart         | backend_xmin |   state   |  sent_lsn  | write_lsn  | flush_lsn  | replay_lsn | write_lag | flush_la
g | replay_lag | sync_priority | sync_state |          reply_time           
-------+----------+---------+------------------+---------------+-----------------+-------------+-----
 11257 |    16384 | esrep   | node249          | 192.168.7.249 |                 |       56568 | 2021-03-01 14:41:
41.062467+08 |              | streaming | 2/1301BEC8 | 2/1301BEC8 | 2/1301BEC8 | 2/1301BEC8 |           |         
  |            |             1 | quorum     | 2021-03-01 15:02:50.835989+08
(1 row)

=== 首先确保现有节点的集群和流复制状态正常===

二、配置新节点下集群目录存储结构和相关配置文件

1)集群目录结构

[kingbase@node3 bin]$ ls -lh /home/kingbase/cluster/R6HA/KHA/kingbase/
total 24M
drwxrwxr-x.  2 kingbase kingbase  16K Jun 24  2021 archive
drwxr-xr-x.  2 kingbase kingbase 4.0K Jun 24  2021 bin
drwxrwxr-x.  2 kingbase kingbase   72 Mar  1  2021 etc
-rw-rw-r--.  1 kingbase kingbase    4 Mar  1 12:07 hamgrd.pid
-rw-rw-r--.  1 kingbase kingbase 2.0M Mar  1  2021 hamgr.log
.......
-rw-rw-r--.  1 kingbase kingbase  20M Mar  1  2021 kbha.log
......
drwxrwxr-x.  5 kingbase kingbase 8.0K Jun 24  2021 lib
-rw-------.  1 kingbase kingbase  47K Mar  1 12:02 logfile
drwxrwxr-x.  7 kingbase kingbase 4.0K Jun 24  2021 share

=== 新节点集群目录存储结构和源节点一致,除了data目录,所有数据从源节点拷贝===

2)配置repmgr.conf配置文件

三、在新节点上执行clone

=== 注意:通过upstream-node-id执行upstream节点(上游节点)===

[kingbase@node3 bin]$ ./repmgr standby clone -h 192.168.7.249 -U esrep -d esrep --upstream-node-id=2

NOTICE: destination directory "/home/kingbase/cluster/R6HA/KHA/kingbase/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=192.168.7.249 user=esrep dbname=esrep
.......
HINT: for example: sys_ctl -D /home/kingbase/cluster/R6HA/KHA/kingbase/data start
HINT: after starting the server, you need to register this standby with "repmgr standby register"

四、启动数据库服务注册standby节点

1)启动新备库数据库服务

[kingbase@node3 bin]$ ./sys_ctl start -D ../data
.......
server started

2)注册standby节点
=== 注意:通过upstream-node-id执行upstream节点(上游节点)===

[kingbase@node3 bin]$ ./repmgr standby register --upstream-node-id=2 --force
INFO: connecting to local node "node243" (ID: 3)
INFO: connecting to primary database
WARNING: this node does not appear to be attached to upstream node "node249" (ID: 2)
INFO: standby registration complete
NOTICE: standby node "node243" (ID: 3) successfully registered

[kingbase@node3 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+--------------
 1  | node248 | primary | * running |          | default  | 100      | 5        | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node249 | standby |   running | node248  | default  | 100      | 5        | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

3)查看集群节点状态

[kingbase@node3 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+-------------
 1  | node248 | primary | * running |          | default  | 100      | 5        | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node249 | standby |   running | node248  | default  | 100      | 5        | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

=== 注意:在新节点上没有看到 node243的注册信息,应该不是正常状态===

4)查看上游节点中集群节点状态信息

[kingbase@node1 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+---------
 1  | node248 | primary | * running |          | default  | 100      | 5        | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node249 | standby |   running | node248  | default  | 100      | 5        | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 3  | node243 | standby |   running | node249  | default  | 100      | 5        | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 [kingbase@node2 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+---------
 1  | node248 | primary | * running |          | default  | 100      | 5        | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node249 | standby |   running | node248  | default  | 100      | 5        | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 3  | node243 | standby |   running | node249  | default  | 100      | 5        | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

=== 上游节点查询集群节点状态信息正常===

五、解决新节点(node243)集群状态非正常的故障

1)查看新节点数据库日志

node243:
......
2021-03-01 12:41:19.789 CST [4909] FATAL:  could not start WAL streaming: ERROR:  replication slot "repmgr_slot_3" does not exist
2021-03-01 12:41:24.789 CST [4910] FATAL:  could not start WAL streaming: ERROR:  replication slot "repmgr_slot_3" does not exist
2021-03-01 12:41:29.794 CST [4911] FATAL:  could not start WAL streaming: ERROR:  replication slot "repmgr_slot_3" does not exist

=== 从日志信息可知,node243通过"repmgr_slot_3"复制槽做流复制连接,而复制槽不存在===

2)查看上游节点(node249)复制槽和流复制状态信息

node249:
1、查看复制槽信息

test=# select * from sys_replication_slots;
 slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn 
-----------+--------+-----------+--------+----------+-----------+--------+------------+------+-
(0 rows)

2、查看流复制信息

test=# select * from sys_stat_replication;
 pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state | reply_time 
-----+----------+---------+------------------+-------------+-----------------+-------------+
(0 rows)

=== 从以上获知,在上游节点node249查询,复制槽和流复制节点信息均为空===

3)在node249节点创建复制槽

test=# select sys_create_physical_replication_slot('repmgr_slot_3');
 sys_create_physical_replication_slot 
--------------------------------------
 (repmgr_slot_3,)
(1 row)

查看复制槽:

test=# select * from sys_replication_slots;
   slot_name   | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn 
---------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-
 repmgr_slot_3 |        | physical  |        |          | f         | t      |      18334 |      |              | 2/1301BEC8  | 
(1 row)

查看流复制:

test=# select * from sys_stat_replication;                          
  pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |   state   |  sent_lsn  | write_lsn  | flush_lsn  | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state |          reply_time           
-------+----------+---------+------------------+---------------+-----------------+-------------+
 18334 |    16384 | esrep   | node243          | 192.168.7.243 |                 |       48912 | 2021-03-01 15:08:13.075586+08 |              | streaming | 2/1301BEC8 | 2/1301BEC8 | 2/1301BEC8 | 2/1301BEC8 |           |         
  |            |             0 | async      | 2021-03-01 12:47:40.421207+08
(1 row)

=== 从以上获知,复制槽创建后,流复制状态正常===

4)在node243上查看集群节点状态

[kingbase@node3 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node248 | primary | * running |          | default  | 100      | 5        | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node249 | standby |   running | node248  | default  | 100      | 5        | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 3  | node243 | standby |   running | node249  | default  | 100      | 5        | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

=== 从以上获知,整个集群节点状态正常===

六、测试级联复制数据同步

node248:
prod=# create table t1 (id int);
CREATE TABLE
prod=# insert into t1 values (generate_series(10,1000,10));
INSERT 0 100
prod=# select * from t1 limit 10;
 id  
-----
  10
 ......
 100
(10 rows)

node249:
prod=# \d t1;
                 Table "public.t1"
 Column |  Type   | Collation | Nullable | Default 
--------+---------+-----------+----------+---------
 id     | integer |           |          | 

prod=# select * from t1 limit 10;
 id  
-----
  10
 ......
 100
(10 rows)

node243:
prod=# select * from t1 limit 10;
 id  
-----
  10
  ......
 100
(10 rows)

七、重启集群测试

1)查看集群节点状态

[kingbase@node3 bin]$ ./ksql -U esrep -d esrep
ksql (V8.0)
Type "help" for help.

esrep=# select * from repmgr.nodes;
 node_id | upstream_node_id | active | node_name |  type   | location | priority |  conninfo    | repluser |   slot_name   |                           config_file                           
---------+------------------+--------+-----------+---------+----------+----------+-------------
       1 |                  | t      | node248   | primary | default  |      100 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 | esrep    | repmgr_slot_1 | /home/kingbase/cluster/R6HA/KHA/kingbase/bin/../etc/repmgr.conf
       2 |                1 | t      | node249   | standby | default  |      100 | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 | esrep    | repmgr_slot_2 | /home/kingbase/cluster/R6HA/KHA/kingbase/bin/../etc/repmgr.conf
       3 |                2 | t      | node243   | standby | default  |      100 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 | esrep    | repmgr_slot_3 | /home/kingbase/cluster/R6HA/KHA/kingbase/bin/../etc/repmgr.conf
(3 rows)

2)使用sys_monitor.sh重启集群测试

[kingbase@node1 bin]$ ./sys_monitor.sh restart
2021-03-01 15:16:53 Ready to stop all DB ...
.......
2021-03-01 15:18:04 repmgrd on "[192.168.7.243]" start success.
 ID | Name    | Role    | Status    | Upstream | repmgrd | PID   | Paused? | Upstream last seen
----+---------+---------+-----------+----------+---------+-------+---------+--------------------
 1  | node248 | primary | * running |          | running | 3420  | no      | n/a                
 2  | node249 | standby |   running | node248  | running | 21209 | no      | 1 second(s) ago    
 3  | node243 | standby |   running | node249  | running | 7376  | no      | 0 second(s) ago    
2021-03-01 15:18:18 Done.
[kingbase@node1 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node248 | primary | * running |          | default  | 100      | 5        | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node249 | standby |   running | node248  | default  | 100      | 5        | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 3  | node243 | standby |   running | node249  | default  | 100      | 5        | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

=== 从以上启动信息获知,通过sys_monitor.sh一键重启集群成功,新节点亦可以加入集群管理===

posted @ 2021-07-24 11:48  KINGBASE研究院  阅读(299)  评论(0编辑  收藏  举报