kingbaseES V8R6集群运维案例之---修改物理IP和VIP

  Repmgr流复制管理工具对集群节点的管理是基于一个分布式的管理方式。每个节点都有自己的repmgr.conf配置文件,用来记录本节点的ID,节点名称,连接信息,数据库KBDATA目录等配置参数。在配置好这些参数后,就可以通过repmgr命令实现对集群节点的“一键式”部署。
  部署完成后,每个节点都有自己的repmgrd守护进程来监控节点数据库状态,且每个节点维护自己的元数据表,用于记录所有集群节点的信息。其中主节点守护进程主要用来监控本节点数据库服务状态,备节点守护进程主要用来监控主节点和本节点数据库服务状态。在发生Auto Failover时,备节点在尝试N次连接主节点失败后,repmgrd会在所有备节点中选举一个候选备节点提升为新主节点,然后其他备节点去Follow到该新主上,至此,形成一个新的集群状态。

如下图所示: repmgr集群架构原理图

案例说明:
在KingbaseES V8R6的集群中,ip地址配置在repmgr.conf和kingbase.auto.conf中,如果需要修改集群的物理ip和vip,需要修改这两个配置文件。ip的修改需要停止集群服务,在修改ip前,对于生产环境要规划好停机窗口,以免影响应用的访问。

案例环境:

操作系统:
[kingbase@node1 bin]$ cat /etc/centos-release
CentOS Linux release 7.2.1511 (Core)

数据库环境:
KingbaseES V8R6

操作步骤:

1、查看和确定主备库后,关闭集群(cluster和db)服务。
2、修改系统ip及/etc/hosts文件中ip。
3、修改集群主备库配置文件repmgr.conf和kingbase.auto.conf中的物理ip和vip信息。
4、重启系统网络服务应用新的物理ip。
5、启动主备库数据库服务。
6、注册主库到集群。
7、关闭备库数据库服务,将备库节点重新加入到集群,注册备库到集群。
8、查看集群服务状态(cluster和db)并启动主备库repmgrd服务。
9、重启集群(sys_monitor.sh)服务验证。

一、查看集群主备库状态信息

1、节点状态信息

[kingbase@node1 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node248 | standby |   running | node249  | default  | 100      | 4        | host=192.168.7.238 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node249 | primary | * running |          | default  | 100      | 4        | host=192.168.7.239 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

2、查看repmgr.conf和kingbase.auto.conf配置文件信息(主备库)

如下所示,repmgr.conf记录ip信息:

[kingbase@node1 etc]$ cat repmgr.conf 
on_bmj=off
node_id=1
node_name='node248'
promote_command='/home/kingbase/cluster/R6HA/KHA/kingbase/bin/repmgr  standby promote -f /home/kingbase/cluster/R6HA/KHA/kingbase/etc/repmgr.conf'
follow_command='/home/kingbase/cluster/R6HA/KHA/kingbase/bin/repmgr  standby follow  -f /home/kingbase/cluster/R6HA/KHA/kingbase/etc/repmgr.conf -W --upstream-node-id=%n'
conninfo='host=192.168.7.238 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'
.......
monitoring_history='no'
trusted_servers='192.168.7.1'
virtual_ip='192.168.7.237/24'
.......

如下所示,kingbase.auto.conf记录ip信息:

[kingbase@node1 data]$ cat kingbase.auto.conf
......
primary_conninfo = 'user=system connect_timeout=10 host=192.168.7.239 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 application_name=node101'
......

3、查看系统 ip 信息

主库:

[kingbase@node2 ~]$ cat /etc/hosts
192.168.7.238   node1   # standby
192.168.7.239   node2   # primary


2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
   link/ether 08:00:27:48:34:53 brd ff:ff:ff:ff:ff:ff
   inet 192.168.7.239/24 brd 192.168.7.255 scope global enp0s3
      valid_lft forever preferred_lft forever

备库:

[kingbase@node1 bin]$ cat /etc/hosts
192.168.7.238   node1   # standby
192.168.7.239   node2   # primary

2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
   link/ether 08:00:27:73:47:f6 brd ff:ff:ff:ff:ff:ff
   inet 192.168.7.238/24 brd 192.168.7.255 scope global enp0s3

二、修改系统IP并应用新的物理IP
1、关闭集群(cluster和db)

[kingbase@node2 bin]$ ./sys_monitor.sh stop
2021-03-01 12:19:00 Ready to stop all DB ...
Service process "node_export" was killed at process 4629
.......
2021-03-01 12:19:11 Done.

2、修改系统IP
主备库修改:

[root@node2 ~]# cat /etc/hosts
192.168.7.248   node1
192.168.7.249   node2

主库:

2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:48:34:53 brd ff:ff:ff:ff:ff:ff
    inet 192.168.7.249/24 brd 192.168.7.255 scope global enp0s3
       valid_lft forever preferred_lft forever

备库:

2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:73:47:f6 brd ff:ff:ff:ff:ff:ff
    inet 192.168.7.248/24 brd 192.168.7.255 scope global enp0s3
       valid_lft forever preferred_lft forever

三、修改repmgr.conf和kingbase.auto.conf配置文件(主备库所有node)

# 修改配置文件中的节点的ip和vip

主库:
[kingbase@node1 etc]$ cat repmgr.conf 
on_bmj=off
node_id=2
node_name='node249'
promote_command='/home/kingbase/cluster/R6HA/KHA/kingbase/bin/repmgr  standby promote -f /home/kingbase/cluster/R6HA/KHA/kingbase/etc/repmgr.conf'
follow_command='/home/kingbase/cluster/R6HA/KHA/kingbase/bin/repmgr  standby follow  -f /home/kingbase/cluster/R6HA/KHA/kingbase/etc/repmgr.conf -W --upstream-node-id=%n'
conninfo='host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'
......
monitoring_history='no'
trusted_servers='192.168.7.1'
virtual_ip='192.168.7.240/24'
.......

备库:
[kingbase@node2 etc]$ cat repmgr.conf
on_bmj=off
node_id=1
node_name='node248'
promote_command='/home/kingbase/cluster/R6HA/KHA/kingbase/bin/repmgr  standby promote -f /home/kingbase/cluster/R6HA/KHA/kingbase/etc/repmgr.conf'
follow_command='/home/kingbase/cluster/R6HA/KHA/kingbase/bin/repmgr  standby follow  -f /home/kingbase/cluster/R6HA/KHA/kingbase/etc/repmgr.conf -W --upstream-node-id=%n'
conninfo='host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'
.......
trusted_servers='192.168.7.1'
virtual_ip='192.168.7.240/24'
.......
主库:
[kingbase@node1 data]$ cat kingbase.auto.conf
.......
primary_conninfo = 'user=system connect_timeout=10 host=192.168.7.248 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 application_name=node1'
......

备库:
[kingbase@node2 data]$ cat kingbase.auto.conf
.......
primary_conninfo = 'user=system connect_timeout=10 host=192.168.7.249 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 application_name=node2'
......

四、启动主备库数据库服务

[kingbase@node2 bin]$ ./sys_ctl start -D ../data
.......
server started

五、注册主库到集群

1、注册primary到集群

[kingbase@node2 bin]$ ./repmgr primary register -F
INFO: connecting to primary database...
INFO: "repmgr" extension is already installed
NOTICE: PING 192.168.7.240 (192.168.7.240) 56(84) bytes of data.

--- 192.168.7.240 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1000ms


WARNING: ping host"192.168.7.240" failed
DETAIL: average RTT value is not greater than zero
NOTICE: node (ID: 2) acquire the virtual ip 192.168.7.240/24 success
NOTICE: primary node record (ID: 2) updated

2、查看集群节点状态

如下所示,集群主库节点状态正常:

[`kingbase@node2 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status        | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+---------------+----------+----------+----------+-------
 1  | node248 | standby | ? unreachable | node249  | default  | 100      | ?        | host=192.168.7.238 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node249 | primary | * running     |          | default  | 100      | 4        | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

WARNING: following issues were detected
  - unable to connect to node "node248" (ID: 1)
  - node "node248" (ID: 1) is registered as an active standby but is unreachable

六、注册备库到集群

1、关闭备库数据库服务

[kingbase@node1 bin]$ ./sys_ctl stop -D ../data
waiting for server to shut down.... done
server stopped

2、将备库节点重新加入到集群

[kingbase@node1 bin]$ ./repmgr node rejoin -h 192.168.7.249 -U esrep -d esrep
INFO: timelines are same, this server is not ahead
DETAIL: local node lsn is 1/EEFF5AB0, rejoin target lsn is 1/EEFF7E78
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: begin to start server at 2021-03-01 12:18:21.290629
NOTICE: starting server using "/home/kingbase/cluster/R6HA/KHA/kingbase/bin/sys_ctl  -w -t 90 -D '/home/kingbase/cluster/R6HA/KHA/kingbase/data' -l /home/kingbase/cluster/R6HA/KHA/kingbase/bin/logfile start"
NOTICE: start server finish at 2021-03-01 12:18:21.906449
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2
[kingbase@node1 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+
 1  | node248 | standby |   running | node249  | default  | 100      | 4        | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node249 | primary | * running |          | default  | 100      | 4        | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

3、注册standby到集群

[kingbase@node1 bin]$ ./repmgr standby register -h 192.168.7.249 -U esrep -d esrep -F
INFO: connecting to local node "node248" (ID: 1)
INFO: connecting to primary database
INFO: standby registration complete
NOTICE: standby node "node248" (ID: 1) successfully registered

4、主库查看集群状态和主备流复制状态

1)查看集群节点状态

[kingbase@node2 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+
 1  | node248 | standby |   running | node249  | default  | 100      | 4        | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node249 | primary | * running |          | default  | 100      | 4        | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

2)查看主备流复制状态

 test=# select * from sys_stat_replication;
 pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_st
art         | backend_xmin |   state   |  sent_lsn  | write_lsn  | flush_lsn  | replay_lsn | write_lag | flush_lag
 | replay_lag | sync_priority | sync_state |          reply_time           
------+----------+---------+------------------+---------------+-----------------+
 4939 |    16384 | esrep   | node248          | 192.168.7.248 |                 |       44492 | 2021-03-01 12:17:4
1.036448+08 |              | streaming | 1/EEFF7FC0 | 1/EEFF7FC0 | 1/EEFF7FC0 | 1/EEFF7FC0 |           |          
 |            |             1 | quorum     | 2021-03-01 12:20:49.634460+08
(1 row)

5、启动主备库repmgrd服务

[kingbase@node2 bin]$ ./repmgrd -d
[2021-03-01 12:32:27] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/KHA/kingbase/hamgr.log"

七、重启集群服务验证

1、通过sys_monitor.sh启动集群

[kingbase@node2 bin]$ ./sys_monitor.sh restart
2021-03-01 12:20:29 Ready to stop all DB ...
......
[2021-03-01 12:21:02] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/KHA/kingbase/hamgr.log"

2021-03-01 12:21:03 repmgrd on "[192.168.7.249]" start success.
 ID | Name    | Role    | Status    | Upstream | repmgrd | PID  | Paused? | Upstream last seen
----+---------+---------+-----------+----------+---------+------+---------+--------------------
 1  | node248 | standby |   running | node249  | running | 5135 | no      | 1 second(s) ago    
 2  | node249 | primary | * running |          | running | 6723 | no      | n/a                
2021-03-01 12:21:10 Done.

2、查看集群节点状态

[kingbase@node2 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                                                                                                
----+---------+---------+-----------+----------+----------+----------+----------+--------
 1  | node248 | standby |   running | node249  | default  | 100      | 4        | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node249 | primary | * running |          | default  | 100      | 4        | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

3、查看主备流复制状态

[kingbase@node2 bin]$ ./ksql -U system test
ksql (V8.0)
Type "help" for help.

test=# select * from sys_stat_replication;
 pid  | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_st
art         | backend_xmin |   state   |  sent_lsn  | write_lsn  | flush_lsn  | replay_lsn | write_lag | flush_lag
 | replay_lag | sync_priority | sync_state |          reply_time           
------+----------+---------+------------------+---------------+-----------------+--------
 6203 |    16384 | esrep   | node248          | 192.168.7.248 |                 |       44499 | 2021-03-01 12:20:4
9.781155+08 |              | streaming | 1/EEFF9090 | 1/EEFF9090 | 1/EEFF9090 | 1/EEFF9090 |           |          
 |            |             1 | quorum     | 2021-03-01 12:22:02.817815+08
(1 row)

八、总结

集群物理IP和VIP修改成功,对于集群IP的修改需要停止集群服务(cluster和db),将影响业务的正常运行,所以在集群部署前需要做好IP的规划,避免在后期修改给业务正常运行带来影响。

posted @ 2021-06-16 20:09  天涯客1224  阅读(454)  评论(0编辑  收藏  举报