KingbaseES V8R6 集群运维系列 -- 修改集群物理IP和VIP
案例说明:
在KingbaseES V8R6的集群中,ip地址配置在repmgr.conf和kingbase.auto.conf中,如果需要修改集群的物理ip和vip,需要修改这两个配置文件。ip的修改需要停止集群服务,在修改ip前,对于生产环境要规划好停机窗口,以免影响应用的访问。
案例运行环境:
操作系统:
[kingbase@node1 bin]$ cat /etc/centos-release
CentOS Linux release 7.2.1511 (Core)
数据库环境:
KingbaseES V8R6
一、查看集群主备库状态信息
1、节点状态信息
[kingbase@node1 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------
1 | node248 | standby | running | node249 | default | 100 | 4 | host=192.168.7.238 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node249 | primary | * running | | default | 100 | 4 | host=192.168.7.239 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2、查看repmgr.conf和kingbase.auto.conf配置文件信息(主备库)
[kingbase@node1 etc]$ cat repmgr.conf
on_bmj=off
node_id=1
node_name='node248'
promote_command='/home/kingbase/cluster/R6HA/KHA/kingbase/bin/repmgr standby promote -f /home/kingbase/cluster/R6HA/KHA/kingbase/etc/repmgr.conf'
follow_command='/home/kingbase/cluster/R6HA/KHA/kingbase/bin/repmgr standby follow -f /home/kingbase/cluster/R6HA/KHA/kingbase/etc/repmgr.conf -W --upstream-node-id=%n'
conninfo='host=192.168.7.238 user=esrep dbname=esrep port=54321 connect_timeout=10
......
trusted_servers='192.168.7.1'
virtual_ip='192.168.7.237/24'
......
[kingbase@node1 data]$ cat kingbase.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
.....
primary_conninfo = 'user=system connect_timeout=10 host=192.168.7.239 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 application_name=node101'
......
3、查看系统 ip 信息
主库:
[kingbase@node2 ~]$ cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.7.238 node1
192.168.7.239 node2
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 08:00:27:48:34:53 brd ff:ff:ff:ff:ff:ff
inet 192.168.7.239/24 brd 192.168.7.255 scope global enp0s3
valid_lft forever preferred_lft forever
inet 192.168.7.237/24 scope global secondary enp0s3:3
valid_lft forever preferred_lft forever
备库:
[kingbase@node1 bin]$ cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.7.238 node1
192.168.7.239 node2
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 08:00:27:73:47:f6 brd ff:ff:ff:ff:ff:ff
inet 192.168.7.238/24 brd 192.168.7.255 scope global enp0s3
二、修改系统IP并应用新的物理IP
1、关闭集群(cluster和db)
[kingbase@node2 bin]$ ./sys_monitor.sh stop
2021-03-01 12:19:00 Ready to stop all DB ...
Service process "node_export" was killed at process 4629
.......
2021-03-01 12:19:11 Done.
2、修改系统IP
主备库修改:
[root@node2 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.7.248 node1
192.168.7.249 node2
主库:
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 08:00:27:48:34:53 brd ff:ff:ff:ff:ff:ff
inet 192.168.7.249/24 brd 192.168.7.255 scope global enp0s3
valid_lft forever preferred_lft forever
备库:
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 08:00:27:73:47:f6 brd ff:ff:ff:ff:ff:ff
inet 192.168.7.248/24 brd 192.168.7.255 scope global enp0s3
valid_lft forever preferred_lft forever
三、修改repmgr.conf和kingbase.auto.conf配置文件(主备库所有node)
--- 修改配置文件中的节点的ip和vip
备库:
[kingbase@node1 etc]$ cat repmgr.conf
on_bmj=off
node_id=1
node_name='node249'
promote_command='/home/kingbase/cluster/R6HA/KHA/kingbase/bin/repmgr standby promote -f /home/kingbase/cluster/R6HA/KHA/kingbase/etc/repmgr.conf'
follow_command='/home/kingbase/cluster/R6HA/KHA/kingbase/bin/repmgr standby follow -f /home/kingbase/cluster/R6HA/KHA/kingbase/etc/repmgr.conf -W --upstream-node-id=%n'
conninfo='host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'
......
trusted_servers='192.168.7.1'
virtual_ip='192.168.7.240/24'
net_device='enp0s3'
......
主库:
[kingbase@node2 etc]$ cat repmgr.conf
on_bmj=off
node_id=2
node_name=node248
promote_command='/home/kingbase/cluster/R6HA/KHA/kingbase/bin/repmgr standby promote -f /home/kingbase/cluster/R6HA/KHA/kingbase/etc/repmgr.conf'
follow_command='/home/kingbase/cluster/R6HA/KHA/kingbase/bin/repmgr standby follow -f /home/kingbase/cluster/R6HA/KHA/kingbase/etc/repmgr.conf -W --upstream-node-id=%n'
conninfo='host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'
......
trusted_servers='192.168.7.1'
virtual_ip='192.168.7.240/24'
net_device='enp0s3'
......
备库:
[kingbase@node1 data]$ cat kingbase.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
......
primary_conninfo = 'user=system connect_timeout=10 host=192.168.7.248 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 application_name=node101'
......
主库:
[kingbase@node2 data]$ cat kingbase.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
......
primary_conninfo = 'user=system connect_timeout=10 host=192.168.7.249 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 application_name=node101'
......
四、启动主备库数据库服务
[kingbase@node2 bin]$ ./sys_ctl start -D ../data
.......
server started
五、注册主库到集群
1、注册primary到集群
[kingbase@node2 bin]$ ./repmgr primary register -F
INFO: connecting to primary database...
INFO: "repmgr" extension is already installed
NOTICE: PING 192.168.7.240 (192.168.7.240) 56(84) bytes of data.
--- 192.168.7.240 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1000ms
WARNING: ping host"192.168.7.240" failed
DETAIL: average RTT value is not greater than zero
NOTICE: node (ID: 2) acquire the virtual ip 192.168.7.240/24 success
NOTICE: primary node record (ID: 2) updated
2、查看集群节点状态
[`kingbase@node2 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+---------------+----------+----------+----------+-------
1 | node248 | standby | ? unreachable | node249 | default | 100 | ? | host=192.168.7.238 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node249 | primary | * running | | default | 100 | 4 | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
WARNING: following issues were detected
- unable to connect to node "node248" (ID: 1)
- node "node248" (ID: 1) is registered as an active standby but is unreachable
六、注册备库到集群
1、直接注册备库到集群会出现无法连接的故障(需要关闭备库数据库服务)
[kingbase@node1 bin]$ ./repmgr standby register -h 192.168.7.249 -d esrep -U esrep -W -F
WARNING: following problems with command line parameters detected:
--no-wait will be ignored when executing STANDBY REGISTER
INFO: connecting to local node "node248" (ID: 1)
WARNING: database connection parameters not required when the standby to be registered is running
DETAIL: repmgr uses the "conninfo" parameter in "repmgr.conf" to connect to the standby
INFO: connecting to primary database
ERROR: connection to database failed
DETAIL:
could not connect to server: No route to host
Is the server running on host "192.168.7.239" and accepting
TCP/IP connections on port 54321?
DETAIL: attempted to connect using:
user=esrep connect_timeout=10 dbname=esrep host=192.168.7.239 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr
ERROR: connection to database failed
DETAIL:
could not connect to server: No route to host
Is the server running on host "192.168.7.238" and accepting
TCP/IP connections on port 54321?
DETAIL: attempted to connect using:
user=esrep connect_timeout=10 dbname=esrep host=192.168.7.238 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr
ERROR: unable to connect to the primary database
HINT: a primary node must be configured before registering a standby node
2、关闭备库数据库服务后将备库节点重新加入到集群
1)关闭数据库服务
[kingbase@node1 bin]$ ./sys_ctl stop -D ../data
waiting for server to shut down.... done
server stopped
2)将备库节点重新加入到集群
[kingbase@node1 bin]$ ./repmgr node rejoin -h 192.168.7.249 -U esrep -d esrep
INFO: timelines are same, this server is not ahead
DETAIL: local node lsn is 1/EEFF5AB0, rejoin target lsn is 1/EEFF7E78
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: begin to start server at 2021-03-01 12:18:21.290629
NOTICE: starting server using "/home/kingbase/cluster/R6HA/KHA/kingbase/bin/sys_ctl -w -t 90 -D '/home/kingbase/cluster/R6HA/KHA/kingbase/data' -l /home/kingbase/cluster/R6HA/KHA/kingbase/bin/logfile start"
NOTICE: start server finish at 2021-03-01 12:18:21.906449
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2
[kingbase@node1 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+
1 | node248 | standby | running | node249 | default | 100 | 4 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node249 | primary | * running | | default | 100 | 4 | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
3)注册standby到集群
[kingbase@node1 bin]$ ./repmgr standby register -h 192.168.7.249 -U esrep -d esrep -F
INFO: connecting to local node "node248" (ID: 1)
INFO: connecting to primary database
INFO: standby registration complete
NOTICE: standby node "node248" (ID: 1) successfully registered
3、主库查看集群状态和主备流复制状态
1)查看集群节点状态
[kingbase@node2 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+
1 | node248 | standby | running | node249 | default | 100 | 4 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node249 | primary | * running | | default | 100 | 4 | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2)查看主备流复制状态
test=# select * from sys_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_st
art | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag
| replay_lag | sync_priority | sync_state | reply_time
------+----------+---------+------------------+---------------+-----------------+
4939 | 16384 | esrep | node248 | 192.168.7.248 | | 44492 | 2021-03-01 12:17:4
1.036448+08 | | streaming | 1/EEFF7FC0 | 1/EEFF7FC0 | 1/EEFF7FC0 | 1/EEFF7FC0 | |
| | 1 | quorum | 2021-03-01 12:20:49.634460+08
(1 row)
4、启动主备库repmgrd服务
[kingbase@node2 bin]$ ./repmgrd -d
[2021-03-01 12:32:27] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/KHA/kingbase/hamgr.log"
七、重启集群服务验证
1、通过sys_monitor.sh启动集群
[kingbase@node2 bin]$ ./sys_monitor.sh restart
2021-03-01 12:20:29 Ready to stop all DB ...
......
[2021-03-01 12:21:02] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/KHA/kingbase/hamgr.log"
2021-03-01 12:21:03 repmgrd on "[192.168.7.249]" start success.
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+---------+---------+-----------+----------+---------+------+---------+--------------------
1 | node248 | standby | running | node249 | running | 5135 | no | 1 second(s) ago
2 | node249 | primary | * running | | running | 6723 | no | n/a
2021-03-01 12:21:10 Done.
2、查看集群节点状态
[kingbase@node2 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+--------
1 | node248 | standby | running | node249 | default | 100 | 4 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node249 | primary | * running | | default | 100 | 4 | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
3、查看主备流复制状态
[kingbase@node2 bin]$ ./ksql -U system test
ksql (V8.0)
Type "help" for help.
test=# select * from sys_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state | reply_time
------+----------+---------+------------------+---------------+-----------------+--------
6203 | 16384 | esrep | node248 | 192.168.7.248 | | 44499 | 2021-03-01 12:20:49.781155+08 | | streaming | 1/EEFF9090 | 1/EEFF9090 | 1/EEFF9090 | 1/EEFF9090 | | | | 1 | quorum | 2021-03-01 12:22:02.817815+08
(1 row)
八、总结
集群物理IP和VIP修改,需要停止集群服务(cluster和db),将影响业务的正常运行,所以在集群部署前需要做好IP的规划,避免在后期修改给业务正常运行带来影响。
操作步骤总结:
1、查看和确定主备库后,关闭集群(cluster和db)服务。
2、修改系统ip及/etc/hosts文件中ip。
3、修改集群主备库配置文件repmgr.conf和kingbase.auto.conf中的物理ip和vip信息。
4、重启系统网络服务应用新的物理ip。
5、启动主备库数据库服务。
6、注册主库到集群。
7、关闭备库数据库服务,将备库节点重新加入到集群,注册备库到集群。
8、查看集群服务状态(cluster和db)并启动主备库repmgrd服务。
9、重启集群(sys_monitor.sh)服务验证。
KINGBASE研究院