KingbaseES V8R6运维案例之---firewalld防火墙启动对failover切换的影响
案例说明:
在KingbaseES V8R6C7的版本中,repmgr.conf增加了connect_check_type参数可以指定参数值,默认值是‘mix’,以前版本为隐藏参数,默认值‘ping‘。connect_check_type参数可以配置为:
监控数据库connect_check_type设置:
connection_check _type:检查上游数据库状态的方法。
ping,使用PQping()检查;
connection,建立数据库连接;
query,使用已有连接执行SELECT TRUE进行检查;
mix,平常使用PQping()进行检查,如果多次失败,最后一次使用已有连接执行 SELECT TRUE进行检查,否则一切正常。
可选参数:ping, connection, query, mix;默认为mix。
适用版本:
KingbaseES V8R6
集群架构:
[kingbase@node202 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+---------+--------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 13 | | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10
2 | node2 | standby | running | node1 | default | 100 | 13 | 0 bytes | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10
一、集群节点启动防火墙
如下所示,在主库节点启动防火墙:
[root@node201 ~]# systemctl start firewalld
[root@node201 ~]# systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: active (running) since Wed 2023-11-29 14:08:44 CST; 4s ago
Main PID: 18450 (firewalld)
CGroup: /system.slice/firewalld.service
└─18450 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid
Nov 29 14:08:44 node202 systemd[1]: Starting firewalld - dynamic firewall daemon...
Nov 29 14:08:44 node202 systemd[1]: Started firewalld - dynamic firewall daemon.
[root@node201 ~]# firewall-cmd --list-all
public (default, active)
interfaces: enp0s3 enp0s8
sources:
services: dhcpv6-client ssh
ports:
masquerade: no
forward-ports:
icmp-blocks:
rich rules:
如下图所示,启动防火墙后,默认备库节点无法远程访问主库数据库服务及securecmdd连接:
如下所示,防火墙禁止到主库系统ping连接:
[root@node201 ~]# firewall-cmd --add-rich-rule='rule protocol value=icmp drop' --permanent
success
[root@node201 ~]# firewall-cmd --complete-reload
success
[root@node201 ~]# firewall-cmd --list-all
public (default, active)
interfaces: enp0s3 enp0s8
sources:
services: dhcpv6-client ssh
ports:
masquerade: no
forward-ports:
icmp-blocks:
rich rules:
rule protocol value="icmp" drop
如下图所示,备库无法ping通主库主机但可以ping通网关:
二、配置connection_check_type参数测试
- 案例一:
1、connection_check_type='mix'
[kingbase@node202 bin]$ cat ../etc/repmgr.conf |grep connection_check
connection_check_type='mix'
节点之间通过securecmdd远程连接:
[kingbase@node202 bin]$ cat ../etc/repmgr.conf |grep scmd_
scmd_options='-q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2
-o ServerAliveCountMax=5 -p 8890'
2、集群状态(启动防火墙后)
如下所示,启动防火墙后,集群之间的通讯出现异常:
[kingbase@node202 bin]$ ./repmgr cluster show
----+-------+---------+---------------+----------+----------+----------+----------+---------+--------------------------------------------------------------------------
1 | node1 | primary | ? unreachable | ? | default | 100 | | | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10
2 | node2 | standby | running | ? node1 | default | 100 | 13 | ? | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10
[WARNING] following issues were detected
- unable to connect to node "node1" (ID: 1)
- node "node1" (ID: 1) is registered as an active primary but is unreachable
- unable to connect to node "node2" (ID: 2)'s upstream node "node1" (ID: 1)
- unable to determine if node "node2" (ID: 2) is attached to its upstream node "node1" (ID: 1)
[HINT] execute with --verbose option to see connection error messages
3、查看备库hamgr.log
如下图所示,备库在检测到PQping()主库异常超过阈值10次后,重新开始检测,没有执行failover切换操作:
[2023-11-29 14:13:07] [INFO] checking state of node "node1" (ID: 1), 10 of 10 attempts
[2023-11-29 14:13:07] [DEBUG] connection_ping(): result is PGRES_TUPLES_OK
[2023-11-29 14:13:07] [INFO] the database is alive by original connection
[2023-11-29 14:13:07] [NOTICE] reconnected to upstream node after 54 seconds
[2023-11-29 14:13:07] [DEBUG] get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
[2023-11-29 14:13:07] [DEBUG] _create_event(): event is "repmgrd_upstream_reconnect" for node 2
[2023-11-29 14:13:07] [DEBUG] get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
[2023-11-29 14:13:07] [DEBUG] _create_event():
INSERT INTO repmgr.events ( node_id, event, successful, details ) VALUES ($1, $2, $3, $4) RETURNING event_timestamp
[2023-11-29 14:13:07] [DEBUG] _create_event(): Event timestamp is "2023-11-29 14:13:07.971921+08"
[2023-11-29 14:13:07] [INFO] node "node2" (ID: 2) monitoring upstream node "node1" (ID: 1) in normal state
[2023-11-29 14:13:07] [DEBUG] connection_ping(): result is PGRES_TUPLES_OK
[2023-11-29 14:13:08] [DEBUG] is_server_available(): ping status for "host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10" is PQPING_OK
[2023-11-29 14:13:08] [DEBUG] get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
[2023-11-29 14:13:08] [DEBUG] get_slot_record():
SELECT slot_name, slot_type, active FROM pg_catalog.pg_replication_slots WHERE slot_name = 'repmgr_slot_2'
[2023-11-29 14:13:08] [DEBUG] is_wal_all_recevied():
SELECT pg_current_wal_flush_lsn() as primary_lsn, replay_lsn as local_lsn, pg_wal_lsn_diff(primary_lsn, local_lsn) <= 16777216 as catched_up from pg_stat_replication where application_name = 'node2' and state = 'streaming'
[2023-11-29 14:13:08] [DEBUG] is_rep_sync_streaming():
SELECT state,sync_state FROM pg_stat_replication where application_name = 'node2'
[2023-11-29 14:13:08] [DEBUG] get_node_record():
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached, n.primary_seen, n.lsn FROM repmgr.nodes n WHERE n.node_id = 2
[2023-11-29 14:13:08] [DEBUG] sleeping 2 seconds (parameter "monitor_interval_secs")
[2023-11-29 14:13:10] [DEBUG] checking host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10
[2023-11-29 14:13:10] [DEBUG] connection check type is "mix"
[2023-11-29 14:13:10] [DEBUG] is_server_available(): ping status for "host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10" is PQPING_NO_RESPONSE
[2023-11-29 14:13:10] [WARNING] unable to ping "host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10"
[2023-11-29 14:13:10] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-11-29 14:13:10] [DEBUG] _create_event(): event is "repmgrd_upstream_disconnect" for node 2
[2023-11-29 14:13:10] [DEBUG] _create_event(): Event timestamp is "2023-11-29 14:13:10+0800"
[2023-11-29 14:13:10] [WARNING] unable to connect to upstream node "node1" (ID: 1)
[2023-11-29 14:13:10] [INFO] checking state of node "node1" (ID: 1), 1 of 10 attempts
[2023-11-29 14:13:10] [DEBUG] is_server_available_params(): ping status for "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 fallback_application_name=repmgr" is PQPING_NO_RESPONSE
[2023-11-29 14:13:10] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 fallback_application_name=repmgr"
[2023-11-29 14:13:10] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-11-29 14:13:10] [INFO] sleeping up to 6 seconds until next reconnection attempt
tion attempt
......
如上所示,在配置connection_check_type='mix'后,启动防火墙后,没有触发集群的failover切换。
- 案例二:
1、connection_check_type='mix'
[kingbase@node202 bin]$ cat ../etc/repmgr.conf |grep connection_check
connection_check_type='connection'
2、主库数据库服务down
3、查看备库hamgr.log
如下所示,主库数据库服务down后,备库触发了failover的主备切换:
[2023-11-29 14:35:45] [INFO] checking state of node "node1" (ID: 1), 10 of 10 attempts
[2023-11-29 14:35:45] [DEBUG] is_server_available_params(): ping status for "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 fallback_application_name=repmgr" is PQPING_NO_RESPONSE
[2023-11-29 14:35:45] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 fallback_application_name=repmgr"
[2023-11-29 14:35:45] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-11-29 14:35:45] [WARNING] unable to reconnect to node "node1" (ID: 1) after 10 attempts
........
[2023-11-29 14:35:54] [NOTICE] PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
--- 192.168.1.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.252/0.256/0.261/0.016 ms
[2023-11-29 14:35:54] [NOTICE] successfully ping one or more of the trusted_servers "192.168.1.1"
[2023-11-29 14:35:54] [INFO] promote_command is:
"/home/kingbase/cluster/R6/R6HA/kingbase/bin/repmgr standby promote -f /home/kingbase/cluster/R6/R6HA/kingbase/etc/repmgr.conf"
[2023-11-29 14:35:54] [DEBUG] test_ssh_connection(): executing /home/kingbase/cluster/R6/R6HA/kingbase/bin/sys_securecmd -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 8890 192.168.1.201 /bin/true 2>/dev/null
........
[2023-11-29 14:35:54] [DEBUG] test_ssh_connection(): executing /home/kingbase/cluster/R6/R6HA/kingbase/bin/sys_securecmd -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 8890 192.168.1.201 /usr/bin/true 2>/dev/null
[NOTICE] promoting standby to primary
[DETAIL] promoting server "node2" (ID: 2) using sys_promote()
[NOTICE] waiting for promotion to complete, replay lsn: 2/2E000858
[2023-11-29 14:35:54] [WARNING] unable to connect to remote host "192.168.1.201" via ES
[2023-11-29 14:35:54] [INFO] unable to connect via ES to host "192.168.1.201", skip stop old primary db
[INFO] SET synchronous TO "async" on primary host
[DEBUG] setting node 2 as primary and marking existing primary as failed
[NOTICE] STANDBY PROMOTE successful
[DETAIL] server "node2" (ID: 2) was successfully promoted to primary
4、切换后集群状态
如下所示,node2被切换为primary:
[kingbase@node202 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+---------+--------------------------------------------------------------------------
1 | node1 | primary | - failed | ? | default | 100 | | | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10
2 | node2 | primary | * running | | default | 100 | 14 | | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10
[WARNING] following issues were detected
- unable to connect to node "node1" (ID: 1)
[HINT] execute with --verbose option to see connection error messages
如上所示,在配置connection_check_type=‘connection'后,启动防火墙后,触发集群的failover切换。
三、总结
在生产环境,在集群部署上线前应该配置好对应的防火墙规则,对于误启动防火墙后,需要触发failover切换的环境,可以配置connection_check_type='connection'。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· DeepSeek 开源周回顾「GitHub 热点速览」