KingbaseES V8R6运维案例之---firewalld防火墙启动对failover切换的影响

案例说明:
在KingbaseES V8R6C7的版本中,repmgr.conf增加了connect_check_type参数可以指定参数值,默认值是‘mix’,以前版本为隐藏参数,默认值‘ping‘。connect_check_type参数可以配置为:

监控数据库connect_check_type设置:

connection_check _type:检查上游数据库状态的方法。
ping,使用PQping()检查;
connection,建立数据库连接;
query,使用已有连接执行SELECT TRUE进行检查;
mix,平常使用PQping()进行检查,如果多次失败,最后一次使用已有连接执行 SELECT TRUE进行检查,否则一切正常。

可选参数:ping, connection, query, mix;默认为mix。

适用版本:
KingbaseES V8R6

集群架构:

[kingbase@node202 bin]$ ./repmgr cluster show

 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string     
----+-------+---------+-----------+----------+----------+----------+----------+---------+--------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 13       |         | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10
 2  | node2 | standby |   running | node1    | default  | 100      | 13       | 0 bytes | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10

一、集群节点启动防火墙

如下所示,在主库节点启动防火墙:

[root@node201 ~]# systemctl start firewalld
[root@node201 ~]# systemctl status firewalld
 firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
   Active: active (running) since Wed 2023-11-29 14:08:44 CST; 4s ago
 Main PID: 18450 (firewalld)
   CGroup: /system.slice/firewalld.service
           └─18450 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid

Nov 29 14:08:44 node202 systemd[1]: Starting firewalld - dynamic firewall daemon...
Nov 29 14:08:44 node202 systemd[1]: Started firewalld - dynamic firewall daemon.
[root@node201 ~]# firewall-cmd --list-all
public (default, active)
  interfaces: enp0s3 enp0s8
  sources:
  services: dhcpv6-client ssh
  ports:
  masquerade: no
  forward-ports:
  icmp-blocks:
  rich rules:

如下图所示,启动防火墙后,默认备库节点无法远程访问主库数据库服务及securecmdd连接:

如下所示,防火墙禁止到主库系统ping连接:

[root@node201 ~]# firewall-cmd --add-rich-rule='rule protocol value=icmp drop' --permanent
success

[root@node201 ~]# firewall-cmd --complete-reload
success

[root@node201 ~]# firewall-cmd --list-all
public (default, active)
  interfaces: enp0s3 enp0s8
  sources:
  services: dhcpv6-client ssh
  ports:
  masquerade: no
  forward-ports:
  icmp-blocks:
  rich rules:
        rule protocol value="icmp" drop

如下图所示,备库无法ping通主库主机但可以ping通网关:

二、配置connection_check_type参数测试

  • 案例一:

1、connection_check_type='mix'

[kingbase@node202 bin]$ cat ../etc/repmgr.conf |grep connection_check
connection_check_type='mix'

节点之间通过securecmdd远程连接:

[kingbase@node202 bin]$ cat ../etc/repmgr.conf |grep scmd_
scmd_options='-q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 
                 -o ServerAliveCountMax=5 -p 8890'

2、集群状态(启动防火墙后)
如下所示,启动防火墙后,集群之间的通讯出现异常:

[kingbase@node202 bin]$ ./repmgr cluster show

----+-------+---------+---------------+----------+----------+----------+----------+---------+--------------------------------------------------------------------------
 1  | node1 | primary | ? unreachable | ?        | default  | 100      |          |         | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10
 2  | node2 | standby |   running     | ? node1  | default  | 100      | 13       | ?       | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10

[WARNING] following issues were detected
  - unable to connect to node "node1" (ID: 1)
  - node "node1" (ID: 1) is registered as an active primary but is unreachable
  - unable to connect to node "node2" (ID: 2)'s upstream node "node1" (ID: 1)
  - unable to determine if node "node2" (ID: 2) is attached to its upstream node "node1" (ID: 1)

[HINT] execute with --verbose option to see connection error messages

3、查看备库hamgr.log

如下图所示,备库在检测到PQping()主库异常超过阈值10次后,重新开始检测,没有执行failover切换操作:

[2023-11-29 14:13:07] [INFO] checking state of node "node1" (ID: 1), 10 of 10 attempts
[2023-11-29 14:13:07] [DEBUG] connection_ping(): result is PGRES_TUPLES_OK
[2023-11-29 14:13:07] [INFO] the database is alive by original connection
[2023-11-29 14:13:07] [NOTICE] reconnected to upstream node after 54 seconds
[2023-11-29 14:13:07] [DEBUG] get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
[2023-11-29 14:13:07] [DEBUG] _create_event(): event is "repmgrd_upstream_reconnect" for node 2
[2023-11-29 14:13:07] [DEBUG] get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
[2023-11-29 14:13:07] [DEBUG] _create_event():
   INSERT INTO repmgr.events (              node_id,              event,              successful,              details             )       VALUES ($1, $2, $3, $4)    RETURNING event_timestamp
[2023-11-29 14:13:07] [DEBUG] _create_event(): Event timestamp is "2023-11-29 14:13:07.971921+08"
[2023-11-29 14:13:07] [INFO] node "node2" (ID: 2) monitoring upstream node "node1" (ID: 1) in normal state
[2023-11-29 14:13:07] [DEBUG] connection_ping(): result is PGRES_TUPLES_OK
[2023-11-29 14:13:08] [DEBUG] is_server_available(): ping status for "host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10" is PQPING_OK
[2023-11-29 14:13:08] [DEBUG] get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
[2023-11-29 14:13:08] [DEBUG] get_slot_record():
SELECT slot_name, slot_type, active   FROM pg_catalog.pg_replication_slots  WHERE slot_name = 'repmgr_slot_2'
[2023-11-29 14:13:08] [DEBUG] is_wal_all_recevied():
SELECT pg_current_wal_flush_lsn() as primary_lsn, replay_lsn as local_lsn,  pg_wal_lsn_diff(primary_lsn, local_lsn) <= 16777216  as catched_up  from pg_stat_replication where application_name = 'node2' and state = 'streaming'
[2023-11-29 14:13:08] [DEBUG] is_rep_sync_streaming():
SELECT state,sync_state FROM pg_stat_replication where application_name = 'node2'
[2023-11-29 14:13:08] [DEBUG] get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached, n.primary_seen, n.lsn  FROM repmgr.nodes n  WHERE n.node_id = 2
[2023-11-29 14:13:08] [DEBUG] sleeping 2 seconds (parameter "monitor_interval_secs")
[2023-11-29 14:13:10] [DEBUG] checking host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10
[2023-11-29 14:13:10] [DEBUG] connection check type is "mix"
[2023-11-29 14:13:10] [DEBUG] is_server_available(): ping status for "host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10" is PQPING_NO_RESPONSE
[2023-11-29 14:13:10] [WARNING] unable to ping "host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10"
[2023-11-29 14:13:10] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-11-29 14:13:10] [DEBUG] _create_event(): event is "repmgrd_upstream_disconnect" for node 2
[2023-11-29 14:13:10] [DEBUG] _create_event(): Event timestamp is "2023-11-29 14:13:10+0800"
[2023-11-29 14:13:10] [WARNING] unable to connect to upstream node "node1" (ID: 1)
[2023-11-29 14:13:10] [INFO] checking state of node "node1" (ID: 1), 1 of 10 attempts
[2023-11-29 14:13:10] [DEBUG] is_server_available_params(): ping status for "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 fallback_application_name=repmgr" is PQPING_NO_RESPONSE
[2023-11-29 14:13:10] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 fallback_application_name=repmgr"
[2023-11-29 14:13:10] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-11-29 14:13:10] [INFO] sleeping up to 6 seconds until next reconnection attempt
tion attempt
......

如上所示,在配置connection_check_type='mix'后,启动防火墙后,没有触发集群的failover切换。

  • 案例二:

1、connection_check_type='mix'

[kingbase@node202 bin]$ cat ../etc/repmgr.conf |grep connection_check
connection_check_type='connection'

2、主库数据库服务down
3、查看备库hamgr.log
如下所示,主库数据库服务down后,备库触发了failover的主备切换:

[2023-11-29 14:35:45] [INFO] checking state of node "node1" (ID: 1), 10 of 10 attempts
[2023-11-29 14:35:45] [DEBUG] is_server_available_params(): ping status for "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 fallback_application_name=repmgr" is PQPING_NO_RESPONSE
[2023-11-29 14:35:45] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 fallback_application_name=repmgr"
[2023-11-29 14:35:45] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-11-29 14:35:45] [WARNING] unable to reconnect to node "node1" (ID: 1) after 10 attempts
........
[2023-11-29 14:35:54] [NOTICE] PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.

--- 192.168.1.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.252/0.256/0.261/0.016 ms

[2023-11-29 14:35:54] [NOTICE] successfully ping one or more of the trusted_servers "192.168.1.1"
[2023-11-29 14:35:54] [INFO] promote_command is:
  "/home/kingbase/cluster/R6/R6HA/kingbase/bin/repmgr  standby promote -f /home/kingbase/cluster/R6/R6HA/kingbase/etc/repmgr.conf"
[2023-11-29 14:35:54] [DEBUG] test_ssh_connection(): executing /home/kingbase/cluster/R6/R6HA/kingbase/bin/sys_securecmd -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 8890 192.168.1.201 /bin/true 2>/dev/null
........

[2023-11-29 14:35:54] [DEBUG] test_ssh_connection(): executing /home/kingbase/cluster/R6/R6HA/kingbase/bin/sys_securecmd -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 8890 192.168.1.201 /usr/bin/true 2>/dev/null
[NOTICE] promoting standby to primary
[DETAIL] promoting server "node2" (ID: 2) using sys_promote()
[NOTICE] waiting for promotion to complete, replay lsn: 2/2E000858
[2023-11-29 14:35:54] [WARNING] unable to connect to remote host "192.168.1.201" via ES
[2023-11-29 14:35:54] [INFO] unable to connect via ES to host "192.168.1.201", skip stop old primary db
[INFO] SET synchronous TO "async" on primary host
[DEBUG] setting node 2 as primary and marking existing primary as failed
[NOTICE] STANDBY PROMOTE successful
[DETAIL] server "node2" (ID: 2) was successfully promoted to primary

4、切换后集群状态
如下所示,node2被切换为primary:

[kingbase@node202 bin]$ ./repmgr cluster show

 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string     
----+-------+---------+-----------+----------+----------+----------+----------+---------+--------------------------------------------------------------------------
 1  | node1 | primary | - failed  | ?        | default  | 100      |          |         | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10
 2  | node2 | primary | * running |          | default  | 100      | 14       |         | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10

[WARNING] following issues were detected
  - unable to connect to node "node1" (ID: 1)

[HINT] execute with --verbose option to see connection error messages

如上所示,在配置connection_check_type=‘connection'后,启动防火墙后,触发集群的failover切换。

三、总结
在生产环境,在集群部署上线前应该配置好对应的防火墙规则,对于误启动防火墙后,需要触发failover切换的环境,可以配置connection_check_type='connection'。

posted @   天涯客1224  阅读(1)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· DeepSeek 开源周回顾「GitHub 热点速览」
点击右上角即可分享
微信分享提示