随笔- 320 文章- 0 评论- 5 阅读- 34799

KingbaseES V8R6运维案例之---firewalld防火墙启动对failover切换的影响

案例说明：
在KingbaseES V8R6C7的版本中，repmgr.conf增加了connect_check_type参数可以指定参数值，默认值是‘mix’，以前版本为隐藏参数，默认值‘ping‘。connect_check_type参数可以配置为：

监控数据库connect_check_type设置:

connection_check _type：检查上游数据库状态的方法。
ping，使用PQping()检查；
connection，建立数据库连接；
query，使用已有连接执行SELECT TRUE进行检查；
mix，平常使用PQping()进行检查，如果多次失败，最后一次使用已有连接执行 SELECT TRUE进行检查，否则一切正常。

可选参数：ping, connection, query, mix；默认为mix。

适用版本：
KingbaseES V8R6

集群架构：

[kingbase@node202 bin]$ ./repmgr cluster show

 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string     
----+-------+---------+-----------+----------+----------+----------+----------+---------+--------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 13       |         | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10
 2  | node2 | standby |   running | node1    | default  | 100      | 13       | 0 bytes | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10

一、集群节点启动防火墙

如下所示，在主库节点启动防火墙：

[root@node201 ~]# systemctl start firewalld
[root@node201 ~]# systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
   Active: active (running) since Wed 2023-11-29 14:08:44 CST; 4s ago
 Main PID: 18450 (firewalld)
   CGroup: /system.slice/firewalld.service
           └─18450 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid

Nov 29 14:08:44 node202 systemd[1]: Starting firewalld - dynamic firewall daemon...
Nov 29 14:08:44 node202 systemd[1]: Started firewalld - dynamic firewall daemon.
[root@node201 ~]# firewall-cmd --list-all
public (default, active)
  interfaces: enp0s3 enp0s8
  sources:
  services: dhcpv6-client ssh
  ports:
  masquerade: no
  forward-ports:
  icmp-blocks:
  rich rules:

如下图所示，启动防火墙后，默认备库节点无法远程访问主库数据库服务及securecmdd连接：

如下所示，防火墙禁止到主库系统ping连接：

[root@node201 ~]# firewall-cmd --add-rich-rule='rule protocol value=icmp drop' --permanent
success

[root@node201 ~]# firewall-cmd --complete-reload
success

[root@node201 ~]# firewall-cmd --list-all
public (default, active)
  interfaces: enp0s3 enp0s8
  sources:
  services: dhcpv6-client ssh
  ports:
  masquerade: no
  forward-ports:
  icmp-blocks:
  rich rules:
        rule protocol value="icmp" drop

如下图所示，备库无法ping通主库主机但可以ping通网关：

二、配置connection_check_type参数测试

案例一：

1、connection_check_type='mix'

[kingbase@node202 bin]$ cat ../etc/repmgr.conf |grep connection_check
connection_check_type='mix'

节点之间通过securecmdd远程连接：

[kingbase@node202 bin]$ cat ../etc/repmgr.conf |grep scmd_
scmd_options='-q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 
                 -o ServerAliveCountMax=5 -p 8890'

2、集群状态（启动防火墙后）
如下所示，启动防火墙后，集群之间的通讯出现异常：

[kingbase@node202 bin]$ ./repmgr cluster show

----+-------+---------+---------------+----------+----------+----------+----------+---------+--------------------------------------------------------------------------
 1  | node1 | primary | ? unreachable | ?        | default  | 100      |          |         | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10
 2  | node2 | standby |   running     | ? node1  | default  | 100      | 13       | ?       | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10

[WARNING] following issues were detected
  - unable to connect to node "node1" (ID: 1)
  - node "node1" (ID: 1) is registered as an active primary but is unreachable
  - unable to connect to node "node2" (ID: 2)'s upstream node "node1" (ID: 1)
  - unable to determine if node "node2" (ID: 2) is attached to its upstream node "node1" (ID: 1)

[HINT] execute with --verbose option to see connection error messages

3、查看备库hamgr.log

如下图所示，备库在检测到PQping()主库异常超过阈值10次后，重新开始检测，没有执行failover切换操作：

[2023-11-29 14:13:07] [INFO] checking state of node "node1" (ID: 1), 10 of 10 attempts
[2023-11-29 14:13:07] [DEBUG] connection_ping(): result is PGRES_TUPLES_OK
[2023-11-29 14:13:07] [INFO] the database is alive by original connection
[2023-11-29 14:13:07] [NOTICE] reconnected to upstream node after 54 seconds
[2023-11-29 14:13:07] [DEBUG] get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
[2023-11-29 14:13:07] [DEBUG] _create_event(): event is "repmgrd_upstream_reconnect" for node 2
[2023-11-29 14:13:07] [DEBUG] get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
[2023-11-29 14:13:07] [DEBUG] _create_event():
   INSERT INTO repmgr.events (              node_id,              event,              successful,              details             )       VALUES ($1, $2, $3, $4)    RETURNING event_timestamp
[2023-11-29 14:13:07] [DEBUG] _create_event(): Event timestamp is "2023-11-29 14:13:07.971921+08"
[2023-11-29 14:13:07] [INFO] node "node2" (ID: 2) monitoring upstream node "node1" (ID: 1) in normal state
[2023-11-29 14:13:07] [DEBUG] connection_ping(): result is PGRES_TUPLES_OK
[2023-11-29 14:13:08] [DEBUG] is_server_available(): ping status for "host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10" is PQPING_OK
[2023-11-29 14:13:08] [DEBUG] get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
[2023-11-29 14:13:08] [DEBUG] get_slot_record():
SELECT slot_name, slot_type, active   FROM pg_catalog.pg_replication_slots  WHERE slot_name = 'repmgr_slot_2'
[2023-11-29 14:13:08] [DEBUG] is_wal_all_recevied():
SELECT pg_current_wal_flush_lsn() as primary_lsn, replay_lsn as local_lsn,  pg_wal_lsn_diff(primary_lsn, local_lsn) <= 16777216  as catched_up  from pg_stat_replication where application_name = 'node2' and state = 'streaming'
[2023-11-29 14:13:08] [DEBUG] is_rep_sync_streaming():
SELECT state,sync_state FROM pg_stat_replication where application_name = 'node2'
[2023-11-29 14:13:08] [DEBUG] get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached, n.primary_seen, n.lsn  FROM repmgr.nodes n  WHERE n.node_id = 2
[2023-11-29 14:13:08] [DEBUG] sleeping 2 seconds (parameter "monitor_interval_secs")
[2023-11-29 14:13:10] [DEBUG] checking host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10
[2023-11-29 14:13:10] [DEBUG] connection check type is "mix"
[2023-11-29 14:13:10] [DEBUG] is_server_available(): ping status for "host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10" is PQPING_NO_RESPONSE
[2023-11-29 14:13:10] [WARNING] unable to ping "host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10"
[2023-11-29 14:13:10] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-11-29 14:13:10] [DEBUG] _create_event(): event is "repmgrd_upstream_disconnect" for node 2
[2023-11-29 14:13:10] [DEBUG] _create_event(): Event timestamp is "2023-11-29 14:13:10+0800"
[2023-11-29 14:13:10] [WARNING] unable to connect to upstream node "node1" (ID: 1)
[2023-11-29 14:13:10] [INFO] checking state of node "node1" (ID: 1), 1 of 10 attempts
[2023-11-29 14:13:10] [DEBUG] is_server_available_params(): ping status for "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 fallback_application_name=repmgr" is PQPING_NO_RESPONSE
[2023-11-29 14:13:10] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 fallback_application_name=repmgr"
[2023-11-29 14:13:10] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-11-29 14:13:10] [INFO] sleeping up to 6 seconds until next reconnection attempt
tion attempt
......

如上所示，在配置connection_check_type='mix'后，启动防火墙后，没有触发集群的failover切换。

案例二：

1、connection_check_type='mix'

[kingbase@node202 bin]$ cat ../etc/repmgr.conf |grep connection_check
connection_check_type='connection'

2、主库数据库服务down
3、查看备库hamgr.log
如下所示，主库数据库服务down后，备库触发了failover的主备切换：

[2023-11-29 14:35:45] [INFO] checking state of node "node1" (ID: 1), 10 of 10 attempts
[2023-11-29 14:35:45] [DEBUG] is_server_available_params(): ping status for "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 fallback_application_name=repmgr" is PQPING_NO_RESPONSE
[2023-11-29 14:35:45] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 fallback_application_name=repmgr"
[2023-11-29 14:35:45] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-11-29 14:35:45] [WARNING] unable to reconnect to node "node1" (ID: 1) after 10 attempts
........
[2023-11-29 14:35:54] [NOTICE] PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.

--- 192.168.1.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.252/0.256/0.261/0.016 ms

[2023-11-29 14:35:54] [NOTICE] successfully ping one or more of the trusted_servers "192.168.1.1"
[2023-11-29 14:35:54] [INFO] promote_command is:
  "/home/kingbase/cluster/R6/R6HA/kingbase/bin/repmgr  standby promote -f /home/kingbase/cluster/R6/R6HA/kingbase/etc/repmgr.conf"
[2023-11-29 14:35:54] [DEBUG] test_ssh_connection(): executing /home/kingbase/cluster/R6/R6HA/kingbase/bin/sys_securecmd -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 8890 192.168.1.201 /bin/true 2>/dev/null
........

[2023-11-29 14:35:54] [DEBUG] test_ssh_connection(): executing /home/kingbase/cluster/R6/R6HA/kingbase/bin/sys_securecmd -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 8890 192.168.1.201 /usr/bin/true 2>/dev/null
[NOTICE] promoting standby to primary
[DETAIL] promoting server "node2" (ID: 2) using sys_promote()
[NOTICE] waiting for promotion to complete, replay lsn: 2/2E000858
[2023-11-29 14:35:54] [WARNING] unable to connect to remote host "192.168.1.201" via ES
[2023-11-29 14:35:54] [INFO] unable to connect via ES to host "192.168.1.201", skip stop old primary db
[INFO] SET synchronous TO "async" on primary host
[DEBUG] setting node 2 as primary and marking existing primary as failed
[NOTICE] STANDBY PROMOTE successful
[DETAIL] server "node2" (ID: 2) was successfully promoted to primary

4、切换后集群状态
如下所示，node2被切换为primary：

[kingbase@node202 bin]$ ./repmgr cluster show

 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string     
----+-------+---------+-----------+----------+----------+----------+----------+---------+--------------------------------------------------------------------------
 1  | node1 | primary | - failed  | ?        | default  | 100      |          |         | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10
 2  | node2 | primary | * running |          | default  | 100      | 14       |         | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10

[WARNING] following issues were detected
  - unable to connect to node "node1" (ID: 1)

[HINT] execute with --verbose option to see connection error messages