postgresql 高可用 repmgr 的使用之九 1 Primary + 2 Standby 的 auto failover
os:ubunbu 16.04
postgresql:9.6.8
repmgr:4.1.1
192.168.56.101 node1
192.168.56.102 node2
192.168.56.103 node3
配置好 1 Primary + 2 Standby
详细过程略,参考前面的blog。
$ repmgr -f /etc/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Connection string
----+-------+---------+-----------+----------+------------+-----------------------------------------------------------------
1 | node1 | primary | * running | | location01 | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | standby | running | node1 | location01 | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2
3 | node3 | standby | running | node1 | location01 | host=192.168.56.103 user=repmgr dbname=repmgr connect_timeout=2
手动关闭node1主库模拟异常
node1 上操作
$ sudo pg_ctlcluster 9.6 main stop
node2 上查看
$ repmgr -f /etc/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Connection string
----+-------+---------+-----------+----------+------------+-----------------------------------------------------------------
1 | node1 | primary | - failed | | location01 | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | primary | * running | | location01 | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2
3 | node3 | standby | running | node2 | location01 | host=192.168.56.103 user=repmgr dbname=repmgr connect_timeout=2
WARNING: following issues were detected
- when attempting to connect to node "node1" (ID: 1), following error encountered :
"could not connect to server: Connection refused
Is the server running on host "192.168.56.101" and accepting
TCP/IP connections on port 5432?"
可以看到 node2 上的 postgresql 已经提升为新的master。
且 node3 的 postgresql 的 upstream 已经由之前的node1调整为 node2 了。
node3 上查看
$ repmgr -f /etc/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Connection string
----+-------+---------+-----------+----------+------------+-----------------------------------------------------------------
1 | node1 | primary | - failed | | location01 | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | primary | * running | | location01 | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2
3 | node3 | standby | running | node2 | location01 | host=192.168.56.103 user=repmgr dbname=repmgr connect_timeout=2
WARNING: following issues were detected
- when attempting to connect to node "node1" (ID: 1), following error encountered :
"could not connect to server: Connection refused
Is the server running on host "192.168.56.101" and accepting
TCP/IP connections on port 5432?"
node2虚拟机掉电
此时,node2 上postgresql 为新的master,继续测试ha,把node2虚拟机掉电。
node3 上查看
$ repmgr -f /etc/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Connection string
----+-------+---------+-----------+----------+------------+-----------------------------------------------------------------
1 | node1 | primary | - failed | | location01 | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | primary | - failed | | location01 | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2
3 | node3 | primary | * running | | location01 | host=192.168.56.103 user=repmgr dbname=repmgr connect_timeout=2
WARNING: following issues were detected
- when attempting to connect to node "node1" (ID: 1), following error encountered :
"could not connect to server: Connection refused
Is the server running on host "192.168.56.101" and accepting
TCP/IP connections on port 5432?"
- when attempting to connect to node "node2" (ID: 2), following error encountered :
"timeout expired"
$ tail -f /var/log/postgresql/repmgrd.log
[2018-09-26 10:54:59] [INFO] node "node3" (node ID: 3) monitoring upstream node "node2" (node ID: 2) in normal state
[2018-09-26 10:54:59] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-26 10:55:11] [WARNING] unable to connect to upstream node "node2" (node ID: 2)
[2018-09-26 10:55:11] [INFO] checking state of node 2, 1 of 10 attempts
[2018-09-26 10:55:13] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-26 10:55:18] [INFO] checking state of node 2, 2 of 10 attempts
[2018-09-26 10:55:20] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-26 10:55:25] [INFO] checking state of node 2, 3 of 10 attempts
[2018-09-26 10:55:27] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-26 10:55:32] [INFO] checking state of node 2, 4 of 10 attempts
[2018-09-26 10:55:34] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-26 10:55:39] [INFO] checking state of node 2, 5 of 10 attempts
[2018-09-26 10:55:41] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-26 10:55:46] [INFO] checking state of node 2, 6 of 10 attempts
[2018-09-26 10:55:48] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-26 10:55:53] [INFO] checking state of node 2, 7 of 10 attempts
[2018-09-26 10:55:55] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-26 10:56:00] [INFO] checking state of node 2, 8 of 10 attempts
[2018-09-26 10:56:02] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-26 10:56:07] [INFO] checking state of node 2, 9 of 10 attempts
[2018-09-26 10:56:09] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-26 10:56:14] [INFO] checking state of node 2, 10 of 10 attempts
[2018-09-26 10:56:16] [WARNING] unable to reconnect to node 2 after 10 attempts
[2018-09-26 10:56:16] [NOTICE] this node is the only available candidate and will now promote itself
[2018-09-26 10:56:16] [INFO] promote_command is:
"/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file"
[2018-09-26 10:56:16] [NOTICE] redirecting logging output to "/var/log/postgresql/repmgrd.log"
[2018-09-26 10:56:18] [NOTICE] promoting standby to primary
[2018-09-26 10:56:18] [DETAIL] promoting server "node3" (ID: 3) using "sudo pg_ctlcluster 9.6 main promote"
[2018-09-26 10:56:18] [DETAIL] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2018-09-26 10:56:19] [NOTICE] STANDBY PROMOTE successful
[2018-09-26 10:56:19] [DETAIL] server "node3" (ID: 3) was successfully promoted to primary
[2018-09-26 10:56:19] [INFO] switching to primary monitoring mode
[2018-09-26 10:56:19] [NOTICE] monitoring cluster primary "node3" (node ID: 3)
[2018-09-26 10:56:29] [INFO] monitoring primary node "node3" (node ID: 3) in normal state
[2018-09-26 10:56:39] [INFO] monitoring primary node "node3" (node ID: 3) in normal state
1 Primary + 2 Standby 的 autofailover 和 1 Primary + 1 Standby 的 autofailover 基本一致,只是多了一个 standby,就多了一点ha。