postgresql 高可用 repmgr 的使用之六 1 Primary + 1 Standby 的 auto failover
os:ubunbu 16.04
postgresql:9.6.8
repmgr:4.1.1
192.168.56.101 node1
192.168.56.102 node2
生产环境确保网络稳定的前提下,数据库能够 auto failover 是最好的状态。如果是同步复制的话,建议配置两个slave,如果是异步复制,也建议配置两个slave。
需要实现 auto failover,就需要启用 repmgrd,而启用 repmgrd 就必须设置 postgresql.conf 的 shared_preload_libraries = ‘repmgr’
$ vi /etc/postgresql/9.6/main/postgresql.conf
shared_preload_libraries = 'repmgr'
/etc/repmgr.conf 的内容
node1 节点上查看,node2 节点上基本相同。
# more /etc/repmgr.conf
##############################################
# node information
node_id=1
node_name=node1
conninfo='host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/9.6/main'
use_replication_slots=true
pg_bindir='/usr/lib/postgresql/9.6/bin'
pg_ctl_options='-l /var/log/postgres/startup.log'
rsync_options=--exclude=postgresql.local.conf --archive
#use_primary_conninfo_password=true
location='location01'
##############################################
# service_start_command
service_start_command = 'sudo pg_ctlcluster 9.6 main start'
service_stop_command = 'sudo pg_ctlcluster 9.6 main stop'
service_restart_command = 'sudo pg_ctlcluster 9.6 main restart'
service_reload_command = 'sudo pg_ctlcluster 9.6 main reload'
service_promote_command = 'sudo pg_ctlcluster 9.6 main promote'
##############################################
# log info
log_level=INFO
log_status_interval=10
log_file=/var/log/postgresql/repmgrd.log
##############################################
# repmgrd options
monitoring_history=yes
monitor_interval_secs=5
reconnect_attempts=10
reconnect_interval=5
##############################################
# automatic failover
failover=automatic
promote_command='/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file'
follow_command='/usr/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'
启动 repmgrd
每个节点都是需要启动的
手动启动
# repmgrd -f /etc/repmgr.conf --pid-file /tmp/repmgrd.pid
系统启动
# vi /etc/default/repmgrd
# default settings for repmgrd. This file is source by /bin/sh from
# /etc/init.d/repmgrd
# disable repmgrd by default so it won't get started upon installation
# valid values: yes/no
REPMGRD_ENABLED=yes
# configuration file (required)
REPMGRD_CONF="/etc/repmgr.conf"
# additional options
#REPMGRD_OPTS=""
# user to run repmgrd as
REPMGRD_USER=postgres
# repmgrd binary
REPMGRD_BIN=/usr/bin/repmgrd
# pid file
REPMGRD_PIDFILE=/var/run/repmgrd.pid
REPMGRD_CONF 指向了配置文件
非常有必要配置成系统启用。
# systemctl status repmgrd.service
# systemctl start repmgrd.service
# systemctl enable repmgrd.service
node1 上的 repmgrd 进程
# ps -ef|grep -i repmgrd |grep -v grep
postgres 4091 1 0 14:30 ? 00:00:00 /usr/lib/postgresql/9.6/bin/repmgrd --config-file /etc/repmgr.conf
# cat /var/log/postgresql/repmgrd.log
[2018-09-25 14:30:28] [NOTICE] repmgrd (repmgrd 4.1.1) starting up
[2018-09-25 14:30:28] [INFO] connecting to database "host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2"
[2018-09-25 14:30:28] [NOTICE] starting monitoring of node "node1" (ID: 1)
[2018-09-25 14:30:28] [NOTICE] monitoring cluster primary "node1" (node ID: 1)
[2018-09-25 14:30:38] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 14:30:48] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
node2 上的 repmgrd 进程
# ps -ef|grep -i repmgrd |grep -v grep
postgres 4349 1 0 14:18 ? 00:00:00 /usr/lib/postgresql/9.6/bin/repmgrd --config-file /etc/repmgr.conf
# cat /var/log/postgresql/repmgrd.log
[2018-09-25 14:33:44] [NOTICE] repmgrd (repmgrd 4.1.1) starting up
[2018-09-25 14:33:44] [INFO] connecting to database "host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2"
[2018-09-25 14:33:44] [NOTICE] starting monitoring of node "node2" (ID: 2)
[2018-09-25 14:33:44] [INFO] monitoring connection to upstream node "node1" (node ID: 1)
[2018-09-25 14:33:55] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 14:33:55] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 14:34:05] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 14:34:05] [DETAIL] last monitoring statistics update was 5 seconds ago
手动关闭主库模拟异常
node1 节点上操作
$ repmgr -f /etc/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Connection string
----+-------+---------+-----------+----------+----------+-----------------------------------------------------------------
1 | node1 | primary | * running | | default | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | standby | running | node1 | default | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2
$ pg_ctl -D /var/lib/postgresql/9.6/main -m fast stop
或者
$ sudo pg_ctlcluster 9.6 main stop
$ tail -f /var/log/postgresql/repmgrd.log
[2018-09-25 15:08:23] [NOTICE] repmgrd (repmgrd 4.1.1) starting up
[2018-09-25 15:08:23] [INFO] connecting to database "host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2"
[2018-09-25 15:08:23] [NOTICE] starting monitoring of node "node1" (ID: 1)
[2018-09-25 15:08:23] [NOTICE] monitoring cluster primary "node1" (node ID: 1)
[2018-09-25 15:08:33] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:43] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:53] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:03] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:13] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:23] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:33] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:43] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:53] [WARNING] unable to connect to local node
[2018-09-25 15:09:53] [INFO] checking state of node 1, 1 of 10 attempts
[2018-09-25 15:09:53] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:09:58] [INFO] checking state of node 1, 2 of 10 attempts
[2018-09-25 15:09:58] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:03] [INFO] checking state of node 1, 3 of 10 attempts
[2018-09-25 15:10:03] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:08] [INFO] checking state of node 1, 4 of 10 attempts
[2018-09-25 15:10:08] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:13] [INFO] checking state of node 1, 5 of 10 attempts
[2018-09-25 15:10:13] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:18] [INFO] checking state of node 1, 6 of 10 attempts
[2018-09-25 15:10:18] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:23] [INFO] checking state of node 1, 7 of 10 attempts
[2018-09-25 15:10:23] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:28] [INFO] checking state of node 1, 8 of 10 attempts
[2018-09-25 15:10:28] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:33] [INFO] checking state of node 1, 9 of 10 attempts
[2018-09-25 15:10:33] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:38] [INFO] checking state of node 1, 10 of 10 attempts
[2018-09-25 15:10:38] [WARNING] unable to reconnect to node 1 after 10 attempts
[2018-09-25 15:10:38] [NOTICE] unable to connect to local node, falling back to degraded monitoring
[2018-09-25 15:10:38] [INFO] monitoring primary node "node1" (node ID: 1) in degraded state
[2018-09-25 15:10:38] [DETAIL] waiting for the node to become available
[2018-09-25 15:10:48] [INFO] monitoring primary node "node1" (node ID: 1) in degraded state
[2018-09-25 15:10:48] [DETAIL] waiting for the node to become available
node2 上查看
$ repmgr -f /etc/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Connection string
----+-------+---------+---------------+----------+----------+-----------------------------------------------------------------
1 | node1 | primary | ? unreachable | | default | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | standby | running | node1 | default | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2
WARNING: following issues were detected
- when attempting to connect to node "node1" (ID: 1), following error encountered :
"could not connect to server: Connection refused
Is the server running on host "192.168.56.101" and accepting
TCP/IP connections on port 5432?"
- node "node1" (ID: 1) is registered as an active primary but is unreachable
$ tail -f /var/log/postgresql/repmgrd.log
[2018-09-25 15:07:58] [NOTICE] repmgrd (repmgrd 4.1.1) starting up
[2018-09-25 15:07:58] [INFO] connecting to database "host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2"
[2018-09-25 15:07:58] [NOTICE] starting monitoring of node "node2" (ID: 2)
[2018-09-25 15:07:58] [INFO] monitoring connection to upstream node "node1" (node ID: 1)
[2018-09-25 15:08:08] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:08] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:08:18] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:18] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:08:28] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:28] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:08:38] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:38] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:08:48] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:48] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:08:58] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:08:58] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:09:08] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:08] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:09:18] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:18] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:09:28] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:28] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:09:38] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:38] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:09:48] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
[2018-09-25 15:09:48] [DETAIL] last monitoring statistics update was 5 seconds ago
[2018-09-25 15:09:53] [WARNING] unable to connect to upstream node "node1" (node ID: 1)
[2018-09-25 15:09:53] [INFO] checking state of node 1, 1 of 10 attempts
[2018-09-25 15:09:53] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:09:58] [INFO] checking state of node 1, 2 of 10 attempts
[2018-09-25 15:09:58] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:03] [INFO] checking state of node 1, 3 of 10 attempts
[2018-09-25 15:10:03] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:08] [INFO] checking state of node 1, 4 of 10 attempts
[2018-09-25 15:10:08] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:13] [INFO] checking state of node 1, 5 of 10 attempts
[2018-09-25 15:10:13] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:18] [INFO] checking state of node 1, 6 of 10 attempts
[2018-09-25 15:10:18] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:23] [INFO] checking state of node 1, 7 of 10 attempts
[2018-09-25 15:10:23] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:28] [INFO] checking state of node 1, 8 of 10 attempts
[2018-09-25 15:10:28] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:33] [INFO] checking state of node 1, 9 of 10 attempts
[2018-09-25 15:10:33] [INFO] sleeping 5 seconds until next reconnection attempt
[2018-09-25 15:10:38] [INFO] checking state of node 1, 10 of 10 attempts
[2018-09-25 15:10:38] [WARNING] unable to reconnect to node 1 after 10 attempts
[2018-09-25 15:10:38] [NOTICE] this node is the only available candidate and will now promote itself
[2018-09-25 15:10:38] [INFO] promote_command is:
"/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file"
[2018-09-25 15:10:38] [NOTICE] redirecting logging output to "/var/log/postgresql/repmgrd.log"
[2018-09-25 15:10:38] [NOTICE] promoting standby to primary
[2018-09-25 15:10:38] [DETAIL] promoting server "node2" (ID: 2) using "sudo pg_ctlcluster 9.6 main promote"
[2018-09-25 15:10:38] [DETAIL] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2018-09-25 15:10:39] [NOTICE] STANDBY PROMOTE successful
[2018-09-25 15:10:39] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary
[2018-09-25 15:10:39] [INFO] switching to primary monitoring mode
[2018-09-25 15:10:39] [NOTICE] monitoring cluster primary "node2" (node ID: 2)
[2018-09-25 15:10:49] [INFO] monitoring primary node "node2" (node ID: 2) in normal state
[2018-09-25 15:10:59] [INFO] monitoring primary node "node2" (node ID: 2) in normal state
[2018-09-25 15:11:09] [INFO] monitoring primary node "node2" (node ID: 2) in normal state
[2018-09-25 15:11:19] [INFO] monitoring primary node "node2" (node ID: 2) in normal state
在经过 10*5s的时常后,发生了 auto failover: promoting standby to primary,如下
[2018-09-25 15:10:38] [INFO] checking state of node 1, 10 of 10 attempts
[2018-09-25 15:10:38] [WARNING] unable to reconnect to node 1 after 10 attempts
[2018-09-25 15:10:38] [NOTICE] this node is the only available candidate and will now promote itself
[2018-09-25 15:10:38] [INFO] promote_command is:
"/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file"
[2018-09-25 15:10:38] [NOTICE] redirecting logging output to "/var/log/postgresql/repmgrd.log"
[2018-09-25 15:10:38] [NOTICE] promoting standby to primary
[2018-09-25 15:10:38] [DETAIL] promoting server "node2" (ID: 2) using "sudo pg_ctlcluster 9.6 main promote"
[2018-09-25 15:10:38] [DETAIL] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2018-09-25 15:10:39] [NOTICE] STANDBY PROMOTE successful
[2018-09-25 15:10:39] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary
[2018-09-25 15:10:39] [INFO] switching to primary monitoring mode
[2018-09-25 15:10:39] [NOTICE] monitoring cluster primary "node2" (node ID: 2)
[2018-09-25 15:10:49] [INFO] monitoring primary node "node2" (node ID: 2) in normal state
node2 上查看 cluster show
$ repmgr -f /etc/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Connection string
----+-------+---------+-----------+----------+----------+-----------------------------------------------------------------
1 | node1 | primary | - failed | | default | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | primary | * running | | default | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2
WARNING: following issues were detected
- when attempting to connect to node "node1" (ID: 1), following error encountered :
"could not connect to server: Connection refused
Is the server running on host "192.168.56.101" and accepting
TCP/IP connections on port 5432?"
虚拟机掉电模拟异常
经过前面的测试,node2 现在是 master,node1 通过 node rejoin后变为slave,细节可以参考上一篇blog。
node2 节点掉电前状态查看
$ repmgr -f /etc/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Connection string
----+-------+---------+-----------+----------+----------+-----------------------------------------------------------------
1 | node1 | standby | running | node2 | default | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | primary | * running | | default | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2
node2 开始掉电
node1 上查看
$ repmgr -f /etc/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Connection string
----+-------+---------+---------------+----------+----------+-----------------------------------------------------------------
1 | node1 | standby | running | node2 | default | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | primary | ? unreachable | | default | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2
WARNING: following issues were detected
- when attempting to connect to node "node2" (ID: 2), following error encountered :
"could not connect to server: No route to host
Is the server running on host "192.168.56.102" and accepting
TCP/IP connections on port 5432?"
- node "node2" (ID: 2) is registered as an active primary but is unreachable
此时 node1 节点的也提升为 master,提升为master的日志输出基本类似,不再重复粘贴。
参考:
https://www.2ndquadrant.com/en/resources/repmgr/
https://github.com/2ndQuadrant/repmgr
https://repmgr.org/docs/4.1/using-repmgrd.html