Implement repmgr PG14 cluster
扫盲
1. autofailover实际是使用repmgrd进程主动监控并执行主备切换,看上去很强大,实则生产中慎用或者不用。
2. PG switchover时,主备 PG owner 必须配置免密,并且 只能在standby机器上执行switchover的操作
系统环境
OS: CentOS7
192.168.56.11 node1
192.168.56.12 node2
部署操作流程
1. 系统环境初始化
echo redhat | passwd root --stdin
echo -e "192.168.56.11 node1 " >>/etc/hosts
echo -e "192.168.56.12 node2 " >>/etc/hosts
sed -i 's/vault.centos.org/mirrors.aliyun.com/g' /etc/yum.repos.d/CentOS*
2. 所有节点安装PG14 <root执行>
sudo yum install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm
sudo yum -y install epel-release yum-utils
sudo yum-config-manager --enable pgdg14
sudo yum install postgresql14-server postgresql14 -y
echo -e " export PATH=/usr/pgsql-14/bin/:$PATH" >>/var/lib/pgsql/.bash_profile
3. 所有节点安装repmgr<root执行>
## on all nodes
sudo yum -y install repmgr_14*
4. 关闭防火墙或者放行PG端口(默认5432)<root执行>
systemctl stop firewalld
5. 主节点创建repmgr账户 , pg_hba.cfg 以及postgresql.conf 参数调整 , repmgr.conf文件配置 <PG owner执行>
5.1 创建repmgr
createuser repmgr -P -s
createdb repmgr -O repmgr;
5.2 调整pg_hba.cfg 以及postgresql.conf参数
## -- postgresql.conf
max_wal_senders = 10
max_replication_slots = 10
wal_level = 'hot_standby'
hot_standby = on
archive_mode = on
archive_command = '/bin/true'
shared_preload_libraries = 'repmgr'
## -- pg_hba.cfg
--- 这里采用了trust模式,因此无需.pgpass
local replication repmgr trust
host replication repmgr 127.0.0.1/32 trust
host replication repmgr 192.168.56.11/32 trust
host replication repmgr 192.168.56.12/32 trust
local repmgr repmgr trust
host repmgr repmgr 127.0.0.1/32 trust
host repmgr repmgr 192.168.56.11/32 trust
host repmgr repmgr 192.168.56.12/32 trust
若有需要 .pgpass ,则需要如下配置:su - postgres
cd /var/lib/pgsql
touch .pgpass
chmod 0600 .pgpass
echo -e "192.168.56.11:5432:repmgr:repmgr:passwd_123456">>.pgpass
echo -e "192.168.56.12:5432:repmgr:repmgr:passwd_123456">>.pgpass
5.3 repmgr.conf文件配置
#-- 主库
node_id=1
node_name=node1
conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/pgsql/14/data/'
#-- 从库
node_id=2
node_name=node2
conninfo='host=node2 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/pgsql/14/data/'
6. 启动主节点PG 并 注册为primary server <PG owner执行>
# -- 首先启动主库PG
sudo /usr/pgsql-14/bin/postgresql-14-setup initdb
sudo systemctl enable postgresql-14
sudo systemctl start postgresql-14
# -- 先模拟运行注册
su - postgres
repmgr -f /etc/repmgr/14/repmgr.conf primary register --dry-run
# -- 将上面错误全部解决后,执行注册
repmgr -f /etc/repmgr/14/repmgr.conf primary register
7. 清空备节点PG data目录, 搭建/clone备节点 <PG owner执行>
# - 确保从库PG以及PG 所有data目录是空的,如果不是,则手动删除即可
systemctl stop postgresql-14
rm -rf /var/lib/pgsql/14/data
8. 将备节点注册成standby server 并启动PG实例 <PG owner执行>
repmgr -f /etc/repmgr/14/repmgr.conf standby register --force
9. 校验集群信息
-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.11 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.11 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=node2 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.11 port=5432 fallback_application_name=repmgr options=-csearch_path="
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 1 | host=192.168.56.11 port=5432 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | standby | running | node1 | default | 100 | 1 | host=node2 user=repmgr port=5432 dbname=repmgr connect_timeout=2
10. 常见错误
10.1 connection to server at "192.168.56.11", port 5432 failed: Connection refused
# -- 明明PG 实例都启动了,但是死活显示如题,最后排查到 默认情况下 postgresql只监听local,因此修改参数listen-addresses ='*'
-bash-4.2$ repmgr -f /etc/repmgr/14/repmgr.conf primary register --dry-run
INFO: connecting to primary database...
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.11 port=5432 fallback_application_name=repmgr options=-csearch_path="
ERROR: connection to database failed
DETAIL:
connection to server at "192.168.56.11", port 5432 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
DETAIL: attempted to connect using:
user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.11 port=5432 fallback_application_name=repmgr options=-csearch_path=
#-- solution 修改如下配置即可
查看 postgresql.conf
listen-addresses ='*'
10.2.ERROR: "repmgr" is not a superuser and no superuser name supplied
# 需要确保做repmgr的账户拥有超级权限
postgres=# \du
List of roles
Role name | Attributes | Member of
-----------+------------------------------------------------------------+-----------
postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | {}
repmgr | Superuser | {}
10.3. Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
# 系统级别的问题, 配置passwordless时,在节点node1上跑 ssh node2 date会直接拒绝
# - solution
修改如下 PasswordAuthentication = yes
PasswordAuthentication yes
10.4.网络有问题
-- 各种查看,发现没有问题
-bash-4.2$ repmgr -hnode1 -p5432 -U repmgr -d repmgr -f /etc/repmgr/14/repmgr.conf standby clone --dry-run
NOTICE: destination directory "/var/lib/pgsql/14/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=node1 port=5432 user=repmgr dbname=repmgr
ERROR: connection to database failed
DETAIL:
connection to server at "node1" (192.168.56.11), port 5432 failed: No route to host
Is the server running on that host and accepting TCP/IP connections?
-- 最后发现是防火墙需要关掉 - systemctl stop firewalld 或者通过firewall-cmd放行PG 实例端口
10.5 WARNING: unable to connect to remote host "192.168.56.11" via SSH
-bash-4.2$ repmgr standby switchover -f /etc/repmgr/14/repmgr.conf --dry-run
NOTICE: checking switchover on node "node2" (ID: 2) in --dry-run mode
WARNING: unable to connect to remote host "192.168.56.11" via SSH
ERROR: unable to connect via SSH to host "192.168.56.11", user ""
# - solution
switchover 需要 passwordless,因此配置postgres passwordless即可
10.6 WARNING: local node "node2" is behind shutdown primary "node1"
-bash-4.2$ repmgr standby switchover -f /etc/repmgr/14/repmgr.conf --dry-run
NOTICE: checking switchover on node "node2" (ID: 2) in --dry-run mode
INFO: SSH connection to host "192.168.56.11" succeeded
INFO: able to execute "repmgr" on remote host "192.168.56.11"
INFO: 1 walsenders required, 10 available
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.11 port=5432 fallback_application_name=repmgr options=-csearch_path="
INFO: demotion candidate is able to make replication connection to promotion candidate
INFO: 0 pending archive files
INFO: replication lag on this standby is 0 seconds
NOTICE: attempting to pause repmgrd on 2 nodes
NOTICE: local node "node2" (ID: 2) would be promoted to primary; current primary "node1" (ID: 1) would be demoted to standby
INFO: following shutdown command would be run on node "node1":
"/usr/pgsql-14/bin/pg_ctl -D '/var/lib/pgsql/14/data' -W -m fast stop"
INFO: parameter "shutdown_check_timeout" is set to 60 seconds
INFO: prerequisites for executing STANDBY SWITCHOVER are met
---- 看上去没问题
-bash-4.2$ repmgr standby switchover -f /etc/repmgr/14/repmgr.conf
NOTICE: executing switchover on node "node2" (ID: 2)
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.11 port=5432 fallback_application_name=repmgr options=-csearch_path="
NOTICE: attempting to pause repmgrd on 2 nodes
NOTICE: local node "node2" (ID: 2) will be promoted to primary; current primary "node1" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "node1" (ID: 1)
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.11 port=5432 fallback_application_name=repmgr options=-csearch_path="
NOTICE: issuing CHECKPOINT on node "node1" (ID: 1)
DETAIL: executing server command "/usr/pgsql-14/bin/pg_ctl -D '/var/lib/pgsql/14/data' -W -m fast stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/8000028
NOTICE: waiting up to 30 seconds (parameter "wal_receive_check_timeout") for received WAL to flush to disk
INFO: sleeping 1 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 2 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 3 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 4 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 5 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 6 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 7 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 8 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 9 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 10 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 11 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 12 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 13 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 14 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 15 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 16 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 17 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 18 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 19 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 20 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 21 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 22 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 23 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 24 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 25 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 26 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 27 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 28 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 29 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 30 of maximum 30 seconds waiting for standby to flush received WAL to disk
WARNING: local node "node2" is behind shutdown primary "node1"
DETAIL: local node last receive LSN is 0/7080000, primary shutdown checkpoint LSN is 0/8000028
NOTICE: aborting switchover
HINT: use --always-promote to force promotion of standby
-- 此时查看,主PG已经关闭了, Standby PG 还在继续运行中
-bash-4.2$ repmgr -f /etc/repmgr/14/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+---------------+----------+----------+----------+----------+--------------------------------------------------------------------------
1 | node1 | primary | ? unreachable | ? | default | 100 | | host=192.168.56.11 port=5432 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | standby | running | ? node1 | default | 100 | 1 | host=node2 user=repmgr port=5432 dbname=repmgr connect_timeout=2
WARNING: following issues were detected
- unable to connect to node "node1" (ID: 1)
- node "node1" (ID: 1) is registered as an active primary but is unreachable
- unable to connect to node "node2" (ID: 2)'s upstream node "node1" (ID: 1)
- unable to determine if node "node2" (ID: 2) is attached to its upstream node "node1" (ID: 1)
-- 然后重新启动 主PG,集群还是继续运行
systemctl start postgresql-14
-bash-4.2$ repmgr -f /etc/repmgr/14/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 1 | host=192.168.56.11 port=5432 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | standby | running | node1 | default | 100 | 1 | host=node2 user=repmgr port=5432 dbname=repmgr connect_timeout=2
-- Solution
11 常用repmgr命令
repmgr primary register — initialise a repmgr installation and register the primary node
repmgr primary unregister — unregister an inactive primary node
repmgr standby clone — clone a PostgreSQL standby node from another PostgreSQL node
repmgr standby register — add a standby's information to the repmgr metadata
repmgr standby unregister — remove a standby's information from the repmgr metadata
repmgr standby promote — promote a standby to a primary
repmgr standby follow — attach a running standby to a new upstream node
repmgr standby switchover — promote a standby to primary and demote the existing primary to a standby
repmgr witness register — add a witness node's information to the repmgr metadata
repmgr witness unregister — remove a witness node's information to the repmgr metadata
repmgr node status — show overview of a node's basic information and replication status
repmgr node check — performs some health checks on a node from a replication perspective
repmgr node rejoin — rejoin a dormant (stopped) node to the replication cluster
repmgr node service — show or execute the system service command to stop/start/restart/reload/promote a node
repmgr cluster show — display information about each registered node in the replication cluster
repmgr cluster matrix — runs repmgr cluster show on each node and summarizes output
repmgr cluster crosscheck — cross-checks connections between each combination of nodes
repmgr cluster event — output a formatted list of cluster events
repmgr cluster cleanup — purge monitoring history
repmgr service status — display information about the status of repmgrd on each node in the cluster
repmgr service pause — Instruct all repmgrd instances in the replication cluster to pause failover operations
repmgr service unpause — Instruct all repmgrd instances in the replication cluster to resume failover operations
repmgr daemon start — Start the repmgrd daemon on the local node
repmgr daemon stop — Stop the repmgrd daemon on the local node