repmgr安装部署(pg13)
环境:
OS:Centos7
DB:pg13
repmgr:5.3
192.168.1.105 主库 pg1
192.168.1.106 从库 pg2
192.168.1.107 从库 pg3
1.在每台机器上都安装好pg
安装请参考
https://www.cnblogs.com/hxlasky/p/16844609.html
说明:主库安装的时候根据机器配置需要对配置文件进行修改,从库可以不修改,因为部署repmgr会做一个pg_basebackup
会将主库的postgresql.conf和pg_hba.conf拷贝到从库.
2.配置/etc/hosts
192.168.1.105
hostnamectl set-hostname pg1
192.168.1.106
hostnamectl set-hostname pg2
192.168.1.107
hostnamectl set-hostname pg3
每个节点配置一致
vi /etc/hosts
192.168.1.105 pg1
192.168.1.106 pg2
192.168.1.107 pg3
3.配置等效连接
3个节点之间postgres用户能互相免密登录
部署参考:
https://www.cnblogs.com/hxlasky/p/17036903.html
4.添加sudo用户 --done
每个节点上执行,root用户下执行
echo 'postgres ALL=(ALL) NOPASSWD: ALL'> /etc/sudoers.d/postgres
5.创建repmgr用户,以及创建repmgr数据库
我这里已经提前安装好了主库,直接登录到主库上执行,在主库上192.168.1.105执行即可
[postgres@pg2 ~]$ psql
psql (14.6)
Type "help" for help.
postgres=# create database repmgr;
CREATE DATABASE
postgres=# create user repmgr with password 'repmgr' superuser login;
CREATE ROLE
postgres=# alter database repmgr owner to repmgr;
ALTER DATABASE
6.repmgr安装
每个节点上都需要安装
curl https://dl.2ndquadrant.com/default/release/get/13/rpm | bash
yum list repmgr*
yum install repmgr13
7.配置pg_hba.conf和postgresql.conf
在主库192.168.1.105上配置即可,在原来的基础上添加如下参数
su - postgres
vi /opt/pg13/data/postgresql.conf
shared_preload_libraries = 'repmgr'
wal_log_hints = on #开启支持pg_rewind
vi /opt/pg13/data/pg_hba.conf
host replication repmgr 192.168.1.105/32 trust
host replication repmgr 192.168.1.106/32 trust
host replication repmgr 192.168.1.107/32 trust
重新启动主库
su - root
systemctl stop postgresql-13
systemctl start postgresql-13
8.在postgres用户下创建.pgpass密码文件
每个节点上都要创建,可以尝试在一个节点上创建,然后scp到另外的机器,
我这里是在主库192.169.1.105上创建,然后scp到另外2个节点
su - postgres
cd ~/
vi .pgpass
#hostname:port:database:username:password
192.168.1.105:5432:repmgr:repmgr:repmgr
192.168.1.106:5432:repmgr:repmgr:repmgr
192.168.1.107:5432:repmgr:repmgr:repmgr
.pgpass文件权限必须是0600
chmod 0600 .pgpass
拷贝到另外2个节点
su - postgres
scp /home/postgres/.pgpass postgres@192.168.1.106:/home/postgres/
scp /home/postgres/.pgpass postgres@192.168.1.107:/home/postgres/
9.主库修改repmgr配置文件
机器192.168.1.105上操作
su - root
vi /etc/repmgr/13/repmgr.conf
node_id=1
node_name='pg1'
conninfo='host=192.168.1.105 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/opt/pg13/data'
log_file='/var/log/repmgr/log'
passfile='/home/postgres/.pgpass'
pg_bindir='/usr/pgsql-13/bin'
#集群faibver设置
failover='automatic'
promote_command='/usr/pgsql-13/bin/repmgr standby promote -f /etc/repmgr/13/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-13/bin/repmgr standby follow -f /etc/repmgr/13/repmgr.conf --log-to-file --upstream-node-id=%n'
10.主库注册
机器192.168.1.105上操作
[root@pg1 14]# su - postgres
[postgres@pg1 ~]$ /usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf primary register
INFO: connecting to primary database...
NOTICE: attempting to install extension "repmgr"
NOTICE: "repmgr" extension successfully installed
NOTICE: primary node record (ID: 1) registered
查看
[postgres@pg1 ~]$ /usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------
1 | pg1 | primary | * running | | default | 100 | 1 | host=192.168.1.105 user=repmgr dbname=repmgr connect_timeout=2
#######################从库1(192.168.1.106)############################
1.修改repmgr配置文件
修改repmgr.conf文件
su - root
vi /etc/repmgr/13/repmgr.conf
node_id=2
node_name='pg2'
conninfo='host=192.168.1.106 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/opt/pg13/data'
passfile='/home/postgres/.pgpass'
pg_bindir='/usr/pgsql-13/bin'
2.停掉从库上的pg,并清空data目录
[root@pg2 14]# systemctl stop postgresql-13
su - postgres
[postgres@pg2 pg14]$ cd /opt/pg13
[postgres@pg2 pg14]$ mv data bakdata
[postgres@pg2 pg14]$ mkdir data
[root@localhost bin]#chown -R postgres:postgres /opt/pg13
[root@localhost bin]#chmod 0700 /opt/pg13/data
3.测试(未真正执行)
su - postgres
[postgres@pg3 pg13]$/usr/pgsql-13/bin/repmgr -h 192.168.1.105 -U repmgr -d repmgr -f /etc/repmgr/13/repmgr.conf standby clone --dry-run
NOTICE: destination directory "/opt/pg13/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=192.168.1.105 user=repmgr dbname=repmgr
DETAIL: current installation size is 32 MB
INFO: "repmgr" extension is installed in database "repmgr"
INFO: replication slot usage not requested; no replication slot will be set up for this standby
INFO: parameter "max_wal_senders" set to 10
NOTICE: checking for available walsenders on the source node (2 required)
INFO: sufficient walsenders available on the source node
DETAIL: 2 required, 9 available
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: required number of replication connections could be made to the source server
DETAIL: 2 replication connections required
NOTICE: standby will attach to upstream node 1
HINT: consider using the -c/--fast-checkpoint option
INFO: would execute:
pg_basebackup -l "repmgr base backup" -D /opt/pg13/data -h 192.168.1.105 -p 5432 -U repmgr -X stream
INFO: all prerequisites for "standby clone" are met
4.clone主节点数据
su - postgres
/usr/pgsql-13/bin/repmgr -h 192.168.1.105 -U repmgr -d repmgr -f /etc/repmgr/13/repmgr.conf standby clone
5.启动从库
[root@pg2 ~]#systemctl start postgresql-13
查看进程
[root@pg2 ~]# ps -ef|grep postgres
root 2407 1551 0 Jan09 pts/0 00:00:00 su - postgres
postgres 2408 2407 0 Jan09 pts/0 00:00:00 -bash
postgres 3076 1 4 00:58 ? 00:00:00 /usr/pgsql-13/bin/postmaster -D /opt/pg13/data/
postgres 3078 3076 0 00:58 ? 00:00:00 postgres: logger
postgres 3079 3076 0 00:58 ? 00:00:00 postgres: startup recovering 000000010000000000000004
postgres 3080 3076 0 00:58 ? 00:00:00 postgres: checkpointer
postgres 3081 3076 0 00:58 ? 00:00:00 postgres: background writer
postgres 3082 3076 0 00:58 ? 00:00:00 postgres: stats collector
postgres 3083 3076 1 00:58 ? 00:00:00 postgres: walreceiver streaming 0/40001F0
root 3085 3026 0 00:58 pts/1 00:00:00 grep --color=auto postgres
6.注册从节点
su - postgres
[postgres@pg2 data]$ /usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf standby register
INFO: connecting to local node "pg2" (ID: 2)
INFO: connecting to primary database
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID: 1)
INFO: standby registration complete
NOTICE: standby node "pg2" (ID: 2) successfully registered
7.查看
可以在主从的机器上查看
[postgres@pg2 data]$ /usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------
1 | pg1 | primary | * running | | default | 100 | 1 | host=192.168.1.105 user=repmgr dbname=repmgr connect_timeout=2
2 | pg2 | standby | running | pg1 | default | 100 | 1 | host=192.168.1.106 user=repmgr dbname=repmgr connect_timeout=2
#######################从库2(192.168.1.107)############################
步骤跟部署从库1一致
从库2也加入到集群后查看
[postgres@pg3 data]$ /usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------
1 | pg1 | primary | * running | | default | 100 | 1 | host=192.168.1.105 user=repmgr dbname=repmgr connect_timeout=2
2 | pg2 | standby | running | pg1 | default | 100 | 1 | host=192.168.1.106 user=repmgr dbname=repmgr connect_timeout=2
3 | pg3 | standby | running | pg1 | default | 100 | 1 | host=192.168.1.107 user=repmgr dbname=repmgr connect_timeout=2
[postgres@pg3 data]$
########################常规主从(当前主从都正常)切换##################
在从库1(192.168.1.106)上执行
先检查可行性:
su - postgres
/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf standby switchover --siblings-follow --dry-run --force-rewind
真正执行切换:
[postgres@pg2 data]$ /usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf standby switchover --siblings-follow
NOTICE: executing switchover on node "pg2" (ID: 2)
NOTICE: attempting to pause repmgrd on 3 nodes
NOTICE: local node "pg2" (ID: 2) will be promoted to primary; current primary "pg1" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "pg1" (ID: 1)
NOTICE: issuing CHECKPOINT on node "pg1" (ID: 1)
DETAIL: executing server command "/usr/pgsql-13/bin/pg_ctl -D '/opt/pg13/data' -W -m fast stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/7000028
NOTICE: promoting standby to primary
DETAIL: promoting server "pg2" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "pg2" (ID: 2) was successfully promoted to primary
NOTICE: node "pg2" (ID: 2) promoted to primary, node "pg1" (ID: 1) demoted to standby
NOTICE: executing STANDBY FOLLOW on 1 of 1 siblings
INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
NOTICE: switchover was successful
DETAIL: node "pg2" is now primary and node "pg1" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully
查看当前的集群状态
[postgres@pg2 data]$ /usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------
1 | pg1 | standby | running | pg2 | default | 100 | 1 | host=192.168.1.105 user=repmgr dbname=repmgr connect_timeout=2
2 | pg2 | primary | * running | | default | 100 | 2 | host=192.168.1.106 user=repmgr dbname=repmgr connect_timeout=2
3 | pg3 | standby | running | pg2 | default | 100 | 1 | host=192.168.1.107 user=repmgr dbname=repmgr connect_timeout=2
[postgres@pg2 data]$
###############目的主库down掉,提升一个从库作为主库#########################
目的:主库down掉无法使用,将其中一个备库拉升为主库
1.主库(192.168.1.105)停掉
[root@pg1 ~]# systemctl stop postgresql-13
将从库1(192.168.1.106)角色提升为主库
[postgres@pg2 ~]$ /usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf standby promote
WARNING: 1 sibling nodes found, but option "--siblings-follow" not specified
DETAIL: these nodes will remain attached to the current primary:
pg3 (node ID: 3)
NOTICE: promoting standby to primary
DETAIL: promoting server "pg2" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "pg2" (ID: 2) was successfully promoted to primary
这个时候查看另外的从库2是否发生切换
postgres=# select * from pg_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_
xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | s
ync_state | reply_time
-----+----------+---------+------------------+-------------+-----------------+-------------+---------------+---------
-----+-------+----------+-----------+-----------+------------+-----------+-----------+------------+---------------+--
----------+------------
(0 rows)
发现另一个从库2是没有跟追新主库的
3.将原来的主库1作为从库加入到集群
这个时候不能启动原主库,若启动了的需要将其停掉再执行如下语句
[postgres@pg1 data]$ /usr/pgsql-13/bin/repmgr node rejoin -d 'host=192.168.1.106 port=5432 user=repmgr dbname=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose
INFO: checking for package configuration file "/etc/repmgr/13/repmgr.conf"
INFO: configuration file found at: "/etc/repmgr/13/repmgr.conf"
NOTICE: rejoin target is node "pg2" (ID: 2)
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server's timeline 4 forked off current database system timeline 3 before current recovery point 0/A000028
INFO: prerequisites for using pg_rewind are met
WARNING: specified file "/opt/pg13/data//opt/pg13/data/postgresql.conf" not found, skipping
WARNING: specified file "/opt/pg13/data//opt/pg13/data/postgresql.auto.conf" not found, skipping
INFO: 0 files copied to "/tmp/repmgr-config-archive-pg1"
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "/usr/pgsql-13/bin/pg_rewind -D '/opt/pg13/data' --source-server='host=192.168.1.106 user=repmgr dbname=repmgr connect_timeout=2'"
NOTICE: 0 files copied to /opt/pg13/data
INFO: directory "/tmp/repmgr-config-archive-pg1" deleted
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=192.168.1.105 user=repmgr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/usr/pgsql-13/bin/pg_ctl -w -D '/opt/pg13/data' start"
INFO: node "pg1" (ID: 1) is pingable
INFO: node "pg1" (ID: 1) has attached to its upstream node
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2
4.将从库2修改跟随新的主库(192.168.1.106)
在从库2上执行
[postgres@pg3 data]$ /usr/pgsql-13/bin/repmgr standby follow -f /etc/repmgr/13/repmgr.conf
NOTICE: attempting to find and follow current primary
ERROR: this node cannot attach to follow target node 2
DETAIL: follow target server's timeline 4 forked off current database system timeline 3 before current recovery point 0/A0000A0
好像这种方式不行,尝试下面的方式
停掉从库2
[root@pg3 ~]# systemctl stop postgresql-13
su - postgres
[postgres@pg3 data]$ /usr/pgsql-13/bin/repmgr node rejoin -d 'host=192.168.1.106 port=5432 user=repmgr dbname=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose
INFO: checking for package configuration file "/etc/repmgr/13/repmgr.conf"
INFO: configuration file found at: "/etc/repmgr/13/repmgr.conf"
NOTICE: rejoin target is node "pg2" (ID: 2)
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server's timeline 4 forked off current database system timeline 3 before current recovery point 0/A0000A0
INFO: prerequisites for using pg_rewind are met
WARNING: specified file "/opt/pg13/data//opt/pg13/data/postgresql.conf" not found, skipping
WARNING: specified file "/opt/pg13/data//opt/pg13/data/postgresql.auto.conf" not found, skipping
INFO: 0 files copied to "/tmp/repmgr-config-archive-pg3"
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "/usr/pgsql-13/bin/pg_rewind -D '/opt/pg13/data' --source-server='host=192.168.1.106 user=repmgr dbname=repmgr connect_timeout=2'"
NOTICE: 0 files copied to /opt/pg13/data
INFO: directory "/tmp/repmgr-config-archive-pg3" deleted
NOTICE: setting node 3's upstream to node 2
WARNING: unable to ping "host=192.168.1.107 user=repmgr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/usr/pgsql-13/bin/pg_ctl -w -D '/opt/pg13/data' start"
INFO: node "pg3" (ID: 3) is pingable
INFO: node "pg3" (ID: 3) has attached to its upstream node
NOTICE: NODE REJOIN successful
DETAIL: node 3 is now attached to node 2
#############################配置自动切换######################
1.修改配置文件
每个节点都需要进行修改
su - root
vi /etc/repmgr/13/repmgr.conf
#集群faibver设置(若已经配置的可略过)
failover='automatic'
promote_command='/usr/pgsql-13/bin/repmgr standby promote -f /etc/repmgr/13/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-13/bin/repmgr standby follow -f /etc/repmgr/13/repmgr.conf --log-to-file --upstream-node-id=%n'
2.启动后台守护进程repmgrd
每个节点都需要启动
su - postgres
/usr/pgsql-13/bin/repmgrd -f /etc/repmgr/13/repmgr.conf
3.模拟主库down机
192.168.1.105机器上操作
[root@pg1 ~]# systemctl stop postgresql-13
可以看到从库1自动切换为了主库
[2023-01-12 20:24:10] [WARNING] 1 sibling nodes found, but option "--siblings-follow" not specified
[2023-01-12 20:24:10] [DETAIL] these nodes will remain attached to the current primary:
pg3 (node ID: 3)
[2023-01-12 20:24:10] [NOTICE] promoting standby to primary
[2023-01-12 20:24:10] [DETAIL] promoting server "pg2" (ID: 2) using pg_promote()
[2023-01-12 20:24:10] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2023-01-12 20:24:11] [NOTICE] STANDBY PROMOTE successful
[2023-01-12 20:24:11] [DETAIL] server "pg2" (ID: 2) was successfully promoted to primary
[2023-01-12 20:24:11] [INFO] checking state of node 2, 1 of 6 attempts
[2023-01-12 20:24:11] [NOTICE] node 2 has recovered, reconnecting
[2023-01-12 20:24:11] [INFO] connection to node 2 succeeded
[2023-01-12 20:24:11] [INFO] original connection is still available
[2023-01-12 20:24:11] [INFO] 1 followers to notify
[2023-01-12 20:24:11] [NOTICE] notifying node "pg3" (ID: 3) to follow node 2
INFO: node 3 received notification to follow node 2
[2023-01-12 20:24:11] [INFO] switching to primary monitoring mode
[2023-01-12 20:24:11] [NOTICE] monitoring cluster primary "pg2" (ID: 2)
[2023-01-12 20:24:17] [NOTICE] new standby "pg3" (ID: 3) has connected
4.尝试将原来的主库启动
启动
[root@pg1 ~]# systemctl start postgresql-13
然后跟随新主库
su - postgres
[postgres@pg1 data]$ /usr/pgsql-13/bin/repmgr standby follow -f /etc/repmgr/13/repmgr.conf
ERROR: this node should be a standby (host=192.168.1.105 user=repmgr dbname=repmgr connect_timeout=2)
因为原主库启动后脱离整个集群了,不再是一个从节点,需要执行如下命令加入到集群
先停掉数据库
[root@pg1 ~]# systemctl stop postgresql-13
su - postgres
/usr/pgsql-13/bin/repmgr node rejoin -d 'host=192.168.1.106 port=5432 user=repmgr dbname=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose
########################repmgr管理#####################
1.启动守护进程 repmgrd
su - postgres
/usr/pgsql-13/bin/repmgrd -f /etc/repmgr/13/repmgr.conf
2.查看状态node
/usr/pgsql-13/bin/repmgrd -f /etc/repmgr/13/repmgr.conf node status
3.集群相关
/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf cluster show
/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf cluster matrix
/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf cluster crosscheck
4.service 相关
/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf service status
5.暂停
/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf service pause
-- The End --