Implement repmgr PG14 cluster

扫盲

1. autofailover实际是使用repmgrd进程主动监控并执行主备切换,看上去很强大,实则生产中慎用或者不用。
2. PG switchover时,主备 PG owner 必须配置免密,并且 只能在standby机器上执行switchover的操作

系统环境

OS: CentOS7
192.168.56.11 node1
192.168.56.12 node2

部署操作流程

1. 系统环境初始化


    echo redhat | passwd root --stdin 

    echo -e "192.168.56.11  node1 " >>/etc/hosts

    echo -e "192.168.56.12  node2 " >>/etc/hosts

    sed -i 's/vault.centos.org/mirrors.aliyun.com/g' /etc/yum.repos.d/CentOS*
 

2. 所有节点安装PG14 <root执行>

    sudo yum install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm

    sudo yum -y install epel-release yum-utils
  
    sudo yum-config-manager --enable pgdg14

    sudo yum install postgresql14-server postgresql14 -y

    echo -e " export PATH=/usr/pgsql-14/bin/:$PATH" >>/var/lib/pgsql/.bash_profile


3. 所有节点安装repmgr<root执行>

## on all nodes
sudo yum -y  install repmgr_14*

4. 关闭防火墙或者放行PG端口(默认5432)<root执行>

systemctl stop firewalld

5. 主节点创建repmgr账户 , pg_hba.cfg 以及postgresql.conf 参数调整 , repmgr.conf文件配置 <PG owner执行>

5.1 创建repmgr

createuser repmgr -P -s 

createdb  repmgr -O repmgr;

5.2 调整pg_hba.cfg 以及postgresql.conf参数

##  -- postgresql.conf
max_wal_senders = 10
max_replication_slots = 10
wal_level = 'hot_standby'
hot_standby = on
archive_mode = on
archive_command = '/bin/true'
shared_preload_libraries = 'repmgr'

##  -- pg_hba.cfg
--- 这里采用了trust模式,因此无需.pgpass
local   replication   repmgr                              trust
host    replication   repmgr      127.0.0.1/32            trust
host    replication   repmgr      192.168.56.11/32       trust
host    replication   repmgr      192.168.56.12/32       trust

local   repmgr        repmgr                              trust
host    repmgr        repmgr      127.0.0.1/32            trust
host    repmgr        repmgr      192.168.56.11/32       trust
host    repmgr        repmgr      192.168.56.12/32       trust

若有需要 .pgpass ,则需要如下配置:su - postgres
cd /var/lib/pgsql
touch .pgpass
chmod 0600 .pgpass
echo -e "192.168.56.11:5432:repmgr:repmgr:passwd_123456">>.pgpass
echo -e "192.168.56.12:5432:repmgr:repmgr:passwd_123456">>.pgpass

5.3 repmgr.conf文件配置

#-- 主库

node_id=1
node_name=node1
conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/pgsql/14/data/'

#-- 从库
node_id=2
node_name=node2
conninfo='host=node2 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/pgsql/14/data/'

6. 启动主节点PG 并 注册为primary server <PG owner执行>

# -- 首先启动主库PG
sudo /usr/pgsql-14/bin/postgresql-14-setup initdb

sudo systemctl enable postgresql-14

sudo systemctl start postgresql-14

# -- 先模拟运行注册
su - postgres
repmgr -f /etc/repmgr/14/repmgr.conf primary register --dry-run

# -- 将上面错误全部解决后,执行注册
repmgr -f /etc/repmgr/14/repmgr.conf primary register 


7. 清空备节点PG data目录, 搭建/clone备节点 <PG owner执行>

# - 确保从库PG以及PG 所有data目录是空的,如果不是,则手动删除即可

systemctl stop postgresql-14


rm -rf /var/lib/pgsql/14/data

8. 将备节点注册成standby server 并启动PG实例 <PG owner执行>

repmgr -f /etc/repmgr/14/repmgr.conf standby register  --force

9. 校验集群信息

-bash-4.2$ repmgr cluster show
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.11 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.11 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=node2 port=5432 fallback_application_name=repmgr options=-csearch_path="
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.11 port=5432 fallback_application_name=repmgr options=-csearch_path="
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        | host=192.168.56.11 port=5432 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running | node1    | default  | 100      | 1        | host=node2 user=repmgr port=5432 dbname=repmgr connect_timeout=2

10. 常见错误

10.1 connection to server at "192.168.56.11", port 5432 failed: Connection refused

# -- 明明PG 实例都启动了,但是死活显示如题,最后排查到 默认情况下 postgresql只监听local,因此修改参数listen-addresses ='*'
-bash-4.2$ repmgr -f /etc/repmgr/14/repmgr.conf primary register --dry-run
INFO: connecting to primary database...
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.11 port=5432 fallback_application_name=repmgr options=-csearch_path="
ERROR: connection to database failed
DETAIL:
connection to server at "192.168.56.11", port 5432 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?

DETAIL: attempted to connect using:
  user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.11 port=5432 fallback_application_name=repmgr options=-csearch_path=

#-- solution 修改如下配置即可
查看 postgresql.conf 
listen-addresses ='*'

10.2.ERROR: "repmgr" is not a superuser and no superuser name supplied

# 需要确保做repmgr的账户拥有超级权限
postgres=# \du
                                   List of roles
 Role name |                         Attributes                         | Member of
-----------+------------------------------------------------------------+-----------
 postgres  | Superuser, Create role, Create DB, Replication, Bypass RLS | {}
 repmgr    | Superuser                                                  | {}

10.3. Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

# 系统级别的问题, 配置passwordless时,在节点node1上跑 ssh node2 date会直接拒绝

# - solution
修改如下 PasswordAuthentication = yes
PasswordAuthentication  yes

10.4.网络有问题

-- 各种查看,发现没有问题


-bash-4.2$ repmgr -hnode1 -p5432 -U repmgr -d repmgr -f /etc/repmgr/14/repmgr.conf standby clone --dry-run
NOTICE: destination directory "/var/lib/pgsql/14/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=node1 port=5432 user=repmgr dbname=repmgr
ERROR: connection to database failed
DETAIL:
connection to server at "node1" (192.168.56.11), port 5432 failed: No route to host
        Is the server running on that host and accepting TCP/IP connections?

-- 最后发现是防火墙需要关掉 - systemctl stop firewalld 或者通过firewall-cmd放行PG 实例端口

10.5 WARNING: unable to connect to remote host "192.168.56.11" via SSH


-bash-4.2$ repmgr standby switchover -f /etc/repmgr/14/repmgr.conf --dry-run
NOTICE: checking switchover on node "node2" (ID: 2) in --dry-run mode
WARNING: unable to connect to remote host "192.168.56.11" via SSH
ERROR: unable to connect via SSH to host "192.168.56.11", user ""


# - solution
switchover 需要 passwordless,因此配置postgres passwordless即可


10.6 WARNING: local node "node2" is behind shutdown primary "node1"

-bash-4.2$ repmgr standby switchover -f /etc/repmgr/14/repmgr.conf --dry-run
NOTICE: checking switchover on node "node2" (ID: 2) in --dry-run mode
INFO: SSH connection to host "192.168.56.11" succeeded
INFO: able to execute "repmgr" on remote host "192.168.56.11"
INFO: 1 walsenders required, 10 available
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.11 port=5432 fallback_application_name=repmgr options=-csearch_path="
INFO: demotion candidate is able to make replication connection to promotion candidate
INFO: 0 pending archive files
INFO: replication lag on this standby is 0 seconds
NOTICE: attempting to pause repmgrd on 2 nodes
NOTICE: local node "node2" (ID: 2) would be promoted to primary; current primary "node1" (ID: 1) would be demoted to standby
INFO: following shutdown command would be run on node "node1":
  "/usr/pgsql-14/bin/pg_ctl  -D '/var/lib/pgsql/14/data' -W -m fast stop"
INFO: parameter "shutdown_check_timeout" is set to 60 seconds
INFO: prerequisites for executing STANDBY SWITCHOVER are met
---- 看上去没问题

-bash-4.2$ repmgr standby switchover -f /etc/repmgr/14/repmgr.conf
NOTICE: executing switchover on node "node2" (ID: 2)
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.11 port=5432 fallback_application_name=repmgr options=-csearch_path="
NOTICE: attempting to pause repmgrd on 2 nodes
NOTICE: local node "node2" (ID: 2) will be promoted to primary; current primary "node1" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "node1" (ID: 1)
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.11 port=5432 fallback_application_name=repmgr options=-csearch_path="
NOTICE: issuing CHECKPOINT on node "node1" (ID: 1)
DETAIL: executing server command "/usr/pgsql-14/bin/pg_ctl  -D '/var/lib/pgsql/14/data' -W -m fast stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/8000028
NOTICE: waiting up to 30 seconds (parameter "wal_receive_check_timeout") for received WAL to flush to disk
INFO: sleeping 1 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 2 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 3 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 4 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 5 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 6 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 7 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 8 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 9 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 10 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 11 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 12 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 13 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 14 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 15 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 16 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 17 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 18 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 19 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 20 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 21 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 22 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 23 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 24 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 25 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 26 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 27 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 28 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 29 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 30 of maximum 30 seconds waiting for standby to flush received WAL to disk
WARNING: local node "node2" is behind shutdown primary "node1"
DETAIL: local node last receive LSN is 0/7080000, primary shutdown checkpoint LSN is 0/8000028
NOTICE: aborting switchover
HINT: use --always-promote to force promotion of standby

-- 此时查看,主PG已经关闭了, Standby PG 还在继续运行中
-bash-4.2$ repmgr -f /etc/repmgr/14/repmgr.conf cluster show
 ID | Name  | Role    | Status        | Upstream | Location | Priority | Timeline | Connection string

----+-------+---------+---------------+----------+----------+----------+----------+--------------------------------------------------------------------------
 1  | node1 | primary | ? unreachable | ?        | default  | 100      |          | host=192.168.56.11 port=5432 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running     | ? node1  | default  | 100      | 1        | host=node2 user=repmgr port=5432 dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "node1" (ID: 1)
  - node "node1" (ID: 1) is registered as an active primary but is unreachable
  - unable to connect to node "node2" (ID: 2)'s upstream node "node1" (ID: 1)
  - unable to determine if node "node2" (ID: 2) is attached to its upstream node "node1" (ID: 1)


-- 然后重新启动 主PG,集群还是继续运行
systemctl start postgresql-14

-bash-4.2$ repmgr -f /etc/repmgr/14/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string

----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        | host=192.168.56.11 port=5432 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running | node1    | default  | 100      | 1        | host=node2 user=repmgr port=5432 dbname=repmgr connect_timeout=2



-- Solution



11 常用repmgr命令

repmgr primary register — initialise a repmgr installation and register the primary node
repmgr primary unregister — unregister an inactive primary node
repmgr standby clone — clone a PostgreSQL standby node from another PostgreSQL node
repmgr standby register — add a standby's information to the repmgr metadata
repmgr standby unregister — remove a standby's information from the repmgr metadata
repmgr standby promote — promote a standby to a primary
repmgr standby follow — attach a running standby to a new upstream node
repmgr standby switchover — promote a standby to primary and demote the existing primary to a standby
repmgr witness register — add a witness node's information to the repmgr metadata
repmgr witness unregister — remove a witness node's information to the repmgr metadata
repmgr node status — show overview of a node's basic information and replication status
repmgr node check — performs some health checks on a node from a replication perspective
repmgr node rejoin — rejoin a dormant (stopped) node to the replication cluster
repmgr node service — show or execute the system service command to stop/start/restart/reload/promote a node
repmgr cluster show — display information about each registered node in the replication cluster
repmgr cluster matrix — runs repmgr cluster show on each node and summarizes output
repmgr cluster crosscheck — cross-checks connections between each combination of nodes
repmgr cluster event — output a formatted list of cluster events
repmgr cluster cleanup — purge monitoring history
repmgr service status — display information about the status of repmgrd on each node in the cluster
repmgr service pause — Instruct all repmgrd instances in the replication cluster to pause failover operations
repmgr service unpause — Instruct all repmgrd instances in the replication cluster to resume failover operations
repmgr daemon start — Start the repmgrd daemon on the local node
repmgr daemon stop — Stop the repmgrd daemon on the local node

12 参考资料

https://repmgr.org/docs/current/index.html

posted @ 2023-05-07 09:31  DBAGPT  阅读(212)  评论(0编辑  收藏  举报