安装环境

操作系统:openeuler 22 LTS SP3
数据库:postgresql 16.3
两台已经安装了数据库的服务器

10.11.110.46	node1
10.11.110.47	node2

配置node1和node2之间的免密登录,以及postgres用户需要无密执行systemd启停数据库服务。

echo "postgres ALL = (root) NOPASSWD:/usr/bin/systemctl start postgresql.service,/usr/bin/systemctl restart postgresql.service,/usr/bin/systemctl stop postgresql.service,/usr/bin/systemctl reload postgresql.service,/usr/bin/systemctl status postgresql.service" | sudo tee /etc/sudoers.d/postgres

chmod 0440 /etc/sudoers.d/postgres
visudo -c

注意这里配置了sudo密码执行systemctl命令,手动运行的时候也必须指定命令全路径
/etc/hosts配置ip与主机名映射关系

10.11.110.46 node1
10.11.110.47 node2

repmgr源码包下载
https://www.repmgr.org/download/repmgr-5.4.1.tar.gz

安装repmgr

su - postgres
tar  zxvf repmgr-5.4.1.tar.gz
export PG_CONFIG=/usr/local/postgresql/bin/pg_config
./configure
make
make install

如果编译报错如下,是因为缺少包导致

/usr/bin/ld: 找不到 -lcurl
/usr/bin/ld: 找不到 -ljson-c

安装依赖包

dnf install libcurl-devel json-c-devel -y

配置repmgr

postgresql配置

选择node1作为主节点,修改配置文件/usr/local/postgresql/data/postgresql.conf

max_wal_senders = 10
max_replication_slots = 10
wal_level = replica
hot_standby = on
archive_mode = on
wal_log_hints = on
archive_command = '/bin/true'
shared_preload_libraries = 'repmgr'

修改配置文件/usr/local/postgresql/data/pg_hba.conf,配置repmgr用户的认证权限,确保其有适当的权限,并且能够以复制模式连接

local   replication   repmgr                                    trust
host   replication   repmgr            127.0.0.1/32            trust
host    replication   repmgr            10.11.110.46/32         trust
host    replication   repmgr            10.11.110.47/32         trust
local   repmgr        repmgr                                    trust
host    repmgr        repmgr            127.0.0.1/32            trust
host    repmgr        repmgr            10.11.110.46/32         trust
host    repmgr        repmgr            10.11.110.47/32         trust

修改完成后重启数据库systemctl restart postgresql
在node1上创建repmgr超级用户和repmgr数据库

su - postgres
createuser  -s repmgr
createdb repmgr -O repmgr
psql
postgres=# ALTER USER repmgr ENCRYPTED PASSWORD 'repmgr';

备节点配置

在node2上停止postgresql,并删除其数据目录,如果之前没有初始化数据库就不用管。
从备节点测试是否能够正常连接主库
psql 'host=node1 user=repmgr dbname=repmgr connect_timeout=2'

创建repmgr配置文件

在node1和node2上创建/usr/local/postgresql/data/repmgr.conf
node1的配置文件

node_id=1
node_name=node1
conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/usr/local/postgresql/data'
log_file='/usr/local/postgresql/log/repmgr.log'
pg_bindir='/usr/local/postgresql/bin'
ssh_options='-q -o ConnectTimeout=10'
failover='automatic'
promote_command='/usr/local/postgresql/bin/repmgr standby promote -f /usr/local/postgresql/repmgr/repmgr.conf --log-to-file'
follow_command='/usr/local/postgresql/bin/repmgr standby follow -f /usr/local/postgresql/repmgr/repmgr.conf --log-to-file --upstream-node-id=%n'
service_start_command='sudo systemctl start postgresql.service'
service_stop_command='sudo systemctl stop postgresql.service'
service_restart_command='sudo systemctl restart postgresql.service'
service_reload_command='sudo systemctl reload postgresql.service'

node2的配置文件

node_id=2
node_name=node2
conninfo='host=node2 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/usr/local/postgresql/data'
log_file='/usr/local/postgresql/log/repmgr.log'
pg_bindir='/usr/local/postgresql/bin'
ssh_options='-q -o ConnectTimeout=10'
failover='automatic'
promote_command='/usr/local/postgresql/bin/repmgr standby promote -f /usr/local/postgresql/repmgr/repmgr.conf --log-to-file'
follow_command='/usr/local/postgresql/bin/repmgr standby follow -f /usr/local/postgresql/repmgr/repmgr.conf --log-to-file --upstream-node-id=%n'
service_start_command='sudo systemctl start postgresql.service'
service_stop_command='sudo systemctl stop postgresql.service'
service_restart_command='sudo systemctl restart postgresql.service'
service_reload_command='sudo systemctl reload postgresql.service'

其他配置项参考样例:https://raw.githubusercontent.com/EnterpriseDB/repmgr/master/repmgr.conf.sample

注册主节点

在主节点node1上注册repmgr主节点信息,将安装repmgr扩展和元数据对象,并为主服务器添加一个元数据记录。

repmgr -f /usr/local/postgresql/repmgr/repmgr.conf primary register

INFO: connecting to primary database...
NOTICE: attempting to install extension "repmgr"
NOTICE: "repmgr" extension successfully installed
NOTICE: primary node record (ID: 1) registered

查看集群状态

repmgr -f /usr/local/postgresql/repmgr/repmgr.conf cluster show
 
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                     
----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        | host=node1 user=repmgr dbname=repmgr connect_timeout=2

repmgr元数据如下:

[postgres@node1 data]$ psql -Urepmgr
repmgr=# \x
Expanded display is on.
repmgr=# select * from repmgr.nodes;
-[ RECORD 1 ]----+-------------------------------------------------------
node_id          | 1
upstream_node_id | 
active           | t
node_name        | node1
type             | primary
location         | default
priority         | 100
conninfo         | host=node1 user=repmgr dbname=repmgr connect_timeout=2
repluser         | repmgr
slot_name        | 
config_file      | /usr/local/postgresql/repmgr/repmgr.conf

一个PostgreSQL复制集群中的每个服务器都将拥有自己的记录。在使用repmgrd时,当节点状态或角色发生变化时,upstream_node_id、active和type字段会更新。

备节点配置

在node2节点上测试(--dry-run)一下是否能正常clone主库数据:

[postgres@node2 data]$ repmgr -h node1 -U repmgr -d repmgr -f /usr/local/postgresql/repmgr/repmgr.conf standby clone --dry-run
NOTICE: destination directory "/usr/local/postgresql/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=node1 user=repmgr dbname=repmgr
DETAIL: current installation size is 29 MB
INFO: "repmgr" extension is installed in database "repmgr"
INFO: replication slot usage not requested;  no replication slot will be set up for this standby
INFO: parameter "max_wal_senders" set to 10
NOTICE: checking for available walsenders on the source node (2 required)
INFO: sufficient walsenders available on the source node
DETAIL: 2 required, 10 available
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: required number of replication connections could be made to the source server
DETAIL: 2 replication connections required
NOTICE: standby will attach to upstream node 1
HINT: consider using the -c/--fast-checkpoint option
INFO: would execute:
  /usr/local/postgresql/pg_basebackup -l "repmgr base backup"  -D /usr/local/postgresql/data -h node1 -p 5432 -U repmgr -X stream 
INFO: all prerequisites for "standby clone" are met

没有报错的情况下,开始正式克隆

[postgres@node2 postgresql]$ repmgr  -h node1 -U repmgr -d repmgr -f /usr/local/postgresql/repmgr/repmgr.conf standby clone
NOTICE: destination directory "/usr/local/postgresql/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=node1 user=repmgr dbname=repmgr
DETAIL: current installation size is 30 MB
INFO: replication slot usage not requested;  no replication slot will be set up for this standby
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: checking and correcting permissions on existing directory "/usr/local/postgresql/data"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
  /usr/local/postgresql/bin/pg_basebackup -l "repmgr base backup"  -D /usr/local/postgresql/data -h node1 -p 5432 -U repmgr -X stream 
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: sudo systemctl start pgserver
HINT: after starting the server, you need to register this standby with "repmgr standby register"

实际上使用了pg_basebackup命令clone了主节点的数据目录文件。主节点数据目录中的配置文件也都被复制到了备节点的数据目录中,包括postgresql.conf, postgresql.auto.conf, pg_hba.conf和pg_ident.conf。 如果不需要针对备节点定制修改这些配置的话,就可以启动备节点的数据库服务了:
sudo /usr/bin/systemctl start postgresql.service

验证复制是否正常工作

在主库node1上查看复制情况

[postgres@node1 repmgr]$ psql   -Urepmgr
psql (16.3)
Type "help" for help.

repmgr=# \x
Expanded display is on.
repmgr=# select * from pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid              | 7665
usesysid         | 16388
usename          | repmgr
application_name | node2
client_addr      | 10.11.110.47
client_hostname  | 
client_port      | 40102
backend_start    | 2024-05-28 17:39:27.557961+08
backend_xmin     | 
state            | streaming
sent_lsn         | 0/7000328
write_lsn        | 0/7000328
flush_lsn        | 0/7000328
replay_lsn       | 0/7000328
write_lag        | 
flush_lag        | 
replay_lag       | 
sync_priority    | 0
sync_state       | async
reply_time       | 2024-05-28 21:48:53.074483+08

备库node2上查看复制情况

[postgres@node1 repmgr]$ psql   -Urepmgr
psql (16.3)
Type "help" for help.

repmgr=# \x
Expanded display is on.
repmgr=# select * from pg_stat_wal_receiver;
-[ RECORD 1 ]---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pid                   | 6255
status                | streaming
receive_start_lsn     | 0/7000000
receive_start_tli     | 1
written_lsn           | 0/7000328
flushed_lsn           | 0/7000328
received_tli          | 1
last_msg_send_time    | 2024-05-28 21:48:33.053217+08
last_msg_receipt_time | 2024-05-28 21:48:33.054117+08
latest_end_lsn        | 0/7000328
latest_end_time       | 2024-05-28 17:42:57.692486+08
slot_name             | 
sender_host           | node1
sender_port           | 5432
conninfo              | user=repmgr passfile=/home/postgres/.pgpass channel_binding=disable connect_timeout=2 dbname=replication host=node1 port=5432 application_name=node2 fallback_application_name=walreceiver sslmode=disable sslcompression=0 sslcertmode=disable sslsni=1 ssl_min_protocol_version=TLSv1.2 gssencmode=disable krbsrvname=postgres gssdelegation=0 target_session_attrs=any load_balance_hosts=disable

注册备节点

在node2上使用repmgr命令将node2注册为备节点

[postgres@node2 ~]$ repmgr -f /usr/local/postgresql/repmgr/repmgr.conf standby register
INFO: connecting to local node "node2" (ID: 2)
INFO: connecting to primary database
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID: 1)
INFO: standby registration complete
NOTICE: standby node "node2" (ID: 2) successfully registered

此时再次查看集群状态

[postgres@node2 ~]$ repmgr -f /usr/local/postgresql/repmgr/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                     
----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        | host=node1 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running | node1    | default  | 100      | 1        | host=node2 user=repmgr dbname=repmgr connect_timeout=2

手动切换测试

在备节点node2上执行切换操作,手动将node2切换为主节点

[postgres@node2 ~]$ repmgr -f /usr/local/postgresql/repmgr/repmgr.conf standby switchover
NOTICE: executing switchover on node "node2" (ID: 2)
NOTICE: attempting to pause repmgrd on 2 nodes
NOTICE: local node "node2" (ID: 2) will be promoted to primary; current primary "node1" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "node1" (ID: 1)
NOTICE: issuing CHECKPOINT on node "node1" (ID: 1) 
DETAIL: executing server command "sudo systemctl stop postgresql.service"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/9000028
NOTICE: waiting up to 30 seconds (parameter "wal_receive_check_timeout") for received WAL to flush to disk
INFO: sleeping 1 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 2 of maximum 30 seconds waiting for standby to flush received WAL to disk
INFO: sleeping 3 of maximum 30 seconds waiting for standby to flush received WAL to disk
[...]
INFO: sleeping 30 of maximum 30 seconds waiting for standby to flush received WAL to disk
WARNING: local node "node2" is behind shutdown primary "node1"
DETAIL: local node last receive LSN is 0/82A0000, primary shutdown checkpoint LSN is 0/9000028
NOTICE: aborting switchover
HINT: use --always-promote to force promotion of standby
[postgres@node2 ~]$ repmgr -f /usr/local/postgresql/repmgr/repmgr.conf cluster show
 ID | Name  | Role    | Status        | Upstream | Location | Priority | Timeline | Connection string                                     
----+-------+---------+---------------+----------+----------+----------+----------+--------------------------------------------------------
 1  | node1 | primary | ? unreachable | ?        | default  | 100      |          | host=node1 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running     | ? node1  | default  | 100      | 1        | host=node2 user=repmgr dbname=repmgr connect_timeout=2
 

第一次切换可能会遇到如上的错误,主节点停掉了,但是报错WARNING: local node "node2" is behind shutdown primary "node1"。查询资料得到的解决办法是,将postgresql.conf配置文件做如下修改

archive_command = '{ sleep 5;true; }'

重新加载配置sudo /usr/bin/systemctl reload postgresql.service ,再执行手动切换

[postgres@node2 ~]$ repmgr -f /usr/local/postgresql/repmgr/repmgr.conf standby switchover
NOTICE: executing switchover on node "node2" (ID: 2)
NOTICE: attempting to pause repmgrd on 2 nodes
NOTICE: local node "node2" (ID: 2) will be promoted to primary; current primary "node1" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "node1" (ID: 1)
NOTICE: issuing CHECKPOINT on node "node1" (ID: 1) 
DETAIL: executing server command "sudo systemctl stop postgresql.service"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/A000028
NOTICE: promoting standby to primary
DETAIL: promoting server "node2" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node2" (ID: 2) was successfully promoted to primary
NOTICE: node "node2" (ID: 2) promoted to primary, node "node1" (ID: 1) demoted to standby
NOTICE: switchover was successful
DETAIL: node "node2" is now primary and node "node1" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully

重新查看集群状态

[postgres@node2 ~]$ repmgr -f /usr/local/postgresql/repmgr/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                     
----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------
 1  | node1 | standby |   running | node2    | default  | 100      | 1        | host=node1 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 2        | host=node2 user=repmgr dbname=repmgr connect_timeout=2

使用repmgrd实现自动故障转移

启动repmgrd

在node1和node2上配置systemd配置文件/lib/systemd/system/repmgrd.service

[Unit]
Description=A replication manager, and failover management tool for PostgreSQL
After=syslog.target
After=network.target
After=postgresql.service

[Service]
Type=forking

User=postgres
Group=postgres

# PID file
PIDFile=/usr/local/postgresql/data/repmgrd.pid

# Location of repmgr conf file:
Environment=REPMGRDCONF=/usr/local/postgresql/repmgr/repmgr.conf
Environment=PIDFILE=/usr/local/postgresql/data/repmgrd.pid

# Where to send early-startup messages from the server
# This is normally controlled by the global default set by systemd
# StandardOutput=syslog
ExecStart=/usr/local/postgresql/bin/repmgrd -f ${REPMGRDCONF} -p ${PIDFILE} -d --verbose
ExecStop=/usr/bin/kill -TERM $MAINPID
ExecReload=/usr/bin/kill -HUP $MAINPID

# Give a reasonable amount of time for the server to start up/shut down
TimeoutSec=300

[Install]
WantedBy=multi-user.target

在node1和node2上启动repmgrd服务

systemctl enable repmgrd --now

自动故障转移测试

当前集群状态

[postgres@node2 ~]$ repmgr -f /usr/local/postgresql/repmgr/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                     
----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------
 1  | node1 | standby |   running | node2    | default  | 100      | 2        | host=node1 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 2        | host=node2 user=repmgr dbname=repmgr connect_timeout=2

当前主节点在node2,模拟主节点宕掉

[postgres@node2 ~]$ sudo systemctl stop postgresql.service

等一会儿查看集群状态,node1节点已提升为主节点

[postgres@node1 ~]$ repmgr -f /usr/local/postgresql/repmgr/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                     
----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 3        | host=node1 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | - failed  | ?        | default  | 100      |          | host=node2 user=repmgr dbname=repmgr connect_timeout=2

重新启动node2节点postgresql,会出现下面的状况,这个也是这种主备高可用方案的缺点,出现了两个主节点

[postgres@node1 ~]$ repmgr -f /usr/local/postgresql/repmgr/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                     
----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 3        | host=node1 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | ! running |          | default  | 100      | 2        | host=node2 user=repmgr dbname=repmgr connect_timeout=2

node2的状态是不对的,先将postgresql再次停止,并重新以备节点的角色加入到集群

[postgres@node2 ~]$ sudo systemctl stop postgresql.service 
[postgres@node2 ~]$ repmgr -f /usr/local/postgresql/repmgr/repmgr.conf node rejoin  -d 'host=node1 dbname=repmgr user=repmgr' --force-rewind
NOTICE: rejoin target is node "node1" (ID: 1)
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 1
DETAIL: rejoin target server's timeline 3 forked off current database system timeline 2 before current recovery point 0/F000028
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "/usr/local/postgresql/bin/pg_rewind -D '/usr/local/postgresql/data' --source-server='host=node1 user=repmgr dbname=repmgr connect_timeout=2'"
NOTICE: 0 files copied to /usr/local/postgresql/data
NOTICE: setting node 2's upstream to node 1
WARNING: unable to ping "host=node2 user=repmgr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "sudo systemctl start postgresql.service"
NOTICE: NODE REJOIN successful
DETAIL: node 2 is now attached to node 1

注意:这里使用了--force-rewind参数,正常情况下先不加该参数执行rejoin,不成功的情况下再添加该参数。-d 参数后面指定的host为当前的主节点。在正式进行操作前可以先加--dry-run参数测试一下能否正常执行。
因此,如果是生产上用的双节点主备集群,那么当发生主备切换后,手动将主节点启动的时候就直接rejoin
repmgr -f /usr/local/postgresql/repmgr/repmgr.conf node rejoin -d 'host=node1 dbname=repmgr user=repmgr'

keepalived实现VIP

repmgr自身没有vip配置功能,可以使用keepalived来实现vip功能。
使用yum安装keepalived即可,配置如下
MASTER节点/etc/keepalived/keepalived.conf

vrrp_script chk_myscript {
  script       "/usr/local/bin/is_postgres_primary.sh"
  interval 3   # check every 3 seconds
  fall 3       # require 3 failures for KO
  rise 3       # require 3 successes for OK
  user postgres postgres
}

vrrp_instance VI_1 {
    state BACKUP
    interface ens33
    virtual_router_id 51
    priority 100   # make all keepalived have different priority
    nopreempt
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 123465
    }
    virtual_ipaddress {
        10.11.88.44/16     # this is the standard VIP, change as needed
    }

    track_script {
        chk_myscript
    }
}

BACKUP节点/etc/keepalived/keepalived.conf

vrrp_script chk_myscript {
  script       "/usr/local/bin/is_postgres_primary.sh"
  interval 3   # check every 3 seconds
  fall 3       # require 3 failures for KO
  rise 3       # require 3 successes for OK
  user postgres postgres
}

vrrp_instance VI_1 {
    state BACKUP
    interface ens33
    virtual_router_id 51
    priority 90   # make all keepalived have different priority
    nopreempt
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 123465
    }
    virtual_ipaddress {
        10.11.88.44/16     # this is the standard VIP, change as needed
    }

    track_script {
        chk_myscript
    }
}

postgresql检查脚本/usr/local/bin/is_postgres_primary.sh,记得赋予执行权限

#!/bin/bash
export PATH=/usr/local/postgresql/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
export PGDATA=/usr/local/postgresql/data
export PGDATABASE=postgres
export PGUSER=postgres
export PGPORT=5432
psql -q -t -c 'select pg_is_in_recovery()' |grep 'f'

配置好后启动keepalived,只有在对应节点postgresql正常运行且为primary节点时,VIP才会在当前服务器。

pgbouncer安装

下载源码包
https://www.pgbouncer.org/downloads/files/1.22.1/pgbouncer-1.22.1.tar.gz
安装依赖
dnf install make make-devel libevent libevent-devel pkgconfig pkgconfig-devel openssl openssl-devel -y
编译安装

tar zxvf pgbouncer-1.22.1.tar.gz && cd pgbouncer-1.22.1
./configure --prefix=/usr/local/pgbouncer --with-systemd
make && make install

安装好以后样例配置文件及service配置文件等在路径/usr/local/pgbouncer/share/doc/pgbouncer下面。
pgbouncer.ini配置

[databases]
test = host=10.11.88.44 port=5432 pool_size=20 dbname=test
#如果懒得配指定数据库也可以用下面这种通配符方式
* = host=10.11.88.44 port=5432 pool_size=20

[pgbouncer]
logfile = /usr/local/pgbouncer/log/pgbouncer.log
pidfile = /usr/local/pgbouncer/pgbouncer.pid

listen_addr = *
listen_port = 16432

auth_type = md5
auth_file = /usr/local/pgbouncer/userlist.txt

admin_users = postgres
stats_users = stats, root

pool_mode = session
server_reset_query = DISCARD ALL
server_check_query = select 1

max_client_conn = 1000
default_pool_size = 20
min_pool_size = 10
reserve_pool_size = 10
reserve_pool_timeout = 5
max_db_connections = 100
max_user_connections = 100

auth_file中的用户即访问后端真实postgresql数据库中的所需要的用户。
auth_type 配置为md5是会自动兼容md5和scram-sha-256两种加密方式。PG14之后默认scram-sha-256。可通过select usename,passwd from pg_shadow order by 1;查看。
日志文件路径要提前创建mkdir /usr/local/pgbouncer/log
host 这里配置的VIP地址

userlist.txt文件的内容可以直接从数据库pg_shadow里查询出来粘贴进去,样例如下:

"bill" "SCRAM-SHA-256$4096:orUtCmrN6jsZDygp3sJuog==$JevG8fFxys35E4Qe+aonmOm52PvXEHhmMgmCumsL9TM=:6E1YwsQc4CMYIXsy2koQFD2sbbphZb1s+dENcotanTM="

配置systemd启动脚本/lib/systemd/system/pgbouncer.service

[Unit]
Description=connection pooler for PostgreSQL
Documentation=man:pgbouncer(1)
Documentation=https://www.pgbouncer.org/
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
User=postgres
ExecStart=/usr/local/pgbouncer/bin/pgbouncer /usr/local/pgbouncer/pgbouncer.ini
ExecReload=/bin/kill -HUP $MAINPID
KillSignal=SIGINT

[Install]
WantedBy=multi-user.target

启动pgbouncer后就可以连接数据库了
systemctl enable pgbouncer.service --now
psql -h 10.11.88.44 -p 16432 -U bill -d test
本地登录pgbouncer控制台,可以查看连接状态等信息。

[postgres@node1 ~]$  psql -p 16432 pgbouncer pgbouncer
psql (16.3, server 1.22.1/bouncer)
WARNING: psql major version 16, server major version 1.22.
         Some psql features might not work.
Type "help" for help.

pgbouncer=# show help;
NOTICE:  Console usage
DETAIL:  
        SHOW HELP|CONFIG|DATABASES|POOLS|CLIENTS|SERVERS|USERS|VERSION
        SHOW PEERS|PEER_POOLS
        SHOW FDS|SOCKETS|ACTIVE_SOCKETS|LISTS|MEM|STATE
        SHOW DNS_HOSTS|DNS_ZONES
        SHOW STATS|STATS_TOTALS|STATS_AVERAGES|TOTALS
        SET key = arg
        RELOAD
        PAUSE [<db>]
        RESUME [<db>]
        DISABLE <db>
        ENABLE <db>
        RECONNECT [<db>]
        KILL <db>
        SUSPEND
        SHUTDOWN
        WAIT_CLOSE [<db>]
SHOW
pgbouncer=# show clients;
 type |   user    | database  | state  | addr | port  | local_addr | local_port |      connect_time       |      request_time       | wait | wait_us | close_needed |   ptr    | link | remote_pid | tls | application_name | prepared_stateme
nts 
------+-----------+-----------+--------+------+-------+------------+------------+-------------------------+-------------------------+------+---------+--------------+----------+------+------------+-----+------------------+-----------------
----
 C    | pgbouncer | pgbouncer | active | unix | 16432 | unix       |      16432 | 2024-05-29 21:27:14 CST | 2024-05-29 21:27:20 CST |    4 |  390026 |            0 | 0xf5a2a0 |      |      84252 |     | psql             |                 
  0
 C    | bill      | test      | active | unix | 16432 | unix       |      16432 | 2024-05-29 21:13:41 CST | 2024-05-29 21:13:41 CST |    0 |       0 |            0 | 0xf59ff0 |      |      82562 |     | psql             |                 
  0
(2 rows)

总结

通过repmgrd实现自动故障转移,结合keepalived使访问请求只会发到primary节点,再加上pgbouncer进行连接池控制,基本上实现了线上最少服务器情况下的postgresql高可用架构。
不足:在发生自动故障转移后,需要手动干预将故障节点重新加入集群。
特殊场景1:即使故障转移后原主节点postgresql被手动或自动的意外拉起来,造成双主的情况也不怕,因为keepalived配置的非抢占模式,vip还是停留在新主节点服务器上。
特殊场景2:手动把主节点的keepalived停掉,此时备节点keepalived检查脚本检测到本机postgresql仍是备节点,故不会将vip添加的备机,当前就处于没有vip的情况,无法对外提供数据库访问服务。

posted on 2024-06-03 18:12  lzc_jack  阅读(287)  评论(0编辑  收藏  举报