使用Patroni管理LightDB高可用

使用Patroni管理LightDB高可用

测试环境

CPU: 海光x86
OS: Kylin Advanced Server V10 SP1
LightDB: 13.8-22.3
Patroni: 2.1.3
etcd: 3.5.4

安装部署etcd集群

需要3台机器。
centos/RHEL等可以从epel获取etcd。
麒麟ky10,ky10sp1没有etcd包,可以使用lightdb预编译的etcd-3.5.4。
解压并设置PATH:

cd $HOME
tar xzf etcd-3.5.4-ky10.hygon.tar.gz
export PATH=$HOME/etcd/bin:$PATH
which etcd
etcd --version

启动集群,

TOKEN=token-01
CLUSTER_STATE=new
NAME_1=machine-2
NAME_2=machine-34
NAME_3=machine-35
HOST_1=10.19.64.2
HOST_2=10.19.64.34
HOST_3=10.19.64.35
CLUSTER=${NAME_1}=http://${HOST_1}:2380,${NAME_2}=http://${HOST_2}:2380,${NAME_3}=http://${HOST_3}:2380
# 在10.19.62.2 (machine-2)启动
THIS_IP=${HOST_1}
THIS_NAME=${NAME_1}
etcd --data-dir=data.etcd --name ${THIS_NAME} \
--initial-advertise-peer-urls http://${THIS_IP}:2380 --listen-peer-urls http://${THIS_IP}:2380 \
--advertise-client-urls http://${THIS_IP}:2379 --listen-client-urls http://${THIS_IP}:2379 \
--enable-v2=true \
--initial-cluster ${CLUSTER} \
--initial-cluster-state ${CLUSTER_STATE} --initial-cluster-token ${TOKEN}
# 在另外两台机器启动,只需要修改THIS_NAME和THIS_IP

安装LightDB

数据库部署在机器A、B,每一个LightDB实例会有对应一个patroni,1对1。
在机器A安装单机版lightdb-13.8-22.3,安装时选择创建实例。这个实例会作为primary
在机器B安装单机版lightdb-13.8-22.3,安装时选择不创建实例。机器B会作为standby
安装步骤详见LightDB安装手册。
本文示例中机器A为10.19.64.76, 机器B为10.19.64.77。

安装patroni

在机器A, B安装patroni。
解压,安装依赖:

tar xzf patroni-2.1.3-lightdb.tar.gz
cd patroni-2.1.3-lightdb
pip3 install --user -r requirements.txt
pip3 install --user psycopg
sudo yum install python3-psutil

或离线安装依赖:

unzip patroni-requirements-ky10sp1.hygon.zip
pip3 install --user *.whl
pip3 install --user ydiff-1.2.tar.gz
pip3 install --user python-etcd-0.4.5.tar.gz
pip3 install --user pysyncobj-0.3.11.tar.gz
pip3 install --user psutil-5.7.3.tar.gz
pip3 install --user cffi-1.15.1.tar.gz

使用patroni启动lightdb集群

参考patroni文档: https://patroni.readthedocs.io/en/latest/existing_data.html

机器A: primary

进入patroni-2.1.3-lightdb目录, 根据实际IP、端口修改lightdb0.yml

# etcd配置段中添加hosts, 去掉原有hosts、host配置
hosts:
- 10.19.64.2:2379
- 10.19.64.34:2379
- 10.19.64.35:2379
# postgresql配置段
# 修改listen,设置为机器A上实例的监听端口和IP
listen: 127.0.0.1,10.19.64.76:55432
connect_addr: 10.19.64.76:55432
# 修改data_dir,同前文安装时一致
data_dir: /home/lightdb/cluster_A/lightdb0
# 修改superuser用户名密码, 同前文安装时保持一致, 如
superuser:
username: lightdb
password: lightdb999

检查环境变量是否正确配置。
此时如果db实例未启动则启动:lt_ctl start
连接实例,根据ligntdb0.ymlauthentication创建patroni需要的db用户:

ltsql -p 55432 -U lightdb
create user replicator with replication encrypted password 'rep-pass';
create user rewind_user with encrypted password 'rewind_password';

因为不是patroni初始化(bootstrap)的实例, yml配置文件中的pg_hba不会更新到lt_hba.conf, 需要手动加入,并reload:

# $LTDATA/lt_hba.conf中添加这一行
host replication replicator 10.19.64.0/24 trust
# 重新加载lt_hba.conf
lt_ctl -D cluster_A/lightdb0/ reload

启动patroni, patroni会接管lightdb0.yml配置的实例:

./patroni.py lightdb0.yml

机器B: standby

进入patroni-2.1.3-lightdb目录, 修改lightdb1.yml

# etcd配置段中添加hosts, 去掉原有hosts、host配置
hosts:
- 10.19.64.2:2379
- 10.19.64.34:2379
- 10.19.64.35:2379
# postgresql配置段中
# 修改listen,设置为机器A上实例的监听端口和IP
listen: 127.0.0.1,10.19.64.77:55433
connect_addr: 10.19.64.77:55432
# 修改data_dir, 会从primary复制到配置的dir, 目录有读写权限
data_dir: ....
# authentication与primary保持一致,如:
authentication:
replication:
username: replicator
password: rep-pass
superuser:
username: lightdb
password: lightdb999
rewind: # Has no effect on postgres 10 and lower
username: rewind_user
password: rewind_password

启动patroni, patroni会根据lightdb1.yml配置,从primary做basebackup并成为standby:

./patroni.py lightdb1.yml

验证复制模式

primary:

ltsql -p 55432 -h 10.19.64.76 -U lightdb
create table t1(id int);
insert into t1 values(55432);
select * from t1;
select * from pg_stat_replication;
show synchronous_standby_names;
show synchronous_commit;
\db+

standby:

ltsql -p 55433 -h 10.19.64.77 -U lightdb
create table t1(id int);
insert into t1 values(55433);
select * from t1;
select * from pg_stat_replication;
show synchronous_standby_names;
show synchronous_commit;
\db+

同步复制

patroni默认使用异步复制,可通过restapi在线修改复制模式:
在primary机器上执行:

curl -s localhost:8008/config | jq .
curl -s -XPATCH -d '{"synchronous_mode":true}' http://localhost:8008/config | jq .
postgres=# select usename,client_addr,sync_state from pg_stat_replication ;
usename | client_addr | sync_state
------------+-------------+------------
replicator | 10.19.64.77 | sync
(1 row)

不允许standby提升

修改lightdb1.yml,tags中添加nofailover, 重启:

tags:
# ...
nofailover: true

把primary(机器A,lightdb0)停止后,可在standby看到如下日志I am not allowed to promote

2022-12-01 16:20:18,035 INFO: Lock owner: lightdb0; I am lightdb1
2022-12-01 16:20:18,035 INFO: establishing a new patroni connection to the postgres cluster
2022-12-01 16:20:18,054 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb0)
2022-12-01 16:20:25,115 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb0)
2022-12-01 16:20:35,114 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb0)
server signaled
2022-12-01 16:20:45,423 INFO: following a different leader because I am not allowed to promote

级联复制

使用配置lightdb2.yml, 与lightdb1在同一主机; 可比较patroni目录中lightdb1.yml与lightdb2.yml区别。
lightdb2.yml中tags包含了replicatefrom,让lightdb2这个实例从lightdb1复制,而不是lightdb0

tags:
# ...
replicatefrom: lightdb1

在lightdb0(primary,patroni中叫做leader)可以看到lightdb1正在复制:

[lightdb@localhost ~]$ ltsql -p 55432 -h 10.19.64.76
ltsql (13.8-22.3)
Type "help" for help.
lightdb@lt_test=# select usename, application_name, client_addr, sync_state from pg_stat_replication ;
usename | application_name | client_addr | sync_state
-------------+------------------+-------------+-------------
replicator | lightdb1 | 10.19.64.77 | async
(1 row)

在lightdb1(standby)上可以看到lightdb2正在复制(cascading standby), lightdb2与lightdb都在10.19.64.77:

[lightdb@localhost ~]$ ltsql -p 55433 -h 10.19.64.77
ltsql (13.8-22.3)
Type "help" for help.
lightdb@lt_test=# select usename, application_name, client_addr, sync_state from pg_stat_replication ;
usename | application_name | client_addr | sync_state
-------------+------------------+-------------+-------------
replicator | lightdb2 | 10.19.64.77 | async
(1 row)

多standby

使用上一节的lightdb2.yml,去掉tags中的replicatefrom, 重启。

从lightdb0可以看到lightdb1,lightdb2都在复制:

[lightdb@localhost ~]$ ltsql -p 55432 -h 10.19.64.76
ltsql (13.8-22.3)
Type "help" for help.
lightdb@lt_test=# select usename, application_name, client_addr, sync_state from pg_stat_replication ;
usename | application_name | client_addr | sync_state
-------------+------------------+-------------+-------------
replicator | lightdb1 | 10.19.64.77 | async
replicator | lightdb2 | 10.19.64.77 | async
(2 rows)

把lightdb0停掉,观察lightdb1,lightdb2的日志,可以看到lightdb2成为leader(primary),lightdb1跟随了lightdb2:
lightdb1:

2022-12-01 16:54:07,307 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb0)
2022-12-01 16:54:17,815 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb0)
2022-12-01 16:54:21,232 INFO: Local timeline=2 lsn=0/C00000A0
server signaled
2022-12-01 16:54:21,249 INFO: following a different leader because I am not allowed to promote
2022-12-01 16:54:21,358 INFO: Lock owner: lightdb2; I am lightdb1
2022-12-01 16:54:21,363 INFO: Local timeline=2 lsn=0/C00000A0
2022-12-01 16:54:21,371 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb2)
2022-12-01 16:54:21,427 INFO: Lock owner: lightdb2; I am lightdb1
2022-12-01 16:54:21,432 INFO: Local timeline=2 lsn=0/C00000A0
2022-12-01 16:54:21,441 INFO: master_timeline=3
2022-12-01 16:54:21,442 INFO: master: history=1 0/6093AB80 no recovery target specified
2 0/C00000A0 no recovery target specified
2022-12-01 16:54:21,449 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb2)
2022-12-01 16:54:31,430 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb2)

lightdb2:

2022-12-01 16:54:07,307 INFO: no action. I am (lightdb2), a secondary, and following a leader (lightdb0)
2022-12-01 16:54:17,823 INFO: no action. I am (lightdb2), a secondary, and following a leader (lightdb0)
2022-12-01 16:54:19,244 WARNING: Request failed to lightdb0: GET http://10.19.64.76:8008/patroni (HTTPConnectionPool(host='10.19.64.76', port=8008): Max retries exceeded with url: /patroni (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))))
2022-12-01 16:54:19,320 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"
2022-12-01 16:54:19,325 INFO: promoted self to leader by acquiring session lock
server promoting
2022-12-01 16:54:19,329 INFO: cleared rewind state after becoming the leader
2022-12-01 16:54:21,381 INFO: no action. I am (lightdb2), the leader with the lock
2022-12-01 16:54:21,419 INFO: no action. I am (lightdb2), the leader with the lock

ltcluster集群转patroni

  1. ltcluster service pause
  2. 停止ltclusterd
  3. 停止keepalived
  4. 如果有witness,停止witness实例

以上步骤完成后,集群不会自动切换主备,再按照上文安装部署etcd集群, 安装patroni使用patroni启动lightdb集群章节中步骤配置即可。

集群监控

可使用rest api来获取集群状态:
https://patroni.readthedocs.io/en/latest/rest_api.html#health-check-endpoints
https://patroni.readthedocs.io/en/latest/rest_api.html#monitoring-endpoint

下载链接

TODO

posted on   Gao科技  阅读(94)  评论(0编辑  收藏  举报

相关博文:
阅读排行:
· winform 绘制太阳,地球,月球 运作规律
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· AI与.NET技术实操系列(五):向量存储与相似性搜索在 .NET 中的实现
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理

统计

点击右上角即可分享
微信分享提示