使用Patroni管理LightDB高可用
使用Patroni管理LightDB高可用
测试环境
CPU: 海光x86
OS: Kylin Advanced Server V10 SP1
LightDB: 13.8-22.3
Patroni: 2.1.3
etcd: 3.5.4
安装部署etcd集群
需要3台机器。
centos/RHEL等可以从epel获取etcd。
麒麟ky10,ky10sp1没有etcd包,可以使用lightdb预编译的etcd-3.5.4。
解压并设置PATH:
cd $HOME tar xzf etcd-3.5.4-ky10.hygon.tar.gz export PATH=$HOME/etcd/bin:$PATH which etcd etcd --version
启动集群,
TOKEN=token-01 CLUSTER_STATE=new NAME_1=machine-2 NAME_2=machine-34 NAME_3=machine-35 HOST_1=10.19.64.2 HOST_2=10.19.64.34 HOST_3=10.19.64.35 CLUSTER=${NAME_1}=http://${HOST_1}:2380,${NAME_2}=http://${HOST_2}:2380,${NAME_3}=http://${HOST_3}:2380 # 在10.19.62.2 (machine-2)启动 THIS_IP=${HOST_1} THIS_NAME=${NAME_1} etcd --data-dir=data.etcd --name ${THIS_NAME} \ --initial-advertise-peer-urls http://${THIS_IP}:2380 --listen-peer-urls http://${THIS_IP}:2380 \ --advertise-client-urls http://${THIS_IP}:2379 --listen-client-urls http://${THIS_IP}:2379 \ --enable-v2=true \ --initial-cluster ${CLUSTER} \ --initial-cluster-state ${CLUSTER_STATE} --initial-cluster-token ${TOKEN} # 在另外两台机器启动,只需要修改THIS_NAME和THIS_IP
安装LightDB
数据库部署在机器A、B,每一个LightDB实例会有对应一个patroni,1对1。
在机器A安装单机版
lightdb-13.8-22.3,安装时选择创建实例
。这个实例会作为primary
。
在机器B安装单机版
lightdb-13.8-22.3,安装时选择不创建实例
。机器B会作为standby
。
安装步骤详见LightDB安装手册。
本文示例中机器A为10.19.64.76, 机器B为10.19.64.77。
安装patroni
在机器A, B安装patroni。
解压,安装依赖:
tar xzf patroni-2.1.3-lightdb.tar.gz cd patroni-2.1.3-lightdb pip3 install --user -r requirements.txt pip3 install --user psycopg sudo yum install python3-psutil
或离线安装依赖:
unzip patroni-requirements-ky10sp1.hygon.zip pip3 install --user *.whl pip3 install --user ydiff-1.2.tar.gz pip3 install --user python-etcd-0.4.5.tar.gz pip3 install --user pysyncobj-0.3.11.tar.gz pip3 install --user psutil-5.7.3.tar.gz pip3 install --user cffi-1.15.1.tar.gz
使用patroni启动lightdb集群
参考patroni文档: https://patroni.readthedocs.io/en/latest/existing_data.html
机器A: primary
进入patroni-2.1.3-lightdb
目录, 根据实际IP、端口修改lightdb0.yml
# etcd配置段中添加hosts, 去掉原有hosts、host配置 hosts: - 10.19.64.2:2379 - 10.19.64.34:2379 - 10.19.64.35:2379 # postgresql配置段 # 修改listen,设置为机器A上实例的监听端口和IP listen: 127.0.0.1,10.19.64.76:55432 connect_addr: 10.19.64.76:55432 # 修改data_dir,同前文安装时一致 data_dir: /home/lightdb/cluster_A/lightdb0 # 修改superuser用户名密码, 同前文安装时保持一致, 如 superuser: username: lightdb password: lightdb999
检查环境变量是否正确配置。
此时如果db实例未启动则启动:lt_ctl start
连接实例,根据ligntdb0.yml
中authentication
创建patroni需要的db用户:
ltsql -p 55432 -U lightdb create user replicator with replication encrypted password 'rep-pass'; create user rewind_user with encrypted password 'rewind_password';
因为不是patroni初始化(bootstrap)的实例, yml配置文件中的pg_hba不会更新到lt_hba.conf, 需要手动加入,并reload:
# $LTDATA/lt_hba.conf中添加这一行 host replication replicator 10.19.64.0/24 trust # 重新加载lt_hba.conf lt_ctl -D cluster_A/lightdb0/ reload
启动patroni, patroni会接管lightdb0.yml
配置的实例:
./patroni.py lightdb0.yml
机器B: standby
进入patroni-2.1.3-lightdb
目录, 修改lightdb1.yml
# etcd配置段中添加hosts, 去掉原有hosts、host配置 hosts: - 10.19.64.2:2379 - 10.19.64.34:2379 - 10.19.64.35:2379 # postgresql配置段中 # 修改listen,设置为机器A上实例的监听端口和IP listen: 127.0.0.1,10.19.64.77:55433 connect_addr: 10.19.64.77:55432 # 修改data_dir, 会从primary复制到配置的dir, 目录有读写权限 data_dir: .... # authentication与primary保持一致,如: authentication: replication: username: replicator password: rep-pass superuser: username: lightdb password: lightdb999 rewind: # Has no effect on postgres 10 and lower username: rewind_user password: rewind_password
启动patroni, patroni会根据lightdb1.yml
配置,从primary做basebackup并成为standby:
./patroni.py lightdb1.yml
验证复制模式
primary:
ltsql -p 55432 -h 10.19.64.76 -U lightdb create table t1(id int); insert into t1 values(55432); select * from t1; select * from pg_stat_replication; show synchronous_standby_names; show synchronous_commit; \db+
standby:
ltsql -p 55433 -h 10.19.64.77 -U lightdb create table t1(id int); insert into t1 values(55433); select * from t1; select * from pg_stat_replication; show synchronous_standby_names; show synchronous_commit; \db+
同步复制
patroni默认使用异步复制,可通过restapi在线修改复制模式:
在primary机器上执行:
curl -s localhost:8008/config | jq . curl -s -XPATCH -d '{"synchronous_mode":true}' http://localhost:8008/config | jq .
postgres=# select usename,client_addr,sync_state from pg_stat_replication ; usename | client_addr | sync_state ------------+-------------+------------ replicator | 10.19.64.77 | sync (1 row)
不允许standby提升
修改lightdb1.yml,tags中添加nofailover, 重启:
tags: # ... nofailover: true
把primary(机器A,lightdb0)停止后,可在standby看到如下日志I am not allowed to promote
:
2022-12-01 16:20:18,035 INFO: Lock owner: lightdb0; I am lightdb1 2022-12-01 16:20:18,035 INFO: establishing a new patroni connection to the postgres cluster 2022-12-01 16:20:18,054 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb0) 2022-12-01 16:20:25,115 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb0) 2022-12-01 16:20:35,114 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb0) server signaled 2022-12-01 16:20:45,423 INFO: following a different leader because I am not allowed to promote
级联复制
使用配置lightdb2.yml, 与lightdb1
在同一主机; 可比较patroni目录中lightdb1.yml与lightdb2.yml区别。
lightdb2.yml中tags包含了replicatefrom,让lightdb2
这个实例从lightdb1
复制,而不是lightdb0
:
tags: # ... replicatefrom: lightdb1
在lightdb0(primary,patroni中叫做leader)可以看到lightdb1正在复制:
[lightdb@localhost ~]$ ltsql -p 55432 -h 10.19.64.76 ltsql (13.8-22.3) Type "help" for help. lightdb@lt_test=# select usename, application_name, client_addr, sync_state from pg_stat_replication ; usename | application_name | client_addr | sync_state -------------+------------------+-------------+------------- replicator | lightdb1 | 10.19.64.77 | async (1 row)
在lightdb1(standby)上可以看到lightdb2正在复制(cascading standby), lightdb2与lightdb都在10.19.64.77:
[lightdb@localhost ~]$ ltsql -p 55433 -h 10.19.64.77 ltsql (13.8-22.3) Type "help" for help. lightdb@lt_test=# select usename, application_name, client_addr, sync_state from pg_stat_replication ; usename | application_name | client_addr | sync_state -------------+------------------+-------------+------------- replicator | lightdb2 | 10.19.64.77 | async (1 row)
多standby
使用上一节的lightdb2.yml,去掉tags中的replicatefrom, 重启。
从lightdb0可以看到lightdb1,lightdb2都在复制:
[lightdb@localhost ~]$ ltsql -p 55432 -h 10.19.64.76 ltsql (13.8-22.3) Type "help" for help. lightdb@lt_test=# select usename, application_name, client_addr, sync_state from pg_stat_replication ; usename | application_name | client_addr | sync_state -------------+------------------+-------------+------------- replicator | lightdb1 | 10.19.64.77 | async replicator | lightdb2 | 10.19.64.77 | async (2 rows)
把lightdb0停掉,观察lightdb1,lightdb2的日志,可以看到lightdb2成为leader(primary),lightdb1跟随了lightdb2:
lightdb1:
2022-12-01 16:54:07,307 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb0) 2022-12-01 16:54:17,815 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb0) 2022-12-01 16:54:21,232 INFO: Local timeline=2 lsn=0/C00000A0 server signaled 2022-12-01 16:54:21,249 INFO: following a different leader because I am not allowed to promote 2022-12-01 16:54:21,358 INFO: Lock owner: lightdb2; I am lightdb1 2022-12-01 16:54:21,363 INFO: Local timeline=2 lsn=0/C00000A0 2022-12-01 16:54:21,371 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb2) 2022-12-01 16:54:21,427 INFO: Lock owner: lightdb2; I am lightdb1 2022-12-01 16:54:21,432 INFO: Local timeline=2 lsn=0/C00000A0 2022-12-01 16:54:21,441 INFO: master_timeline=3 2022-12-01 16:54:21,442 INFO: master: history=1 0/6093AB80 no recovery target specified 2 0/C00000A0 no recovery target specified 2022-12-01 16:54:21,449 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb2) 2022-12-01 16:54:31,430 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb2)
lightdb2:
2022-12-01 16:54:07,307 INFO: no action. I am (lightdb2), a secondary, and following a leader (lightdb0) 2022-12-01 16:54:17,823 INFO: no action. I am (lightdb2), a secondary, and following a leader (lightdb0) 2022-12-01 16:54:19,244 WARNING: Request failed to lightdb0: GET http://10.19.64.76:8008/patroni (HTTPConnectionPool(host='10.19.64.76', port=8008): Max retries exceeded with url: /patroni (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')))) 2022-12-01 16:54:19,320 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'" 2022-12-01 16:54:19,325 INFO: promoted self to leader by acquiring session lock server promoting 2022-12-01 16:54:19,329 INFO: cleared rewind state after becoming the leader 2022-12-01 16:54:21,381 INFO: no action. I am (lightdb2), the leader with the lock 2022-12-01 16:54:21,419 INFO: no action. I am (lightdb2), the leader with the lock
ltcluster集群转patroni
- ltcluster service pause
- 停止ltclusterd
- 停止keepalived
- 如果有witness,停止witness实例
以上步骤完成后,集群不会自动切换主备,再按照上文安装部署etcd集群
, 安装patroni
, 使用patroni启动lightdb集群
章节中步骤配置即可。
集群监控
可使用rest api来获取集群状态:
https://patroni.readthedocs.io/en/latest/rest_api.html#health-check-endpoints
https://patroni.readthedocs.io/en/latest/rest_api.html#monitoring-endpoint
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· winform 绘制太阳,地球,月球 运作规律
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· AI与.NET技术实操系列(五):向量存储与相似性搜索在 .NET 中的实现
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理