KingbaseES V8R6集群运维案例之---复现libpq.so动态库加载故障
案例说明:
KingbaseES V8R6集群,一主多备架构,其中一个备库节点,在使用‘repmgr cluster show’查看集群状态时,出现"conninfo": invalid connection option "tcp_user_timeout"错误,其余节点正常。故障现象如下所示:
适用版本:
KingbaseES V8R6
一、安装libpq软件
1、install libpq
[root@node202 lib64]# yum install libpq*
.......
Installed:
libpqxx.x86_64 1:4.0.1-1.el7
libpqxx-devel.x86_64 1:4.0.1-1.el7
libpqxx-doc.noarch 1:4.0.1-1.el7
2、查看系统libpq动态库
[root@node202 lib64]# ls -lh /usr/lib64/libpq*
lrwxrwxrwx 1 root root 12 Dec 7 16:52 /usr/lib64/libpq.so -> libpq.so.5.5
lrwxrwxrwx 1 root root 12 Dec 7 16:52 /usr/lib64/libpq.so.5 -> libpq.so.5.5
-rwxr-xr-x 1 root root 193K Jun 24 2022 /usr/lib64/libpq.so.5.5
-rwxr-xr-x 1 root root 360K Jun 24 2014 /usr/lib64/libpqxx-4.0.so
lrwxrwxrwx 1 root root 14 Dec 7 16:52 /usr/lib64/libpqxx.so -> libpqxx-4.0.so
二、配置系统动态库加载
1、配置ld.so.conf
[root@node202 lib64]# cat /etc/ld.so.conf
include ld.so.conf.d/*.conf
/usr/lib
/usr/lib64
#/usr/local/pg15/lib
/lib
[root@node202 lib64]# ldconfig -v |grep libpq
libpqxx-4.0.so -> libpqxx.so
libpq.so.5 -> libpq.so.5.5
2、模拟数据库自带libpq动态库损坏
如下图所示,默认集群访问的libpq动态库是数据库自带:
模拟动态库损坏:
[kingbase@node202 lib]$ ls -lh libpq*
-rwxr-xr-x. 1 kingbase kingbase 345K Mar 23 2023 libpq.so
-rwxr-xr-x. 1 kingbase kingbase 345K Mar 23 2023 libpq.so.5
-rwxr-xr-x. 1 kingbase kingbase 345K Mar 23 2023 libpq.so.5.12
-rwxr-xr-x. 1 kingbase kingbase 29K Mar 23 2023 libpqwalreceiver.so
[kingbase@node202 lib]$ mv libpq.so.5 libpq.so.5.bk
[kingbase@node202 lib]$ cd ../bin
[kingbase@node202 bin]$ ldd repmgr
linux-vdso.so.1 => (0x00007ffe7e58b000)
libpq.so.5 => /usr/lib64/libpq.so.5 (0x00007f7274219000)
libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f7273ffc000)
libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f7273df8000)
libc.so.6 => /usr/lib64/libc.so.6 (0x00007f7273a2a000)
libssl.so.10 => /usr/lib64/libssl.so.10 (0x00007f72737b7000)
libcrypto.so.10 => /usr/lib64/libcrypto.so.10 (0x00007f7273354000)
libkrb5.so.3 => /usr/lib64/libkrb5.so.3 (0x00007f727306f000)
libcom_err.so.2 => /usr/lib64/libcom_err.so.2 (0x00007f7272e6a000)
.......
如下图所示,集群libpq的动态库加载为系统下:
三、执行集群访问
如下所示,执行‘repmgr cluster show’出现‘tcp_user_timeout’故障:
[kingbase@node202 bin]$ ./repmgr cluster show
[ERROR] following errors were found in the configuration file:
"conninfo": invalid connection option "tcp_user_timeout"
(provided: "host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000")
[DETAIL] configuration file is: "/home/kingbase/cluster/R6/R6HA/kingbase/bin/../etc/repmgr.conf"
在repmgr.conff中libpq连接串中取消‘tcp_user_timeout’参数:
[kingbase@node202 bin]$ cat ../etc/repmgr.conf |grep connect
conninfo='host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'
执行‘repmgr cluster show’:
[kingbase@node202 bin]$ ./repmgr cluster show
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.202 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr options=-csearch_path="
[ERROR] connection to database failed
[DETAIL]
SCRAM authentication requires libpq version 10 or above
[DETAIL] attempted to connect using:
user=esrep connect_timeout=10 dbname=esrep host=192.168.1.202 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr options=-csearch_path=
数据库连接故障:
[kingbase@node202 bin]$ ./ksql -U esrep esrep
ksql: error: could not connect to server: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
四、问题解决
1、解决libpq动态库正确加载
2、执行集群访问
[kingbase@node201 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 15 | | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10
2 | node2 | standby | running | node1 | default | 100 | 15 | 0 bytes | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
五、总结
此次故障,表面是集群连接串中的参数无法被数据库识别,虽取消此参数,亦无法访问数据库;客户端通过libpq访问数据库,’tcp_user_timeout'是libpq连接串中的一个参数,故问题根本解决需要考虑libpq动态库是否正常加载,是否可以正常访问数据库。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· DeepSeek 开源周回顾「GitHub 热点速览」
2022-12-08 KingbaseES V8R6集群运维案例之---sys_internal.init.*文件引起sys_basebackup失败