KingbaseES 集群访问连接系列 01 -- libpq动态库加载故障
案例说明:
KingbaseES V8R6集群,一主多备架构,其中一个备库节点,在使用‘repmgr cluster show’查看集群状态时,出现"conninfo": invalid connection option "tcp_user_timeout"错误,其余节点正常。故障现象如下所示:
适用版本:
KingbaseES V8R6
问题解决思路:
- 分析问题现象(现场错误及相关日志)。
- 通过表面故障执行测试,发现深层原因。(表面参数错误,实际libpq连接故障)
- 通过对比测试分析,找出故障解决思路。
- 测试并提供解决方案。
一、问题现象
一主多备,一个备节点,执行'repmgr cluster show',出现“conninfo": invalid connection option "tcp_user_timeout"错误,其余节点正常。
[kingbase@node202 bin]$ ./repmgr cluster show
[ERROR] following errors were found in the configuration file:
"conninfo": invalid connection option "tcp_user_timeout"
(provided: "host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000")
[DETAIL] configuration file is: "/home/kingbase/cluster/R6/R6HA/kingbase/bin/../etc/repmgr.conf"
二、问题分析
1、查看'tcp_user_timeout'参数
如下所示,'tcp_user_timeout'在集群libpq连接串中配置了此参数:
[kingbase@node202 bin]$ cat ../etc/repmgr.conf |grep connect
conninfo='host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000'
tcp_user_timeout参数:
控制在强制关闭连接之前传输的数据可能保持未确认的毫秒数。 零值使用系统默认值。 对于通过 Unix 域套接字建立的连接,此参数将被忽略。
2、将tcp_user_timeout参数从连接串取消
如下所示,从连接串取消‘tcp_user_timeout’参数:
[kingbase@node202 bin]$ cat ../etc/repmgr.conf |grep connect
conninfo='host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'
执行'repmgr cluster show',故障信息如下所示:libpq相关错误。
[kingbase@node202 bin]$ ./repmgr cluster show
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.202 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr options=-csearch_path="
[ERROR] connection to database failed
[DETAIL]
SCRAM authentication requires libpq version 10 or above
[DETAIL] attempted to connect using:
user=esrep connect_timeout=10 dbname=esrep host=192.168.1.202 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr options=-csearch_path=
并且ksql无法连接数据库:
[kingbase@node202 bin]$ ./ksql -U esrep esrep
ksql: error: could not connect to server: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
3、查看repmgr动态库
从以上错误看和libpq的连接有关系,有可能是libpq的动态库加载错误导致,查看repmgr动态库的加载:
如下所示,本节点libpq动态库加载:(加载的是操作系统自带的动态库)
[kingbase@node202 bin]$ ldd repmgr
linux-vdso.so.1 => (0x00007ffe7e58b000)
libpq.so.5 => /usr/lib64/libpq.so.5 (0x00007f7274219000)
libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f7273ffc000)
libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f7273df8000)
libc.so.6 => /usr/lib64/libc.so.6 (0x00007f7273a2a000)
libssl.so.10 => /usr/lib64/libssl.so.10 (0x00007f72737b7000)
libcrypto.so.10 => /usr/lib64/libcrypto.so.10 (0x00007f7273354000)
libkrb5.so.3 => /usr/lib64/libkrb5.so.3 (0x00007f727306f000)
libcom_err.so.2 => /usr/lib64/libcom_err.so.2 (0x00007f7272e6a000)
.......
正常节点repmgr动态库加载:(加载的是数据库软件自带的动态库)
[kingbase@node202 bin]$ ldd repmgr
linux-vdso.so.1 => (0x00007ffe21b6d000)
libpq.so.5 => /home/kingbase/cluster/R6/R6HA/kingbase/bin/./../lib/libpq.so.5 (0x00007f1fde2cc000)
libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f1fde0b0000)
libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f1fddeac000)
libc.so.6 => /usr/lib64/libc.so.6 (0x00007f1fddade000)
libssl.so.1.1 => /home/kingbase/cluster/R6/R6HA/kingbase/bin/./../lib/../lib/libssl.so.1.1 (0x00007f1fdd846000)
libcrypto.so.1.1 => /home/kingbase/cluster/R6/R6HA/kingbase/bin/./../lib/../lib/libcrypto.so.1.1 (0x00007f1fdd345000)
/lib64/ld-linux-x86-64.so.2 (0x00007f1fde51a000)
由以上可知,异常的节点由于加载了系统自动的libpq的动态库,导致无法正常通过libpq动态库访问数据库,从而再执行'repmgr cluster show'时,无法识别连接串种的参数及正常连接数据库。
三、问题解决
1、检查节点下是否包含数据库自带动态库
[kingbase@node202 lib]$ pwd
/home/kingbase/cluster/R6/R6HA/kingbase/lib
[kingbase@node202 lib]$ ls -lh libpq.so.5
-rwxr-xr-x. 1 kingbase kingbase 345K Mar 23 2023 libpq.so.5
2、注释数据库用户LD_LIBRARY_PATH变量
[kingbase@node202 ~]$ cat .bashrc
# .bashrc
# User specific aliases and functions
#export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib:/usr/lib64:/usr/local/pg15/lib
3、检查ld.so.conf文件
配置ld.so.conf文件,指定动态库路径:
[root@node202 lib64]# cat /etc/ld.so.conf
include ld.so.conf.d/*.conf
/home/kingbase/cluster/R6/R6HA/kingbase/lib
/usr/lib
/usr/lib64
/usr/local/pg15/lib
/lib
执行:
[root@node202 lib64]# ldconfig -v
4、检查libpq动态库加载
如下图所示,libpq动态库加载恢复正常:
[kingbase@node202 bin]$ ldd repmgr
linux-vdso.so.1 => (0x00007ffe21b6d000)
libpq.so.5 => /home/kingbase/cluster/R6/R6HA/kingbase/bin/./../lib/libpq.so.5 (0x00007f1fde2cc000)
libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f1fde0b0000)
libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f1fddeac000)
libc.so.6 => /usr/lib64/libc.so.6 (0x00007f1fddade000)
libssl.so.1.1 => /home/kingbase/cluster/R6/R6HA/kingbase/bin/./../lib/../lib/libssl.so.1.1 (0x00007f1fdd846000)
libcrypto.so.1.1 => /home/kingbase/cluster/R6/R6HA/kingbase/bin/./../lib/../lib/libcrypto.so.1.1 (0x00007f1fdd345000)
/lib64/ld-linux-x86-64.so.2 (0x00007f1fde51a000)
5、查看集群状态和连接数据库
如下所示,集群状态检查和数据库连接正常:
# 集群状态检查
[kingbase@node202 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 15 | | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10
2 | node2 | standby | running | node1 | default | 100 | 15 | 0 bytes | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
# 连接数据库
[kingbase@node202 bin]$ ./ksql -U esrep esrep
ksql (V8.0)
Type "help" for help.
esrep=#
四、总结
此次故障,表面是集群连接串中的参数无法被数据库识别,虽取消此参数,亦无法访问数据库;客户端通过libpq访问数据库,’tcp_user_timeout'是libpq连接串中的一个参数,故问题根本解决需要考虑libpq动态库是否正常加载,是否可以正常访问数据库。