KingbaseES 集群访问连接系列 01 -- libpq动态库加载故障

案例说明:
KingbaseES V8R6集群,一主多备架构,其中一个备库节点,在使用‘repmgr cluster show’查看集群状态时,出现"conninfo": invalid connection option "tcp_user_timeout"错误,其余节点正常。故障现象如下所示:

适用版本:
KingbaseES V8R6

问题解决思路:

  1. 分析问题现象(现场错误及相关日志)。
  2. 通过表面故障执行测试,发现深层原因。(表面参数错误,实际libpq连接故障)
  3. 通过对比测试分析,找出故障解决思路。
  4. 测试并提供解决方案。

一、问题现象

一主多备,一个备节点,执行'repmgr cluster show',出现“conninfo": invalid connection option "tcp_user_timeout"错误,其余节点正常。

[kingbase@node202 bin]$ ./repmgr cluster show
[ERROR] following errors were found in the configuration file:
  "conninfo": invalid connection option "tcp_user_timeout"
        (provided: "host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000")
[DETAIL] configuration file is: "/home/kingbase/cluster/R6/R6HA/kingbase/bin/../etc/repmgr.conf"

二、问题分析

1、查看'tcp_user_timeout'参数

如下所示,'tcp_user_timeout'在集群libpq连接串中配置了此参数:

[kingbase@node202 bin]$ cat ../etc/repmgr.conf |grep connect
conninfo='host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000'

tcp_user_timeout参数:
控制在强制关闭连接之前传输的数据可能保持未确认的毫秒数。 零值使用系统默认值。 对于通过 Unix 域套接字建立的连接,此参数将被忽略。

2、将tcp_user_timeout参数从连接串取消

如下所示,从连接串取消‘tcp_user_timeout’参数:

[kingbase@node202 bin]$ cat ../etc/repmgr.conf |grep connect
conninfo='host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'

执行'repmgr cluster show',故障信息如下所示:libpq相关错误。

[kingbase@node202 bin]$ ./repmgr cluster show
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.202 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr options=-csearch_path="
[ERROR] connection to database failed
[DETAIL]
SCRAM authentication requires libpq version 10 or above

[DETAIL] attempted to connect using:
  user=esrep connect_timeout=10 dbname=esrep host=192.168.1.202 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr options=-csearch_path=

并且ksql无法连接数据库:

[kingbase@node202 bin]$ ./ksql -U esrep esrep
ksql: error: could not connect to server: could not connect to server: No such file or directory
        Is the server running locally and accepting
        connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?

3、查看repmgr动态库
从以上错误看和libpq的连接有关系,有可能是libpq的动态库加载错误导致,查看repmgr动态库的加载:

如下所示,本节点libpq动态库加载:(加载的是操作系统自带的动态库)

[kingbase@node202 bin]$ ldd repmgr
        linux-vdso.so.1 =>  (0x00007ffe7e58b000)
        libpq.so.5 => /usr/lib64/libpq.so.5 (0x00007f7274219000)
        libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f7273ffc000)
        libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f7273df8000)
        libc.so.6 => /usr/lib64/libc.so.6 (0x00007f7273a2a000)
        libssl.so.10 => /usr/lib64/libssl.so.10 (0x00007f72737b7000)
        libcrypto.so.10 => /usr/lib64/libcrypto.so.10 (0x00007f7273354000)
        libkrb5.so.3 => /usr/lib64/libkrb5.so.3 (0x00007f727306f000)
        libcom_err.so.2 => /usr/lib64/libcom_err.so.2 (0x00007f7272e6a000)
.......

正常节点repmgr动态库加载:(加载的是数据库软件自带的动态库)

[kingbase@node202 bin]$ ldd repmgr
        linux-vdso.so.1 =>  (0x00007ffe21b6d000)
        libpq.so.5 => /home/kingbase/cluster/R6/R6HA/kingbase/bin/./../lib/libpq.so.5 (0x00007f1fde2cc000)
        libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f1fde0b0000)
        libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f1fddeac000)
        libc.so.6 => /usr/lib64/libc.so.6 (0x00007f1fddade000)
        libssl.so.1.1 => /home/kingbase/cluster/R6/R6HA/kingbase/bin/./../lib/../lib/libssl.so.1.1 (0x00007f1fdd846000)
        libcrypto.so.1.1 => /home/kingbase/cluster/R6/R6HA/kingbase/bin/./../lib/../lib/libcrypto.so.1.1 (0x00007f1fdd345000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f1fde51a000)

由以上可知,异常的节点由于加载了系统自动的libpq的动态库,导致无法正常通过libpq动态库访问数据库,从而再执行'repmgr cluster show'时,无法识别连接串种的参数及正常连接数据库。

三、问题解决

1、检查节点下是否包含数据库自带动态库

[kingbase@node202 lib]$ pwd
/home/kingbase/cluster/R6/R6HA/kingbase/lib
[kingbase@node202 lib]$ ls -lh libpq.so.5
-rwxr-xr-x. 1 kingbase kingbase 345K Mar 23  2023 libpq.so.5

2、注释数据库用户LD_LIBRARY_PATH变量

[kingbase@node202 ~]$ cat .bashrc
# .bashrc
# User specific aliases and functions

#export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib:/usr/lib64:/usr/local/pg15/lib

3、检查ld.so.conf文件
配置ld.so.conf文件,指定动态库路径:

[root@node202 lib64]# cat /etc/ld.so.conf
include ld.so.conf.d/*.conf
/home/kingbase/cluster/R6/R6HA/kingbase/lib
/usr/lib
/usr/lib64
/usr/local/pg15/lib
/lib

执行:

[root@node202 lib64]# ldconfig -v 

4、检查libpq动态库加载
如下图所示,libpq动态库加载恢复正常:

[kingbase@node202 bin]$ ldd repmgr
        linux-vdso.so.1 =>  (0x00007ffe21b6d000)
        libpq.so.5 => /home/kingbase/cluster/R6/R6HA/kingbase/bin/./../lib/libpq.so.5 (0x00007f1fde2cc000)
        libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f1fde0b0000)
        libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f1fddeac000)
        libc.so.6 => /usr/lib64/libc.so.6 (0x00007f1fddade000)
        libssl.so.1.1 => /home/kingbase/cluster/R6/R6HA/kingbase/bin/./../lib/../lib/libssl.so.1.1 (0x00007f1fdd846000)
        libcrypto.so.1.1 => /home/kingbase/cluster/R6/R6HA/kingbase/bin/./../lib/../lib/libcrypto.so.1.1 (0x00007f1fdd345000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f1fde51a000)

5、查看集群状态和连接数据库

如下所示,集群状态检查和数据库连接正常:

# 集群状态检查
[kingbase@node202 bin]$ ./repmgr cluster show

 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                     
----+-------+---------+-----------+----------+----------+----------+----------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 15       |         | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10                                                              
 2  | node2 | standby |   running | node1    | default  | 100      | 15       | 0 bytes | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

# 连接数据库
[kingbase@node202 bin]$ ./ksql -U esrep esrep
ksql (V8.0)
Type "help" for help.

esrep=#

四、总结
此次故障,表面是集群连接串中的参数无法被数据库识别,虽取消此参数,亦无法访问数据库;客户端通过libpq访问数据库,’tcp_user_timeout'是libpq连接串中的一个参数,故问题根本解决需要考虑libpq动态库是否正常加载,是否可以正常访问数据库。

posted @ 2024-07-26 11:04  KINGBASE研究院  阅读(1)  评论(0编辑  收藏  举报