KingbaseES V8R6集群运维案例之---复现libpq.so动态库加载故障

案例说明:
KingbaseES V8R6集群,一主多备架构,其中一个备库节点,在使用‘repmgr cluster show’查看集群状态时,出现"conninfo": invalid connection option "tcp_user_timeout"错误,其余节点正常。故障现象如下所示:

适用版本:
KingbaseES V8R6

一、安装libpq软件

1、install libpq
[root@node202 lib64]# yum install libpq*
.......

Installed:
  libpqxx.x86_64 1:4.0.1-1.el7     
  libpqxx-devel.x86_64 1:4.0.1-1.el7     
  libpqxx-doc.noarch 1:4.0.1-1.el7

2、查看系统libpq动态库

[root@node202 lib64]# ls -lh /usr/lib64/libpq*
lrwxrwxrwx 1 root root   12 Dec  7 16:52 /usr/lib64/libpq.so -> libpq.so.5.5
lrwxrwxrwx 1 root root   12 Dec  7 16:52 /usr/lib64/libpq.so.5 -> libpq.so.5.5
-rwxr-xr-x 1 root root 193K Jun 24  2022 /usr/lib64/libpq.so.5.5
-rwxr-xr-x 1 root root 360K Jun 24  2014 /usr/lib64/libpqxx-4.0.so
lrwxrwxrwx 1 root root   14 Dec  7 16:52 /usr/lib64/libpqxx.so -> libpqxx-4.0.so

二、配置系统动态库加载

1、配置ld.so.conf

[root@node202 lib64]# cat /etc/ld.so.conf
include ld.so.conf.d/*.conf
/usr/lib
/usr/lib64
#/usr/local/pg15/lib
/lib

[root@node202 lib64]# ldconfig -v |grep libpq

        libpqxx-4.0.so -> libpqxx.so
        libpq.so.5 -> libpq.so.5.5

2、模拟数据库自带libpq动态库损坏
如下图所示,默认集群访问的libpq动态库是数据库自带:

模拟动态库损坏:

[kingbase@node202 lib]$ ls -lh libpq*
-rwxr-xr-x. 1 kingbase kingbase 345K Mar 23  2023 libpq.so
-rwxr-xr-x. 1 kingbase kingbase 345K Mar 23  2023 libpq.so.5
-rwxr-xr-x. 1 kingbase kingbase 345K Mar 23  2023 libpq.so.5.12
-rwxr-xr-x. 1 kingbase kingbase  29K Mar 23  2023 libpqwalreceiver.so

[kingbase@node202 lib]$ mv libpq.so.5 libpq.so.5.bk

[kingbase@node202 lib]$ cd ../bin
[kingbase@node202 bin]$ ldd repmgr
        linux-vdso.so.1 =>  (0x00007ffe7e58b000)
        libpq.so.5 => /usr/lib64/libpq.so.5 (0x00007f7274219000)
        libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f7273ffc000)
        libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f7273df8000)
        libc.so.6 => /usr/lib64/libc.so.6 (0x00007f7273a2a000)
        libssl.so.10 => /usr/lib64/libssl.so.10 (0x00007f72737b7000)
        libcrypto.so.10 => /usr/lib64/libcrypto.so.10 (0x00007f7273354000)
        libkrb5.so.3 => /usr/lib64/libkrb5.so.3 (0x00007f727306f000)
        libcom_err.so.2 => /usr/lib64/libcom_err.so.2 (0x00007f7272e6a000)
.......

如下图所示,集群libpq的动态库加载为系统下:

三、执行集群访问
如下所示,执行‘repmgr cluster show’出现‘tcp_user_timeout’故障:

[kingbase@node202 bin]$ ./repmgr cluster show
[ERROR] following errors were found in the configuration file:
  "conninfo": invalid connection option "tcp_user_timeout"
        (provided: "host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000")
[DETAIL] configuration file is: "/home/kingbase/cluster/R6/R6HA/kingbase/bin/../etc/repmgr.conf"

在repmgr.conff中libpq连接串中取消‘tcp_user_timeout’参数:

[kingbase@node202 bin]$ cat ../etc/repmgr.conf |grep connect
conninfo='host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'

执行‘repmgr cluster show’:

[kingbase@node202 bin]$ ./repmgr cluster show
[DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.202 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr options=-csearch_path="
[ERROR] connection to database failed
[DETAIL]
SCRAM authentication requires libpq version 10 or above

[DETAIL] attempted to connect using:
  user=esrep connect_timeout=10 dbname=esrep host=192.168.1.202 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr options=-csearch_path=

数据库连接故障:

[kingbase@node202 bin]$ ./ksql -U esrep esrep
ksql: error: could not connect to server: could not connect to server: No such file or directory
        Is the server running locally and accepting
        connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?

四、问题解决

1、解决libpq动态库正确加载

2、执行集群访问

[kingbase@node201 bin]$ ./repmgr cluster show

 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string                                                                                                                     
----+-------+---------+-----------+----------+----------+----------+----------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 15       |         | host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10                                                              
 2  | node2 | standby |   running | node1    | default  | 100      | 15       | 0 bytes | host=192.168.1.202 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

五、总结
此次故障,表面是集群连接串中的参数无法被数据库识别,虽取消此参数,亦无法访问数据库;客户端通过libpq访问数据库,’tcp_user_timeout'是libpq连接串中的一个参数,故问题根本解决需要考虑libpq动态库是否正常加载,是否可以正常访问数据库。

posted @   天涯客1224  阅读(1)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· DeepSeek 开源周回顾「GitHub 热点速览」
历史上的今天:
2022-12-08 KingbaseES V8R6集群运维案例之---sys_internal.init.*文件引起sys_basebackup失败
点击右上角即可分享
微信分享提示