KingbaseES 集群启停系列 03 -- repmgrd启动无法访问共享内存

案例说明：
生产环境，kylin系统，KingbaseES V8R6集群启动后，repmgrd进程启动失败，在hamgr.log日志出现‘unable to write to shared memory’故障，导致repmgrd启动异常终止。
repmgrd进程用于集群环境对数据库服务状态的监控，如果repmgrd进程没有被启动，主库数据库服务down后，备库无法有效监控到主库数据库服务的状态，将不能触发failover的切换。在集群环境，必须保证repmgrd进程运行状态正常。

适用版本：
KingbaseES V8R6

问题解决思路：

查看集群hamgr.log日志，获取相关的错误日志。
根据日志提示信息，查看相关的配置。
从故障信息，分析可能出现故障的原因。
对故障原因进行分析，测试，尽可能复现故障。
复现成功后，提供相应的解决方案。

一、问题现象
如下所示，repmgrd进程启动后，无法访问共享内存，导致启动终止:

[2024-05-10 17:29:05] [NOTICE] repmgrd (repmgrd 5.0.0) starting up
[2024-05-10 17:29:05] [INFO] connecting to database "host=192.168.1.201 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000"
[2024-05-10 17:29:05] [DEBUG] connecting to: "user=esrep connect_timeout=10 dbname=esrep host=192.168.1.201 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 tcp_user_timeout=9000 fallback_application_name=repmgr options=-csearch_path="
[2024-05-10 17:29:05] [DEBUG] set_config():
  SET synchronous_commit TO 'local'
[2024-05-10 17:29:05] [DEBUG] expected extension version: 50000; extension version: 50100
[2024-05-10 17:29:05] [DEBUG] get_node_record():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached, n.primary_seen, n.lsn  FROM repmgr.nodes n  WHERE n.node_id = 1
[2024-05-10 17:29:05] [ERROR] unable to write to shared memory
[2024-05-10 17:29:05] [HINT] ensure "shared_preload_libraries" includes "repmgr"
[2024-05-10 17:29:05] [INFO] repmgrd terminating...
.......

二、问题分析

1、查看集群启动
如下所示，集群启动日志'repmgrd'进程启动失败：

2024-05-10 17:30:28 The primary DB is started.
2024-05-10 17:30:28 begin to start repmgrd on "[192.168.1.201]".
[2024-05-10 17:30:29] [NOTICE] using provided configuration file "/home/kingbase/cluster/R6/R6HA/kingbase/bin/../etc/repmgr.conf"
[2024-05-10 17:30:29] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6/R6HA/kingbase/log/hamgr.log"

2024-05-10 17:30:29 execute to start repmgrd on "[192.168.1.201]" failed.
.......
2024-05-10 17:30:29 execute to start repmgrd on "[192.168.1.202]" failed.
.......

# 查看进程状态
[kingbase@node201 bin]$ ps -ef |grep repmgrd

2、检查repmgr插件加载
根据日志信息，检查kingbase.conf配置repmgr插件的加载，数据库启动时，根据配置将repmgr插件读入到内存。
如下所示，kingbase.conf中repmgr插件加载配置正常。

shared_preload_libraries = 'repmgr,liboracle_parser, synonym, plsql, force_view, kdb_flashback,plugin_debugger, plsql_plugin_debugger, plsql_plprofiler, ora_commands,kdb_ora_expr, sepapower, dblink, sys_kwr, sys_spacequota, sys_stat_statements, backtrace, kdb_utils_function, auto_bmr, sys_squeeze, src_restrict,sys_prewarm'

3、检查shm内核参数配置

repmgrd进程启动需要访问shared memory，检查操作系统内核参数shm配置是否正常。如下所示，内核参数shm及sem配置：

[root@node201 ~]# sysctl -p
kernel.sem = 5010 641280 5010 256
kernel.shmall = 18446744073692774399
kernel.shmmax = 18446744073692774399
kernel.shmmni = 4096
fs.file-max = 7672460
fs.aio-max-nr = 1048576
.......

4、查看系统内存信息

[root@node201 ~]# free -g
              total        used        free      shared  buff/cache   available
Mem:              61          3          41           2           16          53
Swap:            9          0          0

kernel.shmmax表示最大共享内存段大小，单位为字节；64 位 linux 系统：可取的最大值为物理内存值 -1byte ，建议值为多于物理内存的一半。
kernel.shmall表示系统总共享内存页数，这两个参数需要根据实际硬件配置进行调整。 kernel.shmall参数是控制共享内存页数，Linux 共享内存页大小默认为4KB, 共享内存段的大小都是共享内存页大小的整数倍。

如上所示，主机物理内存为64G，而内核参数shm的配置远超过物理内存。

5、建议的shm的配置

三、问题解决

1、按照物理内存的实际值，配置shm内核参数。
2、执行sysctl -p应用内核参数配置。
3、重启集群后恢复正常。

四、总结
此次问题，经沟通是在系统人员执行了优化系统脚本后导致，脚本修改了内核参数shm的配置，配置异常后，repmgrd进程启动无法访问shared memory，repmgrd启动异常终止。在系统配置执行优化后，要及时检查系统关键的参数，以免影响数据库的正常运行。

posted @ 2024-03-28 15:30 KINGBASE研究院阅读(178) 评论(0) 编辑收藏举报

刷新页面返回顶部

KINGBASE研究院

KingbaseES 集群启停系列 03 -- repmgrd启动无法访问共享内存

公告