KingbaseES V8R3集群运维案例---主库OOM故障分析
案例说明:
KingbaseES V8R3集群,主库数据库OOM,产生core,请帮忙分析。数据库内存64Gb,为华为云虚拟机,无swap。
适用版本:
KingbaseES V8R3
一、问题分析
1、查看sys_log数据库OOM信息
PortalMemory: 8192 total in 1 blocks; 7888 free (0 chunks); 304 used
PortalHeapMemory: 1024 total in 1 blocks; 968 free (0 chunks); 56 used
Relcache by OID: 24576 total in 2 blocks; 12976 free (4 chunks); 11600 used
CacheMemoryContext: 516096 total in 6 blocks; 159416 free (0 chunks); 356680 used
CachedPlan: 1024 total in 1 blocks; 784 free (0 chunks); 240 used
SYS_EXTENSION_OID_INDEX: 1024 total in 1 blocks; 408 free (0 chunks); 616 used
SYS_EXTENSION_NAME_INDEX: 1024 total in 1 blocks; 408 free (0 chunks); 616 used
SYS_DB_ROLE_SETTING_DATABASEID_ROL_INDEX: 1024 total in 1 blocks; 320 free (0 chunks); 704 used
SYS_OPCLASS_AM_NAME_NSP_INDEX: 1024 total in 1 blocks; 24 free (0 chunks); 1000 used
SYS_FOREIGN_DATA_WRAPPER_NAME_INDEX: 1024 total in 1 blocks; 456 free (0 chunks); 568 used
SYS_SYNONYM_NAME_C_N_INDEX: 1024 total in 1 blocks; 320 free (0 chunks); 704 used
SYS_ENUM_OID_INDEX: 1024 total in 1 blocks; 456 free (0 chunks); 568 used
SYS_CLASS_RELNAME_NSP_INDEX: 1024 total in 1 blocks; 272 free (0 chunks); 752 used
SYS_FOREIGN_SERVER_OID_INDEX: 1024 total in 1 blocks; 456 free (0 chunks); 568 used
SYS_STATISTIC_RELID_ATT_INH_INDEX: 1024 total in 1 blocks; 24 free (0 chunks); 1000 used
SYS_CAST_SOURCE_TARGET_INDEX: 1024 total in 1 blocks; 320 free (0 chunks); 704 used
SYS_PKGVARIABLE_OID_INDEX: 1024 total in 1 blocks; 456 free (0 chunks); 568 used
SYS_LANGUAGE_NAME_INDEX: 1024 total in 1 blocks; 456 free (0 chunks); 568 used
SYS_PACKAGE_OID_INDEX: 1024 total in 1 blocks; 456 free (0 chunks); 568 used
........
2、主库节点recovery.log日志
如下图所示,主库recovery时出现动态库加载失败及内存访问失败:
3、系统message日志
Jul 18 15:00:11 db0001 com.deepin.api.XEventMonitor[20792]: /usr/lib/deepin-daemon/dde-session-daemon: error while loading shared libraries: libgdk-3.so.0: failed to map segment from shared object
Jul 18 15:00:28 db0001 kernel: [56881707.808160] detected fb_set_par error, error code: -16
.........
Jul 18 15:00:30 db0001 com.deepin.dde.lockFront[20792]: 2023-07-18, 15:00:29.938 [Debug ] [ 0] Failed message: "请输入密码"
Jul 18 15:00:30 db0001 com.deepin.daemon.Zone[20792]: /usr/lib/deepin-daemon/dde-session-daemon: error while loading shared libraries: libatk-1.0.so.0: failed to map segment from shared object
Jul 18 15:01:37 db0001 com.deepin.dde.desktop[20792]: QThread::start: Thread creation error: 资源暂时不可用
Jul 18 15:01:37 db0001 com.deepin.dde.desktop[20792]: QThread::start: Thread creation error: 资源暂时不可用
.........
Jul 18 15:02:49 db0001 com.deepin.dde.desktop[20792]: /usr/bin/dde-desktop: error while loading shared libraries: libXau.so.6: failed to map segment from shared object
Jul 18 15:02:59 db0001 com.deepin.dde.desktop[20792]: /usr/bin/dde-desktop: error while loading shared libraries: libQt5XdgIconLoader.so.3: failed to map segment from shared object
........
Jul 18 15:03:09 db0001 com.deepin.dde.desktop[20792]: /usr/bin/dde-desktop: error while loading shared libraries: libcom_err.so.2: failed to map segment from shared object
Jul 18 15:03:21 db0001 com.deepin.dde.desktop[20792]: out of memory
Jul 18 15:03:31 db0001 com.deepin.dde.desktop[20792]: (process:5739): GLib-ERROR (recursed) **: ../../../glib/gmem.c:135: failed to allocate 16368 bytes
如下图所示,系统进程OOM信息:
4、message日志记录kingbase进程stack error
Jul 18 15:48:45 db0001 kernel: [56884604.361326] CPU: 24 PID: 27697 Comm: kingbase Tainted: G D 4.19.0-arm64-server #3017
Jul 18 15:48:45 db0001 kernel: [56884604.362620] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
Jul 18 15:48:45 db0001 kernel: [56884604.363585] pstate: 20400005 (nzCv daif +PAN -UAO)
Jul 18 15:48:45 db0001 kernel: [56884604.364214] pc : do_last+0x44/0x848
Jul 18 15:48:45 db0001 kernel: [56884604.364620] lr : path_openat+0x60/0x238
Jul 18 15:48:45 db0001 kernel: [56884604.365007] sp : ffff80013f5f7bf0
Jul 18 15:48:45 db0001 kernel: [56884604.365435] x29: ffff80013f5f7bf0 x28: ffff80009e9fc780
Jul 18 15:48:45 db0001 kernel: [56884604.366086] x27: ffff80013f5f7e4c x26: 0000000000000000
Jul 18 15:48:45 db0001 kernel: [56884604.366683] x25: 0000000056000000 x24: 0000000000000200
Jul 18 15:48:45 db0001 kernel: [56884604.367419] x23: ffff8000ef81c280 x22: 0000000000000002
Jul 18 15:48:45 db0001 kernel: [56884604.368032] x21: ffff800f15d3fd00 x20: 0000000000020241
Jul 18 15:48:45 db0001 kernel: [56884604.368612] x19: ffff80013f5f7d28 x18: 0000000000000000
Jul 18 15:48:45 db0001 kernel: [56884604.369224] x17: 0000000000000000 x16: 0000000000000000
Jul 18 15:48:45 db0001 kernel: [56884604.369979] x15: 0000000000000000 x14: 0000000000000000
Jul 18 15:48:45 db0001 kernel: [56884604.370787] x13: 0000000000000000 x12: 0000000000000000
Jul 18 15:48:45 db0001 kernel: [56884604.371731] x11: 0000000000000000 x10: d0d0a0d0a0d0a0bd
Jul 18 15:48:45 db0001 kernel: [56884604.373494] x9 : 72980288f3b72329 x8 : c1647d4c29ee3c8e
Jul 18 15:48:45 db0001 kernel: [56884604.374038] x7 : 2df9567f81f8954b x6 : b3fc0fc4aa596fca
Jul 18 15:48:45 db0001 kernel: [56884604.374571] x5 : 000000000000000a x4 : feff0eff0eff0f00
Jul 18 15:48:45 db0001 kernel: [56884604.375104] x3 : 0000000000000000 x2 : 0000000000000000
Jul 18 15:48:45 db0001 kernel: [56884604.375661] x1 : 0000000000000051 x0 : ffff80013f5f7d28
Jul 18 15:48:45 db0001 kernel: [56884604.376760] Call trace:
Jul 18 15:48:45 db0001 kernel: [56884604.377024] do_last+0x44/0x848
Jul 18 15:48:45 db0001 kernel: [56884604.377365] path_openat+0x60/0x238
Jul 18 15:48:45 db0001 kernel: [56884604.378283] do_filp_open+0x60/0xc0
Jul 18 15:48:45 db0001 kernel: [56884604.378654] do_sys_open+0x164/0x1f0
Jul 18 15:48:45 db0001 kernel: [56884604.379036] __arm64_sys_openat+0x20/0x28
Jul 18 15:48:45 db0001 kernel: [56884604.379444] el0_svc_common+0x90/0x160
Jul 18 15:48:45 db0001 kernel: [56884604.379830] el0_svc_handler+0x9c/0xa8
Jul 18 15:48:45 db0001 kernel: [56884604.380215] el0_svc+0x8/0xc
Jul 18 15:48:45 db0001 kernel: [56884604.381133] ---[ end trace e4652b3ad8a636a3 ]---
二、问题解决
从以上系统的message信息可以获知,数据库服务在15:48左右,出现stack error,导致数据库服务出现OOM故障;但是在15:00左右,系统message日志看,其他进程已经出现动态库加载故障及OOM问题,所以数据库的OOM,应该是整个系统出现了内存资源紧张导致,而不是数据库自身应用问题。
经系统人员检查,发现内存不足时杀毒软件占用了十几个G,重启杀毒软件后内存下降,待后续观察。
三、总结
对于数据库故障问题,除了对数据库自身的日志信息进行分析,还要结合故障时间点对整个主机的状态进行分析,找到问题发生的根本原因。
KINGBASE研究院