Exadata计算节点,系统的剩余内存越来越少
1、故障概述
某Exadata客户,负责该项目的同事反馈:该Exadata的计算节点,几乎每半年左右就会出现内存不足的现象,需要重启一次操作系统才能缓解该故障。最后几天,系统剩余的内存只有4GB左右,监控系统经常告警。客户打算最近找个停机窗口进行重启操作。
2、故障分析
重启操作系统,虽然能释放内存,临时解决内存不足的问题,但根本的办法还是要找出内存消耗的源头。最终一劳永逸地解决问题。趁着操作系统还未重启,赶紧收集了些数据。
2.1 内存使用情况
[root@ex01db01 ~]# free -m
total used free shared buff/cache available
Mem: 257503 242156 4792 10 10554 9036
Swap: 24575 698 23877
[root@ex01db01 ~]#
可以看出,剩余内存只有4G多点,而available的内存也非常少。 Swap也已经开始使用,这说明物理内存已经紧张了。
2.2 大页配置情况
[root@ex01db01 ~]# cat /proc/meminfo
MemTotal: 263683296 kB
MemFree: 4918632 kB
......
HugePages_Total: 92160
HugePages_Free: 3148
HugePages_Rsvd: 1812
HugePages_Surp: 0
Hugepagesize: 2048 kB
[root@ex01db01 ~]#
大页配置了180多G,大页剩余6G左右。这说明大页没有浪费大多物理内存。
2.3 按照物理内存使用进行排序
[root@ex01db01 ~]# ps auxw|head -1;ps auxw|sort -rn -k4|head -10
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 58981 0.1 3.9 13710496 10526188 ? Sl 2023 698:36 ./gse_agent -f /usr/local/gse/agent/etc/gse_agent.conf
root 200091 11.7 2.5 8135544 6648588 ? SLsl 2022 123917:07 /u01/app/19.0.0.0/grid/bin/osysmond.bin
root 112785 1.8 0.4 2851920 1121160 ? Sl Aug30 75:43 /opt/oracle.ahf/jre/bin/java --add-opens java.base/java.lang=ALL-UNNAMED
root 278445 0.6 0.3 2018196 932160 ? SLsl 2022 6648:29 /u01/app/19.0.0.0/grid/bin/ologgerd -M
oracle 40324 34.4 0.3 18764588 880948 ? Sl Aug05 13793:21 /u01/app/Agent/agent_13.5.0.0.0/oracle_common/jdk/bin/java -Xmx177M
root 353469 0.6 0.2 1876580 721004 ? Sl 2023 5249:27 /opt/ds_agent/ds_agent -w /var/opt/ds_agent -b -i -e /opt/ds_agent/ext
grid 265097 0.0 0.2 5928268 644796 ? Sl 2022 933:52 /u01/app/19.0.0.0/grid/jdk/bin/java -server -Xms128M -Xmx512M -Djava.awt.headless=true
grid 220232 1.2 0.2 2930512 743064 ? Ssl 2022 12716:14 /u01/app/19.0.0.0/grid/bin/oraagent.bin
dbmsvc 47129 1.4 0.2 7041436 688820 ? Sl 2022 15944:18 /usr/java/default//bin/java -Xms512m -Xmx512m -XX:-UseLargePages
oracle 395458 99.1 0.1 34185392 359880 ? Rs Sep01 647:36 oracleCRMDB1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
可以看出,gse_agent占用了10G左右的物理内存,而osysmond.bin占用了6G左右的物理内存。
2.4 查看alerthistory日志
2024-08-12T04:43:00+08:00 warning " Server memory is running low at 3.99% available, Top memory consumers 58981 User 0(root) Memory 8999.61 MB Command ./gse_agent -f /usr/local/
2024-08-18T22:08:09+08:00 warning " Server memory is running low at 4.00% available, Top memory consumers 58981 User 0(root) Memory 9404.48 MB Command ./gse_agent -f /usr/local/
2024-08-22T02:07:13+08:00 warning " Server memory is running low at 4.01% available, Top memory consumers 58981 User 0(root) Memory 9594.95 MB Command ./gse_agent -f /usr/local/
2024-08-23T02:07:14+08:00 warning " Server memory is running low at 3.93% available, Top memory consumers 58981 User 0(root) Memory 9655.20 MB Command ./gse_agent -f /usr/local/
2024-08-23T22:07:16+08:00 info " Server memory is running low at 3.90% available, Top memory consumers 58981 User 0(root) Memory 9705.18 MB Command ./gse_agent -f /usr/local/
2024-08-25T00:07:17+08:00 info " Server memory is running low at 4.03% available, Top memory consumers 58981 User 0(root) Memory 9770.43 MB Command ./gse_agent -f /usr/local/
2024-08-25T01:09:17+08:00 info " Server memory is running low at 4.03% available, Top memory consumers 58981 User 0(root) Memory 9773.06 MB Command ./gse_agent -f /usr/local/
2024-08-25T07:07:17+08:00 info " Server memory is running low at 4.04% available, Top memory consumers 58981 User 0(root) Memory 9788.05 MB Command ./gse_agent -f /usr/local/
2024-08-25T10:07:18+08:00 info " Server memory is running low at 4.04% available, Top memory consumers 58981 User 0(root) Memory 9795.68 MB Command ./gse_agent -f /usr/local/
2024-08-25T11:06:18+08:00 info " Server memory is running low at 4.04% available, Top memory consumers 58981 User 0(root) Memory 9798.05 MB Command ./gse_agent -f /usr/local/
2024-08-12T04:43:00+08:00 warning " Server memory is running low at 3.99% available, Top memory consumers 200091 Memory 6207.95 MB Command /u01/app/19.0.0.0/grid/bin/osysmond.bin
2024-08-18T22:08:09+08:00 warning " Server memory is running low at 4.00% available, Top memory consumers 200091 Memory 6264.67 MB Command /u01/app/19.0.0.0/grid/bin/osysmond.bin
2024-08-22T02:07:13+08:00 warning " Server memory is running low at 4.01% available, Top memory consumers 200091 Memory 6291.11 MB Command /u01/app/19.0.0.0/grid/bin/osysmond.bin
2024-08-23T02:07:14+08:00 warning " Server memory is running low at 3.93% available, Top memory consumers 200091 Memory 6299.56 MB Command /u01/app/19.0.0.0/grid/bin/osysmond.bin
2024-08-23T22:07:16+08:00 info " Server memory is running low at 3.90% available, Top memory consumers 200091 Memory 6306.55 MB Command /u01/app/19.0.0.0/grid/bin/osysmond.bin
2024-08-25T00:07:17+08:00 info " Server memory is running low at 4.03% available, Top memory consumers 200091 Memory 6315.64 MB Command /u01/app/19.0.0.0/grid/bin/osysmond.bin
2024-08-25T01:09:17+08:00 info " Server memory is running low at 4.03% available, Top memory consumers 200091 Memory 6315.98 MB Command /u01/app/19.0.0.0/grid/bin/osysmond.bin
2024-08-25T07:07:17+08:00 info " Server memory is running low at 4.04% available, Top memory consumers 200091 Memory 6318.09 MB Command /u01/app/19.0.0.0/grid/bin/osysmond.bin
2024-08-25T10:07:18+08:00 info " Server memory is running low at 4.04% available, Top memory consumers 200091 Memory 6319.09 MB Command /u01/app/19.0.0.0/grid/bin/osysmond.bin
2024-08-25T11:06:18+08:00 info " Server memory is running low at 4.04% available, Top memory consumers 200091 Memory 6319.42 MB Command /u01/app/19.0.0.0/grid/bin/osysmond.bin
从alerthistory日志可以看出,8月12日 至 8月25日,PID为58981的gse_agent进程占用的物理内存从8999M增长至9798M,在这期间,gse_agent进程占用的物理内存一直处于增涨状态。同样,PID为200091的osysmond.bint进程占用的物理内存从6207M增长至6319M,在这期间,osysmond.bint进程占用的物理内存也一直处于增涨状态。
像这种情况,基本上就可以判定为内存泄露。
2.5 处理办法
对于gse_agent进程,这很明显不是系统自带的进程,将这一情况反馈给客户,让相关的软件服务商来解决该软件的内存泄露的问题。
对于osysmond.bint进程的内存泄露问题,相关的BUG非常多,目前,暂无法确认是哪个BUG,临时的解决办法是重启crf服务,或者禁止crf服务。
Workaround:
Restart the ora.crf resource:
$GI_HOME/bin/crsctl stop res ora.crf -init
$GI_HOME/bin/crsctl start res ora.crf -init