cpu 突增问题排查
cpu 突增问题排查
问题背景
临清明节前,收到监控报警,信息如下:
[告警名称:cpu.idle]
[告警指标:cpu.idle]
最近5个点值:[0,2.348,0,0,0]
触发规则:
[过去5个点内,所有点的值 < 10 ]
数据时间:2021-04-01 11:40:00
备注: cpu使用率高
告警时间:2021-04-01 11:40:19
持续时长:just now
查看 cpu 空闲使用率如下:
![cpu.idle cpu.idle](./images/1617524072610.png)
cpu.idle
问题定位
1. 查看占用 CPU 的线程
使用 top -H 查看机器上每个线程对 cpu 的占用情况如下
![cp占用 cp占用](./images/1617533289030.png)
cp占用
2. 打印 java 进程的线程信息
使用 jstack -p pid > stack.txt 导出 java 进程的线程栈
通过 linux 的10进制的线程号计算出 java 的 16 进制线程号,查找对应的 Java 线程,对应的线程栈如下
"serviceAppender" #34 daemon prio=5 os_prio=0 tid=0x00007efc1a700800 nid=0x350 runnable [0x00007efc60f33000] java.lang.Thread.State: RUNNABLE at org.apache.logging.log4j.core.lookup.StrSubstitutor.substitute(StrSubstitutor.java:899) at org.apache.logging.log4j.core.lookup.StrSubstitutor.substitute(StrSubstitutor.java:872) at org.apache.logging.log4j.core.lookup.StrSubstitutor.replace(StrSubstitutor.java:427) at org.apache.logging.log4j.core.pattern.MessagePatternConverter.format(MessagePatternConverter.java:125) at org.apache.logging.log4j.core.pattern.PatternFormatter.format(PatternFormatter.java:38) "AsyncAppender" #24 daemon prio=5 os_prio=0 tid=0x00007efc1a665000 nid=0x346 runnable [0x00007efc6143e000] java.lang.Thread.State: RUNNABLE at org.apache.logging.log4j.core.lookup.StrSubstitutor.substitute(StrSubstitutor.java:1014) at org.apache.logging.log4j.core.lookup.StrSubstitutor.substitute(StrSubstitutor.java:872) at org.apache.logging.log4j.core.lookup.StrSubstitutor.replace(StrSubstitutor.java:427) at org.apache.logging.log4j.core.pattern.MessagePatternConverter.format(MessagePatternConverter.java:125) at org.apache.logging.log4j.core.pattern.PatternFormatter.format(PatternFormatter.java:38) at org.apache.logging.log4j.scribe.layout.LcLayout.toSerializable(LcLayout.java:111) at org.apache.logging.log4j.scribe.layout.LcLayout.toSerializable(LcLayout.java:32) at org.apache.logging.log4j.core.layout.AbstractStringLayout.toByteArray(AbstractStringLayout.java:258) at org.apache.logging.log4j.scribe.appender.ScribeAppender.append(ScribeAppender.java:160) at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156) at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129) at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120) at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84) at org.apache.logging.log4j.core.appender.AsyncAppender$AsyncThread.callAppenders(AsyncAppender.java:451) at org.apache.logging.log4j.core.appender.AsyncAppender$AsyncThread.run(AsyncAppender.java:404)
问题解决
从上面可以看出来是写日志导致,查看最近上线的代码,发现一位小同学在 3.30 上线的代码中加了两行 info 级别的日志。而这两行日志每天执行量在千万级别以上。
将两行代码去掉后重新上线,观察 cpu 的使用情况,恢复正常。
![cpu.idle cpu.idle](./images/1617534374607.png)
cpu.idle