Logback配置问题导致线上服务无响应
场景
某天在钉钉阿里云运维群收到提示,部分节点阿里云网络异常告警。
检查组内项目微服务发现有1个节点出现了假死现象,进程运行服务没有挂,CPU、内存等资源正常,
但访问服务提供的RESTAPI接口均无法响应,通过网关调用接口一直等待直到超时。
分析
通过jstack导出线程堆栈信息到文件:
jps -lmv
jstack PID > jstack.txt
查看文件发现大量WAITING
状态的线程,都是在logback打印日志这行语句:
"http-nio-8088-exec-500" #157863 daemon prio=5 os_prio=0 tid=0x00007f3a9048b800 nid=0x149 waiting on condition [0x00007f3a4b4b1000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000006c7b31b08> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:353)
at ch.qos.logback.core.AsyncAppenderBase.putUninterruptibly(AsyncAppenderBase.java:181)
at ch.qos.logback.core.AsyncAppenderBase.put(AsyncAppenderBase.java:172)
at ch.qos.logback.core.AsyncAppenderBase.append(AsyncAppenderBase.java:161)
at ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(UnsynchronizedAppenderBase.java:84)
at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:51)
at ch.qos.logback.classic.Logger.appendLoopOnAppenders(Logger.java:270)
at ch.qos.logback.classic.Logger.callAppenders(Logger.java:257)
at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:421)
at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(Logger.java:383)
at ch.qos.logback.classic.Logger.info(Logger.java:591)
...
有3个AsyncAppender
的Worker
线程:
"AsyncAppender-Worker-ASYNC_ALI_LOG" #120629 daemon prio=5 os_prio=0 tid=0x00007f3abc07c000 nid=0x4b7c waiting on condition [0x00007f3a677d0000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000006c651bde8> (a java.util.concurrent.Semaphore$NonfairSync)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:582)
at com.aliyun.openservices.aliyun.log.producer.internals.LogAccumulator.doAppend(LogAccumulator.java:108)
at com.aliyun.openservices.aliyun.log.producer.internals.LogAccumulator.append(LogAccumulator.java:79)
at com.aliyun.openservices.aliyun.log.producer.LogProducer.send(LogProducer.java:356)
at com.aliyun.openservices.aliyun.log.producer.LogProducer.send(LogProducer.java:257)
at com.aliyun.openservices.log.logback.LoghubAppender.appendEvent(LoghubAppender.java:171)
at com.aliyun.openservices.log.logback.LoghubAppender.append(LoghubAppender.java:117)
at ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(UnsynchronizedAppenderBase.java:84)
at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:51)
at ch.qos.logback.core.AsyncAppenderBase$Worker.run(AsyncAppenderBase.java:290)
"AsyncAppender-Worker-ASYNC_FILE" #120625 daemon prio=5 os_prio=0 tid=0x00007f3abc07e000 nid=0x4b78 waiting on condition [0x00007f3a674cd000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <cc> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403)
at ch.qos.logback.core.AsyncAppenderBase$Worker.run(AsyncAppenderBase.java:289)
"AsyncAppender-Worker-ASYNC_STDOUT" #120624 daemon prio=5 os_prio=0 tid=0x00007f3abc020800 nid=0x4b77 waiting on condition [0x00007f3a8a9f6000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000006c640f4a0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403)
at ch.qos.logback.core.AsyncAppenderBase$Worker.run(AsyncAppenderBase.java:289)
检查该项目日志配置文件logback-spring.xml
,prod
生产环境配置了3个appender
,
<springProfile name="prod">
<root level="info" additivity="false">
<appender-ref ref="ASYNC_CONSOLE"/>
<appender-ref ref="ASYNC_FILE"/>
<appender-ref ref="ASYNC_ALI_LOG"/>
</root>
</springProfile>
分别输出日志到控制台、文件、阿里云日志。
线上环境一般不用输出到控制台,ASYNC_CONSOLE
这个appender可以去掉;
检查ASYNC_FILE
、ASYNC_ALI_LOG
的配置:
<appender name="ASYNC_FILE" class="ch.qos.logback.classic.AsyncAppender">
<discardingThreshold>0</discardingThreshold>
<queueSize>256</queueSize>
<appender-ref ref="FILE"/>
</appender>
<appender name="ASYNC_ALI_LOG" class="ch.qos.logback.classic.AsyncAppender">
<discardingThreshold>0</discardingThreshold>
<queueSize>256</queueSize>
<appender-ref ref="aliyunLoghubAppender"/>
</appender>
这2个appender均使用了logback的AsyncAppender
,其中:
discardingThreshold
配置为0,表示不丢弃日志,而默认值是队列大小的1/5;
queueSize
为队列大小,配置为256,也就是默认值;
项目使用logback-core、logback-classic
版本号为1.2.3
,阿里云aliyun-log-logback-appender
版本号为0.1.15
。
由于队列大小为256默认值比较小,且不丢弃日志,推测可能是队列数满,导致打印日志往队列添加时等待。
查看Logger.info()
方法,层层跟进方法内部,调用了父类AsyncAppenderBase
的append
方法,
代码如下:
@Override
protected void append(E eventObject) {
if (isQueueBelowDiscardingThreshold() && isDiscardable(eventObject)) {
return;
}
preprocess(eventObject);
put(eventObject);
}
其中isQueueBelowDiscardingThreshold
方法判断当队列剩余容量低于discardingThreshold
阈值时可丢弃:
private boolean isQueueBelowDiscardingThreshold() {
return (blockingQueue.remainingCapacity() < discardingThreshold);
}
最终调用了BlockingQueue
的put
方法,该方法是阻塞的,当队列没有空间时等待。
对日志配置进行优化:
- 去掉
CONSOLE
控制台的appender queueSize
调大,256改为2048discardingThreshold
0改为256,即队列空间余量等于1/8时可丢弃日志
对该节点假死的服务进行重启,并检查重启后服务RESTAPI接口能正常调用。
总结
- 不需要
CONSOLE
的appender的场景可去掉,如生产环境一般日志输出到文件或ELK或阿里云LogStore logback
的AsyncAppender
使用了阻塞队列进行异步处理,注意关键参数的含义和配置,如queueSize
、discardingThreshold
等logback-core、logback-classic
、aliyun-log-logback-appender
查阅资料如果老版本有已知的问题,可考虑升级至较新的版本
参考
- 记一次logback导致服务无响应排查过程 https://blog.csdn.net/a040600145/article/details/115272729
- logback控制台输出导致进程假死引发的血案 https://blog.csdn.net/chu1109785217/article/details/123416259
- logback死锁导致程序被hang住 https://blog.csdn.net/weixin_42738080/article/details/93242210