Logback配置问题导致线上服务无响应

场景

某天在钉钉阿里云运维群收到提示,部分节点阿里云网络异常告警。

检查组内项目微服务发现有1个节点出现了假死现象,进程运行服务没有挂,CPU、内存等资源正常,
但访问服务提供的RESTAPI接口均无法响应,通过网关调用接口一直等待直到超时。

分析

通过jstack导出线程堆栈信息到文件:

jps -lmv
jstack PID > jstack.txt

查看文件发现大量WAITING状态的线程,都是在logback打印日志这行语句:

"http-nio-8088-exec-500" #157863 daemon prio=5 os_prio=0 tid=0x00007f3a9048b800 nid=0x149 waiting on condition [0x00007f3a4b4b1000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000006c7b31b08> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:353)
        at ch.qos.logback.core.AsyncAppenderBase.putUninterruptibly(AsyncAppenderBase.java:181)
        at ch.qos.logback.core.AsyncAppenderBase.put(AsyncAppenderBase.java:172)
        at ch.qos.logback.core.AsyncAppenderBase.append(AsyncAppenderBase.java:161)
        at ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(UnsynchronizedAppenderBase.java:84)
        at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:51)
        at ch.qos.logback.classic.Logger.appendLoopOnAppenders(Logger.java:270)
        at ch.qos.logback.classic.Logger.callAppenders(Logger.java:257)
        at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:421)
        at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(Logger.java:383)
        at ch.qos.logback.classic.Logger.info(Logger.java:591)
        ... 

有3个AsyncAppenderWorker线程:

"AsyncAppender-Worker-ASYNC_ALI_LOG" #120629 daemon prio=5 os_prio=0 tid=0x00007f3abc07c000 nid=0x4b7c waiting on condition [0x00007f3a677d0000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000006c651bde8> (a java.util.concurrent.Semaphore$NonfairSync)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
        at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:582)
        at com.aliyun.openservices.aliyun.log.producer.internals.LogAccumulator.doAppend(LogAccumulator.java:108)
        at com.aliyun.openservices.aliyun.log.producer.internals.LogAccumulator.append(LogAccumulator.java:79)
        at com.aliyun.openservices.aliyun.log.producer.LogProducer.send(LogProducer.java:356)
        at com.aliyun.openservices.aliyun.log.producer.LogProducer.send(LogProducer.java:257)
        at com.aliyun.openservices.log.logback.LoghubAppender.appendEvent(LoghubAppender.java:171)
        at com.aliyun.openservices.log.logback.LoghubAppender.append(LoghubAppender.java:117)
        at ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(UnsynchronizedAppenderBase.java:84)
        at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:51)
        at ch.qos.logback.core.AsyncAppenderBase$Worker.run(AsyncAppenderBase.java:290)

"AsyncAppender-Worker-ASYNC_FILE" #120625 daemon prio=5 os_prio=0 tid=0x00007f3abc07e000 nid=0x4b78 waiting on condition [0x00007f3a674cd000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <cc> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403)
        at ch.qos.logback.core.AsyncAppenderBase$Worker.run(AsyncAppenderBase.java:289)

"AsyncAppender-Worker-ASYNC_STDOUT" #120624 daemon prio=5 os_prio=0 tid=0x00007f3abc020800 nid=0x4b77 waiting on condition [0x00007f3a8a9f6000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000006c640f4a0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:403)
        at ch.qos.logback.core.AsyncAppenderBase$Worker.run(AsyncAppenderBase.java:289)

检查该项目日志配置文件logback-spring.xmlprod生产环境配置了3个appender

<springProfile name="prod">
    <root level="info" additivity="false">
        <appender-ref ref="ASYNC_CONSOLE"/>
        <appender-ref ref="ASYNC_FILE"/>
        <appender-ref ref="ASYNC_ALI_LOG"/>
    </root>
</springProfile>

分别输出日志到控制台、文件、阿里云日志。

线上环境一般不用输出到控制台,ASYNC_CONSOLE这个appender可以去掉;
检查ASYNC_FILEASYNC_ALI_LOG的配置:

<appender name="ASYNC_FILE" class="ch.qos.logback.classic.AsyncAppender">
    <discardingThreshold>0</discardingThreshold>
    <queueSize>256</queueSize>
    <appender-ref ref="FILE"/>
</appender>
<appender name="ASYNC_ALI_LOG" class="ch.qos.logback.classic.AsyncAppender">
    <discardingThreshold>0</discardingThreshold>
    <queueSize>256</queueSize>
    <appender-ref ref="aliyunLoghubAppender"/>
</appender>

这2个appender均使用了logback的AsyncAppender,其中:
discardingThreshold配置为0,表示不丢弃日志,而默认值是队列大小的1/5;
queueSize为队列大小,配置为256,也就是默认值;

项目使用logback-core、logback-classic版本号为1.2.3,阿里云aliyun-log-logback-appender版本号为0.1.15

由于队列大小为256默认值比较小,且不丢弃日志,推测可能是队列数满,导致打印日志往队列添加时等待。

查看Logger.info()方法,层层跟进方法内部,调用了父类AsyncAppenderBaseappend方法,
代码如下:

@Override
protected void append(E eventObject) {
    if (isQueueBelowDiscardingThreshold() && isDiscardable(eventObject)) {
        return;
    }
    preprocess(eventObject);
    put(eventObject);
}

其中isQueueBelowDiscardingThreshold方法判断当队列剩余容量低于discardingThreshold阈值时可丢弃:

private boolean isQueueBelowDiscardingThreshold() {
    return (blockingQueue.remainingCapacity() < discardingThreshold);
}

最终调用了BlockingQueueput方法,该方法是阻塞的,当队列没有空间时等待。

对日志配置进行优化:

  1. 去掉CONSOLE控制台的appender
  2. queueSize调大,256改为2048
  3. discardingThreshold0改为256,即队列空间余量等于1/8时可丢弃日志

对该节点假死的服务进行重启,并检查重启后服务RESTAPI接口能正常调用。

总结

  • 不需要CONSOLE的appender的场景可去掉,如生产环境一般日志输出到文件或ELK或阿里云LogStore
  • logbackAsyncAppender使用了阻塞队列进行异步处理,注意关键参数的含义和配置,如queueSizediscardingThreshold
  • logback-core、logback-classicaliyun-log-logback-appender查阅资料如果老版本有已知的问题,可考虑升级至较新的版本

参考

posted @ 2024-04-07 22:01  cdfive  阅读(44)  评论(0编辑  收藏  举报