压测问题现象分析及解决复盘

一、
现象:压测过程中接口返回慢、有超时情况,后其中一台容器自动重启,重启后生成了dump目录
分析:猜测是发生了OOM并且容器配置了OOM自动输出dump文件。得到的dump文件有thread dump,gcutil,netstat
解决过程:
(1)
从jstack日志入手,依次根据deadlock、blocked关键字查找, 找到以下处于blocked状态的线程堆栈信息

"asyncExecutor-1093" #9490 prio=5 os_prio=0 tid=0x00007f12b400f800 nid=0x25c7 waiting for monitor entry [0x00007f0d1aa03000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.apache.logging.log4j.core.appender.rolling.RollingFileManager.checkRollover(RollingFileManager.java:217)
	- waiting to lock <0x000000065804dcc8> (a org.apache.logging.log4j.core.appender.rolling.RollingRandomAcceslalaileManager)
	at org.apache.logging.log4j.core.appender.RollingRandomAcceslalaileAppender.append(RollingRandomAcceslalaileAppender.java:207)
	at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
	at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:448)
	at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:433)
	at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:417)
	at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:403)
	at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:63)
	at org.apache.logging.log4j.core.Logger.logMessage(Logger.java:146)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2091)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2011)
	at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1884)
	at com.lala.sgs.lalalog.log4j2.Log4j2Logger.error(Log4j2Logger.java:520)
"http-nio-8888-exec-4" #363 daemon prio=5 os_prio=0 tid=0x00007f0f48003000 nid=0x222 waiting for monitor entry [0x00007f11abeb8000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.apache.logging.log4j.core.layout.TextEncoderHelper.copyDataToDestination(TextEncoderHelper.java:61)
	- waiting to lock <0x000000065804dcc8> (a org.apache.logging.log4j.core.appender.rolling.RollingRandomAcceslalaileManager)
	at org.apache.logging.log4j.core.layout.TextEncoderHelper.encodeTextWithCopy(TextEncoderHelper.java:57)
	at org.apache.logging.log4j.core.layout.StringBuilderEncoder.encodeWithThreadLocals(StringBuilderEncoder.java:70)
	at org.apache.logging.log4j.core.layout.StringBuilderEncoder.encode(StringBuilderEncoder.java:63)
	at org.apache.logging.log4j.core.layout.StringBuilderEncoder.encode(StringBuilderEncoder.java:32)
	at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:219)
	at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:57)

该线程blocked现象通过引入disruptor框架并配置asyncLogger后没有再出现blocked现象。(参考:https://www.cnblogs.com/waterystone/p/11170540.html
(2)解决日志打印性能问题后重新压测,发现在send kafka,redis pool方面也有相应的性能问题。

"http-nio-8888-exec-295" #1570 daemon prio=5 os_prio=0 tid=0x00007f0fa40f6000 nid=0x6d7 waiting for monitor entry [0x00007f0d839e1000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at kafka.producer.Producer.send(Producer.scala:72)
	- waiting to lock <0x0000000656965bf8> (a java.lang.Object)
	at kafka.javaapi.producer.Producer.send(Producer.scala:33)
	at com.lala.kafka.api.produce.KafkaProducer.send(KafkaProducer.java:65)
	at com.lala.kafka.api.produce.KafkaProducer.sendBytes(KafkaProducer.java:54)
	at com.lala.kafka.api.produce.ProducerPool.sendBytes(ProducerPool.java:49)
	at com.lala.kafka.api.produce.BaseProducer.sendString(BaseProducer.java:14)

通过增大发送kafka的线程数poolSize得以解决。

"asyncExecutor-253" #1470 prio=5 os_prio=0 tid=0x00007f151000b000 nid=0x672 waiting on condition [0x00007f0d857da000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x0000000659daa5c8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
	at org.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:524)
	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:438)
	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:361)
	at redis.clients.util.Pool.getResource(Pool.java:49)
	at redis.clients.jedis.JedisSentinelPool.getResource(JedisSentinelPool.java:209)
	at redis.clients.jedis.JedisSentinelPool.getResource(JedisSentinelPool.java:17)
	at org.springframework.data.redis.connection.jedis.JedisConnectionFactory.fetchJedisConnector(JedisConnectionFactory.java:194)
	at org.springframework.data.redis.connection.jedis.JedisConnectionFactory.getConnection(JedisConnectionFactory.java:348)
	at org.springframework.data.redis.core.RedisConnectionUtils.doGetConnection(RedisConnectionUtils.java:129)
	at org.springframework.data.redis.core.RedisConnectionUtils.getConnection(RedisConnectionUtils.java:92)
	at org.springframework.data.redis.core.RedisConnectionUtils.getConnection(RedisConnectionUtils.java:79)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:194)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:169)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:157)

看redis的配置使用的是默认的配置,默认值可从GenericObjectPoolConfig查到。
通过增大maxTotal、maxIdle、minIdle值得以解决。
(3)压测还在进行中
(4)通过gcutil查看gc的情况,发现YGC比较频繁,

  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT   
 12.42   0.00  96.21  50.59  90.24  86.46   1770  201.800     8    0.950  202.750

JVM配置是:

-server -Xmx6g -Xms6g -Xmn256m -XX:PermSize=128m -Xss256k -XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSCompactAtFullCollection -XX:LargePageSizeInBytes=128m -XX:+UseFastAccessorMethods -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70

发现年轻代配得太小,于是加大了该项配置。改成如下:

-server -Xmx6g -Xms6g -Xmn3072m -Xss512k -XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSCompactAtFullCollection -XX:LargePageSizeInBytes=128m -XX:+UseFastAccessorMethods -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+PrintClassHistogram -Xloggc:../logs/gc.log

重新压测后gc情况有好转

  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT   
 15.07   0.00  77.15  32.25  90.37  86.32    116   24.551     8    6.216   30.767

dump获取方法(转):
(1)获取thread dump

使用 jstack
jstack 是 JDK 自带的工具,用于 dump 指定进程 ID(PID)的 JVM 的线程堆栈信息。

# 打印堆栈信息到标准输出
jstack PID

# 打印堆栈信息到标准输出,会打印关于锁的信息
jstack -l PID

强制打印堆栈信息到标准输出,如果使用 jstack PID 没有响应的情况下(此时 JVM 进程可能挂起),加 -F 参数
jstack -F PID
使用 jcmd
jcmd 是 JDK 自带的工具,用于向 JVM 进程发送命令,根据命令的不同,可以代替或部分代替 jstack、jmap 等。可以发送命令 Thread.print 来打印出 JVM 的线程堆栈信息。

# 下面的命令同等于 jstack PID
jcmd PID Thread.print

# 同等于 jstack -l PID
jcmd PID Thread.print -l
使用 kill -3
kill 可以向特定的进程发送信号(SIGNAL),缺省情况是发送终止(TERM) 的信号 ,即 kill PID 与 kill -15 PID 或 kill -TERM PID 是等价的。JVM 进程会监听 QUIT 信号(其值为 3),当收到这个信号时,会打印出当时的线程堆栈和堆内存使用概要,相比 jstack,此时多了堆内存的使用概要情况。但 jstack 可以指定 -l 参数,打印锁的信息。

kill -3 PID
# 或
kill -QUIT PID

(2)获取heap dump

# 将 JVM 的堆 dump 到指定文件,如果堆中对象较多,需要的时间会较长,子参数 format 只支持 b,即二进制格式
jmap -dump:format=b,file=FILE_WITH_PATH PID

# 如果 JVM 进程未响应命令,可以加上参数 -F 尝试
jmap -F -dump:format=b,file=FILE_WITH_PATH PID

# 可以只 dump 堆中的存活对象,加上 live 子参数,但使用 -F 时不支持 live
jmap -dump:live,format=b,file=FILE_WITH_PATH PID

(3)获取堆概要信息

# -heap 参数用于查看指定 JVM 进程的堆的信息,包括堆的各个参数的值,堆中新生代、年老代的内存大小、使用率等
jmap -heap PID

# 同样,如果 JVM 进程未响应命令,可以加上参数 -F 尝试
jmap -F -heap PID

(4)gcutil、gccapacity

jstat -gcutil PID
jstat -gccapacity PID
posted @ 2020-08-02 19:31  来自海上的鱼  阅读(1029)  评论(0编辑  收藏  举报