记一次http连接池连接泄漏排查过程
问题现象:
一个批处理发货系统,前几天商家反馈提交批处理文件后,一直处于处理中状态。
排查的过程中,商家的问题不断的报过来,发现一台机器上所有批处理请求全部失败。我们立刻隔离了有问题的机器开始开始排查原因。
排查过程:
查看处理中的任务日志里面没有任何异常报错,服务器的指标(cpu、内存、io等等)也是平时正常状态。
首先获取服务的线程堆栈信息,指令如下:
1 ps -ef|grep java 获取服务进程 2 jstack -l xxxxx >> trade.stack 获取进程xxxxx的堆栈到trade.stack文件
堆栈文件中发现并无任何死锁,并且cpu也是正常水平。
发现堆栈有大量线程处于java.lang.Thread.State: WAITING (parking)和java.lang.Thread.State: BLOCKED (on object monitor)。
其中WAITING的线程为如下:
1 "ConsumerImplV2-ClientBusiness-trade-Pool-Thread-2" #261 prio=10 os_prio=0 tid=0x00007f5878003800 nid=0x2d8 waiting on condition [0x00007f572e512000] 2 java.lang.Thread.State: WAITING (parking) 3 at sun.misc.Unsafe.park(Native Method) 4 - parking to wait for <0x0000000742b2fc68> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) 5 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 6 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) 7 at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 8 at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) 9 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) 10 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 11 at java.lang.Thread.run(Thread.java:748) 12 13 Locked ownable synchronizers: 14 - None
其中BLOCKED有两种:
1 "ConsumerImplV2-ClientBusiness-trade-Pool-Thread-2" #1223 prio=10 os_prio=0 tid=0x00007f5870003000 nid=0x6f3 waiting for monitor entry [0x00007f5666f36000] 2 java.lang.Thread.State: BLOCKED (on object monitor) 3 at com.google.common.eventbus.Subscriber$SynchronizedSubscriber.invokeSubscriberMethod(Subscriber.java:153) 4 - waiting to lock <0x0000000742485c78> (a com.google.common.eventbus.Subscriber$SynchronizedSubscriber) 5 at com.google.common.eventbus.Subscriber$1.run(Subscriber.java:80) 6 at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:456) 7 at com.google.common.eventbus.Subscriber.dispatchEvent(Subscriber.java:76) 8 at com.google.common.eventbus.Dispatcher$PerThreadQueuedDispatcher.dispatch(Dispatcher.java:119) 9 at com.google.common.eventbus.EventBus.post(EventBus.java:215)
1 "ConsumerImplV2-ClientBusiness-trade-Pool-Thread-12" #2225 prio=10 os_prio=0 tid=0x00007f5870011800 nid=0xb66 waiting on condition [0x00007f56625ee000] 2 java.lang.Thread.State: WAITING (parking) 3 at sun.misc.Unsafe.park(Native Method) 4 - parking to wait for <0x000000074305f6a0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) 5 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 6 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) 7 at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:380) 8 at org.apache.http.pool.AbstractConnPool.access$200(AbstractConnPool.java:69) 9 at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:246) 10 - locked <0x000000075675a7e0> (a org.apache.http.pool.AbstractConnPool$2) 11 at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:193) 12 at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:303) 13 at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:279) 14 at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:191) 15 at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) 16 at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) 17 at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111) 18 at org.apache.http.impl.client.InternalHttpClient.doExecute$original$W51oFvtD(InternalHttpClient.java:185) 19 at org.apache.http.impl.client.InternalHttpClient.doExecute$original$W51oFvtD$accessor$t4YoGKDC(InternalHttpClient.java)
我们首先关注到了第一个BLOCKED,因为整个系统采用事件驱动的,使用了eventBus,在机器已经隔离的情况,还是有订阅冲突,可能业务上的事件驱动出现了闭环导致某几个任务一直在执行。
然而将整个业务进行梳理过后发现并没有闭环出现。
再看第二个BLOCKED有20个是获取http连接池的时候出现了等待,看下我们代码里的配置发现没有设置ConnectionRequestTimeout这个属性导致http连接池满的时候其他任务都阻塞了。但是为什么http连接池会一直处于满载的状态却不知道。
static { RequestConfig config = RequestConfig.custom() .setConnectTimeout(2000) .setSocketTimeout(3000) .build(); CLIENT = HttpClientBuilder.create() .setMaxConnTotal(512) .setMaxConnPerRoute(64) .setDefaultRequestConfig(config) .build(); }
从机器上dump了整个堆栈信息,3.6G🤯大小,使用mat分析dump文件发现我们用的http连接池里面,pending一共20个是处于等待的那几个请求获取连接的线程,available是0。
就很疑惑🤔,为什么这些连接会持续这么久不释放。
看服务器上tcp连接状态:
1 # netstat -n | awk '/^tcp/ {++State[$NF]} END {for(i in State) print i, State[i]}' 2 TIME_WAIT 19 3 CLOSE_WAIT 65 4 ESTABLISHED 176
65个处于CLOSE_WAIT状态,和我们配置的MaxConnPerRoute基本吻合,那么说明这几个连接异常了。
看下TCP协议断开连接四次挥手的过程:
其实就是连接异常了,但是没有关闭导致的。
再看看http访问的那段代码:其中有一段
try { log.info("http request url = {},param = {}", url, JSON.toJSONString(get)); CloseableHttpResponse response = CLIENT.execute(get); log.info("http response url = {},param = {}", url, JSON.toJSONString(response)); if (HttpStatus.SC_OK == response.getStatusLine().getStatusCode()) { 。。。 return buffer.toByteArray(); } } catch (SocketTimeoutException | SocketException e) { throw new BizException(BizErrorCodeEnum.BIZ_COMPENSATE); } catch (Throwable throwable) { log.error("[文件http 客户端请求返回为null] url: {}, e : {}", url, throwable); } finally { try { //关闭流 EntityUtils.consume(entity); } catch (IOException e) { return null; } } return null; }
里面的处理是如果服务端返回的不是200,就直接返回了,finally里面关闭的流是entity,但是如果不是返回200,response里面的响应不会放到entity里面,也就是不会被清理。那么问题终于找到了,只要请求的文件服务器返回非200,连接就会一直卡在这里不会被释放。其实最早看的线程栈里面的WAITING的线程就是🤯。但是单纯的看waiting的线程里面的堆栈信息没有任何头绪。
解决的办法就比较简单了,利用语法糖,把response放到try里面就可以了。
1 try (CloseableHttpResponse response = CLIENT.execute(post)) { 2 if (HttpStatus.SC_OK == response.getStatusLine().getStatusCode()) { 3 HttpEntity entity = response.getEntity(); 4 if (entity != null) { 5 String result = EntityUtils.toString(entity); 6 EntityUtils.consume(entity); 7 return JSON.parseObject(result, access); 8 } 9 } 10 } catch (Exception e) { 11 log.info("http 请求异常,url={}, failmsg={}", url, e.getMessage()); 12 }
------------------------------------------------------------------------------------------------------------------------------------------------
模拟复现的步骤也很简单:
客户端:
1 public static void main(String[] args) { 2 3 Runnable r = ()->{ 4 System.out.println("开始执行子线程请求任务"); 5 System.out.println("子线程执行结束,结果:" + HttpClientUtilsTest.multipartGet("http://127.0.0.1:9076/test")); 6 }; 7 8 ExecutorService thread = Executors.newFixedThreadPool(1,new ThreadFactory() { 9 int i = 0; 10 @Override 11 public Thread newThread(Runnable r) { 12 return new Thread(r, "TestThreadPool-" + (i++)); 13 } 14 }); 15 thread.execute(r); 16 17 try { 18 Thread.sleep(1000); 19 } catch (InterruptedException e) { 20 e.printStackTrace(); 21 } 22 23 System.out.println("主线程请求执行开始"); 24 System.out.println("主线程请求执行结束,结果" + HttpClientUtilsTest.multipartGet("http://127.0.0.1:9076/test")); 25 }
服务端:
1 @RequestMapping(value = "/test", method = RequestMethod.GET) 2 public String test() throws Exception { 3 try { 4 Thread.sleep(10000); 5 } catch (InterruptedException e) { 6 e.printStackTrace(); 7 } 8 throw new Exception(); 9 }
客户端执行结果:
开始执行子线程请求任务 主线程请求执行开始 子线程执行结束,结果:null
可以看到主线程的请求一直没有结果。
打印客户端调用服务的线程栈:
jps -l 17650 org.jetbrains.jps.cmdline.Launcher 17652 com.avalon.academy.academy.design.AcademyDesignPatternApplication 70554 sun.tools.jps.Jps 70125 com.avalon.util.HttpClientUtilsTest 9342 70126 org.jetbrains.jps.cmdline.Launcher jstack -l 70125
结果:
1 2019-12-26 14:04:37 2 Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.211-b12 mixed mode): 3 4 "Attach Listener" #13 daemon prio=9 os_prio=31 tid=0x00007f85ddc1b800 nid=0xc07 waiting on condition [0x0000000000000000] 5 java.lang.Thread.State: RUNNABLE 6 7 Locked ownable synchronizers: 8 - None 9 10 "TestThreadPool-0" #12 prio=5 os_prio=31 tid=0x00007f85ddc28800 nid=0x3e03 waiting on condition [0x0000700004337000] 11 java.lang.Thread.State: WAITING (parking) 12 at sun.misc.Unsafe.park(Native Method) 13 - parking to wait for <0x0000000797da30c0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) 14 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 15 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) 16 at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 17 at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) 18 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) 19 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 20 at java.lang.Thread.run(Thread.java:748) 21 22 Locked ownable synchronizers: 23 - None 24 25 "Service Thread" #11 daemon prio=9 os_prio=31 tid=0x00007f85dc16f000 nid=0x3903 runnable [0x0000000000000000] 26 java.lang.Thread.State: RUNNABLE 27 28 Locked ownable synchronizers: 29 - None 30 31 "C1 CompilerThread2" #10 daemon prio=9 os_prio=31 tid=0x00007f85dd255800 nid=0x4103 waiting on condition [0x0000000000000000] 32 java.lang.Thread.State: RUNNABLE 33 34 Locked ownable synchronizers: 35 - None 36 37 "C2 CompilerThread1" #9 daemon prio=9 os_prio=31 tid=0x00007f85dc16e000 nid=0x3703 waiting on condition [0x0000000000000000] 38 java.lang.Thread.State: RUNNABLE 39 40 Locked ownable synchronizers: 41 - None 42 43 "C2 CompilerThread0" #8 daemon prio=9 os_prio=31 tid=0x00007f85dcb1f000 nid=0x4403 waiting on condition [0x0000000000000000] 44 java.lang.Thread.State: RUNNABLE 45 46 Locked ownable synchronizers: 47 - None 48 49 "JDWP Command Reader" #7 daemon prio=10 os_prio=31 tid=0x00007f85dc027000 nid=0x4503 runnable [0x0000000000000000] 50 java.lang.Thread.State: RUNNABLE 51 52 Locked ownable synchronizers: 53 - None 54 55 "JDWP Event Helper Thread" #6 daemon prio=10 os_prio=31 tid=0x00007f85dd02a800 nid=0x4703 runnable [0x0000000000000000] 56 java.lang.Thread.State: RUNNABLE 57 58 Locked ownable synchronizers: 59 - None 60 61 "JDWP Transport Listener: dt_socket" #5 daemon prio=10 os_prio=31 tid=0x00007f85dd82c000 nid=0x4807 runnable [0x0000000000000000] 62 java.lang.Thread.State: RUNNABLE 63 64 Locked ownable synchronizers: 65 - None 66 67 "Signal Dispatcher" #4 daemon prio=9 os_prio=31 tid=0x00007f85dd845000 nid=0x3303 runnable [0x0000000000000000] 68 java.lang.Thread.State: RUNNABLE 69 70 Locked ownable synchronizers: 71 - None 72 73 "Finalizer" #3 daemon prio=8 os_prio=31 tid=0x00007f85dc80a000 nid=0x4e03 in Object.wait() [0x0000700003919000] 74 java.lang.Thread.State: WAITING (on object monitor) 75 at java.lang.Object.wait(Native Method) 76 - waiting on <0x0000000797b00bc0> (a java.lang.ref.ReferenceQueue$Lock) 77 at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144) 78 - locked <0x0000000797b00bc0> (a java.lang.ref.ReferenceQueue$Lock) 79 at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165) 80 at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:216) 81 82 Locked ownable synchronizers: 83 - None 84 85 "Reference Handler" #2 daemon prio=10 os_prio=31 tid=0x00007f85dc023000 nid=0x5003 in Object.wait() [0x0000700003816000] 86 java.lang.Thread.State: WAITING (on object monitor) 87 at java.lang.Object.wait(Native Method) 88 - waiting on <0x0000000797b1a678> (a java.lang.ref.Reference$Lock) 89 at java.lang.Object.wait(Object.java:502) 90 at java.lang.ref.Reference.tryHandlePending(Reference.java:191) 91 - locked <0x0000000797b1a678> (a java.lang.ref.Reference$Lock) 92 at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153) 93 94 Locked ownable synchronizers: 95 - None 96 97 "main" #1 prio=5 os_prio=31 tid=0x00007f85dc00b800 nid=0x1903 waiting on condition [0x0000700003203000] 98 java.lang.Thread.State: WAITING (parking) 99 at sun.misc.Unsafe.park(Native Method) 100 - parking to wait for <0x0000000797b1f480> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) 101 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 102 at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) 103 at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:380) 104 at org.apache.http.pool.AbstractConnPool.access$200(AbstractConnPool.java:69) 105 at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:246) 106 - locked <0x00000007968cf630> (a org.apache.http.pool.AbstractConnPool$2) 107 at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:193) 108 at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:303) 109 at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:279) 110 at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:191) 111 at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) 112 at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) 113 at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111) 114 at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) 115 at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) 116 at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) 117 at com.avalon.util.HttpClientUtilsTest.multipartGet(HttpClientUtilsTest.java:179) 118 at com.avalon.util.HttpClientUtilsTest.main(HttpClientUtilsTest.java:293) 119 120 Locked ownable synchronizers: 121 - None 122 123 "VM Thread" os_prio=31 tid=0x00007f85dc01c800 nid=0x2d03 runnable 124 125 "GC task thread#0 (ParallelGC)" os_prio=31 tid=0x00007f85dd003800 nid=0x1f07 runnable 126 127 "GC task thread#1 (ParallelGC)" os_prio=31 tid=0x00007f85dd004800 nid=0x2a03 runnable 128 129 "GC task thread#2 (ParallelGC)" os_prio=31 tid=0x00007f85dd005000 nid=0x5303 runnable 130 131 "GC task thread#3 (ParallelGC)" os_prio=31 tid=0x00007f85dd005800 nid=0x5203 runnable 132 133 "VM Periodic Task Thread" os_prio=31 tid=0x00007f85dc16f800 nid=0x3f03 waiting on condition 134 135 JNI global references: 5906
可以看到主线程tid=0x00007f85dc00b800是被卡在了从连接池获取连接的地方,TestThreadPool-0 tid=0x00007f85ddc28800这个自线程处于一个WAITING的状态,在等待连接释放。完美的复现了我们的问题。