ES客户端异常关闭原因
一、问题描述
新测试环境的icedoc有个2分钟触发一次的定时任务,任务里有查询ES逻辑,在启动后不久(10分钟以上)开始出现ES客户端异常关闭的错误:
1.1、先出现客户端连接异常关闭的错误
2021-07-02 10:34:00.600 ERROR c.s.a.s.i.SearchServiceImpl
org.apache.http.ConnectionClosedException: Connection closed unexpectedly
at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:813)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1514)
at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1484)
at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1454)
at org.elasticsearch.client.RestHighLevelClient.search(RestHighLevelClient.java:970)
1.2、紧接着I/O dispatch
线程异常退出
2021-07-02 10:34:00.709 ERROR o.a.h.i.n.c.CloseableHttpAsyncClientBase$1 [run:66] [pool-1-thread-1] []
- I/O reactor terminated abnormally
org.apache.http.nio.reactor.IOReactorException: I/O dispatch worker terminated abnormally
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:359)
at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221)
at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoSuchMethodError: org.apache.http.ConnectionClosedException: method <init>()V not found
at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.endOfInput(HttpAsyncRequestExecutor.java:356)
at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:261)
at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
1.3、后续每2分钟的任务执行都是I/O reactor
已停止的错误
2021-07-02 10:36:00.633 ERROR c.s.c.f.ServiceExceptionFiler
java.lang.RuntimeException: Request cannot be executed; I/O reactor status: STOPPED
at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:831)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1514)
at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1484)
at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1454)
at org.elasticsearch.client.RestHighLevelClient.search(RestHighLevelClient.java:970)
调整定时触发频率为1小时,没有再发现连接异常关闭的错误了。由于其他环境都是1小时的频率,所以也没有发现这种错误
二、复现问题
2.1、在开发环境按1分钟执行一次,重复执行了1天时间,只出现一次连接中止,但并未影响后续请求
2021-07-22 00:25:44.662 ERROR c.s.a.s.i.SearchServiceImpl [search:235] [DubboServerHandler-172.16.20.14:21027-thread-200] []
java.io.IOException: 您的主机中的软件中止了一个已建立的连接。
at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:828)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1514)
at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1484)
at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1454)
at org.elasticsearch.client.RestHighLevelClient.search(RestHighLevelClient.java:970)
2.2、通过-Xmx64m
调整JVM堆内存后出现GC检查的异常,并复现了后续连接异常关闭的错误
2021-07-22 10:39:50.324 WARN o.a.z.ClientCnxn$SendThread [run:1162] [main-SendThread(172.16.2.55:2181)] []
- Session 0x104d4b207c30562 for server 172.16.2.55/172.16.2.55:2181, unexpected error, closing socket connection and attempting reconnect
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.zookeeper.ClientCnxn$SendThread.sendPing(ClientCnxn.java:973)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1118)
2021-07-22 10:39:50.526 ERROR c.s.c.f.ServiceExceptionFiler [afterThrowing:64] [DubboServerHandler-172.16.20.14:21027-thread-22] []
- SearchResponse com.worley.api.service.impl.SearchServiceImpl.search(SearchRequest)
java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.constructParser(ByteSourceJsonBootstrapper.java:254)
at com.fasterxml.jackson.core.JsonFactory._createParser(JsonFactory.java:1322)
at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:827)
at org.elasticsearch.common.xcontent.json.JsonXContent.createParser(JsonXContent.java:91)
at org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:142)
at org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:114)
at org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:90)
at org.elasticsearch.search.lookup.SourceLookup.sourceAsMapAndType(SourceLookup.java:86)
at org.elasticsearch.search.lookup.SourceLookup.sourceAsMap(SourceLookup.java:90)
at org.elasticsearch.search.SearchHit.getSourceAsMap(SearchHit.java:412)
at com.worley.api.service.impl.SearchServiceImpl.search(SearchServiceImpl.java:250)
2021-07-22 10:39:51.869 INFO o.a.z.ClientCnxn$SendThread [logStartConnect:1032] [main-SendThread(172.16.2.55:2181)] []
- Opening socket connection to server 172.16.2.55/172.16.2.55:2181. Will not attempt to authenticate using SASL (unknown error)
2021-07-22 10:39:51.991 INFO o.a.z.ClientCnxn$SendThread [primeConnection:876] [main-SendThread(172.16.2.55:2181)] []
- Socket connection established to 172.16.2.55/172.16.2.55:2181, initiating session
2021-07-22 10:39:52.115 INFO o.a.z.ClientCnxn$SendThread [onConnected:1299] [main-SendThread(172.16.2.55:2181)] []
- Session establishment complete on server 172.16.2.55/172.16.2.55:2181, sessionid = 0x104d4b207c30562, negotiated timeout = 30000
2021-07-22 10:41:47.954 ERROR c.s.a.s.i.SearchServiceImpl [search:235] [DubboServerHandler-172.16.20.14:21027-thread-23] []
org.apache.http.ConnectionClosedException: Connection closed unexpectedly
at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:813)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1514)
at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1484)
at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1454)
at org.elasticsearch.client.RestHighLevelClient.search(RestHighLevelClient.java:970)
at com.worley.api.service.impl.SearchServiceImpl.search(SearchServiceImpl.java:228)
2.3、在控制台看到了几处系统ERR
的信息并没有打印在日志文件里
通过查看新测试环境启动命令,发现堆内存配置是
-Xms256m -Xmx512m
,在加上每2分钟触发一次任务,有可能是内存不足导致连接未正常退出。
但是一般这种错误会抛出业务层并打印在日志文件里,在新测试环境未发现GC
的日志
2021-07-22 12:46:56.597 INFO
Exception in thread "pool-1-thread-1" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.HashMap$KeySet.iterator(HashMap.java:916)
at java.util.HashSet.iterator(HashSet.java:172)
at java.util.Collections$UnmodifiableCollection$1.<init>(Collections.java:1039)
at java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1038)
at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processTimeouts(DefaultConnectingIOReactor.java:204)
at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:158)
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:351)
at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221)
at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
at java.lang.Thread.run(Thread.java:748)
# 这里在控制台的ERR打印的,所以日志文件里找不到
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "I/O dispatcher 2" Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "pool-1-thread-1" java.lang.OutOfMemoryError: GC overhead limit exceeded
# 日志文件里找到了关于GC的错误日志,说明新测试的iecdoc不是GC问题引起的
2021-07-22 14:40:15.802 ERROR c.s.c.f.ServiceExceptionFiler [afterThrowing:64] [DubboServerHandler-172.16.20.14:21027-thread-39] []
- SearchResult com.worley.ecdoc.service.index.ecdoc.EcdocContentIndexSearchService.cmsPublishListSearch(EcdocContentSearchVO)
java.lang.OutOfMemoryError: GC overhead limit exceeded
2.4、使用与新测试一样的频率(2分钟)重复调用,在40分钟后出现一样的错误,问题完整复现
由于nginx长连接超时配置为120s,客户端1.0.3版本未设置长连接保持时间,全部依赖nginx的配置。
我怀疑是客户端恰好在nginx即将断开长连接的时候拿到连接,在请求时nginx正好关闭长连接导致这个错误。
这里有个错误需要明确指出:
java.lang.NoSuchMethodError: org.apache.http.ConnectionClosedException: method <init>()V not found
这是由于在httpcore-nio-4.4.12
的org.apache.http.nio.protocol.HttpAsyncRequestExecutor.endOfInput
方法里使用无参构造器实例化ConnectionClosedException
时报错导致的,
因为httpcore-4.4.6
里没有这个无参构造器,高版本里才有
@Override
public void endOfInput(final NHttpClientConnection conn) throws IOException {
final State state = getState(conn);
final HttpContext context = conn.getContext();
synchronized (context) {
if (state != null) {
if (state.getRequestState().compareTo(MessageState.READY) != 0) {
state.invalidate();
}
final HttpAsyncClientExchangeHandler handler = getHandler(conn);
if (handler != null) {
if (state.isValid()) {
handler.inputTerminated();
} else {
// 从错误日志里可以看到,用无参构造器实例化 ConnectionClosedException 失败
handler.failed(new ConnectionClosedException());
}
}
}
if (conn.getSocketTimeout() <= 0) {
conn.setSocketTimeout(1000);
}
conn.close();
}
}
最终出现I/O reactor terminated abnormally
错误,导致后续请求一直报Request cannot be executed; I/O reactor status: STOPPED
的错误:
public CloseableHttpAsyncClientBase(
final NHttpClientConnectionManager connmgr,
final ThreadFactory threadFactory,
final NHttpClientEventHandler handler) {
super();
this.connmgr = connmgr;
if (threadFactory != null && handler != null) {
this.reactorThread = threadFactory.newThread(new Runnable() {
@Override
public void run() {
try {
final IOEventDispatch ioEventDispatch = new InternalIODispatch(handler);
connmgr.execute(ioEventDispatch);
} catch (final Exception ex) {
log.error("I/O reactor terminated abnormally", ex);
} finally {
status.set(Status.STOPPED); // 异常后设置了状态为:STOPPED
}
}
});
} else {
this.reactorThread = null;
}
this.status = new AtomicReference<Status>(Status.INACTIVE);
}
2.5、使用与新测试一样的频率(2分钟)重复调用,调整httpcore-core-4.4.12
的依赖
在iecdoc里调整版本
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>4.4.12</version>
</dependency>
使用debug启动,并在endOfInput
里设置断点。等待20分钟后端点被触发,通过单步执行走到handler.failed(new ConnectionClosedException())
这行代码,这次没有类构造失败的错误了,
只是报了下列Connection is closed
的错误,后续请求正常执行,说明只要I/O dispatch
线程正常就可以继续获取连接发起请求:
2021-07-23 14:44:26.804 ERROR c.s.a.s.i.SearchServiceImpl [search:235] [DubboServerHandler-172.16.20.14:21027-thread-88] []
- search error:
org.apache.http.ConnectionClosedException: Connection is closed
at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:813)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1514)
at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1484)
at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1454)
at org.elasticsearch.client.RestHighLevelClient.search(RestHighLevelClient.java:970)
at com.worley.api.service.impl.SearchServiceImpl.search(SearchServiceImpl.java:228)
这个异常只是影响当前请求,可以使用spring-retry
包的重试机制解决。
三、总结
导致Request cannot be executed; I/O reactor status: STOPPED
的错误是由于I/O dispatch
线程异常退出导致
- 唯一原因是
ConnectionClosedException
的构造错误引起(httpcore-4.4.6
版本过低)
导致程序在endOfInput
方法里判断连接状态是无效的原因,目前发现有两种:
- 堆内存不足
GC overhead limit exceeded
导致连接关闭 - 由于客户端长连接未设置,完全依赖服务端主动关闭长连接。在获取请求后,此时服务端有可能主动(达到设置的过期时间)关闭长连接。
解决方式:
- 客户端设置长连接时间小于服务端,有效避免客户端发起请求前服务端突然断开长连接