ES客户端异常关闭原因

一、问题描述

新测试环境的icedoc有个2分钟触发一次的定时任务,任务里有查询ES逻辑,在启动后不久(10分钟以上)开始出现ES客户端异常关闭的错误:

1.1、先出现客户端连接异常关闭的错误

2021-07-02 10:34:00.600 ERROR c.s.a.s.i.SearchServiceImpl
org.apache.http.ConnectionClosedException: Connection closed unexpectedly
	at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:813)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
	at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1514)
	at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1484)
	at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1454)
	at org.elasticsearch.client.RestHighLevelClient.search(RestHighLevelClient.java:970)

1.2、紧接着I/O dispatch线程异常退出

2021-07-02 10:34:00.709 ERROR o.a.h.i.n.c.CloseableHttpAsyncClientBase$1 [run:66] [pool-1-thread-1] []
				- I/O reactor terminated abnormally
org.apache.http.nio.reactor.IOReactorException: I/O dispatch worker terminated abnormally
	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:359)
	at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221)
	at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoSuchMethodError: org.apache.http.ConnectionClosedException: method <init>()V not found
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.endOfInput(HttpAsyncRequestExecutor.java:356)
	at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:261)
	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
	at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)

1.3、后续每2分钟的任务执行都是I/O reactor已停止的错误

2021-07-02 10:36:00.633 ERROR c.s.c.f.ServiceExceptionFiler
java.lang.RuntimeException: Request cannot be executed; I/O reactor status: STOPPED
	at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:831)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
	at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1514)
	at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1484)
	at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1454)
	at org.elasticsearch.client.RestHighLevelClient.search(RestHighLevelClient.java:970)

调整定时触发频率为1小时,没有再发现连接异常关闭的错误了。由于其他环境都是1小时的频率,所以也没有发现这种错误

二、复现问题

2.1、在开发环境按1分钟执行一次,重复执行了1天时间,只出现一次连接中止,但并未影响后续请求

2021-07-22 00:25:44.662 ERROR c.s.a.s.i.SearchServiceImpl [search:235] [DubboServerHandler-172.16.20.14:21027-thread-200] []
java.io.IOException: 您的主机中的软件中止了一个已建立的连接。
	at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:828)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
	at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1514)
	at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1484)
	at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1454)
	at org.elasticsearch.client.RestHighLevelClient.search(RestHighLevelClient.java:970)

2.2、通过-Xmx64m调整JVM堆内存后出现GC检查的异常,并复现了后续连接异常关闭的错误

2021-07-22 10:39:50.324 WARN  o.a.z.ClientCnxn$SendThread [run:1162] [main-SendThread(172.16.2.55:2181)] []
				- Session 0x104d4b207c30562 for server 172.16.2.55/172.16.2.55:2181, unexpected error, closing socket connection and attempting reconnect
java.lang.OutOfMemoryError: GC overhead limit exceeded
	at org.apache.zookeeper.ClientCnxn$SendThread.sendPing(ClientCnxn.java:973)
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1118)
2021-07-22 10:39:50.526 ERROR c.s.c.f.ServiceExceptionFiler [afterThrowing:64] [DubboServerHandler-172.16.20.14:21027-thread-22] []
				- SearchResponse com.worley.api.service.impl.SearchServiceImpl.search(SearchRequest)
java.lang.OutOfMemoryError: GC overhead limit exceeded
	at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.constructParser(ByteSourceJsonBootstrapper.java:254)
	at com.fasterxml.jackson.core.JsonFactory._createParser(JsonFactory.java:1322)
	at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:827)
	at org.elasticsearch.common.xcontent.json.JsonXContent.createParser(JsonXContent.java:91)
	at org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:142)
	at org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:114)
	at org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:90)
	at org.elasticsearch.search.lookup.SourceLookup.sourceAsMapAndType(SourceLookup.java:86)
	at org.elasticsearch.search.lookup.SourceLookup.sourceAsMap(SourceLookup.java:90)
	at org.elasticsearch.search.SearchHit.getSourceAsMap(SearchHit.java:412)
	at com.worley.api.service.impl.SearchServiceImpl.search(SearchServiceImpl.java:250)
2021-07-22 10:39:51.869 INFO  o.a.z.ClientCnxn$SendThread [logStartConnect:1032] [main-SendThread(172.16.2.55:2181)] []
				- Opening socket connection to server 172.16.2.55/172.16.2.55:2181. Will not attempt to authenticate using SASL (unknown error)
2021-07-22 10:39:51.991 INFO  o.a.z.ClientCnxn$SendThread [primeConnection:876] [main-SendThread(172.16.2.55:2181)] []
				- Socket connection established to 172.16.2.55/172.16.2.55:2181, initiating session
2021-07-22 10:39:52.115 INFO  o.a.z.ClientCnxn$SendThread [onConnected:1299] [main-SendThread(172.16.2.55:2181)] []
				- Session establishment complete on server 172.16.2.55/172.16.2.55:2181, sessionid = 0x104d4b207c30562, negotiated timeout = 30000

2021-07-22 10:41:47.954 ERROR c.s.a.s.i.SearchServiceImpl [search:235] [DubboServerHandler-172.16.20.14:21027-thread-23] []
org.apache.http.ConnectionClosedException: Connection closed unexpectedly
	at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:813)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
	at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1514)
	at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1484)
	at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1454)
	at org.elasticsearch.client.RestHighLevelClient.search(RestHighLevelClient.java:970)
	at com.worley.api.service.impl.SearchServiceImpl.search(SearchServiceImpl.java:228)

2.3、在控制台看到了几处系统ERR的信息并没有打印在日志文件里

通过查看新测试环境启动命令,发现堆内存配置是-Xms256m -Xmx512m,在加上每2分钟触发一次任务,有可能是内存不足导致连接未正常退出。
但是一般这种错误会抛出业务层并打印在日志文件里,在新测试环境未发现GC的日志

2021-07-22 12:46:56.597 INFO
Exception in thread "pool-1-thread-1" java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.util.HashMap$KeySet.iterator(HashMap.java:916)
	at java.util.HashSet.iterator(HashSet.java:172)
	at java.util.Collections$UnmodifiableCollection$1.<init>(Collections.java:1039)
	at java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1038)
	at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processTimeouts(DefaultConnectingIOReactor.java:204)
	at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:158)
	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:351)
	at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221)
	at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
	at java.lang.Thread.run(Thread.java:748)

# 这里在控制台的ERR打印的,所以日志文件里找不到
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "I/O dispatcher 2" Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "pool-1-thread-1" java.lang.OutOfMemoryError: GC overhead limit exceeded

# 日志文件里找到了关于GC的错误日志,说明新测试的iecdoc不是GC问题引起的
2021-07-22 14:40:15.802 ERROR c.s.c.f.ServiceExceptionFiler [afterThrowing:64] [DubboServerHandler-172.16.20.14:21027-thread-39] []
				- SearchResult com.worley.ecdoc.service.index.ecdoc.EcdocContentIndexSearchService.cmsPublishListSearch(EcdocContentSearchVO)
java.lang.OutOfMemoryError: GC overhead limit exceeded

2.4、使用与新测试一样的频率(2分钟)重复调用,在40分钟后出现一样的错误,问题完整复现

由于nginx长连接超时配置为120s,客户端1.0.3版本未设置长连接保持时间,全部依赖nginx的配置。
我怀疑是客户端恰好在nginx即将断开长连接的时候拿到连接,在请求时nginx正好关闭长连接导致这个错误。

这里有个错误需要明确指出:

java.lang.NoSuchMethodError: org.apache.http.ConnectionClosedException: method <init>()V not found

这是由于在httpcore-nio-4.4.12org.apache.http.nio.protocol.HttpAsyncRequestExecutor.endOfInput方法里使用无参构造器实例化ConnectionClosedException时报错导致的,
因为httpcore-4.4.6里没有这个无参构造器,高版本里才有

    @Override
    public void endOfInput(final NHttpClientConnection conn) throws IOException {
        final State state = getState(conn);
        final HttpContext context = conn.getContext();
        synchronized (context) {
            if (state != null) {
                if (state.getRequestState().compareTo(MessageState.READY) != 0) {
                    state.invalidate();
                }
                final HttpAsyncClientExchangeHandler handler = getHandler(conn);
                if (handler != null) {
                    if (state.isValid()) {
                        handler.inputTerminated();
                    } else {
                        // 从错误日志里可以看到,用无参构造器实例化 ConnectionClosedException 失败
                        handler.failed(new ConnectionClosedException());
                    }
                }
            }
            if (conn.getSocketTimeout() <= 0) {
                conn.setSocketTimeout(1000);
            }
            conn.close();
        }
    }

最终出现I/O reactor terminated abnormally错误,导致后续请求一直报Request cannot be executed; I/O reactor status: STOPPED的错误:

    public CloseableHttpAsyncClientBase(
            final NHttpClientConnectionManager connmgr,
            final ThreadFactory threadFactory,
            final NHttpClientEventHandler handler) {
        super();
        this.connmgr = connmgr;
        if (threadFactory != null && handler != null) {
            this.reactorThread = threadFactory.newThread(new Runnable() {
                @Override
                public void run() {
                    try {
                        final IOEventDispatch ioEventDispatch = new InternalIODispatch(handler);
                        connmgr.execute(ioEventDispatch);
                    } catch (final Exception ex) {
                        log.error("I/O reactor terminated abnormally", ex);
                    } finally {
                        status.set(Status.STOPPED); // 异常后设置了状态为:STOPPED
                    }
                }
            });
        } else {
            this.reactorThread = null;
        }
        this.status = new AtomicReference<Status>(Status.INACTIVE);
    }

2.5、使用与新测试一样的频率(2分钟)重复调用,调整httpcore-core-4.4.12的依赖

在iecdoc里调整版本

<dependency>
	<groupId>org.apache.httpcomponents</groupId>
	<artifactId>httpcore</artifactId>
	<version>4.4.12</version>
</dependency>

使用debug启动,并在endOfInput里设置断点。等待20分钟后端点被触发,通过单步执行走到handler.failed(new ConnectionClosedException())这行代码,这次没有类构造失败的错误了,
只是报了下列Connection is closed的错误,后续请求正常执行,说明只要I/O dispatch线程正常就可以继续获取连接发起请求:

2021-07-23 14:44:26.804 ERROR c.s.a.s.i.SearchServiceImpl [search:235] [DubboServerHandler-172.16.20.14:21027-thread-88] []
				- search error:
org.apache.http.ConnectionClosedException: Connection is closed
	at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:813)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
	at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1514)
	at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1484)
	at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1454)
	at org.elasticsearch.client.RestHighLevelClient.search(RestHighLevelClient.java:970)
	at com.worley.api.service.impl.SearchServiceImpl.search(SearchServiceImpl.java:228)

这个异常只是影响当前请求,可以使用spring-retry包的重试机制解决。

三、总结

导致Request cannot be executed; I/O reactor status: STOPPED的错误是由于I/O dispatch线程异常退出导致

  • 唯一原因是ConnectionClosedException的构造错误引起(httpcore-4.4.6版本过低)

导致程序在endOfInput方法里判断连接状态是无效的原因,目前发现有两种:

  • 堆内存不足GC overhead limit exceeded导致连接关闭
  • 由于客户端长连接未设置,完全依赖服务端主动关闭长连接。在获取请求后,此时服务端有可能主动(达到设置的过期时间)关闭长连接。

解决方式:

  • 客户端设置长连接时间小于服务端,有效避免客户端发起请求前服务端突然断开长连接
posted @ 2021-10-28 16:03  爱定小闹钟  阅读(5632)  评论(0编辑  收藏  举报