ElasticSearch批量写入时遇到EsRejectedExecutionException

阿里云ARMS控制台发现线上搜索服务的批量写入方法报大量异常：

 [DUBBO] Got unchecked and undeclared exception which called by 192.168.x.x. service: xxx.IProductSearchService, method: saveProductEntitys, 
exception: org.springframework.data.elasticsearch.ElasticsearchException: Bulk indexing has failures. Use ElasticsearchException.getFailedDocuments() for detailed messages 
[{346833406144942081PY331010069=RemoteTransportException[[node-100][192.168.x.x:9300][indices:data/write/bulk[s]]]; nested: EsRejectedExecutionException[rejected 
execution of org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1@c1f6e62 on EsThreadPoolExecutor[bulk, queue capacity = 200, 
org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@66f7b4ec[Running, pool size = 16, active threads = 16, queued tasks = 200, completed tasks = 332599]]];}], dubbo 
version: search, current host: 192.168.x.x

推测跟上线前商品服务的改动有关，将更新商品增量索引的方法为了异步无返回的方式，该方法构建好数据后调搜索服务批量写入ES的接口。

<dubbo:reference id="productIdxService" interface="com.xxx.ProductIdxService" lazy="true" timeout="10000">
    <!-- 异步调用无返回 -->
    <dubbo:method name="buildIncrementalProductIndex" async="true" return="false" />
</dubbo:reference>

由于是异步调用，调用方很快返回，可能导致提供方并发数增加，最终批量写入ES接口的并发增加。
日志里的EsRejectedExecutionException也提示了ES的线程池EsThreadPoolExecutor bulk queue capacity = 200，pool size = 16, active threads = 16, queued tasks = 200，
跟JUC的固定线程池类似，任务数超过了队列大小，ES拒绝执行。

线上使用的ES版本很老，是2.2版本。

通过ES提供的HTTP接口：

查看线程池配置
GET /ES地址/_cat/thread_pool?v
// 注：?v表示显示表头(header)，也可指定具体显示哪里列
// 例：/_cat/thread_pool?v&h=host,bulk.active,bulk.rejected,bulk.queue,bulk.queueSize,bulk.size,bulk.min,bulk.max,search.active,search.rejected,search.queue,search.queueSize
查看各节点线程池配置
GET /ES地址/_nodes/thread_pool/
查看集群配置
GET /ES地址/_cat/_cluster/settings
修改集群配置
PUT /ES地址/_cat/_cluster/settings

{
  "transient": {
    "threadpool.bulk.type": "fixed",
    "threadpool.bulk.queue_size": 1000,
    "threadpool.bulk.size": 16,
    "threadpool.bulk.min": 16,
    "threadpool.bulk.max": 16
  }
}

// 将bulk的队列大小修改为1000，注意size、min、max也可修改，但实测发现查看集群配置参数确实改了，而查看线程池3个参数仍然没变，只是队列大小已修改。
由于是线上ES，且考虑到线程数16是ES的保护机制以及机器配置和负载情况，只修改了队列大小。
队列大小调整到1000后，发现报EsRejectedExecutionException的数量少了很多，在并发量很大的时候还是会报该异常，
异常提示里的queue capacity已是1000了，说明参数生效。

该方案为临时处理方案，少量的写入异常从业务上看能接受，对业务影响不大基本上很难发现，且目前老版本ES的负载较高。
未来着手对ES版本进行升级，使用阿里云ES7.4的版本，并且升级老的搜索服务，应用新版本的spring-data-elasticseach或者使用rest-high-level-client。

参考：
es 查看线程池 https://blog.csdn.net/yun0000000/article/details/106327838/
es写入报错，EsRejectedExecutionException https://elasticsearch.cn/question/4647
使用 /_cat/thread_pool 查询线程池运行情况 https://www.letianbiji.com/elasticsearch/es7-cat-thread-pool.html

posted @ 2020-09-13 12:05 cdfive 阅读(5368) 评论(4) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

cdfive

To be a better developer!

ElasticSearch批量写入时遇到EsRejectedExecutionException

公告