【elasticsearch】 Elasticsearch Breaker CircuitBreakingException Parent Data Too Large Real Usage
参考转载:
本文作者: jiankunking
本文链接: https://jiankunking.com/elasticsearch-breaker-circuitbreakingexception-parent-data-too-large-transport-request-real-usage.html
前言:
今天收到告警,发现es集群一个节点服务挂掉了,于是排查一下日志。
报错内容:
报错内容触发了父熔断器。
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<reused_arrays>] would be [3207150816/2.9gb], which is larger than the limit of [3060164198/2.8gb], real usage: [3207150728/2.9gb], new bytes reserved: [88/88b], usages [request=11064/10.8kb, fielddata=84212/82.2kb, in_flight_requests=27938762/26.6mb, accounting=285742893/272.5mb]
at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:343) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.common.util.BigArrays.adjustBreaker(BigArrays.java:399) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.common.util.BigArrays.validate(BigArrays.java:447) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.common.util.BigArrays.newLongArray(BigArrays.java:608) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.common.util.BigArrays.resize(BigArrays.java:626) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.common.util.AbstractHash.resize(AbstractHash.java:51) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.common.util.AbstractPagedHashMap.grow(AbstractPagedHashMap.java:102) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.common.util.LongHash.add(LongHash.java:107) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.search.aggregations.bucket.histogram.DateHistogramAggregator$1.collect(DateHistogramAggregator.java:121) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.search.aggregations.LeafBucketCollector.collect(LeafBucketCollector.java:82) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.apache.lucene.search.MultiCollector$MultiLeafCollector.collect(MultiCollector.java:188) ~[lucene-core-8.2.0.jar:8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:05:56]
当时的jvm.options配置如下:
-Xms3g
-Xmx3g
## GC configuration
-XX:+UseG1GC
# full gc效率更高
-XX:+ExplicitGCInvokesConcurrent
-XX:+UseGCOverheadLimit
# # # 启动并发GC周期时的堆内存占用百分比.
-XX:InitiatingHeapOccupancyPercent=45
# G1为分配担保预留的空间比例 默认10%
-XX:G1ReservePercent=15
# 表示每次GC最大的停顿毫秒数,默认200,减少该值会增加系统load。
-XX:MaxGCPauseMillis=100
报错原因:
经过一顿Google之后,发现该问题是由于:
es 7.x之后引入了indices.breaker.total.use_real_memory
造成的
,从文档来看indices.breaker.total.use_real_memory
控制的是jvm实际使用的内存。
那么jvm内存用到多少的时候,会触发该熔断呢?
从集群系统配置中可以看到indices.breaker.total.limit
默认是95。
curl "http://127.0.0.1:9200/_cluster/settings?include_defaults&flat_settings&local&filter_path=defaults.indices*&pretty"
结果:
{
"defaults" : {
"indices.analysis.hunspell.dictionary.ignore_case" : "false",
"indices.analysis.hunspell.dictionary.lazy" : "false",
"indices.breaker.accounting.limit" : "100%",
"indices.breaker.accounting.overhead" : "1.0",
"indices.breaker.fielddata.limit" : "40%",
"indices.breaker.fielddata.overhead" : "1.03",
"indices.breaker.fielddata.type" : "memory",
"indices.breaker.request.limit" : "60%",
"indices.breaker.request.overhead" : "1.0",
"indices.breaker.request.type" : "memory",
"indices.breaker.total.limit" : "95%",
"indices.breaker.total.use_real_memory" : "true",
"indices.breaker.type" : "hierarchy",
"indices.cache.cleanup_interval" : "1m",
"indices.fielddata.cache.size" : "-1b",
"indices.lifecycle.poll_interval" : "10m",
"indices.mapping.dynamic_timeout" : "30s",
"indices.memory.index_buffer_size" : "10%",
"indices.memory.interval" : "5s",
"indices.memory.max_index_buffer_size" : "-1",
"indices.memory.min_index_buffer_size" : "48mb",
"indices.memory.shard_inactive_time" : "5m",
"indices.queries.cache.all_segments" : "false",
"indices.queries.cache.count" : "10000",
"indices.queries.cache.size" : "10%",
"indices.query.bool.max_clause_count" : "1024",
"indices.query.query_string.allowLeadingWildcard" : "true",
"indices.query.query_string.analyze_wildcard" : "false",
"indices.recovery.internal_action_long_timeout" : "1800000ms",
"indices.recovery.internal_action_timeout" : "15m",
"indices.recovery.max_bytes_per_sec" : "40mb",
"indices.recovery.max_concurrent_file_chunks" : "2",
"indices.recovery.recovery_activity_timeout" : "1800000ms",
"indices.recovery.retry_delay_network" : "5s",
"indices.recovery.retry_delay_state_sync" : "500ms",
"indices.requests.cache.expire" : "0ms",
"indices.requests.cache.size" : "1%",
"indices.store.delete.shard.timeout" : "30s"
}
}
关于为什么在默认开启indices.breaker.total.use_real_memory
之后,如果GC算法是G1的话,会频繁触发熔断呢?
先解释一下G1的几个参数:
InitiatingHeapOccupancyPercent
:表示G1 GC并行循环初始设置的堆大小值,这个值决定了一个并行循环是不是要开始执行。它的逻辑是在一次GC完成后,比较老年代占用的空间和整个Java堆之间的比例。如果大于这个值,则预约下一次GC开始一个并行循环回收垃圾,从初始标记阶段开始。这个值越小,GC越频繁,反之,值越大,可以让应用程序执行时间更长。不过在内存消耗很快的情况下,我认为早运行并行循环比晚运行要好,看病要趁早。G1NewSizePercent
:年轻代初始化值,默认是 5%G1MaxNewSizePercent
:年轻代占用最大值,最大值默认是整个Java堆大小的60%
关于该问题具体分析可以看:https://github.com/elastic/elasticsearch/pull/46169
简单来说就是:es jvm.options之前的默认配置会导致老年代+年轻代的内存占用超过95%(理论上内存阈值会达到60+75=135),从而导致频繁的熔断。
解决方案:
修复该问题最有效的方式是根据不同版本JDK调整GC
算法:
我这里使用的是jdk11
版本,降低IHOP
参数值 配置如下:
## GC configuration
-XX:+UseG1GC
# full gc效率更高
-XX:+ExplicitGCInvokesConcurrent
-XX:+UseGCOverheadLimit
# # # 启动并发GC周期时的堆内存占用百分比.
-XX:InitiatingHeapOccupancyPercent=25
# G1为分配担保预留的空间比例 默认10%
-XX:G1ReservePercent=15
# 表示每次GC最大的停顿毫秒数,默认200,减少该值会增加系统load。
-XX:MaxGCPauseMillis=100