elasticsearch报Data too large异常
在线上ES集群日志中发现了如下异常,elasticsearch版本为7.3.2
[2021-03-16T21:05:10,338][DEBUG][o.e.a.a.c.n.i.TransportNodesInfoAction] [java-d-service-es-200-56-client-1] failed to execute on node [hsF4JzeAQ6mflJRGnJIKzQ] org.elasticsearch.transport.RemoteTransportException: [data-es-group-online-200-67-2][10.110.200.67:9301][cluster:monitor/nodes/info[n]] Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [33093117638/30.8gb], which is larger than the limit of [31621696716/29.4gb], real usage: [33093114144/30.8gb], new bytes reserved: [3494/3.4kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=3494/3.4kb, accounting=104564949/99.7mb] at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:342) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:173) [elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:121) [elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) [elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:660) [elasticsearch-7.3.2.jar:7.3.2] at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) [transport-netty4-client-7.3.2.jar:7.3.2] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final] at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) [netty-codec-4.1.36.Final.jar:4.1.36.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.36.Final.jar:4.1.36.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final] at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.36.Final.jar:4.1.36.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final] at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408) [netty-transport-4.1.36.Final.jar:4.1.36.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final] at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) [netty-transport-4.1.36.Final.jar:4.1.36.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.36.Final.jar:4.1.36.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682) [netty-transport-4.1.36.Final.jar:4.1.36.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:582) [netty-transport-4.1.36.Final.jar:4.1.36.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:536) [netty-transport-4.1.36.Final.jar:4.1.36.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) [netty-transport-4.1.36.Final.jar:4.1.36.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) [netty-common-4.1.36.Final.jar:4.1.36.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.36.Final.jar:4.1.36.Final] at java.lang.Thread.run(Thread.java:835) [?:?] [2021-03-16T21:05:11,203][INFO ][o.e.x.s.a.AuthenticationServi
拉下ES源码,报错类位置org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService,具体代码如下:
public void checkParentLimit(long newBytesReserved, String label) throws CircuitBreakingException { final MemoryUsage memoryUsed = memoryUsed(newBytesReserved); long parentLimit = this.parentSettings.getLimit(); if (memoryUsed.totalUsage > parentLimit) { this.parentTripCount.incrementAndGet(); final StringBuilder message = new StringBuilder("[parent] Data too large, data for [" + label + "]" + " would be [" + memoryUsed.totalUsage + "/" + new ByteSizeValue(memoryUsed.totalUsage) + "]" + ", which is larger than the limit of [" + parentLimit + "/" + new ByteSizeValue(parentLimit) + "]"); if (this.trackRealMemoryUsage) { final long realUsage = memoryUsed.baseUsage; message.append(", real usage: ["); message.append(realUsage); message.append("/"); message.append(new ByteSizeValue(realUsage)); message.append("], new bytes reserved: ["); message.append(newBytesReserved); message.append("/"); message.append(new ByteSizeValue(newBytesReserved)); message.append("]"); } else { message.append(", usages ["); message.append(String.join(", ", this.breakers.entrySet().stream().map(e -> { final CircuitBreaker breaker = e.getValue(); final long breakerUsed = (long)(breaker.getUsed() * breaker.getOverhead()); return e.getKey() + "=" + breakerUsed + "/" + new ByteSizeValue(breakerUsed); }) .collect(Collectors.toList()))); message.append("]"); } // derive durability of a tripped parent breaker depending on whether the majority of memory tracked by // child circuit breakers is categorized as transient or permanent. CircuitBreaker.Durability durability = memoryUsed.transientChildUsage >= memoryUsed.permanentChildUsage ? CircuitBreaker.Durability.TRANSIENT : CircuitBreaker.Durability.PERMANENT; throw new CircuitBreakingException(message.toString(), memoryUsed.totalUsage, parentLimit, durability); } }
从代码可以看出,当memoryUsed.totalUsage > parentLimit时,才会出现熔断;parentLimit的值与配置indices.breaker.total.limit(默认值为95%或者70%)有关,它的默认值与indices.breaker.total.use_real_memory(默认值为true)的配置有关,如下代码所示:
public static final Setting<Boolean> USE_REAL_MEMORY_USAGE_SETTING = Setting.boolSetting("indices.breaker.total.use_real_memory", true, Property.NodeScope); public static final Setting<ByteSizeValue> TOTAL_CIRCUIT_BREAKER_LIMIT_SETTING = Setting.memorySizeSetting("indices.breaker.total.limit", settings -> { if (USE_REAL_MEMORY_USAGE_SETTING.get(settings)) { return "95%"; } else { return "70%"; } }, Property.Dynamic, Property.NodeScope);
我们再来看看memoryUsed.totalUsage的值,它是该类的一个方法计算出来,代码如下:
private MemoryUsage memoryUsed(long newBytesReserved) { long transientUsage = 0; long permanentUsage = 0; for (CircuitBreaker breaker : this.breakers.values()) { long breakerUsed = (long)(breaker.getUsed() * breaker.getOverhead()); if (breaker.getDurability() == CircuitBreaker.Durability.TRANSIENT) { transientUsage += breakerUsed; } else if (breaker.getDurability() == CircuitBreaker.Durability.PERMANENT) { permanentUsage += breakerUsed; } } if (this.trackRealMemoryUsage) { final long current = currentMemoryUsage(); return new MemoryUsage(current, current + newBytesReserved, transientUsage, permanentUsage); } else { long parentEstimated = transientUsage + permanentUsage; return new MemoryUsage(parentEstimated, parentEstimated, transientUsage, permanentUsage); } }
trackRealMemoryUsage的值(取自该配置indices.breaker.total.use_real_memory)决定了是使用实际的内存使用量还是child circuit breakers的内存使用量来判断熔断; 官方解释如下:
Static setting determining whether the parent breaker should take real memory usage into account (true
) or only consider the amount that is reserved by child circuit breakers (false
). Defaults to true
总结:2021年3月17日中午11点50开始修改线上DATA节点配置:indices.breaker.total.use_real_memory:false 并且滚动重启了线上集群;
今天是2021年3月18日,昨天中午更新完该配置,昨天晚上18:30对集群进行了业务压测,未见该异常出现;(没改前,压力测试集群会掉点,并且由于分片漂移导致集群变yellow);