disk usage exceeded flood-stage watermark

0. 问题描述

在公司产品开发迭代过程中，由于测试服务器磁盘使用率高达96%，代码部署到测试服务器进行测试时，发现创作模块不能新建创作、不能编辑创作，以及素材模块也不能添加新的素材以及修改之前的素材，只能读取之前的数据，一进行修改和新增的时候页面直接抛出一大串异常信息，查看服务端日志看到以下：

0.1 日志信息

org.elasticsearch.cluster.block.ClusterBlockException: index [.ds-ilm-history-5-2022.05.28-000003] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];index [.ds-ilm-history-5-2022.06.27-000004] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];index [.ds-ilm-history-5-2022.04.28-000002] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];

从日志信息上可以得到一些信息：一个是请求太多，另一个是磁盘超过阈值；这两个原因都有可能触发ES主动给索引上锁。

1. 请求过多引起上锁解决方案

由于前段时间有在测试环境对创作写入接口进行压测，所以当时我初步判断是因为压测的时候请求次数过多引起 ES 节点服务器内存超过限制，ES 主动给索引上锁，然后再 Kibana 开发工具上执行以下命令：

PUT _all/_settings
{
  "index.blocks.read_only_allow_delete": null
}

执行命令后，ES可以执行几次的写入操作了，但还是会触发上锁，显然这个并不是这次问题根本原因，但出于也有可能引发同样的问题，所以也记录以下这个解决方法

2. 磁盘使用超过阈值解决方案

在执行完方案一之后仍然没有解决问题，于是判断是磁盘超过阈值而导致问题的发生

2.1 ES 警告日志

在查询 ES 节点日志信息当中，发现 ES 一直在打印一个警告信息

[WARN ][o.e.c.r.a.DiskThresholdMonitor] [node01] flood stage disk watermark [95%] exceeded on [NOe0S30GTKm4BSFxz6ndUw][node01][/data/es/node01/nodes/0] free: 3.7gb[3.8%], all indices on this node will be marked read-only

2.2 查询ES节点磁盘使用情况

df -h

执行完命令后发现，Linux 服务器磁盘使用率已经高达 96% 了，由于我们在部署 ES 的时候并没有针对磁盘使用率进行任何配置，ES 磁盘使用率配置默认上限为 95%，这一参数可以参考 ES 源码：

public static final Setting<RelativeByteSizeValue> CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_FROZEN_SETTING =
        new Setting<>("cluster.routing.allocation.disk.watermark.flood_stage.frozen", "95%",
            (s) -> RelativeByteSizeValue.parseRelativeByteSizeValue(s,  "cluster.routing.allocation.disk.watermark.flood_stage.frozen"),
            Setting.Property.Dynamic, Setting.Property.NodeScope);

源码地址：https://github.com/elastic/elasticsearch/blob/v7.14.2/server/src/main/java/org/elasticsearch/cluster/routing/allocation/DiskThresholdSettings.java

警告代码：

for (final ObjectObjectCursor<String, DiskUsage> entry : usages) {
            final String node = entry.key;
            final DiskUsage usage = entry.value;
            final RoutingNode routingNode = routingNodes.node(node);

            if (isDedicatedFrozenNode(routingNode)) {
                ByteSizeValue total = ByteSizeValue.ofBytes(usage.getTotalBytes());
                long frozenFloodStageThreshold = diskThresholdSettings.getFreeBytesThresholdFrozenFloodStage(total).getBytes();
                if (usage.getFreeBytes() < frozenFloodStageThreshold) {
                    logger.warn("flood stage disk watermark [{}] exceeded on {}",
                        diskThresholdSettings.describeFrozenFloodStageThreshold(total), usage);
                }
                // skip checking high/low watermarks for frozen nodes, since frozen shards have only insignificant local storage footprint
                // and this allows us to use more of the local storage for cache.
                continue;
            }

            if (usage.getFreeBytes() < diskThresholdSettings.getFreeBytesThresholdFloodStage().getBytes() ||
                usage.getFreeDiskAsPercentage() < diskThresholdSettings.getFreeDiskThresholdFloodStage()) {

                nodesOverLowThreshold.add(node);
                nodesOverHighThreshold.add(node);
                nodesOverHighThresholdAndRelocating.remove(node);

                if (routingNode != null) { // might be temporarily null if the ClusterInfoService and the ClusterService are out of step
                    for (ShardRouting routing : routingNode) {
                        String indexName = routing.index().getName();
                        indicesToMarkReadOnly.add(indexName);
                        indicesNotToAutoRelease.add(indexName);
                    }
                }

                logger.warn("flood stage disk watermark [{}] exceeded on {}, all indices on this node will be marked read-only",
                    diskThresholdSettings.describeFloodStageThreshold(), usage);

源码地址：https://github.com/elastic/elasticsearch/blob/v7.14.2/server/src/main/java/org/elasticsearch/cluster/routing/allocation/DiskThresholdMonitor.java
所以，我们到这可以确定问题所在了

2.3 解决方案

针对这个问题，我们可以采用以下几个解决方案：

按照官方文档进行配置（文档地址：https://www.elastic.co/guide/en/elasticsearch/reference/6.2/disk-allocator.html）
清理服务器磁盘
对服务器进行扩容

由于使用率已经高达 96% 了，所以感觉更改配置也撑不了多久，所以直接从第二个方案开始执行了，在清理了一些东西之后，磁盘使用率已经回到了 55%，应该能撑上一段时间了，如果不确定哪些磁盘文件能清理或者说清理磁盘依旧没有太大成效，下下策就只能去花钱扩容了

参考文档：

ElasticSearch-磁盘空间不够引起的问题：https://www.jianshu.com/p/55fd8a0b120b

官方文档：https://www.elastic.co/guide/en/elasticsearch/reference/6.2/disk-allocator.html

posted @ 2022-08-09 15:29 悬崖勒码！阅读(1502) 评论(0) 编辑收藏举报

刷新页面返回顶部

悬崖勒码