es 报错 shard has exceeded the maximum number of retries [5] on failed allocation attempts

文章目录

@[toc]
查看节点健康状态
查看索引分配失败的原因
调用 API 手动分配索引

起因是 es 当时启动的时候 xmx 配置的小了，修改 xmx 和 xms 之后，需要重启 es；

在重启 es 之前，关闭了索引自动分配的功能，es 节点逐一重启，避免脑裂问题（因为当前客户环境磁盘性能很差，甚至不如机械硬盘），结果还是因为 io 问题，导致集群重启完成后，动态查看集群状态一直是 red，并且 active_shards_percent 停留在 99.4% 不动了（active_shards_percent 必须是 100% ，集群状态才能是 green）

通过报错，可以看出来是超过系统的重试次数，系统建议手动尝试

查看节点健康状态

es_url 和 es_port 改成自己环境的 es 地址和 es 端口

curl -XGET <es_url>:<es_port>/_cat/health?v

通过下面的返回，可以看出来有 18 个索引处于 unassign 状态

epoch 			timestamp cluster 	status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent 
1700542142  04:49:02  store-es  red    4          4           2964 1488    0    0       18             0                  -                 99.4%

查看索引分配失败的原因

通过 python -m json.tool 把输出的 json 内容格式化一下，看起来方便一点

curl -s -XGET <es_url>:<es_port>/_cluster/allocation/explain | python -m json.tool

shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] 通过这里可以看出，集群已经重试了5次都失败了，然后给出建议是通过 /_cluster/reroute?retry_failed=true 这个接口来手动重试

{
    "index": "event",
    "shard": 2,
    "primary": true,
    "current_state": "unassigned",
    "unassigned_info": {
        "reason": "ALLOCATION_FAILED",
        "at": "2023-11-20T09:39:49.248Z",
        "failed_allocation_attempts": 5,
        "details": "failed shard on node [U5GI_9APQya9VLLaz3YY5g]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[event][2]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ",
        "last_allocation_status": "no"
    },
    "can_allocate": "no",
    "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
    "node_allocation_decisions": [
        {
            "node_id": "<es 节点的 uuid>",
            "node_name": "es-2",
            "transport_address": "<es 通讯节点>",
            "node_attributes": {
                "ml.machine_memory": "33747447808",
                "rack": "<es 的地址>",
                "ml.max_open_jobs": "20",
                "xpack.installed": "true"
            },
            "node_decision": "no",
            "store": {
                "in_sync": true,
                "allocation_id": "MyHsQ-vBSsCxG3da-854Sw"
            },
            "deciders": [
                {
                    "decider": "max_retry",
                    "decision": "NO",
                    "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2023-11-20T09:39:49.248Z], failed_attempts[5], delayed=false, details[failed shard on node [U5GI_9APQya9VLLaz3YY5g]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[event][2]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], allocation_status[deciders_no]]]"
                }
            ]
        }
    ]
}

调用 API 手动分配索引

这个 API 需要用 POST 请求

curl -XPOST <es_url>:<es_port>/_cluster/reroute?retry_failed=true

再次查看集群健康状态就变成 green 了

posted @ 2024-09-09 17:21 月巴左耳东阅读(31) 评论(0) 编辑收藏举报来源

刷新页面返回顶部

月巴左耳东

以梦为马|越骑越傻

es 报错 shard has exceeded the maximum number of retries [5] on failed allocation attempts

文章目录 @[toc]查看节点健康状态查看索引分配失败的原因调用 API 手动分配索引

文章目录

查看节点健康状态

查看索引分配失败的原因

调用 API 手动分配索引

文章目录

@[toc]
查看节点健康状态
查看索引分配失败的原因
调用 API 手动分配索引