es 报错 shard has exceeded the maximum number of retries [5] on failed allocation attempts

起因是 es 当时启动的时候 xmx 配置的小了,修改 xmx 和 xms 之后,需要重启 es;

在重启 es 之前,关闭了索引自动分配的功能,es 节点逐一重启,避免脑裂问题(因为当前客户环境磁盘性能很差,甚至不如机械硬盘),结果还是因为 io 问题,导致集群重启完成后,动态查看集群状态一直是 red,并且 active_shards_percent 停留在 99.4% 不动了(active_shards_percent 必须是 100% ,集群状态才能是 green)

通过报错,可以看出来是超过系统的重试次数,系统建议手动尝试

查看节点健康状态

es_urles_port 改成自己环境的 es 地址和 es 端口

curl -XGET <es_url>:<es_port>/_cat/health?v

通过下面的返回,可以看出来有 18 个索引处于 unassign 状态

epoch 			timestamp cluster 	status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent 
1700542142  04:49:02  store-es  red    4          4           2964 1488    0    0       18             0                  -                 99.4%

查看索引分配失败的原因

通过 python -m json.tool 把输出的 json 内容格式化一下,看起来方便一点

curl -s -XGET <es_url>:<es_port>/_cluster/allocation/explain | python -m json.tool 

shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] 通过这里可以看出,集群已经重试了5次都失败了,然后给出建议是通过 /_cluster/reroute?retry_failed=true 这个接口来手动重试

{
    "index": "event",
    "shard": 2,
    "primary": true,
    "current_state": "unassigned",
    "unassigned_info": {
        "reason": "ALLOCATION_FAILED",
        "at": "2023-11-20T09:39:49.248Z",
        "failed_allocation_attempts": 5,
        "details": "failed shard on node [U5GI_9APQya9VLLaz3YY5g]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[event][2]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ",
        "last_allocation_status": "no"
    },
    "can_allocate": "no",
    "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
    "node_allocation_decisions": [
        {
            "node_id": "<es 节点的 uuid>",
            "node_name": "es-2",
            "transport_address": "<es 通讯节点>",
            "node_attributes": {
                "ml.machine_memory": "33747447808",
                "rack": "<es 的地址>",
                "ml.max_open_jobs": "20",
                "xpack.installed": "true"
            },
            "node_decision": "no",
            "store": {
                "in_sync": true,
                "allocation_id": "MyHsQ-vBSsCxG3da-854Sw"
            },
            "deciders": [
                {
                    "decider": "max_retry",
                    "decision": "NO",
                    "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2023-11-20T09:39:49.248Z], failed_attempts[5], delayed=false, details[failed shard on node [U5GI_9APQya9VLLaz3YY5g]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[event][2]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], allocation_status[deciders_no]]]"
                }
            ]
        }
    ]
}

调用 API 手动分配索引

这个 API 需要用 POST 请求

curl -XPOST <es_url>:<es_port>/_cluster/reroute?retry_failed=true

再次查看集群健康状态就变成 green

posted @ 2024-09-09 17:21  月巴左耳东  阅读(31)  评论(0编辑  收藏  举报  来源