es 报错 shard has exceeded the maximum number of retries [5] on failed allocation attempts
起因是 es 当时启动的时候 xmx 配置的小了,修改 xmx 和 xms 之后,需要重启 es;
在重启 es 之前,关闭了索引自动分配的功能,es 节点逐一重启,避免脑裂问题(因为当前客户环境磁盘性能很差,甚至不如机械硬盘),结果还是因为 io 问题,导致集群重启完成后,动态查看集群状态一直是 red,并且
active_shards_percent
停留在99.4%
不动了(active_shards_percent
必须是100%
,集群状态才能是 green)通过报错,可以看出来是超过系统的重试次数,系统建议手动尝试
查看节点健康状态
es_url
和es_port
改成自己环境的 es 地址和 es 端口
curl -XGET <es_url>:<es_port>/_cat/health?v
通过下面的返回,可以看出来有 18 个索引处于
unassign
状态
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1700542142 04:49:02 store-es red 4 4 2964 1488 0 0 18 0 - 99.4%
查看索引分配失败的原因
通过
python -m json.tool
把输出的 json 内容格式化一下,看起来方便一点
curl -s -XGET <es_url>:<es_port>/_cluster/allocation/explain | python -m json.tool
shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true]
通过这里可以看出,集群已经重试了5次都失败了,然后给出建议是通过/_cluster/reroute?retry_failed=true
这个接口来手动重试
{
"index": "event",
"shard": 2,
"primary": true,
"current_state": "unassigned",
"unassigned_info": {
"reason": "ALLOCATION_FAILED",
"at": "2023-11-20T09:39:49.248Z",
"failed_allocation_attempts": 5,
"details": "failed shard on node [U5GI_9APQya9VLLaz3YY5g]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[event][2]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ",
"last_allocation_status": "no"
},
"can_allocate": "no",
"allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
"node_allocation_decisions": [
{
"node_id": "<es 节点的 uuid>",
"node_name": "es-2",
"transport_address": "<es 通讯节点>",
"node_attributes": {
"ml.machine_memory": "33747447808",
"rack": "<es 的地址>",
"ml.max_open_jobs": "20",
"xpack.installed": "true"
},
"node_decision": "no",
"store": {
"in_sync": true,
"allocation_id": "MyHsQ-vBSsCxG3da-854Sw"
},
"deciders": [
{
"decider": "max_retry",
"decision": "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2023-11-20T09:39:49.248Z], failed_attempts[5], delayed=false, details[failed shard on node [U5GI_9APQya9VLLaz3YY5g]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[event][2]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], allocation_status[deciders_no]]]"
}
]
}
]
}
调用 API 手动分配索引
这个 API 需要用
POST
请求
curl -XPOST <es_url>:<es_port>/_cluster/reroute?retry_failed=true
再次查看集群健康状态就变成
green
了