ES重启后部分索引无法分配的问题
这次遇到的问题参考了文章《es实战-分片分配失败解决方案》,在这篇文章对于分片失败的问题给出了很全面的解决方法。
开发环境ES重启后有部分索引的主副分片都分配失败,导致集群状态是红色。通过查看日志发现第一类错误:是调用分词器异常导致重试5次都失败,所以分片加载失败
[2021-11-05T17:20:33,188][DEBUG][o.e.a.a.c.a.TransportClusterAllocationExplainAction] [node-2] explaining the allocation for
[ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false], found shard [[dev_srms_service_process][0], node[null], [P],
recovery_source[existing store recovery; bootstrap_history_uuid=false], s[UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED],
at[2021-11-04T10:43:44.702Z], failed_attempts[5], failed_nodes[[teRXfA-SRPSZ0Gt7bogtYA, AvVMttYeQ4eaVorPLofABw]], delayed=false,
details[failed shard on node [teRXfA-SRPSZ0Gt7bogtYA]: failed recovery, failure RecoveryFailedException[[dev_srms_service_process][0]:
Recovery failed on {node-2}{teRXfA-SRPSZ0Gt7bogtYA}{FaumuozZQL-4x5dfkgD5UA}{172.16.2.68}{172.16.2.68:9301}{dilm}{ml.machine_memory=33564663808,
xpack.installed=true, ml.max_open_jobs=20}]; nested: IndexShardRecoveryException[failed to recover from gateway];
nested: EngineException[failed to recover from translog]; nested: RuntimeException[调用分词器异常]; ], allocation_status[deciders_no]]]
通过执行分配状态查询得到第二类错误:GET /_cluster/allocation/explain,这个错误应该是启动时node状态不稳定,分片加载失败
{
"index" : "zhugeio_person_search_user_v2",
"shard" : 2,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2021-11-05T02:23:36.603Z",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [
{
"node_id" : "4cesg3C0RIeODfnXvSRlNw",
"node_name" : "node-3",
"transport_address" : "172.16.2.69:9301",
"node_attributes" : {
"ml.machine_memory" : "33564680192",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "replica_after_primary_active",
"decision" : "NO",
"explanation" : "primary shard for this replica is not yet active"
},
{
"decider" : "throttling",
"decision" : "NO",
"explanation" : "primary shard for this replica is not yet active"
}
]
},
{
"node_id" : "AvVMttYeQ4eaVorPLofABw",
"node_name" : "node-1",
"transport_address" : "172.16.2.69:9300",
"node_attributes" : {
"ml.machine_memory" : "33564680192",
"xpack.installed" : "true",
"ml.max_open_jobs" : "20"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "replica_after_primary_active",
"decision" : "NO",
"explanation" : "primary shard for this replica is not yet active"
},
{
"decider" : "throttling",
"decision" : "NO",
"explanation" : "primary shard for this replica is not yet active"
}
]
},
{
"node_id" : "teRXfA-SRPSZ0Gt7bogtYA",
"node_name" : "node-2",
"transport_address" : "172.16.2.69:9302",
"node_attributes" : {
"ml.machine_memory" : "33564680192",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "replica_after_primary_active",
"decision" : "NO",
"explanation" : "primary shard for this replica is not yet active"
},
{
"decider" : "throttling",
"decision" : "NO",
"explanation" : "primary shard for this replica is not yet active"
}
]
}
]
}
最后,尝试重新分配失败的分片,问题解决:POST /_cluster/reroute?retry_failed=true
如果副本分片一直正在初始化(INITIALIZING)导致索引无法恢复为绿色,可尝试调整索引的副本为0,等待恢复为绿色后再把副本数设置回来,这样应该很快就由黄变绿了
参考文章:恢复状态为INITIALIZING的分片、ES使用Lucene修复错误的分片
总结:
- 默认索引分配的尝试次数为5
- 如果节点状态刚启动时状态不稳定,很容易重试失败超过5,最终导致某个分片分配失败
- 当集群各节点启动并稳定后,如果不是分片数据损坏,再尝试重新分配失败的分片,问题可以解决