Elasticsearch集群运维

一、索引管理
1、创建索引

PUT test-2019-03
{
  "settings": {
    "index": {
      "number_of_shards": 10,
      "number_of_replicas": 1,
      "routing": {
        "allocation": {
          "include": {
            "type": "hot"
          }
        }
      }
    }
  }
}

2、删除索引

DELETE test-2019-03
DELETE test*

支持通配符*

3、修改索引
修改副本数：

PUT test-2019-03/_settings
{
  "index": {
    "number_of_replicas": 0
  }
}

4、重构索引ReIndex

POST _reindex
{
  "source": {
    "index": [
      "test-2018-07-*"
    ]
  },
  "dest": {
    "index": "test -2018-07"
  }
}

查看reIndex任务：

GET _tasks?detailed=true&actions=*reindex

5、添加数据：

PUT /<target>/_doc/<_id>
POST /<target>/_doc/
PUT /<target>/_create/<_id>
POST /<target>/_create/<_id>

6、删除数据delete_by_id

DELETE indexApple-2019-02/_doc/d98a33c22d9a11edab9abc17b8bcc83e

7、删除数据delete_by_query

POST indexApple-2019-02/_delete_by_query?conflicts=proceed
{
  "query": {
    "bool": {
      "must": {
        "term": {"appIndex": "apple"}
      },
      "filter": {
        "range": {
          "timestamp": {
            "gte": "2019-02-23 08:00:00",
            "lte": "2019-02-23 22:00:00",
            "time_zone": "+08:00"
          }
        }
      }
    }
  }
}

查看delete_by_query任务：

GET _tasks?detailed=true&actions=*/delete/byquery

8、节点间移动分片
手动移动分配分片，将分片从一个节点移动到另一节点。

POST /_cluster/reroute
{
  "commands": [
    {
      "move": {
        "index": "indexname",
        "shard": 1,
        "from_node": "nodename",
        "to_node": "nodename"
      }
    }
  ]
}

二、集群设置

ES cluster的settings：
curl -XPUT http://<domain>:<port>/_cluster/settings

1、Shard Rebalancing Settings

{"persistent":{"cluster.routing. rebalance.enable": "all"}}
设置集群哪种分片允许重平衡，4个选项：
all - (default) Allows shard balancing for all kinds of shards.
primaries - Allows shard balancing only for primary shards.
replicas - Allows shard balancing only for replica shards.
none - No shard balancing of any kind are allowed for any indices.

{"persistent":{"cluster.routing. allocation. allow_rebalance": "all"}}
设置集群什么时候允许重平衡，3个选项：
always - Always allow rebalancing.
indices_primaries_active - Only when all primaries in the cluster are allocated.
indices_all_active - (default) Only when all shards (primaries and replicas) in the cluster are allocated.

{"transient":{"cluster.routing.allocation.cluster_concurrent_rebalance": 8}}
设置在集群上并发分片重平衡的个数，只控制“重平衡”过程的并发数，对集群“恢复”和其他情况下的并发数没有影响。

{"transient":{"cluster.routing.allocation.cluster_concurrent_rebalance": 0}}
禁用集群“rebalance”

{"transient":{"cluster.routing.allocation.cluster_concurrent_rebalance": null}}
启用集群“rebalance”

2、Shard Allocation Settings

{"persistent":{"cluster.routing.allocation.enable": "all"}}
设置集群哪种分片允许分配，4个选项：
all - (default) Allows shard allocation for all kinds of shards.
primaries - Allows shard allocation only for primary shards.
new_primaries - Allows shard allocation only for primary shards for new indices.
none - No shard allocations of any kind are allowed for any indices.

{"persistent":{"cluster.routing.allocation.node_concurrent_recoveries": 8}}
设置在节点上并发分片恢复的个数（写和读）。

{"persistent":{"cluster.routing.allocation.node_initial_primaries_recoveries": 16}}
设置节点重启后有多少并发数从本地恢复未分配的主分片。

{"persistent":{"indices.recovery.max_bytes_per_sec": "500mb"}}
设置索引恢复时每秒字节数。

3、Disk-based Shard Allocation

#调整数据节点的低水位值为80%
{"transient":{"cluster.routing.allocation.disk.watermark.low":"80%"}}
#调整数据节点的高水位值为90%
{"transient":{"cluster.routing.allocation.disk.watermark.high":"90%"}}
#取消用户设置，集群恢复这一项的默认配置
{"transient":{"cluster.routing.allocation.disk.watermark.low": null}}
{"transient":{"cluster.routing.allocation.disk.watermark.high": null}}

4、Shard Allocation Routing

明确指定是否允许分片分配到指定Node上，分为index级别和cluster级别
index.routing.allocation.require.{attribute}
index.routing.allocation.include{attribute}
index.routing.allocation.exclude.{attribute}
cluster.routing.allocation.require.{attribute}
cluster.routing.allocation.include.{attribute}
cluster.routing.allocation.exclude.{attribute}
require表示必须分配到指定node，include表示可以分配到指定node，exclude表示不允许分配到指定Node。
cluster的配置会覆盖index级别的配置，比如index include某个node，cluster exclude某个node，最后的结果是exclude某个node

#通过IP，排除集群中的某个节点：节点IP：10.100.0.11
{"transient":{"cluster.routing.allocation.exclude._ip":"10.100.0.11"}}

#通过IP，排除集群中的多个节点：节点IP：10.10.0.11,10.100.0.12
{"transient":{"cluster.routing.allocation.exclude._ip":"10.100.0.11,10.100.0.12"}}

#取消节点排除的限制
{"transient":{"cluster.routing.allocation.exclude._ip": null}}

设置索引不分配到某些IP：
PUT test/_settings
{"index.routing.allocation.exclude._ip": "192.168.2.*"}

默认支持的属性：
_name Match nodes by node name
_host_ip Match nodes by host IP address (IP associated with hostname)
_publish_ip Match nodes by publish IP address
_ip Match either _host_ip or _publish_ip
_host Match nodes by hostname

5、Shard Allocation Awareness

通过shard allocation awareness控制分片的分配策略
在每个master节点上如下配置：
cluster.routing.allocation.awareness.attributes: zone
cluster.routing.allocation.awareness.force.zone.values: zone1,zone2
假设副本数为1，则可以达到的效果：强制primary分片在zone1上，replica分片在zone2上，如果zone2的节点挂了，分片也不会被重新分配到zone1上。

6、内存（缓存）

segment内存：

1、查询某个index的segment内存：

GET /_cat/segments/<index>?v&h=shard,segment,size,size.memory

2、查询node的segment内存：

GET /_cat/nodes?v&h=name,port,sm

cache查询：
1、cache的缓存命中可以通过 Nodes stats API 来查看：
GET /_nodes/stats/indices/query_cache,request_cache,fielddata?pretty

2、索引级别的信息可以通过 Index stats API来查看，对于缓存方面的指标可以使用如下命令：
GET /<index>/_stats/query_cache,fielddata,request_cache?pretty&human

cache清理：
1、清理整个集群的cache
POST /_cache/clear

2、清理指定索引的cache，支持多个
POST /<index>/_cache/clear

3、清理指定索引的3种cache
POST /<index>/_cache/clear?query=true #只清理 query cache
POST /<index>/_cache/clear?request=true #只清理 request cache
POST /<index>/_cache/clear?fielddata=true #只清理 fields cache

三、问题排查及定位

1、查看集群状态（red/yellow/green）
GET _cluster/health
GET _cluster/health?level=indices

2、查看状态是yellow/red的索引
GET /_cat/indices?health=yellow
GET /_cat/indices?health=red

3、找到对应的分片
GET _cluster/health?level=shards

4、查看有分片问题（unassigned shards）原因
GET _cluster/allocation/explain?pretty

5、查看索引的恢复状态，以索引user为例
GET user/_recovery?active_only=true

6、使用reroute重试之前分配失败的，集群在尝试分配分片index.allocation.max_retries（默认为5）次后会放弃分配
POST /_cluster/reroute?retry_failed=true

四、集群滚动重启

1、准备工作
##提前打开如下信息，有些API是需要观察的各项指标（出现问题则停止重启），其余是配合检查的API：
##查看集群unassigned shards原因
curl http://0.0.0.0:9200/_cluster/allocation/explain?pretty

###集群配置
curl http://0.0.0.0:9200/_cluster/settings?pretty

###pending-tasks
curl http://0.0.0.0:9200/_cluster/pending_tasks?pretty

###集群健康
curl http://0.0.0.0:9200/_cluster/health?pretty

2、重启client-node
#start
步骤1：关闭其中一个client节点
步骤2：重启节点
步骤3：检查节点是否加入集群
步骤4：重复步骤2-3重启其他节点
#end

3、重启master-node
#start
步骤1：明确master节点IP
步骤2：关闭master-node组的一个非master节点
步骤3：重启节点
步骤4：检查节点是否加入集群（确保已经加入集群）
步骤5：重复步骤2-4，重启另外的master-node组的一个非master节点
步骤6：关闭master节点
步骤7：重启master节点
##在master节点选举过程中，集群功能不可用（包括了：索引功能、search功能，API功能堵塞等），集群并不会立即选举出master节点（默认进行选举的时间为3s, 由于网络的问题，往往将master选举的时间延长）
步骤8：检查集群装填，检查节点是否加入集群。
##当master选举出来，集群功能将全部正常。
#end

4、重启data-node
#start
步骤1：禁用分片分配
curl -X PUT http://0.0.0.0:9200/_cluster/settings?pretty -d '{"transient": {"cluster.routing.allocation.enable": "new_primaries"}}'
##禁用分片分配期间，集群新建索引将无法分配副本分片，允许新建索引主分片的分配
步骤2：执行同步刷新
curl -XPOST "http://0.0.0.0:9200/_flush/synced?pretty"
##对于在此刻不在更新的索引，此操作将通过synced值来确认主副分片是否数据一致（加快了分片加入集群的时间）；对于在此刻索引发生变化的分片，此操作对节点加入集群的索引恢复没有作用
步骤3：关闭一个data-node节点
步骤4：重启节点
步骤5：检查节点是否加入集群
步骤6：启用分片分配
curl -X PUT http://0.0.0.0:9200/_cluster/settings?pretty -d '{"transient": {"cluster.routing.allocation.enable": "all"}}'
步骤7：检查集群状态是否为green
##在启用了分片分配后，unassigned shards会瞬间减少（不会瞬间减少为0，因为在大的ES集群中，每个节点都会有在更新的索引分片）；之后会出现一些initializing shards，这部分分片会需要等待一段时间才会减少为0（分片同步过程中）
步骤8：重复步骤3-7，重启其他节点
步骤9：节点全部重启完毕后，检查集群配置，确保没有禁用分片分配
#end

参考：
https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cluster.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/restart-cluster.html

作者：阿凡卢

出处：https://www.cnblogs.com/luxiaoxun/p/9533148.html

版权：本作品采用「署名-非商业性使用-相同方式共享 4.0 国际」许可协议进行许可。