ELK之elasticsearch6.5集群
前面介绍并初试了es6.5系列的单节点的操作,现在搭建es6.5系列的集群:
环境:三节点:master-172.16.23.128.node1-172.16.23.129.node2-172.16.23.130,首先查看es的服务状态:
1 2 3 4 | [root@master ~] # ansible all_nodes -m shell -a "systemctl status elasticsearch"|grep -i running Active: active (running) since 六 2018-12-29 12:06:55 CST; 3h 33min ago Active: active (running) since 六 2018-12-29 12:07:43 CST; 3h 32min ago Active: active (running) since 六 2018-12-29 15:38:47 CST; 1min 42s ago |
查看各节点上面的es的配置文件:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | [root@master ~] # ansible all_nodes -m shell -a 'cat /etc/elasticsearch/elasticsearch.yml|egrep -v "^$|^#"' 172.16.23.128 | CHANGED | rc=0 >> cluster.name: estest node.name: esnode2 path.data: /var/lib/elasticsearch path.logs: /var/log/elasticsearch network.host: 0.0.0.0 http.port: 9200 discovery.zen. ping .unicast.hosts: [ "172.16.23.128" , "172.16.23.131" ] 172.16.23.130 | CHANGED | rc=0 >> path.data: /var/lib/elasticsearch path.logs: /var/log/elasticsearch 172.16.23.129 | CHANGED | rc=0 >> cluster.name: es node.name: node1 path.data: /var/lib/elasticsearch path.logs: /var/log/elasticsearch network.host: 0.0.0.0 http.port: 9200 |
现在基于discovery.zen做集群配置参考:https://www.elastic.co/guide/en/elasticsearch/reference/6.5/modules-discovery-zen.html,具体配置如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | [root@master ~] # ansible all_nodes -m shell -a 'cat /etc/elasticsearch/elasticsearch.yml|egrep -v "^$|^#"' 172.16.23.128 | CHANGED | rc=0 >> cluster.name: estest node.name: master path.data: /var/lib/elasticsearch path.logs: /var/log/elasticsearch network.host: 0.0.0.0 http.port: 9200 discovery.zen. ping .unicast.hosts: [ "172.16.23.128" , "172.16.23.129" , "172.16.23.130" ] 172.16.23.130 | CHANGED | rc=0 >> cluster.name: estest node.name: node2 path.data: /var/lib/elasticsearch path.logs: /var/log/elasticsearch network.host: 0.0.0.0 http.port: 9200 discovery.zen. ping .unicast.hosts: [ "172.16.23.128" , "172.16.23.129" , "172.16.23.130" ] 172.16.23.129 | CHANGED | rc=0 >> cluster.name: estest node.name: node1 path.data: /var/lib/elasticsearch path.logs: /var/log/elasticsearch network.host: 0.0.0.0 http.port: 9200 discovery.zen. ping .unicast.hosts: [ "172.16.23.128" , "172.16.23.129" , "172.16.23.130" ] |
重启elasticsearch服务:
1 2 3 4 5 6 7 8 | [root@master ~] # ansible all_nodes -m shell -a "systemctl restart elasticsearch" 172.16.23.130 | CHANGED | rc=0 >> 172.16.23.128 | CHANGED | rc=0 >> 172.16.23.129 | CHANGED | rc=0 >> |
然后查看集群状态:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | [root@master ~] # curl -X GET "localhost:9200/_cluster/health" -s|python -m json.tool { "active_primary_shards" : 0, "active_shards" : 0, "active_shards_percent_as_number" : 100.0, "cluster_name" : "estest" , "delayed_unassigned_shards" : 0, "initializing_shards" : 0, "number_of_data_nodes" : 3, "number_of_in_flight_fetch" : 0, "number_of_nodes" : 3, "number_of_pending_tasks" : 0, "relocating_shards" : 0, "status" : "green" , "task_max_waiting_in_queue_millis" : 0, "timed_out" : false , "unassigned_shards" : 0 } |
查看节点个数:
1 2 3 4 5 | [root@master ~] # curl -X GET "localhost:9200/_cat/nodes?v" ip heap.percent ram .percent cpu load_1m load_5m load_15m node.role master name 172.16.23.128 28 71 3 0.04 0.11 0.08 mdi * master 172.16.23.130 29 67 4 0.04 0.11 0.10 mdi - node2 172.16.23.129 28 58 4 0.12 0.20 0.13 mdi - node1 |
单单只看master节点:
1 2 3 | [root@master ~] # curl -X GET "localhost:9200/_cat/master?v" id host ip node hVY-U_ocQueMtcryoGGbTg 172.16.23.128 172.16.23.128 master |
查看集群health:
1 2 3 | [root@master ~] # curl -X GET "localhost:9200/_cat/health?v" epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent 1546070536 08:02:16 estest green 3 3 0 0 0 0 0 0 - 100.0% |
查看nodeattrs属性:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | [root@master ~] # curl -X GET "localhost:9200/_cat/nodeattrs?v" node host ip attr value master 172.16.23.128 172.16.23.128 ml.machine_memory 3956293632 master 172.16.23.128 172.16.23.128 xpack.installed true master 172.16.23.128 172.16.23.128 ml.max_open_jobs 20 master 172.16.23.128 172.16.23.128 ml.enabled true node2 172.16.23.130 172.16.23.130 ml.machine_memory 3956293632 node2 172.16.23.130 172.16.23.130 ml.max_open_jobs 20 node2 172.16.23.130 172.16.23.130 xpack.installed true node2 172.16.23.130 172.16.23.130 ml.enabled true node1 172.16.23.129 172.16.23.129 ml.machine_memory 3956293632 node1 172.16.23.129 172.16.23.129 ml.max_open_jobs 20 node1 172.16.23.129 172.16.23.129 xpack.installed true node1 172.16.23.129 172.16.23.129 ml.enabled true |
现在手动创建一个index为test:
1 | # curl -X PUT "localhost:9200/test" |
然后查看各节点index情况:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | [root@master ~] # ansible all_nodes -m shell -a 'curl -X GET "localhost:9200/_cat/indices?v" -s' [WARNING]: Consider using the get_url or uri module rather than running curl. If you need to use command because get_url or uri is insufficient you can add warn=False to this command task or set command_warnings=False in ansible.cfg to get rid of this message. 172.16.23.128 | CHANGED | rc=0 >> health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open test l0Js1PJLTPSFEdXhanVSHA 5 1 0 0 1.7kb 1.1kb 172.16.23.130 | CHANGED | rc=0 >> health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open test l0Js1PJLTPSFEdXhanVSHA 5 1 0 0 1.7kb 1.1kb 172.16.23.129 | CHANGED | rc=0 >> health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open test l0Js1PJLTPSFEdXhanVSHA 5 1 0 0 1.7kb 1.1kb |
查看index的分片情况:
1 2 3 4 5 6 7 8 9 10 11 12 | [root@master ~] # curl -X GET "localhost:9200/_cat/shards?v" index shard prirep state docs store ip node test 3 p STARTED 0 230b 172.16.23.128 master test 3 r STARTED 0 230b 172.16.23.130 node2 test 2 r STARTED 0 230b 172.16.23.129 node1 test 2 p STARTED 0 230b 172.16.23.130 node2 test 1 p STARTED 0 230b 172.16.23.129 node1 test 1 r UNASSIGNED test 4 p STARTED 0 230b 172.16.23.129 node1 test 4 r UNASSIGNED test 0 p STARTED 0 230b 172.16.23.128 master test 0 r STARTED 0 230b 172.16.23.130 node2 |
由上面可以看出有两个分片是UNASSIGNED状态,查看集群health:
1 2 3 | [root@master ~] # curl -X GET "localhost:9200/_cat/health?v" epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent 1546071645 08:20:45 estest yellow 3 3 8 5 0 0 2 0 - 80.0% |
使用下面的命令定位有问题的分片以及原因:
1 2 3 | [root@master ~] # curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason -s| grep UNASSIGNED test 1 r UNASSIGNED INDEX_CREATED test 4 r UNASSIGNED INDEX_CREATED |
获取分片更多信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | [root@master ~] # curl -XGET localhost:9200/_cluster/allocation/explain?pretty { "index" : "test" , "shard" : 1, "primary" : false , "current_state" : "unassigned" , "unassigned_info" : { "reason" : "INDEX_CREATED" , "at" : "2018-12-29T08:14:47.378Z" , "last_allocation_status" : "no_attempt" }, "can_allocate" : "no" , "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes" , "node_allocation_decisions" : [ { "node_id" : "hVY-U_ocQueMtcryoGGbTg" , "node_name" : "master" , "transport_address" : "172.16.23.128:9300" , "node_attributes" : { "ml.machine_memory" : "3956293632" , "xpack.installed" : "true" , "ml.max_open_jobs" : "20" , "ml.enabled" : "true" }, "node_decision" : "no" , "weight_ranking" : 1, "deciders" : [ { "decider" : "node_version" , "decision" : "NO" , "explanation" : "cannot allocate replica shard to a node with version [6.5.2] since this is older than the primary version [6.5.4]" } ] }, { "node_id" : "q95yZ4W4Tj6PaXyzLZZYDQ" , "node_name" : "node1" , "transport_address" : "172.16.23.129:9300" , "node_attributes" : { "ml.machine_memory" : "3956293632" , "ml.max_open_jobs" : "20" , "xpack.installed" : "true" , "ml.enabled" : "true" }, "node_decision" : "no" , "weight_ranking" : 2, "deciders" : [ { "decider" : "same_shard" , "decision" : "NO" , "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[test][1], node[q95yZ4W4Tj6PaXyzLZZYDQ], [P], s[STARTED], a[id=j7V8PBUvQnOZzISPAxK9Uw]]" } ] }, { "node_id" : "_ADSWG04TEqNfX_88ejtzQ" , "node_name" : "node2" , "transport_address" : "172.16.23.130:9300" , "node_attributes" : { "ml.machine_memory" : "3956293632" , "ml.max_open_jobs" : "20" , "xpack.installed" : "true" , "ml.enabled" : "true" }, "node_decision" : "no" , "weight_ranking" : 3, "deciders" : [ { "decider" : "node_version" , "decision" : "NO" , "explanation" : "cannot allocate replica shard to a node with version [6.5.2] since this is older than the primary version [6.5.4]" } ] } ] } |
由上面结果可知node1,node2的es版本不同于master的es版本:
1 2 3 4 5 6 7 8 9 10 11 12 | [root@master ~] # ansible all_nodes -m shell -a 'rpm -qa|grep elasticsearch' [WARNING]: Consider using the yum, dnf or zypper module rather than running rpm. If you need to use command because yum, dnf or zypper is insufficient you can add warn=False to this command task or set command_warnings=False in ansible.cfg to get rid of this message. 172.16.23.128 | CHANGED | rc=0 >> elasticsearch-6.5.2-1.noarch 172.16.23.130 | CHANGED | rc=0 >> elasticsearch-6.5.2-1.noarch 172.16.23.129 | CHANGED | rc=0 >> elasticsearch-6.5.4-1.noarch |
将其中上面版本不一致的替换掉后,开启es服务,然后观察集群以及shards情况:
1 2 3 4 5 6 | [root@master ~] # curl -X GET "localhost:9200/_cat/health?v" epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent 1546073143 08:45:43 estest red 1 1 2 2 0 0 8 0 - 20.0% [root@master ~] # curl -X GET "localhost:9200/_cat/health?v" epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent 1546073274 08:47:54 estest green 3 3 10 5 0 0 0 0 - 100.0% |
1 2 3 4 5 6 7 8 9 10 11 12 | [root@master ~] # curl -X GET "localhost:9200/_cat/shards?v" index shard prirep state docs store ip node test 3 p STARTED 0 261b 172.16.23.128 master test 3 r STARTED 0 261b 172.16.23.130 node2 test 4 r STARTED 0 261b 172.16.23.128 master test 4 p STARTED 0 261b 172.16.23.129 node1 test 2 r STARTED 0 261b 172.16.23.129 node1 test 2 p STARTED 0 261b 172.16.23.130 node2 test 1 p STARTED 0 261b 172.16.23.129 node1 test 1 r STARTED 0 261b 172.16.23.130 node2 test 0 p STARTED 0 261b 172.16.23.128 master test 0 r STARTED 0 261b 172.16.23.130 node2 |
索引test由10个分片组成,五个主分片,5个replica shard,replica shard是primary shard的副本,负责容错,以及承担读请求负载,primary shard的数量在创建索引的时候就固定了,replica shard的数量可以随时修改,primary shard的默认数量是5,replica默认是1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | [root@master ~] # curl -XGET localhost:9200/test?pretty { "test" : { "aliases" : { }, "mappings" : { }, "settings" : { "index" : { "creation_date" : "1546071287243" , "number_of_shards" : "5" , "number_of_replicas" : "1" , "uuid" : "l0Js1PJLTPSFEdXhanVSHA" , "version" : { "created" : "6050299" }, "provided_name" : "test" } } } } |
1 2 3 | [root@master ~] # curl -XGET localhost:9200/_cat/indices?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open test l0Js1PJLTPSFEdXhanVSHA 5 1 0 0 2.5kb 1.2kb |
primary shard不能和自己的replica shard放在同一个节点上(否则节点宕机,primary shard和副本都丢失,起不到容错的作用),但是可以和其他primary shard的replica shard放在 同一个节点上
节点以及shards数分配参考:https://blog.csdn.net/qq_38486203/article/details/80077844
然后这里梳理一下es中一些基础概念:
1.cluster:
集群,一个ES集群由一个或多个节点(Node)组成,每个集群都有一个cluster name作为标识
2.node:
节点,一个ES实例就是一个node,一个机器可以有多个实例,一个集群由多个节点构成,大多数情况下每个node运行在一个独立的环境或虚拟机上。
3.index:
索引,即一系列documents的集合
3.shard:
分片,ES是分布式搜索引擎,每个索引有一个或多个分片,索引的数据被分配到各个分片上,相当于一桶水用了N个杯子装
分片有助于横向扩展,N个分片会被尽可能平均地(rebalance)分配在不同的节点上(例如你有2个节点,4个主分片(不考虑备份),那么每个节点会分到2个分片,后来你增加了2个节点,那么你这4个节点上都会有1个分片,这个过程叫relocation,ES感知后自动完成)
分片是独立的,对于一个Search Request的行为,每个分片都会执行这个Request.另外每个分片都是一个Lucene Index,所以一个分片只能存放 Integer.MAX_VALUE - 128 = 2,147,483,519 个docs
4.replica:
复制,可以理解为备份分片,相应地有primary shard(主分片)
主分片和备分片不会出现在同一个节点上(防止单点故障),默认情况下一个索引创建5个分片一个备份(即5primary+5replica=10个分片)
如果你只有一个节点,那么5个replica都无法分配(unassigned),此时cluster status会变成Yellow。
ES集群的三种状态:
Green: 所有主分片和备份分片都准备就绪,分配成功, 即使有一台机器挂了(假设一台机器实例),数据都不会丢失,但是会变成yellow状态.
Yellow: 所有主分片准备就绪,但至少一个主分片(假设是A)对应的备份分片没有就绪,此时集群处于告警状态,意味着高可用和容灾能力下降.如果刚好A所在的机器挂了,并且你只设置了一个备份(已处于未继续状态), 那么A的数据就会丢失(查询不完整),此时集群处于Red状态.
Red:至少有一个主分片没有就绪(直接原因是找不到对应的备份分片成为新的主分片),此时查询的结果会出现数据丢失(不完整).
容灾:primary分片丢失,replica分片就会被顶上去成为新的主分片,同时根据这个新的主分片创建新的replica,集群数据安然无恙
提高查询性能:replica和primary分片的数据是相同的,所以对于一个query既可以查主分片也可以查备分片,在合适的范围内多个replica性能会更优(但要考虑资源占用也会提升[cpu/disk/heap]),另外index request只能发生在主分片上,replica不能执行index request。
对于一个索引,除非重建索引否则不能调整分片的数目(主分片数, number_of_shards),但可以随时调整replica数(number_of_replicas)。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 【自荐】一款简洁、开源的在线白板工具 Drawnix
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY
· Docker 太简单,K8s 太复杂?w7panel 让容器管理更轻松!