Elasticsearch7-分布式及分布式搜索机制
分布式特性
Elasticsearch的分布式带来的好处:
- 存储的水平扩容
- 提供系统的可用性,部分节点停止服务,整个集群不受影响
Elasticsearch的分布式架构
- 不同集群通过不同集群名称区分,默认"elasticsearch"
- 通过配置文件修改,或者在命令行中-E cluster.name="ops-es"进行设定
节点
节点是一个Elasticsearch实例:
- 本质上就是一个JAVA进程
- 一台机器上可以运行多个Elasticsearch进程,但是生产环境一般建议一台机器上就运行一个Elasticsearch实例
每一个节点都有名字,通过配置文件,或者启动的时候-E node.name=es01指定
每一个节点启动后,都会生产一个UID,保存在data目录下
Coordinating Node
处理请求的节点叫 Coordinating Node
- 路由到正确的节点,例如创建索引,就路由到master节点
所有节点默认都是Coordinating Node
通过将其他类型设置成False,使其变成Coordinating Node节点
Data Node
可以保存数据的节点,就叫Data Node节点
- 节点启动后,默认就是数据节点,可以设置成node.data: false 禁止
Data Node的职责
- 保存分片数据,在数据扩展上起到至关重要的作用,(由Master Node决定如何把分片分发到数据节点上)
通过增加数据节点
- 可以解决数据水平扩展和解决数据单点的问题
Master Node
Master Node的职责
- 处理创建、删除索引等请求、决定分片分到那个节点
- 维护并更新Cluster 状态
Master Node最佳实践
- Master 节点非常重要,在部署的时候需要考虑单点的问题
- 为一个集群设置多个Master节点/每一个节点只承担Master单一角色
集群状态信息
集群状态信息,维护一个集群中,必要信息
- 所有节点信息
- 所有索引和其相关的Mapping和setting信息
- 分片路由信息
在每一个节点上都保存了集群的状态信息
但是,只有Master节点上才能修改集群状态的信息,并负责同步给其他节点
- 因为,任意节点都能修改信息会导致Cluster state信息的不一致
Master Eligible Nodes & 选主的过程
相互ping对方,Node ID低的会成为被选举的节点
其他节点会加入集群,但是不承担Master 节点的角色,一旦发现被选中的节点丢失,就会选举出新的Master节点
脑裂问题
Split-Brain,分布式系统的经典网络问题,当出现网络问题,一个节点和其他节点无法连接
- Node2 和Node3会重新选举Master
- Node1 自己还是作为Master,组成一个集群,同时更新Cluster state
- 导致2个Master节点,维护不同的cluster state。当网络恢复时,无法选择正确恢复
如何避免脑裂问题
限定一个选举条件,设置quorum(仲裁),只有在Master eligishble 节点数大于quorum时,才能进行选举
- quorum = (master节点数/2)+1
- 当3个master eligible时,设置discovery.zen.minimum_master_nodes为2,既避免脑裂
从7.0开始,无需此配置
- 移除minimum_master_nodes参数,让Elasticsearch自己选择可以形成仲裁的节点
- 典型的主节点选举现在只需要很短的时间就可以完成。集群的伸缩变得更安全、更容易、并且可能造成丢失数据的系统配置选项更少了
- 节点更清楚的记录它们的状态,有助于判断为什么它们不能加入集群或为什么无法选举出主节点
Primary Shard
分片是Elasticsearch分布式存储基石
- 主分片/副本分片
通过主分片将数据分布在所有节点上
- primary shard,可以将一份索引的数据,分散在多个Data Node上,实现存储的水平扩展
- 主分片数在索引创建时指定,后续默认不能修改,如需修改,需要重新索引
分片数设定
如何规划一个索引的主分片和副本分片数
- 主分片数过小:例如创建1个primary shard 的index
- 如果该索引增长很快,集群无法通过增加节点实现对这个索引的数据扩展
- 主分片数设置过大:导致单个shard容量很小,引发一个节点上过多分片,影响性能
- 副本分片设置过多,会降低集群整体写入性能
文档存储在分片上
文档会存储在具体的某个主分片和副本分片上,例如:文档1,会存储在P0和R0分片上
文档到分片的映射算法:
- 确保文档能均匀分布在所有分片上,充分利用硬件资源,避免部分机器空闲,部分机器繁忙
- 潜在算法
- 随机/Round Robin。当查询文档1,分片数很多,需要多次查询才可能查到文档1
- 维护文档到分片的映射关系,当文档数据量很大的时候,维护成本高
- 实时计算,通过文档1,自动算出,需要去那个分片上获取文档
文档到分片的路由算法
shard = hash(_routing) % number_of_primary_shards
- hash算法确保文档均匀分散到分片中
- 默认的_routing值是文档id
- 可以自行限定_ronting数值,例如相同国家的商品,都分配到指定的shard
- 设置Index settings 后,Primary数,不能随意修改的根本原因
分片的内部原理
什么是ES的分片
- ES中最小的工作单元:是一个Lucene的index
一些问题:
- 为什么ES的搜索是近实时的
- ES如何保证在断电时数据也不会丢失
- 为什么删除文档,并不会立即释放空间
倒排索引的不可变性
- 倒排索引采用Immutable Design,一旦生产,不可更改
- 不可变性,带来的好处:
- 无需考虑并发写文件的问题,避免了锁机制带来的性能问题
- 一旦写入内核的文件系统缓存,便留在哪里。只要文件系统存有足够的空间,大部分请求就会直接请求内存,不会命中磁盘,提升了很大的性能
- 缓存容易生产和维护、数据可以被压缩
- 不可变性,带来了的挑战:如果需要让一个新文档可以被搜索,需要从建整个索引。
Lucene Index
- 在Lucene中,单个倒排索引文件被成为Segment,Sgement是自包含的,不可变更的,多个Sgement汇总在一起,称为Lucene的Index,其对应的就是ES中的Shard
- 当有新文档写入时,会生成新的Segment,查询时会同时查询所有的Segment,并且对结果汇总,Lucene中有一个文件,用来记录所有Segment信息,叫做Commit Point
- 删除的文档信息,保存在“.del”文件中
什么Refresh
- 将Index Buffer写入Segment的过程叫Refresh。Refresh不执行fsync操作
- Refresh频率:默认1秒发生一次,可通过index.refresh_interval配置。Refersh后,数据就可以被搜索到了。这也是为什么Elasticsearch是近实时查询的原因
- 如果系统有大量的数据写入,那就会产生很多Segment
- Index Buffer被占满时,会触发Refresh,默认值是JVM的10%
什么是Transaction Log
- Segment写入磁盘的过程相对耗时,借助文件系统缓存,Refresh时,先将Segment写入缓存以开放查询
- 为了保证数据不会丢失。所以在Index文档时,同时写Transaction Log,高版本开始,Transaction Log默认落盘,每个分片有一个Transaction Log
- 在ES Refresh 时,Index Buffer被清空,Transaction Log不会被清空
什么是Flush
ES Flush & Luence Commit
- 调用Refresh,Index Buffer清空并且Refresh
- 调用fsync,将缓存中的Segment写入磁盘
- 清空Transaction Log
- 默认30分钟调用一次
- Transaction Log满(默认512M)
什么是Merge
- Segment很多,需要被定期被合并
- 减少Segment/删除已经删除的文档
- ES和Luence会自动进行Merge操作
- POST my_index/_forcemerge
分布式搜索机制
Elasticsearch的搜索分为两步:
第一步-Query
第二部-Fetch
- 用户发出搜索的请求到ES节点,节点搜到请求后,会以Coordinating节点身份,在6个主副本分片中随机选择3个分片,发出查询请求
- 被选中的分片执行查询,进行排序。然后,每个分片都会返回From+Size个排序后文档id和排序值给Coordinating节点
- Coordinating节点会将Query阶段,从每个分片获取的排序后的文档Id列表,重新进行排序。选取From到From + Size个文档的ID
- 以 multi get 请求的方式,到相应的分片获取详细的文档数据
Query Then Fetch 的潜在问题
性能问题:
- 每个分片上需要查的文档个数=From + Size
- 最终协调节点需要处理:number_of_shard * (From+size)
- 深度分页
相关性算分
- 每一个都基于自己上分片数据进行相关度算分。这会导致大分偏离的情况,特别是数据量很少时,相关性算分在分片之间是相互独立,当文档总数很少情况下,如果主分片大于1,主分片数越多,相关性算法越不准
分页& 遍历
- From:开始的位置
- Size:期望获取文档的总数
ES天生就是分布式系统,查询信息,但是数据分别保存在多个分片中,多台机器上,ES天生就需要满足排序的需求(按照相关性算分)
当一个查询:From=990, Size=10
- 会在每个分片中获取1000个文档。然后,在通过Coordinating Node聚合所有结果。最好再通过排序选取前1000个文档
- 页数越深,占用内存越多。为了避免深度分页带来的内存开销,ES有一个设定,默认限定10000个文档
Search After避免深度分页的问题
- 避免深度分页的性能问题,可以实时获取下一页文档信息
- 不支持指定页数(From)
- 只能往下分页
- 第一步搜索需要指定sort,并且保证值是唯一的(可以通过加入_id保证唯一性)
- 然后使用上一次,最后一个文档的sort值进行查询
Bucket & Metric 聚合分析及嵌套聚合
- Metric 一些一系列的统计方法
- Bucket 一组满足条件的文档
Metric Aggregation
单值分析
- max min avg sum
- Cardinality(类似 distinct count)
多值分析
- stats、extended stats
- percentile、percentile rank
- top hits
Demo
生产数据
#定义员工表索引的定义 PUT /employees/ { "mappings":{ "properties":{ "age":{ "type": "integer" }, "gender":{ "type": "keyword" }, "job":{ "type": "text", "fields":{ "keyword": { "type": "keyword", "ignore_above": 50 } } }, "name":{ "type": "keyword" }, "salary":{ "type" : "integer" } } } } #插入数据 PUT /employees/_bulk { "index" : { "_id" : "1" } } { "name" : "Emma","age":32,"job":"Product Manager","gender":"female","salary":35000 } { "index" : { "_id" : "2" } } { "name" : "Underwood","age":41,"job":"Dev Manager","gender":"male","salary": 50000} { "index" : { "_id" : "3" } } { "name" : "Tran","age":25,"job":"Web Designer","gender":"male","salary":18000 } { "index" : { "_id" : "4" } } { "name" : "Rivera","age":26,"job":"Web Designer","gender":"female","salary": 22000} { "index" : { "_id" : "5" } } { "name" : "Rose","age":25,"job":"QA","gender":"female","salary":18000 } { "index" : { "_id" : "6" } } { "name" : "Lucy","age":31,"job":"QA","gender":"female","salary": 25000} { "index" : { "_id" : "7" } } { "name" : "Byrd","age":27,"job":"QA","gender":"male","salary":20000 } { "index" : { "_id" : "8" } } { "name" : "Foster","age":27,"job":"Java Programmer","gender":"male","salary": 20000} { "index" : { "_id" : "9" } } { "name" : "Gregory","age":32,"job":"Java Programmer","gender":"male","salary":22000 } { "index" : { "_id" : "10" } } { "name" : "Bryant","age":20,"job":"Java Programmer","gender":"male","salary": 9000} { "index" : { "_id" : "11" } } { "name" : "Jenny","age":36,"job":"Java Programmer","gender":"female","salary":38000 } { "index" : { "_id" : "12" } } { "name" : "Mcdonald","age":31,"job":"Java Programmer","gender":"male","salary": 32000} { "index" : { "_id" : "13" } } { "name" : "Jonthna","age":30,"job":"Java Programmer","gender":"female","salary":30000 } { "index" : { "_id" : "14" } } { "name" : "Marshall","age":32,"job":"Javascript Programmer","gender":"male","salary": 25000} { "index" : { "_id" : "15" } } { "name" : "King","age":33,"job":"Java Programmer","gender":"male","salary":28000 } { "index" : { "_id" : "16" } } { "name" : "Mccarthy","age":21,"job":"Javascript Programmer","gender":"male","salary": 16000} { "index" : { "_id" : "17" } } { "name" : "Goodwin","age":25,"job":"Javascript Programmer","gender":"male","salary": 16000} { "index" : { "_id" : "18" } } { "name" : "Catherine","age":29,"job":"Javascript Programmer","gender":"female","salary": 20000} { "index" : { "_id" : "19" } } { "name" : "Boone","age":30,"job":"DBA","gender":"male","salary": 30000} { "index" : { "_id" : "20" } } { "name" : "Kathy","age":29,"job":"DBA","gender":"female","salary": 20000}
测试样例
#Metric 聚合 找到最低工资 POST employees/_search { "size":0, "aggs": { "min_salary": { "min": { "field": "salary" } } } } #查询结果 { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "min_salary" : { "value" : 9000.0 } } } #Metric 聚合 找到最高工资 POST employees/_search { "size":0, "aggs": { "max_salary": { "max": { "field": "salary" } } } } #查询结果 { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "max_salary" : { "value" : 50000.0 } } } #多个Metric 聚合 找到 最低最高平均工资 POST employees/_search { "size": 0, "aggs": { "max_salary": { "max": { "field": "salary" } }, "min_salary": { "min": { "field": "salary" } }, "avg_salary": { "avg": { "field": "salary" } } } } #查询结果 { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "max_salary" : { "value" : 50000.0 }, "avg_salary" : { "value" : 24700.0 }, "min_salary" : { "value" : 9000.0 } } } # 一个聚合,输出多值,统计 POST employees/_search { "size": 0, "aggs": { "stats_salary": { "stats": { "field":"salary" } } } } #查询结果 { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "stats_salary" : { "count" : 20, "min" : 9000.0, "max" : 50000.0, "avg" : 24700.0, "sum" : 494000.0 } } }
Bucket聚合分析
按照一定规则,将文档分配到不同的桶中,从而达到分类的目的,ES提供常见Bucket Aggregation
- Terms
- 数字类型
- Range/Data Range
- Histogram/Data Histogram
- 支持嵌套(桶中桶)
Terms Aggregation
- 字段需要打开fieldata,才能进行Terms Aggregation
- keyword 默认支持Terms Aggregation
- Text需要在Mapping中enable。会按照分词后的执行结果分
# 对job的keyword 进行聚合 POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field":"job.keyword" } } } } #查询结果 { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "jobs" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "Java Programmer", "doc_count" : 7 }, { "key" : "Javascript Programmer", "doc_count" : 4 }, { "key" : "QA", "doc_count" : 3 }, { "key" : "DBA", "doc_count" : 2 }, { "key" : "Web Designer", "doc_count" : 2 }, { "key" : "Dev Manager", "doc_count" : 1 }, { "key" : "Product Manager", "doc_count" : 1 } ] } } }
对Text类型的进行聚合分析的话,需要打开fieldata功能
# 对 Text 字段打开 fielddata,支持terms aggregation PUT employees/_mapping { "properties" : { "job":{ "type": "text", "fielddata": true } } } # 对 Text 字段进行 terms 分词。分词后的terms POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field":"job" } } } } #查询结果,而keyword不同, { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "jobs" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "programmer", "doc_count" : 11 }, { "key" : "java", "doc_count" : 7 }, { "key" : "javascript", "doc_count" : 4 }, { "key" : "qa", "doc_count" : 3 }, { "key" : "dba", "doc_count" : 2 }, { "key" : "designer", "doc_count" : 2 }, { "key" : "manager", "doc_count" : 2 }, { "key" : "web", "doc_count" : 2 }, { "key" : "dev", "doc_count" : 1 }, { "key" : "product", "doc_count" : 1 } ] } } }
对terms统计的的做法
# 对job.keyword 和 job 进行 terms 聚合,分桶的总数并不一样 POST employees/_search { "size": 0, "aggs": { "cardinate": { "cardinality": { "field": "job.keyword" } } } } #查询结果 { "took" : 7, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "cardinate" : { "value" : 7 } } }
对性别分桶
# 对 性别的 keyword 进行聚合 POST employees/_search { "size": 0, "aggs": { "gender": { "terms": { "field":"gender" } } } } #查询结果 { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "gender" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "male", "doc_count" : 12 }, { "key" : "female", "doc_count" : 8 } ] } } }
指定size
#指定 bucket 的 size POST employees/_search { "size": 0, "aggs": { "ages_5": { "terms": { "field":"age", "size":3 } } } } #查询结果 { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "ages_5" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 12, "buckets" : [ { "key" : 25, "doc_count" : 3 }, { "key" : 32, "doc_count" : 3 }, { "key" : 27, "doc_count" : 2 } ] } } }
Bucket Size
# 指定size,不同工种中,年纪最大的3个员工的具体信息 POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field":"job.keyword" }, "aggs":{ "old_employee":{ "top_hits":{ "size":3, "sort":[ { "age":{ "order":"desc" } } ] } } } } } } #查询结果 { "took" : 4, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "jobs" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "Java Programmer", "doc_count" : 7, "old_employee" : { "hits" : { "total" : { "value" : 7, "relation" : "eq" }, "max_score" : null, "hits" : [ { "_index" : "employees", "_type" : "_doc", "_id" : "11", "_score" : null, "_source" : { "name" : "Jenny", "age" : 36, "job" : "Java Programmer", "gender" : "female", "salary" : 38000 }, "sort" : [ 36 ] }, { "_index" : "employees", "_type" : "_doc", "_id" : "15", "_score" : null, "_source" : { "name" : "King", "age" : 33, "job" : "Java Programmer", "gender" : "male", "salary" : 28000 }, "sort" : [ 33 ] }, { "_index" : "employees", "_type" : "_doc", "_id" : "9", "_score" : null, "_source" : { "name" : "Gregory", "age" : 32, "job" : "Java Programmer", "gender" : "male", "salary" : 22000 }, "sort" : [ 32 ] } ] } } }, { "key" : "Javascript Programmer", "doc_count" : 4, "old_employee" : { "hits" : { "total" : { "value" : 4, "relation" : "eq" }, "max_score" : null, "hits" : [ { "_index" : "employees", "_type" : "_doc", "_id" : "14", "_score" : null, "_source" : { "name" : "Marshall", "age" : 32, "job" : "Javascript Programmer", "gender" : "male", "salary" : 25000 }, "sort" : [ 32 ] }, { "_index" : "employees", "_type" : "_doc", "_id" : "18", "_score" : null, "_source" : { "name" : "Catherine", "age" : 29, "job" : "Javascript Programmer", "gender" : "female", "salary" : 20000 }, "sort" : [ 29 ] }, { "_index" : "employees", "_type" : "_doc", "_id" : "17", "_score" : null, "_source" : { "name" : "Goodwin", "age" : 25, "job" : "Javascript Programmer", "gender" : "male", "salary" : 16000 }, "sort" : [ 25 ] } ] } } }, { "key" : "QA", "doc_count" : 3, "old_employee" : { "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : null, "hits" : [ { "_index" : "employees", "_type" : "_doc", "_id" : "6", "_score" : null, "_source" : { "name" : "Lucy", "age" : 31, "job" : "QA", "gender" : "female", "salary" : 25000 }, "sort" : [ 31 ] }, { "_index" : "employees", "_type" : "_doc", "_id" : "7", "_score" : null, "_source" : { "name" : "Byrd", "age" : 27, "job" : "QA", "gender" : "male", "salary" : 20000 }, "sort" : [ 27 ] }, { "_index" : "employees", "_type" : "_doc", "_id" : "5", "_score" : null, "_source" : { "name" : "Rose", "age" : 25, "job" : "QA", "gender" : "female", "salary" : 18000 }, "sort" : [ 25 ] } ] } } }, { "key" : "DBA", "doc_count" : 2, "old_employee" : { "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : null, "hits" : [ { "_index" : "employees", "_type" : "_doc", "_id" : "19", "_score" : null, "_source" : { "name" : "Boone", "age" : 30, "job" : "DBA", "gender" : "male", "salary" : 30000 }, "sort" : [ 30 ] }, { "_index" : "employees", "_type" : "_doc", "_id" : "20", "_score" : null, "_source" : { "name" : "Kathy", "age" : 29, "job" : "DBA", "gender" : "female", "salary" : 20000 }, "sort" : [ 29 ] } ] } } }, { "key" : "Web Designer", "doc_count" : 2, "old_employee" : { "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : null, "hits" : [ { "_index" : "employees", "_type" : "_doc", "_id" : "4", "_score" : null, "_source" : { "name" : "Rivera", "age" : 26, "job" : "Web Designer", "gender" : "female", "salary" : 22000 }, "sort" : [ 26 ] }, { "_index" : "employees", "_type" : "_doc", "_id" : "3", "_score" : null, "_source" : { "name" : "Tran", "age" : 25, "job" : "Web Designer", "gender" : "male", "salary" : 18000 }, "sort" : [ 25 ] } ] } } }, { "key" : "Dev Manager", "doc_count" : 1, "old_employee" : { "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : null, "hits" : [ { "_index" : "employees", "_type" : "_doc", "_id" : "2", "_score" : null, "_source" : { "name" : "Underwood", "age" : 41, "job" : "Dev Manager", "gender" : "male", "salary" : 50000 }, "sort" : [ 41 ] } ] } } }, { "key" : "Product Manager", "doc_count" : 1, "old_employee" : { "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : null, "hits" : [ { "_index" : "employees", "_type" : "_doc", "_id" : "1", "_score" : null, "_source" : { "name" : "Emma", "age" : 32, "job" : "Product Manager", "gender" : "female", "salary" : 35000 }, "sort" : [ 32 ] } ] } } } ] } } }
#Ranges 分桶
#Salary Ranges 分桶,可以自己定义 key POST employees/_search { "size": 0, "aggs": { "salary_range": { "range": { "field":"salary", "ranges":[ { "to":10000 }, { "from":10000, "to":20000 }, { "key":">20000", "from":20000 } ] } } } } #查询结果 { "took" : 4, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "salary_range" : { "buckets" : [ { "key" : "*-10000.0", "to" : 10000.0, "doc_count" : 1 }, { "key" : "10000.0-20000.0", "from" : 10000.0, "to" : 20000.0, "doc_count" : 4 }, { "key" : ">20000", "from" : 20000.0, "doc_count" : 15 } ] } } }
#Salary Histogram,工资0到10万,以 5000一个区间进行分桶 POST employees/_search { "size": 0, "aggs": { "salary_histrogram": { "histogram": { "field":"salary", "interval":5000, "extended_bounds":{ "min":0, "max":100000 } } } } }
Bucket 子聚合分析、子聚合可以是Bucket 或者 Metric
# 嵌套聚合1,按照工作类型分桶,并统计工资信息 POST employees/_search { "size": 0, "aggs": { "Job_salary_stats": { "terms": { "field": "job.keyword" }, "aggs": { "salary": { "stats": { "field": "salary" } } } } } } #查询结果 { "took" : 9, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "Job_salary_stats" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "Java Programmer", "doc_count" : 7, "salary" : { "count" : 7, "min" : 9000.0, "max" : 38000.0, "avg" : 25571.428571428572, "sum" : 179000.0 } }, { "key" : "Javascript Programmer", "doc_count" : 4, "salary" : { "count" : 4, "min" : 16000.0, "max" : 25000.0, "avg" : 19250.0, "sum" : 77000.0 } }, { "key" : "QA", "doc_count" : 3, "salary" : { "count" : 3, "min" : 18000.0, "max" : 25000.0, "avg" : 21000.0, "sum" : 63000.0 } }, { "key" : "DBA", "doc_count" : 2, "salary" : { "count" : 2, "min" : 20000.0, "max" : 30000.0, "avg" : 25000.0, "sum" : 50000.0 } }, { "key" : "Web Designer", "doc_count" : 2, "salary" : { "count" : 2, "min" : 18000.0, "max" : 22000.0, "avg" : 20000.0, "sum" : 40000.0 } }, { "key" : "Dev Manager", "doc_count" : 1, "salary" : { "count" : 1, "min" : 50000.0, "max" : 50000.0, "avg" : 50000.0, "sum" : 50000.0 } }, { "key" : "Product Manager", "doc_count" : 1, "salary" : { "count" : 1, "min" : 35000.0, "max" : 35000.0, "avg" : 35000.0, "sum" : 35000.0 } } ] } } }
# 多次嵌套。根据工作类型分桶,然后按照性别分桶,计算工资的统计信息 POST employees/_search { "size": 0, "aggs": { "Job_gender_stats": { "terms": { "field": "job.keyword" }, "aggs": { "gender_stats": { "terms": { "field": "gender" }, "aggs": { "salary_stats": { "stats": { "field": "salary" } } } } } } } } #查询结果 { "took" : 3, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "Job_gender_stats" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "Java Programmer", "doc_count" : 7, "gender_stats" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "male", "doc_count" : 5, "salary_stats" : { "count" : 5, "min" : 9000.0, "max" : 32000.0, "avg" : 22200.0, "sum" : 111000.0 } }, { "key" : "female", "doc_count" : 2, "salary_stats" : { "count" : 2, "min" : 30000.0, "max" : 38000.0, "avg" : 34000.0, "sum" : 68000.0 } } ] } }, { "key" : "Javascript Programmer", "doc_count" : 4, "gender_stats" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "male", "doc_count" : 3, "salary_stats" : { "count" : 3, "min" : 16000.0, "max" : 25000.0, "avg" : 19000.0, "sum" : 57000.0 } }, { "key" : "female", "doc_count" : 1, "salary_stats" : { "count" : 1, "min" : 20000.0, "max" : 20000.0, "avg" : 20000.0, "sum" : 20000.0 } } ] } }, { "key" : "QA", "doc_count" : 3, "gender_stats" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "female", "doc_count" : 2, "salary_stats" : { "count" : 2, "min" : 18000.0, "max" : 25000.0, "avg" : 21500.0, "sum" : 43000.0 } }, { "key" : "male", "doc_count" : 1, "salary_stats" : { "count" : 1, "min" : 20000.0, "max" : 20000.0, "avg" : 20000.0, "sum" : 20000.0 } } ] } }, { "key" : "DBA", "doc_count" : 2, "gender_stats" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "female", "doc_count" : 1, "salary_stats" : { "count" : 1, "min" : 20000.0, "max" : 20000.0, "avg" : 20000.0, "sum" : 20000.0 } }, { "key" : "male", "doc_count" : 1, "salary_stats" : { "count" : 1, "min" : 30000.0, "max" : 30000.0, "avg" : 30000.0, "sum" : 30000.0 } } ] } }, { "key" : "Web Designer", "doc_count" : 2, "gender_stats" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "female", "doc_count" : 1, "salary_stats" : { "count" : 1, "min" : 22000.0, "max" : 22000.0, "avg" : 22000.0, "sum" : 22000.0 } }, { "key" : "male", "doc_count" : 1, "salary_stats" : { "count" : 1, "min" : 18000.0, "max" : 18000.0, "avg" : 18000.0, "sum" : 18000.0 } } ] } }, { "key" : "Dev Manager", "doc_count" : 1, "gender_stats" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "male", "doc_count" : 1, "salary_stats" : { "count" : 1, "min" : 50000.0, "max" : 50000.0, "avg" : 50000.0, "sum" : 50000.0 } } ] } }, { "key" : "Product Manager", "doc_count" : 1, "gender_stats" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "female", "doc_count" : 1, "salary_stats" : { "count" : 1, "min" : 35000.0, "max" : 35000.0, "avg" : 35000.0, "sum" : 35000.0 } } ] } } ] } } }
Pipeline 聚合分析
管道的概念:支持聚合分析的结果,再次聚合分析
Pipeline的分析结果输出到原结果当中,根据位置的不同,分为两类:
- sibling 结果和现有结果同级
- min max avg sum Bucket
- stats,Extended status Bucket
- Percentiles Bucket
- parent 结果内嵌到现有聚合分析结果之中
- Derivative(求导)
- Cumultive Sum (累计求和)
- Moving Function (移动窗口)
# 平均工资最低的工作类型 POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 10 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } } } }, "min_salary_by_job":{ "min_bucket": { "buckets_path": "jobs>avg_salary" } } } } # 平均工资最高的工作类型 POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 10 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } } } }, "max_salary_by_job":{ "max_bucket": { "buckets_path": "jobs>avg_salary" } } } } # 平均工资的平均工资 POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 10 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } } } }, "avg_salary_by_job":{ "avg_bucket": { "buckets_path": "jobs>avg_salary" } } } } # 平均工资的统计分析 POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 10 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } } } }, "stats_salary_by_job":{ "stats_bucket": { "buckets_path": "jobs>avg_salary" } } } } # 平均工资的百分位数 POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 10 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } } } }, "percentiles_salary_by_job":{ "percentiles_bucket": { "buckets_path": "jobs>avg_salary" } } } } #按照年龄对平均工资求导 POST employees/_search { "size": 0, "aggs": { "age": { "histogram": { "field": "age", "min_doc_count": 1, "interval": 1 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } }, "derivative_avg_salary":{ "derivative": { "buckets_path": "avg_salary" } } } } } } #Cumulative_sum POST employees/_search { "size": 0, "aggs": { "age": { "histogram": { "field": "age", "min_doc_count": 1, "interval": 1 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } }, "cumulative_salary":{ "cumulative_sum": { "buckets_path": "avg_salary" } } } } } } #Moving Function POST employees/_search { "size": 0, "aggs": { "age": { "histogram": { "field": "age", "min_doc_count": 1, "interval": 1 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } }, "moving_avg_salary":{ "moving_fn": { "buckets_path": "avg_salary", "window":10, "script": "MovingFunctions.min(values)" } } } } } }
作用范围和排序
ES聚合分析默认作用范围是query的查询结果集
同时ES还支持一下方式改变聚合查询的作用范围
- Filter
- Post Filter
- Global
#作用范围 # Query 的作用范围 POST employees/_search { "size": 0, "query": { "range": { "age": { "gte": 20 } } }, "aggs": { "jobs": { "terms": { "field":"job.keyword" } } } } #Filter 的作用范围 POST employees/_search { "size": 0, "aggs": { "older_person": { "filter":{ "range":{ "age":{ "from":35 } } }, "aggs":{ "jobs":{ "terms": { "field":"job.keyword" } } }}, "all_jobs": { "terms": { "field":"job.keyword" } } } } #Post field. 一条语句,找出所有的job类型。还能找到聚合后符合条件的结果 POST employees/_search { "aggs": { "jobs": { "terms": { "field": "job.keyword" } } }, "post_filter": { "match": { "job.keyword": "Dev Manager" } } } #global POST employees/_search { "size": 0, "query": { "range": { "age": { "gte": 40 } } }, "aggs": { "jobs": { "terms": { "field":"job.keyword" } }, "all":{ "global":{}, "aggs":{ "salary_avg":{ "avg":{ "field":"salary" } } } } } }
排序:
指定order,安装count和key进行排序
- 默认情况下,按照count降序排序
- 指定size,就能返回相应的桶
#排序 order #count and key POST employees/_search { "size": 0, "query": { "range": { "age": { "gte": 20 } } }, "aggs": { "jobs": { "terms": { "field":"job.keyword", "order":[ {"_count":"asc"}, {"_key":"desc"} ] } } } } #排序 order #count and key POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field":"job.keyword", "order":[ { "avg_salary":"desc" }] }, "aggs": { "avg_salary": { "avg": { "field":"salary" } } } } } } #排序 order #count and key POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field":"job.keyword", "order":[ { "stats_salary.min":"desc" }] }, "aggs": { "stats_salary": { "stats": { "field":"salary" } } } } } }
UpdateByQuery & Reindex
使用场景:
一般以下情况,需要重新索引
- 索引的mapping发送变更:字段类型、分词器及字典更新
- 索引的setting发送变更:索引主分片数发送改变
- 集群内,集群间需要做数据迁移
ES内置提供的API
-
UpdateByQuery 在现有索引上重建
-
Reindex 在其他索引上重建索引
案例1
#重建索引 DELETE blogs/ # 写入文档 PUT blogs/_doc/1 { "content":"Hadoop is cool", "keyword":"hadoop" } # 查看 Mapping GET blogs/_mapping # 修改 Mapping,增加子字段,使用英文分词器 PUT blogs/_mapping { "properties" : { "content" : { "type" : "text", "fields" : { "english" : { "type" : "text", "analyzer":"english" } } } } } # 写入文档 PUT blogs/_doc/2 { "content":"Elasticsearch rocks", "keyword":"elasticsearch" } # 查询新写入文档 POST blogs/_search { "query": { "match": { "content.english": "Elasticsearch" } } } # 查询 Mapping 变更前写入的文档 POST blogs/_search { "query": { "match": { "content.english": "Hadoop" } } } # Update所有文档 POST blogs/_update_by_query { } # 执行update_by_query后 再查询之前写入的文档 POST blogs/_search { "query": { "match": { "content.english": "Hadoop" } } }
案例2,更新已有字段的mapping
- ES不允许在原有mapping上对字段类型进行修改
- 只能创建新的索引,并且设定正确的字段类型,再重新导入数据
# 查询 GET blogs/_mapping #结果查询,我们看keyword 的字段类型是Text { "blogs" : { "mappings" : { "properties" : { "content" : { "type" : "text", "fields" : { "english" : { "type" : "text", "analyzer" : "english" }, "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "keyword" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } } } } } } #尝试修改类型,报错,ES不允许对已有字段进行修改 PUT blogs/_mapping { "properties" : { "content" : { "type" : "text", "fields" : { "english" : { "type" : "text", "analyzer" : "english" } } }, "keyword" : { "type" : "keyword" } } } # 创建新的索引并且设定新的Mapping PUT blogs_fix/ { "mappings": { "properties" : { "content" : { "type" : "text", "fields" : { "english" : { "type" : "text", "analyzer" : "english" } } }, "keyword" : { "type" : "keyword" } } } } # Reindx API POST _reindex { "source": { "index": "blogs" }, "dest": { "index": "blogs_fix" } } #查看新索引 GET blogs_fix/_doc/1 #查询结果 { "_index" : "blogs_fix", "_type" : "_doc", "_id" : "1", "_version" : 1, "_seq_no" : 0, "_primary_term" : 1, "found" : true, "_source" : { "content" : "Hadoop is cool", "keyword" : "hadoop" } } # 测试 Term Aggregation POST blogs_fix/_search { "size": 0, "aggs": { "blog_keyword": { "terms": { "field": "keyword", "size": 10 } } } } #我们修改成keyword类型,只有keyword 才能Term Aggregation #查询结果 { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "blog_keyword" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "elasticsearch", "doc_count" : 1 }, { "key" : "hadoop", "doc_count" : 1 } ] } } }
Reindex以上总结
Reindex API支持从一个索引拷贝到另一个索引中
使用ReindexAPI的场景:
- 修改索引的主分片数
- 改变字段的Mapping字段类型
- 集群内/外 数据迁移
IngestPipeline & PainlessScript
Ingest Node
ES5.0后,引入的一种新的节点类型,默认配置下,每个节点都是Ingest Node
- 具有预处理数据的能力,可拦截Index或者Bulk API 的请求
- 对数据进行转换,并重新返回给Index 或者Bulk API
无需Logstash,就可以进行数据的预处理,例如:
- 为某个字段设置默认值:重命名某个字段的字段名;对字段进行Split操作
- 支持设置Painless脚本,对数据进行更多复杂加工
Demo
创建文档
#Blog数据,包含3个字段,tags用逗号间隔 PUT tech_blogs/_doc/1 { "title":"Introducing big data......", "tags":"hadoop,elasticsearch,spark", "content":"You konw, for big data" }
POST _ingest/pipeline/_simulate { "pipeline": { "description": "to split blog tags", // 按,切割 "processors": [ { "split": { "field": "tags", "separator": "," } } ] }, "docs": [ { "_index": "index", "_id": "id", "_source": { "title": "Introducing big data......", "tags": "hadoop,elasticsearch,spark", "content": "You konw, for big data" } }, { "_index": "index", "_id": "idxx", "_source": { "title": "Introducing cloud computering", "tags": "openstack,k8s", "content": "You konw, for cloud" } } ] }
#同时为文档,增加一个字段。blog查看量 POST _ingest/pipeline/_simulate { "pipeline": { "description": "to split blog tags", "processors": [ { "split": { "field": "tags", "separator": "," } }, // 增加一个字段, { "set":{ "field": "views", "value": 0 } } ] }, "docs": [ { "_index":"index", "_id":"id", "_source":{ "title":"Introducing big data......", "tags":"hadoop,elasticsearch,spark", "content":"You konw, for big data" } }, { "_index":"index", "_id":"idxx", "_source":{ "title":"Introducing cloud computering", "tags":"openstack,k8s", "content":"You konw, for cloud" } } ] }
以上是测试可以使用,我们测试完成后,在ES上创建一个Pipeline
PUT _ingest/pipeline/blog_pipeline { "description": "a blog pipeline", "processors": [ { "split": { "field": "tags", "separator": "," } }, { "set":{ "field": "views", "value": 0 } } ] }
#查看Pipleline
GET _ingest/pipeline/blog_pipeline
#测试pipeline,只需要提供文档的数组就可以了 POST _ingest/pipeline/blog_pipeline/_simulate { "docs": [ { "_source": { "title": "Introducing cloud computering", "tags": "openstack,k8s", "content": "You konw, for cloud" } } ] }
#测试2 情况索引 DELETE tech_blogs #不使用pipeline更新数据 PUT tech_blogs/_doc/1 { "title":"Introducing big data......", "tags":"hadoop,elasticsearch,spark", "content":"You konw, for big data" } #使用pipeline更新数据 PUT tech_blogs/_doc/2?pipeline=blog_pipeline { "title": "Introducing cloud computering", "tags": "openstack,k8s", "content": "You konw, for cloud" } #查看两条数据,一条被处理,一条未被处理 POST tech_blogs/_search {} #update_by_query 会导致错误 POST tech_blogs/_update_by_query?pipeline=blog_pipeline { } #增加update_by_query的条件 POST tech_blogs/_update_by_query?pipeline=blog_pipeline { "query": { "bool": { "must_not": { "exists": { "field": "views" } } } } } #再次索引,这次我们可以看到文档1也被pipeline处理了 POST tech_blogs/_search
一些内置的Processors
- Split 给一个字段分成数组
- Remove / Rename 移除或者重命名一个字段
- Append 增加一个新标签
- Convert 从字符串转换成float类型
- Date / JSON 日期格式转换,字符串转JSON
- Data Index Name 将通过该处理器的文档,分配到指定时间格式的索引中
- Fail 一旦出现异常,该Pipeline指定的错误信息能返回给用户
- Foreach 数组字段,数组的每个元素都会使用到一个相同的处理器
- Grok 日志的格式切割
- Gsub /Join /Split 字符串转换 数组转换字符串 字符串转换数组
- Lowercase /Upcase 大小写转换
Painless
- 自ES5.x后引入,专门为ES设计,扩展了JAVA的语法
- 6.0开始,ES只支持Painless。Groovy JavaScript和Python 都不支持
- Painless支持所有java数据类型及Java API子集
- Painless Script 具备以下特性:
- 高性能 / 安全
- 支持显示类型或者动态定义类型
Painless 用途:
可以对文档字段加工处理
- 更新删除字段,处理数据聚合操作
- Script Field: 对返回字段提前进行计算
- Fcunction Score: 对文档的算分进行处理
在Ingest Pipeline 中执行脚本
在Reindex API, Update By Query时,对数据进行处理
#########Demo for Painless############### # 增加一个 Script Prcessor POST _ingest/pipeline/_simulate { "pipeline": { "description": "to split blog tags", "processors": [ { "split": { "field": "tags", "separator": "," } }, { "script": { "source": """ if(ctx.containsKey("content")){ ctx.content_length = ctx.content.length(); }else{ ctx.content_length=0; } """ } }, { "set":{ "field": "views", "value": 0 } } ] }, "docs": [ { "_index":"index", "_id":"id", "_source":{ "title":"Introducing big data......", "tags":"hadoop,elasticsearch,spark", "content":"You konw, for big data" } }, { "_index":"index", "_id":"idxx", "_source":{ "title":"Introducing cloud computering", "tags":"openstack,k8s", "content":"You konw, for cloud" } } ] } DELETE tech_blogs PUT tech_blogs/_doc/1 { "title":"Introducing big data......", "tags":"hadoop,elasticsearch,spark", "content":"You konw, for big data", "views":0 } POST tech_blogs/_update/1 { "script": { "source": "ctx._source.views += params.new_views", "params": { "new_views":100 } } } # 查看views计数 POST tech_blogs/_search { } #保存脚本在 Cluster State POST _scripts/update_views { "script":{ "lang": "painless", "source": "ctx._source.views += params.new_views" } } POST tech_blogs/_update/1 { "script": { "id": "update_views", "params": { "new_views":1000 } } } GET tech_blogs/_search { "script_fields": { "rnd_views": { "script": { "lang": "painless", "source": """ java.util.Random rnd = new Random(); doc['views'].value+rnd.nextInt(1000); """ } } }, "query": { "match_all": {} } }