Elasticsearch7-分布式及分布式搜索机制

分布式特性

Elasticsearch的分布式带来的好处:

  • 存储的水平扩容
  • 提供系统的可用性,部分节点停止服务,整个集群不受影响

Elasticsearch的分布式架构

  • 不同集群通过不同集群名称区分,默认"elasticsearch"
  • 通过配置文件修改,或者在命令行中-E cluster.name="ops-es"进行设定

节点

节点是一个Elasticsearch实例:

  • 本质上就是一个JAVA进程
  • 一台机器上可以运行多个Elasticsearch进程,但是生产环境一般建议一台机器上就运行一个Elasticsearch实例

每一个节点都有名字,通过配置文件,或者启动的时候-E node.name=es01指定

每一个节点启动后,都会生产一个UID,保存在data目录下

Coordinating Node

处理请求的节点叫 Coordinating Node

  • 路由到正确的节点,例如创建索引,就路由到master节点

所有节点默认都是Coordinating Node

通过将其他类型设置成False,使其变成Coordinating Node节点

 Data Node

可以保存数据的节点,就叫Data Node节点

  • 节点启动后,默认就是数据节点,可以设置成node.data: false 禁止

Data Node的职责

  • 保存分片数据,在数据扩展上起到至关重要的作用,(由Master Node决定如何把分片分发到数据节点上)

通过增加数据节点

  • 可以解决数据水平扩展和解决数据单点的问题

 Master Node

Master Node的职责

  • 处理创建、删除索引等请求、决定分片分到那个节点
  • 维护并更新Cluster 状态

Master Node最佳实践

  • Master 节点非常重要,在部署的时候需要考虑单点的问题
  • 为一个集群设置多个Master节点/每一个节点只承担Master单一角色

集群状态信息

集群状态信息,维护一个集群中,必要信息

  • 所有节点信息
  • 所有索引和其相关的Mapping和setting信息
  • 分片路由信息

在每一个节点上都保存了集群的状态信息

但是,只有Master节点上才能修改集群状态的信息,并负责同步给其他节点

  • 因为,任意节点都能修改信息会导致Cluster state信息的不一致

Master Eligible Nodes & 选主的过程

相互ping对方,Node ID低的会成为被选举的节点

其他节点会加入集群,但是不承担Master 节点的角色,一旦发现被选中的节点丢失,就会选举出新的Master节点

脑裂问题

Split-Brain,分布式系统的经典网络问题,当出现网络问题,一个节点和其他节点无法连接

  • Node2 和Node3会重新选举Master
  • Node1 自己还是作为Master,组成一个集群,同时更新Cluster state
  • 导致2个Master节点,维护不同的cluster state。当网络恢复时,无法选择正确恢复

如何避免脑裂问题

限定一个选举条件,设置quorum(仲裁),只有在Master eligishble 节点数大于quorum时,才能进行选举

  • quorum = (master节点数/2)+1
  • 当3个master eligible时,设置discovery.zen.minimum_master_nodes为2,既避免脑裂

从7.0开始,无需此配置

  • 移除minimum_master_nodes参数,让Elasticsearch自己选择可以形成仲裁的节点
  • 典型的主节点选举现在只需要很短的时间就可以完成。集群的伸缩变得更安全、更容易、并且可能造成丢失数据的系统配置选项更少了
  • 节点更清楚的记录它们的状态,有助于判断为什么它们不能加入集群或为什么无法选举出主节点

Primary Shard

分片是Elasticsearch分布式存储基石

  • 主分片/副本分片

通过主分片将数据分布在所有节点上

  • primary shard,可以将一份索引的数据,分散在多个Data Node上,实现存储的水平扩展
  • 主分片数在索引创建时指定,后续默认不能修改,如需修改,需要重新索引

分片数设定

如何规划一个索引的主分片和副本分片数

  • 主分片数过小:例如创建1个primary shard 的index
    • 如果该索引增长很快,集群无法通过增加节点实现对这个索引的数据扩展
  • 主分片数设置过大:导致单个shard容量很小,引发一个节点上过多分片,影响性能
  • 副本分片设置过多,会降低集群整体写入性能

 文档存储在分片上

文档会存储在具体的某个主分片和副本分片上,例如:文档1,会存储在P0和R0分片上

文档到分片的映射算法:

  • 确保文档能均匀分布在所有分片上,充分利用硬件资源,避免部分机器空闲,部分机器繁忙
  • 潜在算法
    • 随机/Round Robin。当查询文档1,分片数很多,需要多次查询才可能查到文档1
    • 维护文档到分片的映射关系,当文档数据量很大的时候,维护成本高
    • 实时计算,通过文档1,自动算出,需要去那个分片上获取文档

文档到分片的路由算法

shard = hash(_routing) % number_of_primary_shards

  • hash算法确保文档均匀分散到分片中
  • 默认的_routing值是文档id
  • 可以自行限定_ronting数值,例如相同国家的商品,都分配到指定的shard
  • 设置Index settings 后,Primary数,不能随意修改的根本原因

分片的内部原理

什么是ES的分片

  • ES中最小的工作单元:是一个Lucene的index

一些问题:

  • 为什么ES的搜索是近实时的
  • ES如何保证在断电时数据也不会丢失
  • 为什么删除文档,并不会立即释放空间

倒排索引的不可变性

  • 倒排索引采用Immutable Design,一旦生产,不可更改
  • 不可变性,带来的好处:
    • 无需考虑并发写文件的问题,避免了锁机制带来的性能问题
    • 一旦写入内核的文件系统缓存,便留在哪里。只要文件系统存有足够的空间,大部分请求就会直接请求内存,不会命中磁盘,提升了很大的性能
    • 缓存容易生产和维护、数据可以被压缩
  • 不可变性,带来了的挑战:如果需要让一个新文档可以被搜索,需要从建整个索引。

Lucene Index

  • 在Lucene中,单个倒排索引文件被成为Segment,Sgement是自包含的,不可变更的,多个Sgement汇总在一起,称为Lucene的Index,其对应的就是ES中的Shard
  • 当有新文档写入时,会生成新的Segment,查询时会同时查询所有的Segment,并且对结果汇总,Lucene中有一个文件,用来记录所有Segment信息,叫做Commit Point
  • 删除的文档信息,保存在“.del”文件中

什么Refresh

  • 将Index Buffer写入Segment的过程叫Refresh。Refresh不执行fsync操作
  • Refresh频率:默认1秒发生一次,可通过index.refresh_interval配置。Refersh后,数据就可以被搜索到了。这也是为什么Elasticsearch是近实时查询的原因
  • 如果系统有大量的数据写入,那就会产生很多Segment
  • Index Buffer被占满时,会触发Refresh,默认值是JVM的10%

什么是Transaction Log

  • Segment写入磁盘的过程相对耗时,借助文件系统缓存,Refresh时,先将Segment写入缓存以开放查询
  • 为了保证数据不会丢失。所以在Index文档时,同时写Transaction Log,高版本开始,Transaction Log默认落盘,每个分片有一个Transaction Log
  • 在ES Refresh 时,Index Buffer被清空,Transaction Log不会被清空

 什么是Flush

ES Flush & Luence Commit

  • 调用Refresh,Index Buffer清空并且Refresh
  • 调用fsync,将缓存中的Segment写入磁盘
  • 清空Transaction Log
  • 默认30分钟调用一次
  • Transaction Log满(默认512M)

什么是Merge

  • Segment很多,需要被定期被合并
    • 减少Segment/删除已经删除的文档
  • ES和Luence会自动进行Merge操作
    • POST my_index/_forcemerge

 分布式搜索机制

Elasticsearch的搜索分为两步:

第一步-Query

第二部-Fetch

  • 用户发出搜索的请求到ES节点,节点搜到请求后,会以Coordinating节点身份,在6个主副本分片中随机选择3个分片,发出查询请求
  • 被选中的分片执行查询,进行排序。然后,每个分片都会返回From+Size个排序后文档id和排序值给Coordinating节点
  • Coordinating节点会将Query阶段,从每个分片获取的排序后的文档Id列表,重新进行排序。选取From到From + Size个文档的ID
  • 以 multi get 请求的方式,到相应的分片获取详细的文档数据

Query Then Fetch 的潜在问题

性能问题:

  • 每个分片上需要查的文档个数=From + Size
  • 最终协调节点需要处理:number_of_shard * (From+size)
  • 深度分页

相关性算分

  • 每一个都基于自己上分片数据进行相关度算分。这会导致大分偏离的情况,特别是数据量很少时,相关性算分在分片之间是相互独立,当文档总数很少情况下,如果主分片大于1,主分片数越多,相关性算法越不准

分页& 遍历

  • From:开始的位置
  • Size:期望获取文档的总数

ES天生就是分布式系统,查询信息,但是数据分别保存在多个分片中,多台机器上,ES天生就需要满足排序的需求(按照相关性算分)

当一个查询:From=990, Size=10

  • 会在每个分片中获取1000个文档。然后,在通过Coordinating Node聚合所有结果。最好再通过排序选取前1000个文档
  • 页数越深,占用内存越多。为了避免深度分页带来的内存开销,ES有一个设定,默认限定10000个文档

Search After避免深度分页的问题

  • 避免深度分页的性能问题,可以实时获取下一页文档信息
    • 不支持指定页数(From)
    • 只能往下分页
  • 第一步搜索需要指定sort,并且保证值是唯一的(可以通过加入_id保证唯一性)
  • 然后使用上一次,最后一个文档的sort值进行查询

 Bucket & Metric 聚合分析及嵌套聚合

  • Metric 一些一系列的统计方法
  • Bucket 一组满足条件的文档

Metric Aggregation

单值分析

  • max min avg sum
  • Cardinality(类似  distinct count)

多值分析

  • stats、extended stats
  • percentile、percentile rank
  • top hits

Demo

生产数据

#定义员工表索引的定义
PUT /employees/ 
{
  "mappings":{
    "properties":{
      "age":{
        "type": "integer"
      },
      "gender":{
        "type": "keyword"
      },
      "job":{
        "type": "text",
        "fields":{
          "keyword": {
            "type": "keyword",
            "ignore_above": 50
          }
        }
      },
      "name":{
        "type": "keyword"
      },
      "salary":{
        "type" : "integer"
      }
    }
  }
}
#插入数据
PUT /employees/_bulk
{ "index" : {  "_id" : "1" } }
{ "name" : "Emma","age":32,"job":"Product Manager","gender":"female","salary":35000 }
{ "index" : {  "_id" : "2" } }
{ "name" : "Underwood","age":41,"job":"Dev Manager","gender":"male","salary": 50000}
{ "index" : {  "_id" : "3" } }
{ "name" : "Tran","age":25,"job":"Web Designer","gender":"male","salary":18000 }
{ "index" : {  "_id" : "4" } }
{ "name" : "Rivera","age":26,"job":"Web Designer","gender":"female","salary": 22000}
{ "index" : {  "_id" : "5" } }
{ "name" : "Rose","age":25,"job":"QA","gender":"female","salary":18000 }
{ "index" : {  "_id" : "6" } }
{ "name" : "Lucy","age":31,"job":"QA","gender":"female","salary": 25000}
{ "index" : {  "_id" : "7" } }
{ "name" : "Byrd","age":27,"job":"QA","gender":"male","salary":20000 }
{ "index" : {  "_id" : "8" } }
{ "name" : "Foster","age":27,"job":"Java Programmer","gender":"male","salary": 20000}
{ "index" : {  "_id" : "9" } }
{ "name" : "Gregory","age":32,"job":"Java Programmer","gender":"male","salary":22000 }
{ "index" : {  "_id" : "10" } }
{ "name" : "Bryant","age":20,"job":"Java Programmer","gender":"male","salary": 9000}
{ "index" : {  "_id" : "11" } }
{ "name" : "Jenny","age":36,"job":"Java Programmer","gender":"female","salary":38000 }
{ "index" : {  "_id" : "12" } }
{ "name" : "Mcdonald","age":31,"job":"Java Programmer","gender":"male","salary": 32000}
{ "index" : {  "_id" : "13" } }
{ "name" : "Jonthna","age":30,"job":"Java Programmer","gender":"female","salary":30000 }
{ "index" : {  "_id" : "14" } }
{ "name" : "Marshall","age":32,"job":"Javascript Programmer","gender":"male","salary": 25000}
{ "index" : {  "_id" : "15" } }
{ "name" : "King","age":33,"job":"Java Programmer","gender":"male","salary":28000 }
{ "index" : {  "_id" : "16" } }
{ "name" : "Mccarthy","age":21,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : {  "_id" : "17" } }
{ "name" : "Goodwin","age":25,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : {  "_id" : "18" } }
{ "name" : "Catherine","age":29,"job":"Javascript Programmer","gender":"female","salary": 20000}
{ "index" : {  "_id" : "19" } }
{ "name" : "Boone","age":30,"job":"DBA","gender":"male","salary": 30000}
{ "index" : {  "_id" : "20" } }
{ "name" : "Kathy","age":29,"job":"DBA","gender":"female","salary": 20000}

测试样例

#Metric 聚合 找到最低工资
POST employees/_search
{
  "size":0,
  "aggs": {
    "min_salary": {
      "min": {
        "field": "salary"
      }
    }
  }
}
#查询结果
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "min_salary" : {
      "value" : 9000.0
    }
  }
}
#Metric 聚合 找到最高工资
POST employees/_search
{
  "size":0,
  "aggs": {
    "max_salary": {
      "max": {
        "field": "salary"
      }
    }
  }
}
#查询结果
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "max_salary" : {
      "value" : 50000.0
    }
  }
}
#多个Metric 聚合 找到 最低最高平均工资
POST employees/_search
{
  "size": 0,
  "aggs": {
    "max_salary": {
      "max": {
        "field": "salary"
      }
    },
    "min_salary": {
      "min": {
        "field": "salary"
      }
    },
    "avg_salary": {
      "avg": {
        "field": "salary"
      }
    }
  }
}
#查询结果
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "max_salary" : {
      "value" : 50000.0
    },
    "avg_salary" : {
      "value" : 24700.0
    },
    "min_salary" : {
      "value" : 9000.0
    }
  }
}
# 一个聚合,输出多值,统计
POST employees/_search
{
  "size": 0,
  "aggs": {
    "stats_salary": {
      "stats": {
        "field":"salary"
      }
    }
  }
}
#查询结果
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "stats_salary" : {
      "count" : 20,
      "min" : 9000.0,
      "max" : 50000.0,
      "avg" : 24700.0,
      "sum" : 494000.0
    }
  }
}

Bucket聚合分析

按照一定规则,将文档分配到不同的桶中,从而达到分类的目的,ES提供常见Bucket Aggregation

  • Terms
  • 数字类型
    • Range/Data Range
    • Histogram/Data Histogram
  • 支持嵌套(桶中桶)

Terms Aggregation

  • 字段需要打开fieldata,才能进行Terms Aggregation
    • keyword 默认支持Terms Aggregation
    • Text需要在Mapping中enable。会按照分词后的执行结果分

 

# 对job的keyword 进行聚合
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword"
      }
    }
  }
}
#查询结果
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "jobs" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Java Programmer",
          "doc_count" : 7
        },
        {
          "key" : "Javascript Programmer",
          "doc_count" : 4
        },
        {
          "key" : "QA",
          "doc_count" : 3
        },
        {
          "key" : "DBA",
          "doc_count" : 2
        },
        {
          "key" : "Web Designer",
          "doc_count" : 2
        },
        {
          "key" : "Dev Manager",
          "doc_count" : 1
        },
        {
          "key" : "Product Manager",
          "doc_count" : 1
        }
      ]
    }
  }
}

对Text类型的进行聚合分析的话,需要打开fieldata功能

# 对 Text 字段打开 fielddata,支持terms aggregation
PUT employees/_mapping
{
  "properties" : {
    "job":{
       "type":     "text",
       "fielddata": true
    }
  }
}
# 对 Text 字段进行 terms 分词。分词后的terms
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job"
      }
    }
  }
}
#查询结果,而keyword不同,
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "jobs" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "programmer",
          "doc_count" : 11
        },
        {
          "key" : "java",
          "doc_count" : 7
        },
        {
          "key" : "javascript",
          "doc_count" : 4
        },
        {
          "key" : "qa",
          "doc_count" : 3
        },
        {
          "key" : "dba",
          "doc_count" : 2
        },
        {
          "key" : "designer",
          "doc_count" : 2
        },
        {
          "key" : "manager",
          "doc_count" : 2
        },
        {
          "key" : "web",
          "doc_count" : 2
        },
        {
          "key" : "dev",
          "doc_count" : 1
        },
        {
          "key" : "product",
          "doc_count" : 1
        }
      ]
    }
  }
}

对terms统计的的做法

# 对job.keyword 和 job 进行 terms 聚合,分桶的总数并不一样
POST employees/_search
{
  "size": 0,
  "aggs": {
    "cardinate": {
      "cardinality": {
        "field": "job.keyword"
      }
    }
  }
}
#查询结果
{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "cardinate" : {
      "value" : 7
    }
  }
}

对性别分桶

# 对 性别的 keyword 进行聚合
POST employees/_search
{
  "size": 0,
  "aggs": {
    "gender": {
      "terms": {
        "field":"gender"
      }
    }
  }
}
#查询结果
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "gender" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "male",
          "doc_count" : 12
        },
        {
          "key" : "female",
          "doc_count" : 8
        }
      ]
    }
  }
}

指定size

#指定 bucket 的 size
POST employees/_search
{
  "size": 0,
  "aggs": {
    "ages_5": {
      "terms": {
        "field":"age",
        "size":3
      }
    }
  }
}
#查询结果
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "ages_5" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 12,
      "buckets" : [
        {
          "key" : 25,
          "doc_count" : 3
        },
        {
          "key" : 32,
          "doc_count" : 3
        },
        {
          "key" : 27,
          "doc_count" : 2
        }
      ]
    }
  }
}

Bucket Size

# 指定size,不同工种中,年纪最大的3个员工的具体信息
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword"
      },
      "aggs":{
        "old_employee":{
          "top_hits":{
            "size":3,
            "sort":[
              {
                "age":{
                  "order":"desc"
                }
              }
            ]
          }
        }
      }
    }
  }
}
#查询结果
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "jobs" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Java Programmer",
          "doc_count" : 7,
          "old_employee" : {
            "hits" : {
              "total" : {
                "value" : 7,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "11",
                  "_score" : null,
                  "_source" : {
                    "name" : "Jenny",
                    "age" : 36,
                    "job" : "Java Programmer",
                    "gender" : "female",
                    "salary" : 38000
                  },
                  "sort" : [
                    36
                  ]
                },
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "15",
                  "_score" : null,
                  "_source" : {
                    "name" : "King",
                    "age" : 33,
                    "job" : "Java Programmer",
                    "gender" : "male",
                    "salary" : 28000
                  },
                  "sort" : [
                    33
                  ]
                },
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "9",
                  "_score" : null,
                  "_source" : {
                    "name" : "Gregory",
                    "age" : 32,
                    "job" : "Java Programmer",
                    "gender" : "male",
                    "salary" : 22000
                  },
                  "sort" : [
                    32
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : "Javascript Programmer",
          "doc_count" : 4,
          "old_employee" : {
            "hits" : {
              "total" : {
                "value" : 4,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "14",
                  "_score" : null,
                  "_source" : {
                    "name" : "Marshall",
                    "age" : 32,
                    "job" : "Javascript Programmer",
                    "gender" : "male",
                    "salary" : 25000
                  },
                  "sort" : [
                    32
                  ]
                },
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "18",
                  "_score" : null,
                  "_source" : {
                    "name" : "Catherine",
                    "age" : 29,
                    "job" : "Javascript Programmer",
                    "gender" : "female",
                    "salary" : 20000
                  },
                  "sort" : [
                    29
                  ]
                },
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "17",
                  "_score" : null,
                  "_source" : {
                    "name" : "Goodwin",
                    "age" : 25,
                    "job" : "Javascript Programmer",
                    "gender" : "male",
                    "salary" : 16000
                  },
                  "sort" : [
                    25
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : "QA",
          "doc_count" : 3,
          "old_employee" : {
            "hits" : {
              "total" : {
                "value" : 3,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "6",
                  "_score" : null,
                  "_source" : {
                    "name" : "Lucy",
                    "age" : 31,
                    "job" : "QA",
                    "gender" : "female",
                    "salary" : 25000
                  },
                  "sort" : [
                    31
                  ]
                },
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "7",
                  "_score" : null,
                  "_source" : {
                    "name" : "Byrd",
                    "age" : 27,
                    "job" : "QA",
                    "gender" : "male",
                    "salary" : 20000
                  },
                  "sort" : [
                    27
                  ]
                },
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "5",
                  "_score" : null,
                  "_source" : {
                    "name" : "Rose",
                    "age" : 25,
                    "job" : "QA",
                    "gender" : "female",
                    "salary" : 18000
                  },
                  "sort" : [
                    25
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : "DBA",
          "doc_count" : 2,
          "old_employee" : {
            "hits" : {
              "total" : {
                "value" : 2,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "19",
                  "_score" : null,
                  "_source" : {
                    "name" : "Boone",
                    "age" : 30,
                    "job" : "DBA",
                    "gender" : "male",
                    "salary" : 30000
                  },
                  "sort" : [
                    30
                  ]
                },
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "20",
                  "_score" : null,
                  "_source" : {
                    "name" : "Kathy",
                    "age" : 29,
                    "job" : "DBA",
                    "gender" : "female",
                    "salary" : 20000
                  },
                  "sort" : [
                    29
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : "Web Designer",
          "doc_count" : 2,
          "old_employee" : {
            "hits" : {
              "total" : {
                "value" : 2,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "4",
                  "_score" : null,
                  "_source" : {
                    "name" : "Rivera",
                    "age" : 26,
                    "job" : "Web Designer",
                    "gender" : "female",
                    "salary" : 22000
                  },
                  "sort" : [
                    26
                  ]
                },
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "3",
                  "_score" : null,
                  "_source" : {
                    "name" : "Tran",
                    "age" : 25,
                    "job" : "Web Designer",
                    "gender" : "male",
                    "salary" : 18000
                  },
                  "sort" : [
                    25
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : "Dev Manager",
          "doc_count" : 1,
          "old_employee" : {
            "hits" : {
              "total" : {
                "value" : 1,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "2",
                  "_score" : null,
                  "_source" : {
                    "name" : "Underwood",
                    "age" : 41,
                    "job" : "Dev Manager",
                    "gender" : "male",
                    "salary" : 50000
                  },
                  "sort" : [
                    41
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : "Product Manager",
          "doc_count" : 1,
          "old_employee" : {
            "hits" : {
              "total" : {
                "value" : 1,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "1",
                  "_score" : null,
                  "_source" : {
                    "name" : "Emma",
                    "age" : 32,
                    "job" : "Product Manager",
                    "gender" : "female",
                    "salary" : 35000
                  },
                  "sort" : [
                    32
                  ]
                }
              ]
            }
          }
        }
      ]
    }
  }
}

#Ranges 分桶

#Salary Ranges 分桶,可以自己定义 key
POST employees/_search
{
  "size": 0,
  "aggs": {
    "salary_range": {
      "range": {
        "field":"salary",
        "ranges":[
          {
            "to":10000
          },
          {
            "from":10000,
            "to":20000
          },
          {
            "key":">20000",
            "from":20000
          }
        ]
      }
    }
  }
}
#查询结果
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "salary_range" : {
      "buckets" : [
        {
          "key" : "*-10000.0",
          "to" : 10000.0,
          "doc_count" : 1
        },
        {
          "key" : "10000.0-20000.0",
          "from" : 10000.0,
          "to" : 20000.0,
          "doc_count" : 4
        },
        {
          "key" : ">20000",
          "from" : 20000.0,
          "doc_count" : 15
        }
      ]
    }
  }
}
#Salary Histogram,工资0到10万,以 5000一个区间进行分桶
POST employees/_search
{
  "size": 0,
  "aggs": {
    "salary_histrogram": {
      "histogram": {
        "field":"salary",
        "interval":5000,
        "extended_bounds":{
          "min":0,
          "max":100000

        }
      }
    }
  }
}

Bucket 子聚合分析、子聚合可以是Bucket 或者 Metric

# 嵌套聚合1,按照工作类型分桶,并统计工资信息
POST employees/_search
{
  "size": 0,
  "aggs": {
    "Job_salary_stats": {
      "terms": {
        "field": "job.keyword"
      },
      "aggs": {
        "salary": {
          "stats": {
            "field": "salary"
          }
        }
      }
    }
  }
}
#查询结果
{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "Job_salary_stats" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Java Programmer",
          "doc_count" : 7,
          "salary" : {
            "count" : 7,
            "min" : 9000.0,
            "max" : 38000.0,
            "avg" : 25571.428571428572,
            "sum" : 179000.0
          }
        },
        {
          "key" : "Javascript Programmer",
          "doc_count" : 4,
          "salary" : {
            "count" : 4,
            "min" : 16000.0,
            "max" : 25000.0,
            "avg" : 19250.0,
            "sum" : 77000.0
          }
        },
        {
          "key" : "QA",
          "doc_count" : 3,
          "salary" : {
            "count" : 3,
            "min" : 18000.0,
            "max" : 25000.0,
            "avg" : 21000.0,
            "sum" : 63000.0
          }
        },
        {
          "key" : "DBA",
          "doc_count" : 2,
          "salary" : {
            "count" : 2,
            "min" : 20000.0,
            "max" : 30000.0,
            "avg" : 25000.0,
            "sum" : 50000.0
          }
        },
        {
          "key" : "Web Designer",
          "doc_count" : 2,
          "salary" : {
            "count" : 2,
            "min" : 18000.0,
            "max" : 22000.0,
            "avg" : 20000.0,
            "sum" : 40000.0
          }
        },
        {
          "key" : "Dev Manager",
          "doc_count" : 1,
          "salary" : {
            "count" : 1,
            "min" : 50000.0,
            "max" : 50000.0,
            "avg" : 50000.0,
            "sum" : 50000.0
          }
        },
        {
          "key" : "Product Manager",
          "doc_count" : 1,
          "salary" : {
            "count" : 1,
            "min" : 35000.0,
            "max" : 35000.0,
            "avg" : 35000.0,
            "sum" : 35000.0
          }
        }
      ]
    }
  }
}
# 多次嵌套。根据工作类型分桶,然后按照性别分桶,计算工资的统计信息
POST employees/_search
{
  "size": 0,
  "aggs": {
    "Job_gender_stats": {
      "terms": {
        "field": "job.keyword"
      },
      "aggs": {
        "gender_stats": {
          "terms": {
            "field": "gender"
          },
          "aggs": {
            "salary_stats": {
              "stats": {
                "field": "salary"
              }
            }
          }
        }
      }
    }
  }
}
#查询结果
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "Job_gender_stats" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Java Programmer",
          "doc_count" : 7,
          "gender_stats" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "male",
                "doc_count" : 5,
                "salary_stats" : {
                  "count" : 5,
                  "min" : 9000.0,
                  "max" : 32000.0,
                  "avg" : 22200.0,
                  "sum" : 111000.0
                }
              },
              {
                "key" : "female",
                "doc_count" : 2,
                "salary_stats" : {
                  "count" : 2,
                  "min" : 30000.0,
                  "max" : 38000.0,
                  "avg" : 34000.0,
                  "sum" : 68000.0
                }
              }
            ]
          }
        },
        {
          "key" : "Javascript Programmer",
          "doc_count" : 4,
          "gender_stats" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "male",
                "doc_count" : 3,
                "salary_stats" : {
                  "count" : 3,
                  "min" : 16000.0,
                  "max" : 25000.0,
                  "avg" : 19000.0,
                  "sum" : 57000.0
                }
              },
              {
                "key" : "female",
                "doc_count" : 1,
                "salary_stats" : {
                  "count" : 1,
                  "min" : 20000.0,
                  "max" : 20000.0,
                  "avg" : 20000.0,
                  "sum" : 20000.0
                }
              }
            ]
          }
        },
        {
          "key" : "QA",
          "doc_count" : 3,
          "gender_stats" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "female",
                "doc_count" : 2,
                "salary_stats" : {
                  "count" : 2,
                  "min" : 18000.0,
                  "max" : 25000.0,
                  "avg" : 21500.0,
                  "sum" : 43000.0
                }
              },
              {
                "key" : "male",
                "doc_count" : 1,
                "salary_stats" : {
                  "count" : 1,
                  "min" : 20000.0,
                  "max" : 20000.0,
                  "avg" : 20000.0,
                  "sum" : 20000.0
                }
              }
            ]
          }
        },
        {
          "key" : "DBA",
          "doc_count" : 2,
          "gender_stats" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "female",
                "doc_count" : 1,
                "salary_stats" : {
                  "count" : 1,
                  "min" : 20000.0,
                  "max" : 20000.0,
                  "avg" : 20000.0,
                  "sum" : 20000.0
                }
              },
              {
                "key" : "male",
                "doc_count" : 1,
                "salary_stats" : {
                  "count" : 1,
                  "min" : 30000.0,
                  "max" : 30000.0,
                  "avg" : 30000.0,
                  "sum" : 30000.0
                }
              }
            ]
          }
        },
        {
          "key" : "Web Designer",
          "doc_count" : 2,
          "gender_stats" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "female",
                "doc_count" : 1,
                "salary_stats" : {
                  "count" : 1,
                  "min" : 22000.0,
                  "max" : 22000.0,
                  "avg" : 22000.0,
                  "sum" : 22000.0
                }
              },
              {
                "key" : "male",
                "doc_count" : 1,
                "salary_stats" : {
                  "count" : 1,
                  "min" : 18000.0,
                  "max" : 18000.0,
                  "avg" : 18000.0,
                  "sum" : 18000.0
                }
              }
            ]
          }
        },
        {
          "key" : "Dev Manager",
          "doc_count" : 1,
          "gender_stats" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "male",
                "doc_count" : 1,
                "salary_stats" : {
                  "count" : 1,
                  "min" : 50000.0,
                  "max" : 50000.0,
                  "avg" : 50000.0,
                  "sum" : 50000.0
                }
              }
            ]
          }
        },
        {
          "key" : "Product Manager",
          "doc_count" : 1,
          "gender_stats" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "female",
                "doc_count" : 1,
                "salary_stats" : {
                  "count" : 1,
                  "min" : 35000.0,
                  "max" : 35000.0,
                  "avg" : 35000.0,
                  "sum" : 35000.0
                }
              }
            ]
          }
        }
      ]
    }
  }
}

 Pipeline 聚合分析

管道的概念:支持聚合分析的结果,再次聚合分析

Pipeline的分析结果输出到原结果当中,根据位置的不同,分为两类:

  • sibling  结果和现有结果同级
    • min max avg sum Bucket
    • stats,Extended status Bucket
    • Percentiles Bucket
  • parent 结果内嵌到现有聚合分析结果之中
    • Derivative(求导)
    • Cumultive Sum (累计求和)
    • Moving Function (移动窗口)

 

# 平均工资最低的工作类型
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    },
    "min_salary_by_job":{
      "min_bucket": {
        "buckets_path": "jobs>avg_salary"
      }
    }
  }
}


# 平均工资最高的工作类型
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    },
    "max_salary_by_job":{
      "max_bucket": {
        "buckets_path": "jobs>avg_salary"
      }
    }
  }
}


# 平均工资的平均工资
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    },
    "avg_salary_by_job":{
      "avg_bucket": {
        "buckets_path": "jobs>avg_salary"
      }
    }
  }
}


# 平均工资的统计分析
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    },
    "stats_salary_by_job":{
      "stats_bucket": {
        "buckets_path": "jobs>avg_salary"
      }
    }
  }
}


# 平均工资的百分位数
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    },
    "percentiles_salary_by_job":{
      "percentiles_bucket": {
        "buckets_path": "jobs>avg_salary"
      }
    }
  }
}



#按照年龄对平均工资求导
POST employees/_search
{
  "size": 0,
  "aggs": {
    "age": {
      "histogram": {
        "field": "age",
        "min_doc_count": 1,
        "interval": 1
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        },
        "derivative_avg_salary":{
          "derivative": {
            "buckets_path": "avg_salary"
          }
        }
      }
    }
  }
}


#Cumulative_sum
POST employees/_search
{
  "size": 0,
  "aggs": {
    "age": {
      "histogram": {
        "field": "age",
        "min_doc_count": 1,
        "interval": 1
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        },
        "cumulative_salary":{
          "cumulative_sum": {
            "buckets_path": "avg_salary"
          }
        }
      }
    }
  }
}

#Moving Function
POST employees/_search
{
  "size": 0,
  "aggs": {
    "age": {
      "histogram": {
        "field": "age",
        "min_doc_count": 1,
        "interval": 1
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        },
        "moving_avg_salary":{
          "moving_fn": {
            "buckets_path": "avg_salary",
            "window":10,
            "script": "MovingFunctions.min(values)"
          }
        }
      }
    }
  }
}

作用范围和排序

ES聚合分析默认作用范围是query的查询结果集

同时ES还支持一下方式改变聚合查询的作用范围

  • Filter
  • Post Filter
  • Global
#作用范围
# Query 的作用范围
POST employees/_search
{
  "size": 0,
  "query": {
    "range": {
      "age": {
        "gte": 20
      }
    }
  },
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword"
        
      }
    }
  }
}


#Filter 的作用范围
POST employees/_search
{
  "size": 0,
  "aggs": {
    "older_person": {
      "filter":{
        "range":{
          "age":{
            "from":35
          }
        }
      },
      "aggs":{
         "jobs":{
           "terms": {
        "field":"job.keyword"
      }
      }
    }},
    "all_jobs": {
      "terms": {
        "field":"job.keyword"
        
      }
    }
  }
}



#Post field. 一条语句,找出所有的job类型。还能找到聚合后符合条件的结果
POST employees/_search
{
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"
      }
    }
  },
  "post_filter": {
    "match": {
      "job.keyword": "Dev Manager"
    }
  }
}


#global
POST employees/_search
{
  "size": 0,
  "query": {
    "range": {
      "age": {
        "gte": 40
      }
    }
  },
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword"
        
      }
    },
    
    "all":{
      "global":{},
      "aggs":{
        "salary_avg":{
          "avg":{
            "field":"salary"
          }
        }
      }
    }
  }
}

排序:

指定order,安装count和key进行排序

  • 默认情况下,按照count降序排序
  • 指定size,就能返回相应的桶
#排序 order
#count and key
POST employees/_search
{
  "size": 0,
  "query": {
    "range": {
      "age": {
        "gte": 20
      }
    }
  },
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword",
        "order":[
          {"_count":"asc"},
          {"_key":"desc"}
          ]
        
      }
    }
  }
}


#排序 order
#count and key
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword",
        "order":[  {
            "avg_salary":"desc"
          }]
        
        
      },
    "aggs": {
      "avg_salary": {
        "avg": {
          "field":"salary"
        }
      }
    }
    }
  }
}


#排序 order
#count and key
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword",
        "order":[  {
            "stats_salary.min":"desc"
          }]
        
        
      },
    "aggs": {
      "stats_salary": {
        "stats": {
          "field":"salary"
        }
      }
    }
    }
  }
}

 UpdateByQuery & Reindex

使用场景:

一般以下情况,需要重新索引

  • 索引的mapping发送变更:字段类型、分词器及字典更新
  • 索引的setting发送变更:索引主分片数发送改变
  • 集群内,集群间需要做数据迁移

 ES内置提供的API

  •  UpdateByQuery 在现有索引上重建

  • Reindex 在其他索引上重建索引

 案例1

#重建索引
DELETE blogs/

# 写入文档
PUT blogs/_doc/1
{
  "content":"Hadoop is cool",
  "keyword":"hadoop"
}

# 查看 Mapping
GET blogs/_mapping

# 修改 Mapping,增加子字段,使用英文分词器
PUT blogs/_mapping
{
      "properties" : {
        "content" : {
          "type" : "text",
          "fields" : {
            "english" : {
              "type" : "text",
              "analyzer":"english"
            }
          }
        }
      }
    }
# 写入文档
PUT blogs/_doc/2
{
  "content":"Elasticsearch rocks",
    "keyword":"elasticsearch"
}

# 查询新写入文档
POST blogs/_search
{
  "query": {
    "match": {
      "content.english": "Elasticsearch"
    }
  }

}

# 查询 Mapping 变更前写入的文档
POST blogs/_search
{
  "query": {
    "match": {
      "content.english": "Hadoop"
    }
  }
}


# Update所有文档
POST blogs/_update_by_query
{

}

# 执行update_by_query后 再查询之前写入的文档
POST blogs/_search
{
  "query": {
    "match": {
      "content.english": "Hadoop"
    }
  }
}

案例2,更新已有字段的mapping

  • ES不允许在原有mapping上对字段类型进行修改
  • 只能创建新的索引,并且设定正确的字段类型,再重新导入数据
# 查询
GET blogs/_mapping
#结果查询,我们看keyword 的字段类型是Text
{
  "blogs" : {
    "mappings" : {
      "properties" : {
        "content" : {
          "type" : "text",
          "fields" : {
            "english" : {
              "type" : "text",
              "analyzer" : "english"
            },
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "keyword" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}
#尝试修改类型,报错,ES不允许对已有字段进行修改
PUT blogs/_mapping
{
        "properties" : {
        "content" : {
          "type" : "text",
          "fields" : {
            "english" : {
              "type" : "text",
              "analyzer" : "english"
            }
          }
        },
        "keyword" : {
          "type" : "keyword"
        }
      }
}
# 创建新的索引并且设定新的Mapping
PUT blogs_fix/
{
  "mappings": {
        "properties" : {
        "content" : {
          "type" : "text",
          "fields" : {
            "english" : {
              "type" : "text",
              "analyzer" : "english"
            }
          }
        },
        "keyword" : {
          "type" : "keyword"
        }
      }    
  }
}
# Reindx API
POST  _reindex
{
  "source": {
    "index": "blogs"
  },
  "dest": {
    "index": "blogs_fix"
  }
}
#查看新索引
GET  blogs_fix/_doc/1
#查询结果
{
  "_index" : "blogs_fix",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "content" : "Hadoop is cool",
    "keyword" : "hadoop"
  }
}
# 测试 Term Aggregation
POST blogs_fix/_search
{
  "size": 0,
  "aggs": {
    "blog_keyword": {
      "terms": {
        "field": "keyword",
        "size": 10
      }
    }
  }
}
#我们修改成keyword类型,只有keyword 才能Term Aggregation
#查询结果
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "blog_keyword" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "elasticsearch",
          "doc_count" : 1
        },
        {
          "key" : "hadoop",
          "doc_count" : 1
        }
      ]
    }
  }
}

Reindex以上总结

Reindex API支持从一个索引拷贝到另一个索引中

使用ReindexAPI的场景:

  • 修改索引的主分片数
  • 改变字段的Mapping字段类型
  • 集群内/外 数据迁移

 IngestPipeline & PainlessScript

Ingest Node

ES5.0后,引入的一种新的节点类型,默认配置下,每个节点都是Ingest Node

  • 具有预处理数据的能力,可拦截Index或者Bulk API 的请求
  • 对数据进行转换,并重新返回给Index 或者Bulk API

无需Logstash,就可以进行数据的预处理,例如:

  • 为某个字段设置默认值:重命名某个字段的字段名;对字段进行Split操作
  • 支持设置Painless脚本,对数据进行更多复杂加工

Demo

创建文档

#Blog数据,包含3个字段,tags用逗号间隔
PUT tech_blogs/_doc/1
{
  "title":"Introducing big data......",
  "tags":"hadoop,elasticsearch,spark",
  "content":"You konw, for big data"
}
POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "to split blog tags",
    // 按,切割
    "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "title": "Introducing big data......",
        "tags": "hadoop,elasticsearch,spark",
        "content": "You konw, for big data"
      }
    },
    {
      "_index": "index",
      "_id": "idxx",
      "_source": {
        "title": "Introducing cloud computering",
        "tags": "openstack,k8s",
        "content": "You konw, for cloud"
      }
    }
  ]
}
#同时为文档,增加一个字段。blog查看量
POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "to split blog tags",
    "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      },
// 增加一个字段,
      {
        "set":{
          "field": "views",
          "value": 0
        }
      }
    ]
  },

  "docs": [
    {
      "_index":"index",
      "_id":"id",
      "_source":{
        "title":"Introducing big data......",
  "tags":"hadoop,elasticsearch,spark",
  "content":"You konw, for big data"
      }
    },


    {
      "_index":"index",
      "_id":"idxx",
      "_source":{
        "title":"Introducing cloud computering",
  "tags":"openstack,k8s",
  "content":"You konw, for cloud"
      }
    }

    ]
}

以上是测试可以使用,我们测试完成后,在ES上创建一个Pipeline

PUT _ingest/pipeline/blog_pipeline
{
  "description": "a blog pipeline",
  "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      },

      {
        "set":{
          "field": "views",
          "value": 0
        }
      }
    ]
}
#查看Pipleline
GET _ingest/pipeline/blog_pipeline
#测试pipeline,只需要提供文档的数组就可以了
POST _ingest/pipeline/blog_pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "title": "Introducing cloud computering",
        "tags": "openstack,k8s",
        "content": "You konw, for cloud"
      }
    }
  ]
}
#测试2  情况索引
DELETE tech_blogs

#不使用pipeline更新数据
PUT tech_blogs/_doc/1
{
  "title":"Introducing big data......",
  "tags":"hadoop,elasticsearch,spark",
  "content":"You konw, for big data"
}

#使用pipeline更新数据
PUT tech_blogs/_doc/2?pipeline=blog_pipeline
{
  "title": "Introducing cloud computering",
  "tags": "openstack,k8s",
  "content": "You konw, for cloud"
}


#查看两条数据,一条被处理,一条未被处理
POST tech_blogs/_search
{}

#update_by_query 会导致错误
POST tech_blogs/_update_by_query?pipeline=blog_pipeline
{
}

#增加update_by_query的条件
POST tech_blogs/_update_by_query?pipeline=blog_pipeline
{
    "query": {
        "bool": {
            "must_not": {
                "exists": {
                    "field": "views"
                }
            }
        }
    }
}
#再次索引,这次我们可以看到文档1也被pipeline处理了
POST tech_blogs/_search

一些内置的Processors

  • Split 给一个字段分成数组
  • Remove / Rename 移除或者重命名一个字段
  • Append 增加一个新标签
  • Convert 从字符串转换成float类型
  • Date / JSON 日期格式转换,字符串转JSON
  • Data Index Name 将通过该处理器的文档,分配到指定时间格式的索引中
  • Fail 一旦出现异常,该Pipeline指定的错误信息能返回给用户
  • Foreach 数组字段,数组的每个元素都会使用到一个相同的处理器
  • Grok 日志的格式切割
  • Gsub /Join /Split 字符串转换 数组转换字符串 字符串转换数组
  • Lowercase /Upcase  大小写转换

Painless

  • 自ES5.x后引入,专门为ES设计,扩展了JAVA的语法
  • 6.0开始,ES只支持Painless。Groovy JavaScript和Python 都不支持
  • Painless支持所有java数据类型及Java API子集
  • Painless Script 具备以下特性:
    • 高性能 / 安全
    • 支持显示类型或者动态定义类型

Painless 用途:

可以对文档字段加工处理

  • 更新删除字段,处理数据聚合操作
  • Script Field: 对返回字段提前进行计算
  • Fcunction Score: 对文档的算分进行处理

在Ingest Pipeline 中执行脚本

在Reindex API, Update By Query时,对数据进行处理

#########Demo for Painless###############

# 增加一个 Script Prcessor
POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "to split blog tags",
    "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      },
      {
        "script": {
          "source": """
          if(ctx.containsKey("content")){
            ctx.content_length = ctx.content.length();
          }else{
            ctx.content_length=0;
          }


          """
        }
      },

      {
        "set":{
          "field": "views",
          "value": 0
        }
      }
    ]
  },

  "docs": [
    {
      "_index":"index",
      "_id":"id",
      "_source":{
        "title":"Introducing big data......",
  "tags":"hadoop,elasticsearch,spark",
  "content":"You konw, for big data"
      }
    },


    {
      "_index":"index",
      "_id":"idxx",
      "_source":{
        "title":"Introducing cloud computering",
  "tags":"openstack,k8s",
  "content":"You konw, for cloud"
      }
    }

    ]
}


DELETE tech_blogs
PUT tech_blogs/_doc/1
{
  "title":"Introducing big data......",
  "tags":"hadoop,elasticsearch,spark",
  "content":"You konw, for big data",
  "views":0
}

POST tech_blogs/_update/1
{
  "script": {
    "source": "ctx._source.views += params.new_views",
    "params": {
      "new_views":100
    }
  }
}

# 查看views计数
POST tech_blogs/_search
{

}

#保存脚本在 Cluster State
POST _scripts/update_views
{
  "script":{
    "lang": "painless",
    "source": "ctx._source.views += params.new_views"
  }
}

POST tech_blogs/_update/1
{
  "script": {
    "id": "update_views",
    "params": {
      "new_views":1000
    }
  }
}


GET tech_blogs/_search
{
  "script_fields": {
    "rnd_views": {
      "script": {
        "lang": "painless",
        "source": """
          java.util.Random rnd = new Random();
          doc['views'].value+rnd.nextInt(1000);
        """
      }
    }
  },
  "query": {
    "match_all": {}
  }
}

 

posted @ 2020-01-02 21:34  大胖猴  阅读(1749)  评论(0编辑  收藏  举报