ElasticSearch笔记-聚合查询
概述
聚合查询,它是在搜索的结果上,提供的一些聚合数据信息的方法。比如:求和、最大值、平均数等。
基本语法
GET <index_name>/_search
{
"aggs": {
"<aggs_name>": { // 聚合名称需要自己定义
"<agg_type>": {
"field": "<field_name>"
}
}
}
}
aggs_name:聚合名称
agg_type:聚合种类,比如是桶聚合(terms)或者是指标聚合(avg、sum、min、max等)
field_name:字段名称或者叫域名。
聚合类型
聚合查询可以分为:bucket聚合查询、metrics聚合查询
-
桶聚合(bucket)
相当于MYSQL的group by,作用是对数据分组,它将文档分为多个“桶”,然后在每个桶上进行统计分析。
bucket创建多个“存放“文档的桶。每个桶都与一个标准相关联(取决于聚合类型) ,该标准确定当前上下文中的文档是否“落入”它。换句话说,bucket 有效地定义了文档集。 -
指标聚合(metric)
相当于MYSQL的聚合函数,对数据分组进行统计,如max、min、avg、sum等 -
管道聚合(pipeline)
Pipeline 聚合在 Metrics 聚合的基础上进行进一步的计算和分析。它可以对桶中的统计指标进行计算、比较、筛选和排序。Pipeline 聚合是管道聚合查询的核心部分。
测试数据
点击查看代码
PUT school-data
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"properties": {
"class_id" : {
"type" : "long"
},
"age" : {
"type" : "long"
},
"birthday" : {
"type" : "date"
},
"fenshu" : {
"type" : "float"
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"userid" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
POST school-data/_bulk
{ "index": {} }
{ "class_id": 1, "age": 18, "birthday": "2003-01-01","fenshu":99, "name": "John Doe", "userid": "abc123" }
{ "index": {} }
{ "class_id": 1, "age": 18, "birthday": "2001-05-10","fenshu":100, "name": "Jane Smith", "userid": "def456" }
{ "index": {} }
{ "class_id": 2, "age": 19, "birthday": "2002-09-15","fenshu":80, "name": "Bob Johnson", "userid": "ghi789" }
{ "index": {} }
{ "class_id": 2, "age": 17, "birthday": "2004-03-20","fenshu":45, "name": "Emily Brown", "userid": "jkl012" }
{ "index": {} }
{ "class_id": 2, "age": 16, "birthday": "2021-07-05","fenshu":50, "name": "Michael Wilson", "userid": "mno345" }
{ "index": {} }
{ "class_id": 3, "age": 18, "birthday": "2022-11-12","fenshu":78, "name": "Sophia Davis", "userid": "pqr678" }
{ "index": {} }
{ "class_id": 3, "age": 19, "birthday": "2002-02-25","fenshu":80, "name": "Oliver Martinez", "userid": "stu901" }
{ "index": {} }
{ "class_id": 3, "age": 17, "birthday": "2023-08-08","fenshu":95, "name": "Ava Anderson", "userid": "vwx234" }
{ "index": {} }
{ "class_id": 3, "age": 20, "birthday": "2023-07-01","fenshu":85, "name": "Liam Thomas", "userid": "yzab567" }
{ "index": {} }
{ "class_id": 4, "age": 18, "birthday": "2023-06-17","fenshu":1, "name": "Emma Garcia", "userid": "cde890" }
桶聚合
桶聚合类型
- Adjacency matrix
- Auto-interval date histogram
- Categorize text
- Children
- Composite
- Date histogram
- Date range
- Diversified sampler
- Filter
- Filters
- Geo-distance
- Geohash grid
- Geotile grid
- Global
- Histogram
- IP range
- Missing
- Multi Terms
- Nested
- Parent
- Range
- Rare terms
- Reverse nested
- Sampler
- Significant terms
- Significant text
- Terms
- Variable width histogram
- Subtleties of bucketing range fields
terms
每个唯一值一个桶,桶是动态构建的。
类似mysql中的 select type , count(type) from table group by type;
GET school-data/_search
{
"size": 0,
"aggs": {
"agg_class": {
"terms": {
"field": "class_id"
}
}
}
}
查询结果:
"aggregations" : {
"agg_class" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 3,
"doc_count" : 4
},
{
"key" : 2,
"doc_count" : 3
},
{
"key" : 1,
"doc_count" : 2
},
{
"key" : 4,
"doc_count" : 1
}
]
}
}
查询结果中可以得到每个分组的数量,这仅是一个bucket操作,doc_count是bucket操作默认执行的一个内置metric
基于桶的文档数排序:
GET school-data/_search
{
"size": 0,
"aggs": {
"agg_class": {
"terms": {
"field": "class_id",
"order": {
"_count": "desc"
}
}
}
}
}
基于分组key值排序:
GET school-data/_search
{
"size": 0,
"aggs": {
"agg_class": {
"terms": {
"field": "class_id",
"order": {
"_key": "desc"
}
}
}
}
}
使用 min_doc_count 选项可以只返回匹配超过配置的命中数的结果:
GET school-data/_search
{
"size": 0,
"aggs": {
"agg_class": {
"terms": {
"field": "class_id",
"min_doc_count": 3
}
}
}
}
range
按数值范围分组
"aggs": {
"range_height": {
"range": {
"field": "doc_height",
"ranges": [
{
"to": 800
},
{
"from": 800,
"to": 1000
},
{
"from": 1000
}
]
}
}
}
查询结果:
"aggregations" : {
"range_height" : {
"buckets" : [
{
"key" : "*-800.0",
"to" : 800.0,
"doc_count" : 61
},
{
"key" : "800.0-1000.0",
"from" : 800.0,
"to" : 1000.0,
"doc_count" : 9310
},
{
"key" : "1000.0-*",
"from" : 1000.0,
"doc_count" : 4453
}
]
}
}
date_range
基于时间范围分组
GET school-data/_search
{
"size": 0,
"aggs": {
"birthday_range": {
"date_range": {
"field": "birthday",
"ranges": [
{
"from": "2001-10-05T00:00:00.000Z",
"to": "now-300d/d"
},
{
"from": "now-300d/d",
"to": "now"
}
]
}
}
}
}
查询结果:
"birthday_range" : {
"buckets" : [
{
"key" : "2001-10-05T00:00:00.000Z-2022-10-05T00:00:00.000Z",
"from" : 1.00224E12,
"from_as_string" : "2001-10-05T00:00:00.000Z",
"to" : 1.664928E12,
"to_as_string" : "2022-10-05T00:00:00.000Z",
"doc_count" : 5
},
{
"key" : "2022-10-05T00:00:00.000Z-2023-08-01T07:05:42.485Z",
"from" : 1.664928E12,
"from_as_string" : "2022-10-05T00:00:00.000Z",
"to" : 1.690873542485E12,
"to_as_string" : "2023-08-01T07:05:42.485Z",
"doc_count" : 3
}
]
}
}
Multi Terms
当在聚合的时候需要对多个字段同时聚合的时候,可以使用 multi_terms 来完成(7.12.0支持multi_terms )
类比sql:select genre ,product, count(*) from products group by genre,product;
GET /<index_name>/_search
{
"aggs": {
"agg_name": {
"multi_terms": {
"terms": [
{
"field": "field_1"
},
{
"field": "field_2"
}
]
}
}
}
}
date_histogram
按时间区间统计平均分数
GET index5/people/_search
{
"size":0,
"aggs": {
"age_group": {
"date_histogram": {
"field": "birthday",
"interval": "year",
"format": "yyyy-MM-dd",
"min_doc_count": 0,
"extended_bounds": {
"min": "1985-01-01",
"max": "2020-01-01"
}
},
"aggs": {
"avg_fenshu": {
"avg": {
"field": "fenshu"
}
}
}
}
}
}
自动日期分组聚合 Auto Date Histogram
自动根据日期分组聚合,还提供了一个目标桶数,表示所需桶数,并自动选择桶的间隔,以最好地实现该目标。返回的桶数总是小于或等于这个目标数。Bucket 字段是可选的,如果没有指定,则默认为10个 bucket。
示例:要求3桶的目标。
POST /sales/_search?size=0
{
"aggs": {
"sales_over_time": {
"auto_date_histogram": {
"field": "date",
"buckets": 10
}
}
}
}
聚合中使用过滤器filter
GET school-data/_search
{
"size": 0,
"aggs": {
"agg_class": {
"terms": {
"field": "class_id"
},
"aggs": {
"dist_age": {
"terms": {
"field": "age",
"size": 10
}
}
}
}
}
}
查询结构:
GET school-data/_search
{
"size": 0,
"aggs": {
"agg_class": {
"terms": {
"field": "class_id"
},
"aggs": {
"dist_age": {
"terms": {
"field": "age",
"size": 10
}
}
}
}
}
}
稀有值查询 Rare terms
示例:查询桶中最多包含2条记录的桶。
GET school-data/_search
{
"size": 0,
"aggs": {
"agg_class": {
"rare_terms": {
"field": "class_id",
"max_doc_count": 2
}
}
}
}
缺少聚合 Missing aggregation
Missing聚合属于单桶聚合,对空值数据统计。
示例:统计没有价格的产品总数。
POST /sales/_search?size=0
{
"aggs": {
"products_without_a_price": {
"missing": { "field": "price" }
}
}
}
全局聚合 Global aggregators
定义搜索执行上下文中所有文档的单个存储桶。此上下文由您正在搜索的索引和文档类型定义,但不受搜索查询本身的影响。
示例:
聚合查询all_products中的平均价格统计不受外层query中的查询添加的影响。
GET school-data/_search
{
"size": 0,
"aggs": {
"agg_class": {
"terms": {
"field": "class_id"
},
"aggs": {
"dist_age": {
"cardinality": {
"field": "age"
}
},
"bucket_sorg_agg":{
"bucket_sort": {
"sort": [
{"dist_age":{
"order":"asc"
}
}
]
}
}
}
}
}
}
指标聚合案例
求全部学生的平均分数
GET index5/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"avg_fenshu": {
"avg": {
"field": "fenshu"
}
}
}
}
aggs:appgregations的缩写
avg_fenshu:聚合名称
求全部学生的总分、平均分、最大分、最小分
GET index6/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"fenshu_sum": {
"sum": {
"field": "fenshu"
}
},
"fenshu_avg": {
"avg": {
"field": "fenshu"
}
},
"fenshu_max": {
"max": {
"field": "fenshu"
}
},
"fenshu_min": {
"min": {
"field": "fenshu"
}
},
"age_distinct": {
"cardinality": {
"field": "age"
}
}
}
}
cardinality对每个bucket中指定field进行去重,取去重后的count,类似count(distinct)
查找每个班级最高分、最低分、平均分
需要使用多个metric,对bucket分析
GET index5/people/_search
{
"size":0,
"aggs": {
"class_group":{
"terms": {
"field": "classid"
},
"aggs": {
"avg_age": {
"avg": {
"field": "age"
}
},
"max_age":{
"max": {
"field": "age"
}
},
"min_age":{
"min":{
"field": "age"
}
}
}
}
}
}
按年龄区间统计平均分数
GET index5/people/_search
{
"size":0,
"aggs": {
"age_group": {
"histogram": {
"field": "age",
"interval": 1
},
"aggs": {
"fenshu": {
"avg": {
"field": "fenshu"
}
}
}
}
}
}
下钻
下钻的意思是在分组的基础上在进行分组,比如在班级分组的基础上再对年龄分组,最后对每个最小粒度的分组执行聚合分析操作,年龄分组的基础上计算平均分数
也就是需要两个bucket查询,一个metrics查询
GET index5/people/_search
{
"size":0,
"aggs": {
"sex_group": {
"terms": {
"field": "classid"
},
"aggs": {
"age_group": {
"terms": {
"field": "age"
},
"aggs":{
"avg_fenshu":{
"avg": {
"field": "fenshu"
}
}
}
}
}
}
}
}
global bucket
global bucket可排除查询条件,用于整体与按条件查询结果对比
查询1班的平均成绩与全部班级的平均成绩
GET index6/user/_search
{
"size":0,
"query": {
"term": {
"classid": {
"value": 3
}
}
},
"aggs": {
"age_avg": {
"avg": {
"field": "age"
}
},
"all_user":{
"global": {},
"aggs": {
"global_avg_fenshu": {
"avg": {
"field": "fenshu"
}
}
}
}
}
}
过滤bucket
aggs.filter,针对聚合过滤
如果filter放query里,是全局的。如果想对统计结果过滤,需要将filter放在aggs中
搜索1班90年出生学生的平均成绩
GET index5/people/_search
{
"size": 0,
"query": {
"match": {
"classid": 1
}
},
"aggs": {
"recent_90":{
"filter": {
"range": {
"birthday": {
"gte": "1990/01/01",
"lte": "1991/01/01"
}
}
},
"aggs": {
"avg_fenshu": {
"avg": {
"field": "fenshu"
}
}
}
}
}
}
排序
对分析的结果排序
按班级分组,并按平均分数排序
GET index5/people/_search
{
"size": 0,
"aggs": {
"class_group": {
"terms": {
"field": "classid",
"order": {
"avg_fenshu": "desc"
}
},
"aggs": {
"avg_fenshu": {
"avg": {
"field": "fenshu"
}
}
}
}
}
}
去重
对每个bucket中指定field进行去重,取去重后的count,类似count(distinct)
GET index5/people/_search
{
"size": 0,
"aggs": {
"years": {
"date_histogram": {
"field": "birthday",
"interval": "year",
"format": "yyyy-MM-dd",
"min_doc_count": 0,
"extended_bounds": {
"min": "1985-01-01",
"max": "2020-01-01"
}
},
"aggs": {
"age_distinct": {
"cardinality": {
"field": "age"
}
}
}
}
}
}
聚合结果排序
按age分组,显示前3组
"aggs":{
"bucket_page":{
"terms":{
"field":"age"
},
"aggs":{
"url_page":{
"bucket_sort":{
"from":0,
"size":3
}
}
}
}
}
按age分组,再按domain_userid去重后,再显示前3组
"aggs":{
"bucket_page":{
"terms":{
"field":"age"
},
"aggs":{
"bucket_user":{
"cardinality": {
"field": "domain_userid.keyword"
}
},
"url_page":{
"bucket_sort":{
"from":0,
"size":3
}
}
}
}
}
附带文档数据
使用es聚合时,,有时还需要获取query(或filter) 的相关文档结果(数据)。
比如统计各个地区编码的营业额,得到了聚合的统计结果,还想知道query结果中对应的地区名称,并根据营业额进行排序,
这时可以使用 top_hits。
GET meshop-track-*/_search
{
"query":{
"bool":{
"must":[
{
"term":{
"program_id.keyword":{
"value":"xxxxx"
}
}
}
]
}
},
"size":0,
"aggs": {
"url_group": {
"terms": {
"field": "page_urlpath.keyword",
"size": 10,
"order": {
"_count": "desc"
}
},
"aggs": {
"top_hits": {
"top_hits": {
"size": 1,
"_source": "page_title"
}
}
}
}
}
}
管道聚合
管道聚合主要用来处理来自其他聚合的产出结果,而不是来自文档集的产出,并将信息添加到最终的输出结果中。
管道聚合可以引用它们执行计算所需的聚合,方法是使用 bucket_path 参数来指示到所需指标的路径。
管道聚合类型:
max_bucket、min_bucket、sum_bucket
兄弟级管道聚合,用兄弟级聚合中指定指标的最小值标识 bucket,并输出 bucket 的值和键。指定的度量必须是数值的,同级聚合必须是多桶聚合。
GET school-data/_search
{
"size": 0,
"aggs": {
"agg_class": {
"terms": {
"field": "class_id"
},
"aggs": {
"dist_age": {
"cardinality": {
"field": "age"
}
}
}
},
"max_age":{
"max_bucket": {
"buckets_path": "agg_class>dist_age"
}
},
"min_age":{
"min_bucket": {
"buckets_path": "agg_class>dist_age"
}
},
"sum_age":{
"sum_bucket": {
"buckets_path": "agg_class>dist_age"
}
}
}
}
查询结果:
{
"aggregations" : {
"agg_class" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 3,
"doc_count" : 4,
"dist_age" : {
"value" : 4
}
},
{
"key" : 2,
"doc_count" : 3,
"dist_age" : {
"value" : 3
}
},
{
"key" : 1,
"doc_count" : 2,
"dist_age" : {
"value" : 1
}
},
{
"key" : 4,
"doc_count" : 1,
"dist_age" : {
"value" : 1
}
}
]
},
"max_age" : {
"value" : 4.0,
"keys" : [
"3"
]
},
"min_age" : {
"value" : 1.0,
"keys" : [
"1",
"4"
]
},
"sum_age" : {
"value" : 9.0
}
}
bucket_sort
父管道聚合,对其父多桶聚合的桶进行排序。可以将零个或多个排序字段与相应的排序顺序一起指定。每个 bucket 可以根据其_key、_count 或其子聚合进行排序。
此外,可以设置from和size的参数,以截断结果存储桶。
# 排序
GET school-data/_search
{
"size": 0,
"aggs": {
"agg_class": {
"terms": {
"field": "class_id"
},
"aggs": {
"dist_age": {
"cardinality": {
"field": "age"
}
},
"bucket_sorg_agg":{
"bucket_sort": {
"sort": [
{"dist_age":{
"order":"asc"
}
}
]
}
}
}
}
}
}
#分页
GET school-data/_search
{
"size": 0,
"aggs": {
"agg_class": {
"terms": {
"field": "class_id"
},
"aggs": {
"dist_age": {
"cardinality": {
"field": "age"
}
},
"bucket_sorg_agg":{
"bucket_sort": {
"sort": [
{"dist_age":{
"order":"asc"
}
}
],
"from": 1,
"size": 1
}
}
}
}
}
}
查询结果:
{
"aggregations" : {
"agg_class" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 4,
"doc_count" : 1,
"dist_age" : {
"value" : 1
}
},
{
"key" : 2,
"doc_count" : 3,
"dist_age" : {
"value" : 3
}
}
]
}
}