ElasticSearch笔记-聚合查询

概述

聚合查询,它是在搜索的结果上,提供的一些聚合数据信息的方法。比如:求和、最大值、平均数等。

基本语法

GET <index_name>/_search
{
  "aggs": {
    "<aggs_name>": { // 聚合名称需要自己定义
      "<agg_type>": {
        "field": "<field_name>"
      }
    }
  }
}

aggs_name:聚合名称
agg_type:聚合种类,比如是桶聚合(terms)或者是指标聚合(avg、sum、min、max等)
field_name:字段名称或者叫域名。

聚合类型

聚合查询可以分为:bucket聚合查询、metrics聚合查询

  • 桶聚合(bucket)
    相当于MYSQL的group by,作用是对数据分组,它将文档分为多个“桶”,然后在每个桶上进行统计分析。
    bucket创建多个“存放“文档的桶。每个桶都与一个标准相关联(取决于聚合类型) ,该标准确定当前上下文中的文档是否“落入”它。换句话说,bucket 有效地定义了文档集。

  • 指标聚合(metric)
    相当于MYSQL的聚合函数,对数据分组进行统计,如max、min、avg、sum等

  • 管道聚合(pipeline)
    Pipeline 聚合在 Metrics 聚合的基础上进行进一步的计算和分析。它可以对桶中的统计指标进行计算、比较、筛选和排序。Pipeline 聚合是管道聚合查询的核心部分。

测试数据

点击查看代码

PUT school-data
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "properties": {
      "class_id" : {
          "type" : "long"
        },
      "age" : {
          "type" : "long"
        },
      "birthday" : {
          "type" : "date"
        },
      "fenshu" : {
          "type" : "float"
        },
      "name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
      "userid" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
    }
  }
}

POST school-data/_bulk
{ "index": {} }
{ "class_id": 1, "age": 18, "birthday": "2003-01-01","fenshu":99, "name": "John Doe", "userid": "abc123" }
{ "index": {} }
{ "class_id": 1, "age": 18, "birthday": "2001-05-10","fenshu":100,  "name": "Jane Smith", "userid": "def456" }
{ "index": {} }
{ "class_id": 2, "age": 19, "birthday": "2002-09-15","fenshu":80,  "name": "Bob Johnson", "userid": "ghi789" }
{ "index": {} }
{ "class_id": 2, "age": 17, "birthday": "2004-03-20","fenshu":45,  "name": "Emily Brown", "userid": "jkl012" }
{ "index": {} }
{ "class_id": 2, "age": 16, "birthday": "2021-07-05","fenshu":50,  "name": "Michael Wilson", "userid": "mno345" }
{ "index": {} }
{ "class_id": 3, "age": 18, "birthday": "2022-11-12","fenshu":78,  "name": "Sophia Davis", "userid": "pqr678" }
{ "index": {} }
{ "class_id": 3, "age": 19, "birthday": "2002-02-25","fenshu":80,  "name": "Oliver Martinez", "userid": "stu901" }
{ "index": {} }
{ "class_id": 3, "age": 17, "birthday": "2023-08-08","fenshu":95,  "name": "Ava Anderson", "userid": "vwx234" }
{ "index": {} }
{ "class_id": 3, "age": 20, "birthday": "2023-07-01","fenshu":85,  "name": "Liam Thomas", "userid": "yzab567" }
{ "index": {} }
{ "class_id": 4, "age": 18, "birthday": "2023-06-17","fenshu":1,  "name": "Emma Garcia", "userid": "cde890" }

桶聚合

桶聚合类型

官方文档:https://www.elastic.co/guide/en/elasticsearch/reference/7.16/search-aggregations-bucket-adjacency-matrix-aggregation.html

  • Adjacency matrix
  • Auto-interval date histogram
  • Categorize text
  • Children
  • Composite
  • Date histogram
  • Date range
  • Diversified sampler
  • Filter
  • Filters
  • Geo-distance
  • Geohash grid
  • Geotile grid
  • Global
  • Histogram
  • IP range
  • Missing
  • Multi Terms
  • Nested
  • Parent
  • Range
  • Rare terms
  • Reverse nested
  • Sampler
  • Significant terms
  • Significant text
  • Terms
  • Variable width histogram
  • Subtleties of bucketing range fields

terms

每个唯一值一个桶,桶是动态构建的。
类似mysql中的 select type , count(type) from table group by type;

GET school-data/_search
{
  "size": 0, 
  "aggs": {
    "agg_class": {
      "terms": {
        "field": "class_id"
      }
    }
  }
}

查询结果:

  "aggregations" : {
    "agg_class" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 3,
          "doc_count" : 4
        },
        {
          "key" : 2,
          "doc_count" : 3
        },
        {
          "key" : 1,
          "doc_count" : 2
        },
        {
          "key" : 4,
          "doc_count" : 1
        }
      ]
    }
  }

查询结果中可以得到每个分组的数量,这仅是一个bucket操作,doc_count是bucket操作默认执行的一个内置metric
基于桶的文档数排序:

GET school-data/_search
{
  "size": 0, 
  "aggs": {
    "agg_class": {
      "terms": {
        "field": "class_id",
        "order": {
          "_count": "desc"
        }
      }
    }
  }
}

基于分组key值排序:

GET school-data/_search
{
  "size": 0, 
  "aggs": {
    "agg_class": {
      "terms": {
        "field": "class_id",
        "order": {
          "_key": "desc"
        }
      }
    }
  }
}

使用 min_doc_count 选项可以只返回匹配超过配置的命中数的结果:

GET school-data/_search
{
  "size": 0, 
  "aggs": {
    "agg_class": {
      "terms": {
        "field": "class_id",
        "min_doc_count": 3
      }
    }
  }
}

range

按数值范围分组

"aggs": {
      "range_height": {
        "range": {
          "field": "doc_height",
          "ranges": [
            {
              "to": 800
            },
            {
              "from": 800,
              "to": 1000
            },
            {
              "from": 1000
            }
          ]
        }
      }
    }

查询结果:

"aggregations" : {
    "range_height" : {
      "buckets" : [
        {
          "key" : "*-800.0",
          "to" : 800.0,
          "doc_count" : 61
        },
        {
          "key" : "800.0-1000.0",
          "from" : 800.0,
          "to" : 1000.0,
          "doc_count" : 9310
        },
        {
          "key" : "1000.0-*",
          "from" : 1000.0,
          "doc_count" : 4453
        }
      ]
    }
  }

date_range

基于时间范围分组

GET school-data/_search
{
  "size": 0, 
  "aggs": {
    "birthday_range": {
      "date_range": {
        "field": "birthday",
        "ranges": [
          {
            "from": "2001-10-05T00:00:00.000Z",
            "to": "now-300d/d"
          },
          {
            "from": "now-300d/d",
            "to": "now"
          }
        ]
      }
    }
  }
}

查询结果:

"birthday_range" : {
      "buckets" : [
        {
          "key" : "2001-10-05T00:00:00.000Z-2022-10-05T00:00:00.000Z",
          "from" : 1.00224E12,
          "from_as_string" : "2001-10-05T00:00:00.000Z",
          "to" : 1.664928E12,
          "to_as_string" : "2022-10-05T00:00:00.000Z",
          "doc_count" : 5
        },
        {
          "key" : "2022-10-05T00:00:00.000Z-2023-08-01T07:05:42.485Z",
          "from" : 1.664928E12,
          "from_as_string" : "2022-10-05T00:00:00.000Z",
          "to" : 1.690873542485E12,
          "to_as_string" : "2023-08-01T07:05:42.485Z",
          "doc_count" : 3
        }
      ]
    }
  }

Multi Terms

当在聚合的时候需要对多个字段同时聚合的时候,可以使用 multi_terms 来完成(7.12.0支持multi_terms )
类比sql:select genre ,product, count(*) from products group by genre,product;

GET /<index_name>/_search
{
  "aggs": {
    "agg_name": {
      "multi_terms": {
        "terms": [
          {
            "field": "field_1"
          },
          {
            "field": "field_2"
          }
        ]
      }
    }
  }
}

date_histogram

按时间区间统计平均分数

GET index5/people/_search
{
  "size":0,
  "aggs": {
    "age_group": {
      "date_histogram": {
        "field": "birthday",
        "interval": "year",
        "format": "yyyy-MM-dd",
        "min_doc_count": 0,
        "extended_bounds": {
          "min": "1985-01-01",
          "max": "2020-01-01"
        }
      },
      "aggs": {
        "avg_fenshu": {
          "avg": {
            "field": "fenshu"
          }
        }
      }
    }    
  }
}

自动日期分组聚合 Auto Date Histogram

自动根据日期分组聚合,还提供了一个目标桶数,表示所需桶数,并自动选择桶的间隔,以最好地实现该目标。返回的桶数总是小于或等于这个目标数。Bucket 字段是可选的,如果没有指定,则默认为10个 bucket。
示例:要求3桶的目标。

POST /sales/_search?size=0
{
  "aggs": {
    "sales_over_time": {
      "auto_date_histogram": {
        "field": "date",
        "buckets": 10
      }
    }
  }
}

聚合中使用过滤器filter

GET school-data/_search
{
  "size": 0, 
  "aggs": {
    "agg_class": {
      "terms": {
        "field": "class_id"
      },
      "aggs": {
        "dist_age": {
          "terms": {
            "field": "age",
            "size": 10
          }
        }
      }
    }
  }
}

查询结构:


GET school-data/_search
{
  "size": 0, 
  "aggs": {
    "agg_class": {
      "terms": {
        "field": "class_id"
      },
      "aggs": {
        "dist_age": {
          "terms": {
            "field": "age",
            "size": 10
          }
        }
      }
    }
  }
}

稀有值查询 Rare terms

示例:查询桶中最多包含2条记录的桶。

GET school-data/_search
{
  "size": 0, 
  "aggs": {
    "agg_class": {
      "rare_terms": {
        "field": "class_id",
        "max_doc_count": 2
      }
    }
  }
}

缺少聚合 Missing aggregation

Missing聚合属于单桶聚合,对空值数据统计。

示例:统计没有价格的产品总数。

POST /sales/_search?size=0
{
  "aggs": {
    "products_without_a_price": {
      "missing": { "field": "price" }
    }
  }
}

全局聚合 Global aggregators

定义搜索执行上下文中所有文档的单个存储桶。此上下文由您正在搜索的索引和文档类型定义,但不受搜索查询本身的影响。

示例:
聚合查询all_products中的平均价格统计不受外层query中的查询添加的影响。

GET school-data/_search
{
  "size": 0, 
  "aggs": {
    "agg_class": {
      "terms": {
        "field": "class_id"
      },
      "aggs": {
        "dist_age": {
          "cardinality": {
            "field": "age"
          }
        },
        "bucket_sorg_agg":{
          "bucket_sort": {
            "sort": [
                {"dist_age":{
                  "order":"asc"
                  }
                }
              ]
          }
        }
      }
    }
  }
}

指标聚合案例

求全部学生的平均分数

GET index5/_search
{
  "size": 0, 
  "query": {
    "match_all": {}
  },
  "aggs": {
    "avg_fenshu": {
      "avg": {
        "field": "fenshu"
      }
    }
  }
}

aggs:appgregations的缩写
avg_fenshu:聚合名称

求全部学生的总分、平均分、最大分、最小分

GET index6/_search
{
  "size": 0, 
  "query": {
    "match_all": {}
  },
  "aggs": {
    "fenshu_sum": {
      "sum": {
        "field": "fenshu"
      }
    },
    "fenshu_avg": {
      "avg": {
        "field": "fenshu"
      }
    },
    "fenshu_max": {
      "max": {
        "field": "fenshu"
      }
    },
    "fenshu_min": {
      "min": {
        "field": "fenshu"
      }
    },
    "age_distinct": {
          "cardinality": {
            "field": "age"
          }
     }
  }
}

cardinality对每个bucket中指定field进行去重,取去重后的count,类似count(distinct)

查找每个班级最高分、最低分、平均分

需要使用多个metric,对bucket分析

GET index5/people/_search
{
  "size":0,
  "aggs": {
    "class_group":{
      "terms": {
        "field": "classid"
      },
      "aggs": {
        "avg_age": {
          "avg": {
            "field": "age"
          }
        },
        "max_age":{
          "max": {
            "field": "age"
          }
        },
        "min_age":{
          "min":{
            "field": "age"
          }
        }
      }
    }
  }
}

按年龄区间统计平均分数

GET index5/people/_search
{
  "size":0,
  "aggs": {
    "age_group": {
      "histogram": {
        "field": "age",
        "interval": 1
      },
      "aggs": {
        "fenshu": {
          "avg": {
            "field": "fenshu"
          }
        }
      }
    }
  }
}

下钻

下钻的意思是在分组的基础上在进行分组,比如在班级分组的基础上再对年龄分组,最后对每个最小粒度的分组执行聚合分析操作,年龄分组的基础上计算平均分数
也就是需要两个bucket查询,一个metrics查询

GET index5/people/_search
{
  "size":0,
  "aggs": {
    "sex_group": {
      "terms": {
        "field": "classid"
      },
      "aggs": {
        "age_group": {
          "terms": {
            "field": "age"
          },
          "aggs":{
            "avg_fenshu":{
              "avg": {
                "field": "fenshu"
              }
            }
          }
        }
      }
    }
  }
}

global bucket

global bucket可排除查询条件,用于整体与按条件查询结果对比
查询1班的平均成绩与全部班级的平均成绩

GET index6/user/_search
{
  "size":0,
  "query": {
    "term": {
      "classid": {
        "value": 3
      }
    }
  }, 
  "aggs": {
    "age_avg": {
      "avg": {
        "field": "age"
      }
    },
    "all_user":{
      "global": {},
      "aggs": {
        "global_avg_fenshu": {
          "avg": {
            "field": "fenshu"
          }
        }
      }
    }
  }
}

过滤bucket

aggs.filter,针对聚合过滤
如果filter放query里,是全局的。如果想对统计结果过滤,需要将filter放在aggs中
搜索1班90年出生学生的平均成绩

GET index5/people/_search
{
  "size": 0,
  "query": {
    "match": {
      "classid": 1
    }
  }, 
  "aggs": {
    "recent_90":{
      "filter": {
        "range": {
          "birthday": {
            "gte": "1990/01/01",
            "lte": "1991/01/01"
          }
        }
      },
      "aggs": {
        "avg_fenshu": {
          "avg": {
            "field": "fenshu"
          }
        }
      }
    }
  }
}

排序

对分析的结果排序
按班级分组,并按平均分数排序

GET index5/people/_search
{
  "size": 0,
  "aggs": {
    "class_group": {
      "terms": {
        "field": "classid",
        "order": {
          "avg_fenshu": "desc"
        }
      },
      "aggs": {
        "avg_fenshu": {
          "avg": {
            "field": "fenshu"
          }
        }
      }
    }
  }
}

去重

对每个bucket中指定field进行去重,取去重后的count,类似count(distinct)

GET index5/people/_search
{
  "size": 0,
  "aggs": {
    "years": {
      "date_histogram": {
        "field": "birthday",
        "interval": "year",
        "format": "yyyy-MM-dd",
        "min_doc_count": 0,
        "extended_bounds": {
          "min": "1985-01-01",
          "max": "2020-01-01"
        }
      },
      "aggs": {
        "age_distinct": {
          "cardinality": {
            "field": "age"
          }
          }
      }
    }
  }
}

聚合结果排序

按age分组,显示前3组

"aggs":{
        "bucket_page":{
            "terms":{
                "field":"age"
            },
            "aggs":{
                "url_page":{
                    "bucket_sort":{
                        "from":0,
                        "size":3
                    }
                }
            }
        }
    }

按age分组,再按domain_userid去重后,再显示前3组

"aggs":{
        "bucket_page":{
            "terms":{
                "field":"age"
            },
            "aggs":{
                "bucket_user":{
                    "cardinality": {
                      "field": "domain_userid.keyword"
                    }
                },
                "url_page":{
                    "bucket_sort":{
                        "from":0,
                        "size":3
                    }
                }
            }
        }
    }

附带文档数据

使用es聚合时,,有时还需要获取query(或filter) 的相关文档结果(数据)。
比如统计各个地区编码的营业额,得到了聚合的统计结果,还想知道query结果中对应的地区名称,并根据营业额进行排序,
这时可以使用 top_hits。

GET meshop-track-*/_search
{
    "query":{
        "bool":{
            "must":[
                {
                    "term":{
                        "program_id.keyword":{
                            "value":"xxxxx"
                        }
                    }
                }
            ]
        }
    },
    "size":0,
    "aggs": {
      "url_group": {
        "terms": {
          "field": "page_urlpath.keyword",
          "size": 10,
          "order": {
            "_count": "desc"
          }
        },
        "aggs": {
          "top_hits": {
            "top_hits": {
              "size": 1,
              "_source": "page_title"
            }
          }
        }
      }
    }
}

管道聚合

管道聚合主要用来处理来自其他聚合的产出结果,而不是来自文档集的产出,并将信息添加到最终的输出结果中。
管道聚合可以引用它们执行计算所需的聚合,方法是使用 bucket_path 参数来指示到所需指标的路径。
管道聚合类型:

max_bucket、min_bucket、sum_bucket

兄弟级管道聚合,用兄弟级聚合中指定指标的最小值标识 bucket,并输出 bucket 的值和键。指定的度量必须是数值的,同级聚合必须是多桶聚合。

GET school-data/_search
{
  "size": 0, 
  "aggs": {
    "agg_class": {
      "terms": {
        "field": "class_id"
      },
      "aggs": {
        "dist_age": {
          "cardinality": {
            "field": "age"
          }
        }
      }
    },
    "max_age":{
      "max_bucket": {
        "buckets_path": "agg_class>dist_age"
      }
    },
    "min_age":{
      "min_bucket": {
        "buckets_path": "agg_class>dist_age"
      }
    },
    "sum_age":{
      "sum_bucket": {
        "buckets_path": "agg_class>dist_age"
      }
    }
  }
}

查询结果:

{
"aggregations" : {
    "agg_class" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 3,
          "doc_count" : 4,
          "dist_age" : {
            "value" : 4
          }
        },
        {
          "key" : 2,
          "doc_count" : 3,
          "dist_age" : {
            "value" : 3
          }
        },
        {
          "key" : 1,
          "doc_count" : 2,
          "dist_age" : {
            "value" : 1
          }
        },
        {
          "key" : 4,
          "doc_count" : 1,
          "dist_age" : {
            "value" : 1
          }
        }
      ]
    },
    "max_age" : {
      "value" : 4.0,
      "keys" : [
        "3"
      ]
    },
    "min_age" : {
      "value" : 1.0,
      "keys" : [
        "1",
        "4"
      ]
    },
    "sum_age" : {
      "value" : 9.0
    }
  }

bucket_sort

父管道聚合,对其父多桶聚合的桶进行排序。可以将零个或多个排序字段与相应的排序顺序一起指定。每个 bucket 可以根据其_key、_count 或其子聚合进行排序。
此外,可以设置from和size的参数,以截断结果存储桶。

# 排序
GET school-data/_search
{
  "size": 0, 
  "aggs": {
    "agg_class": {
      "terms": {
        "field": "class_id"
      },
      "aggs": {
        "dist_age": {
          "cardinality": {
            "field": "age"
          }
        },
        "bucket_sorg_agg":{
          "bucket_sort": {
            "sort": [
                {"dist_age":{
                  "order":"asc"
                  }
                }
              ]
          }
        }
      }
    }
  }
}

#分页
GET school-data/_search
{
  "size": 0, 
  "aggs": {
    "agg_class": {
      "terms": {
        "field": "class_id"
      },
      "aggs": {
        "dist_age": {
          "cardinality": {
            "field": "age"
          }
        },
        "bucket_sorg_agg":{
          "bucket_sort": {
            "sort": [
                {"dist_age":{
                  "order":"asc"
                  }
                }
              ],
              "from": 1,
              "size": 1
          }
        }
      }
    }
  }
}

查询结果:

{
  "aggregations" : {
    "agg_class" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 4,
          "doc_count" : 1,
          "dist_age" : {
            "value" : 1
          }
        },
        {
          "key" : 2,
          "doc_count" : 3,
          "dist_age" : {
            "value" : 3
          }
        }
      ]
    }
  }
posted @ 2021-01-02 00:06  .Neterr  阅读(150)  评论(0编辑  收藏  举报