ES系列十四、ES聚合分析(聚合分析简介、指标聚合、桶聚合)
一、聚合分析简介
1. ES聚合分析是什么?
聚合分析是数据库中重要的功能特性,完成对一个查询的数据集中数据的聚合计算,如:找出某字段(或计算表达式的结果)的最大值、最小值,计算和、平均值等。ES作为搜索引擎兼数据库,同样提供了强大的聚合分析能力。
对一个数据集求最大、最小、和、平均值等指标的聚合,在ES中称为指标聚合 metric
而关系型数据库中除了有聚合函数外,还可以对查询出的数据进行分组group by,再在组上进行指标聚合。在 ES 中group by 称为分桶,桶聚合 bucketing
ES中还提供了矩阵聚合(matrix)、管道聚合(pipleline),但还在完善中。
2. ES聚合分析查询的写法
在查询请求体中以aggregations节点按如下语法定义聚合分析:
"aggregations" : { "<aggregation_name>" : { <!--聚合的名字 --> "<aggregation_type>" : { <!--聚合的类型 --> <aggregation_body> <!--聚合体:对哪些字段进行聚合 --> } [,"meta" : { [<meta_data_body>] } ]? <!--元 --> [,"aggregations" : { [<sub_aggregation>]+ } ]? <!--在聚合里面在定义子聚合 --> } [,"<aggregation_name_2>" : { ... } ]*<!--聚合的名字 --> }
说明:
aggregations 也可简写为 aggs
3. 聚合分析的值来源
聚合计算的值可以取字段的值,也可是脚本计算的结果。
二、指标聚合
1. max min sum avg
示例1:查询所有记录中年龄的最大值
POST /book1/_search?pretty { "size": 0, "aggs": { "maxage": { "max": { "field": "age" } } } }
结果1:
{ "took": 4, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "maxage": { "value": 54 } } }
示例2:加上查询条件,查询名字包含'test'的年龄最大值:
POST /book1/_search?pretty { "query":{ "term":{ "name":"test" } }, "size": 2, "sort": [ { "age": { "order": "desc" } } ], "aggs": { "maxage": { "max": { "field": "age" } } } }
结果2:
{ "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 5, "max_score": null, "hits": [ { "_index": "book1", "_type": "english", "_id": "6IUkUmUBRzBxBrDgFok2", "_score": null, "_source": { "name": "test goog my money", "age": [ 14, 54, 45, 34 ], "class": "dsfdsf", "addr": "中国" }, "sort": [ 54 ] }, { "_index": "book1", "_type": "english", "_id": "54UiUmUBRzBxBrDgfIl9", "_score": null, "_source": { "name": "test goog my money", "age": [ 11, 13, 14 ], "class": "dsfdsf", "addr": "中国" }, "sort": [ 14 ] } ] }, "aggregations": { "maxage": { "value": 54 } } }
示例3:值来源于脚本,查询所有记录的平均年龄是多少,并对平均年龄加10
POST /book1/_search?pretty { "size":0, "aggs": { "avg_age": { "avg": { "script": { "source": "doc.age.value" } } }, "avg_age10": { "avg": { "script": { "source": "doc.age.value + 10" } } } } }
结果3:
{ "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "avg_age": { "value": 7.585365853658536 }, "avg_age10": { "value": 17.585365853658537 } } }
示例4:指定field,在脚本中用_value 取字段的值
POST /book1/_search?pretty { "size":0, "aggs": { "sun_age": { "sum": { "field":"age", "script": { "source": "_value * 2" } } } } }
结果4:
{ "took": 4, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "sun_age": { "value": 942 } } }
示例5:为没有值字段指定值。如未指定,缺失该字段值的文档将被忽略:
POST /book1/_search?pretty { "size":0, "aggs": { "sun_age": { "avg": { "field":"age", "missing":15 } } } }
结果5:
{ "took": 12, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "sun_age": { "value": 12.847826086956522 } } }
2. 文档计数 count
示例1:统计银行索引book下年龄为12的文档数量
POST book1/english/_count { "query":{ "match":{ "age":12 } } }
结果1:
{ "count": 16, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 } }
3. Value count 统计某字段有值的文档数
示例1:
POST /book1/_search?size=0 { "aggs":{ "age_count":{ "value_count":{ "field":"age" } } } }
结果1:
{ "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "age_count": { "value": 38 } } }
4. cardinality 值去重计数
示例1:
POST /book1/_search?size=0 { "aggs":{ "age_count":{ "value_count":{ "field":"age" } }, "name_count":{ "cardinality":{ "field":"age" } } } }
结果1:
{ "took": 16, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "name_count": { "value": 11 }, "age_count": { "value": 38 } } }
说明:有值的38个,去掉重复的之后以一共有11个。
5. stats 统计 count max min avg sum 5个值
示例1:
POST /book1/_search?size=0 { "aggs":{ "age_count":{ "stats":{ "field":"age" } } } }
结果1:
{ "took": 12, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "age_count": { "count": 38, "min": 1, "max": 54, "avg": 12.394736842105264, "sum": 471 } } }
6. Extended stats
高级统计,比stats多4个统计结果: 平方和、方差、标准差、平均值加/减两个标准差的区间。
示例1:
POST /book1/_search?size=0 { "aggs":{ "age_stats":{ "extended_stats":{ "field":"age" } } } }
结果1:
{ "took": 8, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "age_stats": { "count": 38, "min": 1, "max": 54, "avg": 12.394736842105264, "sum": 471, "sum_of_squares": 11049, "variance": 137.13365650969527, "std_deviation": 11.710408041981085, "std_deviation_bounds": { "upper": 35.81555292606743, "lower": -11.026079241856905 } } } }
7. Percentiles 占比百分位对应的值统计
示例1:
对指定字段(脚本)的值按从小到大累计每个值对应的文档数的占比(占所有命中文档数的百分比),返回指定占比比例对应的值。默认返回[ 1, 5, 25, 50, 75, 95, 99 ]分位上的值。如下中间的结果,可以理解为:占比为50%的文档的age值 <= 12,或反过来:age<=12的文档数占总命中文档数的50%。
POST /book1/_search?size=0 { "aggs":{ "age_percentiles":{ "percentiles":{ "field":"age" } } } }
结果1:
{ "took": 16, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "age_percentiles": { "values": { "1.0": 1, "5.0": 1, "25.0": 1, "50.0": 12, "75.0": 13, "95.0": 40.600000000000016, "99.0": 54 } } } }
示例2:指定分位值(占比50%,96%,99%的范围值分别是多少)
POST /book1/_search?size=0 { "aggs":{ "age_percentiles":{ "percentiles":{ "field":"age", "percents" : [50,96,99] } } } }
结果2:
{ "took": 6, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "age_percentiles": { "values": { "50.0": 12, "96.0": 44.779999999999966, "99.0": 54 } } } }
说明:50%的数值<= 12, 96%的数值<= 96%, 99%的数值<= 54
8. Percentiles rank 统计值小于等于指定值的文档占比
示例1:统计年龄小于25和30的文档的占比,和第7项相反
POST /book1/_search?size=0 { "aggs":{ "aggs_perc_rank":{ "percentile_ranks":{ "field":"age", "values" : [12,35] } } } }
结果1:
{ "took": 8, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "aggs_perc_rank": { "values": { "12.0": 71.05263157894737, "35.0": 92.76315789473685 } } } }
结果说明:年龄小于12的文档占比为71%,年龄小于35的文档占比为92%,
9. Geo Bounds aggregation 求文档集中的地理位置坐标点的范围
参考官网链接:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-geobounds-aggregation.html
10. Geo Centroid aggregation 求地理位置中心点坐标值
参考官网链接:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-geocentroid-aggregation.html
三、桶聚合
1. Terms Aggregation 根据字段值项分组聚合
示例1:
POST /book1/_search?size=0 { "aggs":{ "age_terms":{ "terms":{ "field":"age" } } } }
说明:相当于group by age
结果1:
{ "took": 4, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "age_terms": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 1, "buckets": [ { "key": 12, "doc_count": 16 }, { "key": 1, "doc_count": 11 }, { "key": 13, "doc_count": 2 }, { "key": 14, "doc_count": 2 }, { "key": 11, "doc_count": 1 }, { "key": 16, "doc_count": 1 }, { "key": 21, "doc_count": 1 }, { "key": 33, "doc_count": 1 }, { "key": 34, "doc_count": 1 }, { "key": 45, "doc_count": 1 } ] } } }
结果说明:
"doc_count_error_upper_bound": 0:文档计数的最大偏差值
"sum_other_doc_count": 1:未返回的其他文档数,不在桶里的文档数量
默认情况下返回按文档计数从高到低的前10个分组:
示例2:sizz可以指定返回多少组数
POST /book1/_search?size=0 { "aggs":{ "age_terms":{ "terms":{ "field":"age", "size":5 } } } }
结果2:
{ "took": 4, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "age_terms": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 6, "buckets": [ { "key": 12, "doc_count": 16 }, { "key": 1, "doc_count": 11 }, { "key": 13, "doc_count": 2 }, { "key": 14, "doc_count": 2 }, { "key": 11, "doc_count": 1 } ] } } }
示例3:每个分组上显示偏差值
POST /book1/_search?size=0 { "aggs":{ "age_terms":{ "terms":{ "field":"age", "size":5, "show_term_doc_count_error": true } } } }
结果3:
{ "took": 5, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "age_terms": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 6, "buckets": [ { "key": 12, "doc_count": 16, "doc_count_error_upper_bound": 0 }, { "key": 1, "doc_count": 11, "doc_count_error_upper_bound": 0 }, { "key": 13, "doc_count": 2, "doc_count_error_upper_bound": 0 }, { "key": 14, "doc_count": 2, "doc_count_error_upper_bound": 0 }, { "key": 11, "doc_count": 1, "doc_count_error_upper_bound": 0 } ] } } }
示例4:shard_size 指定每个分片上返回多少个分组
POST /book1/_search?size=0 { "aggs":{ "age_terms":{ "terms":{ "field":"age", "size":3, "shard_size": 20 } } } }
结果4:
{ "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "age_terms": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 9, "buckets": [ { "key": 12, "doc_count": 16 }, { "key": 1, "doc_count": 11 }, { "key": 13, "doc_count": 2 } ] } } }
order 指定分组的排序
示例5:根据分组值"_key"排序
POST /book1/_search?size=0 { "aggs":{ "age_terms":{ "terms":{ "field":"age", "size":3, "order":{"_key":"desc"} } } } }
结果5:
{ "took": 6, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "age_terms": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 35, "buckets": [ { "key": 54, "doc_count": 1 }, { "key": 45, "doc_count": 1 }, { "key": 34, "doc_count": 1 } ] } } }
示例6:根据文档计数"_count"排序
POST /book1/_search?size=0 { "aggs":{ "age_terms":{ "terms":{ "field":"age", "size":3, "order":{"_count":"desc"} } } } }
结果6:
{ "took": 91, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "age_terms": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 9, "buckets": [ { "key": 12, "doc_count": 16 }, { "key": 1, "doc_count": 11 }, { "key": 13, "doc_count": 2 } ] } } }
示例7:取分组指标值排序
POST /book1/_search?size=0 { "aggs":{ "age_terms":{ "terms":{ "field":"age", "order":{"max_age":"desc"} }, "aggs":{ "max_age":{ "max":{ "field":"age" } }, "min_age":{ "min":{ "field":"age" } } } } } }
说明:先根据age 分组,再计算每个组的最大最小值,最后根据最大值倒排
示例8:筛选分组-正则表达式匹配值
POST book1/_search?size=0 { "aggs":{ "tags":{ "terms":{ "field":"name", "include":"里*", "exclude":"test*" } } } }
结果8:
{ "took": 22, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "tags": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "里", "doc_count": 13 } ] } } }
示例9:筛选分组-指定值列表
POST book1/_search?size=0 { "aggs":{ "Chinese":{ "terms":{ "field":"name", "include":["里","国"] } }, "Test":{ "terms":{ "field":"name", "exclude":["test","the"] } } } }
结果9:
{ "took": 23, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "Test": { "doc_count_error_upper_bound": 6, "sum_other_doc_count": 559, "buckets": [ { "key": "里", "doc_count": 12 }, { "key": "否", "doc_count": 11 }, { "key": "a", "doc_count": 7 }, { "key": "default", "doc_count": 7 }, { "key": "document", "doc_count": 7 }, { "key": "for", "doc_count": 7 }, { "key": "absolute", "doc_count": 6 }, { "key": "account", "doc_count": 6 }, { "key": "accurate", "doc_count": 6 }, { "key": "documents", "doc_count": 6 } ] }, "Chinese": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "国", "doc_count": 4 } ] } } }
示例10:根据脚本计算值分组
POST book1/_search?size=0 { "aggs":{ "name":{ "terms":{ "script":{ "source":"doc['age'].value + doc.age.value", "lang": "painless" } } } } }
说明:脚本取值的方式doc['age'].value 或者 doc.age.value
结果10:
"took": 18,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 41,
"max_score": 0,
"hits": []
},
"aggregations": {
"name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "24",
"doc_count": 16
},
{
"key": "2",
"doc_count": 11
},
{
"key": "0",
"doc_count": 8
},
{
"key": "22",
"doc_count": 1
},
{
"key": "26",
"doc_count": 1
},
{
"key": "28",
"doc_count": 1
},
{
"key": "32",
"doc_count": 1
},
{
"key": "42",
"doc_count": 1
},
{
"key": "66",
"doc_count": 1
}
]
}
}
}
2. filter Aggregation 对满足过滤查询的文档进行聚合计算
示例1:在查询命中的文档中选取符合过滤条件的文档进行聚合,先过滤再聚合(和上面的示例9示例9:筛选分组,区分开:先聚合再过滤)
POST book1/_search?size=0 { "aggs":{ "age_terms":{ "filter":{ "match":{"name":"test"} }, "aggs":{ "avg_age":{ "avg":{"field":"age" } } } } } }
结果1:
{ "took": 152, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "age_terms": { "doc_count": 5, "avg_age": { "value": 19.9 } } } }
3. Filters Aggregation 多个过滤组聚合计算
示例1:分别统计包含‘test’,和‘里’的文档的个数
POST book1/_search?size=0 { "aggs":{ "age_terms":{ "filters":{ "filters":{ "test":{ "match":{"name":"test"} }, "china":{ "match":{"name":"里"} } } } } } }
结果:
{ "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "age_terms": { "buckets": { "china": { "doc_count": 13 }, "test": { "doc_count": 5 } } } } }
例如:日志中选出 error和warning日志的个数,作日志预警
GET logs/_search { "size": 0, "aggs": { "messages": { "filters": { "filters": { "errors": { "match": { "body": "error" } }, "warnings": { "match": { "body": "warning" } } } } } } }
示例2:为其他值组指定key
POST book1/_search?size=0 { "aggs":{ "age_terms":{ "filters":{ "other_bucket_key": "other_messages", "filters":{ "test":{ "match":{"name":"test"} }, "china":{ "match":{"name":"里"} } } } } } }
结果2:
{ "took": 9, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "age_terms": { "buckets": { "china": { "doc_count": 13 }, "test": { "doc_count": 5 }, "other_messages": { "doc_count": 23 } } } } }
4. Range Aggregation 范围分组聚合
示例1:
POST book1/_search?size=0 { "aggs":{ "age_range":{ "range":{ "field":"age", "keyed":true, "ranges":[ { "to":20, "key":"TW" }, { "from":25, "to":40, "key":"TH" }, { "from":60, "key":"SIX" } ] } } } }
结果1:
{ "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "age_range": { "buckets": { "TW": { "to": 20, "doc_count": 31 }, "TH": { "from": 25, "to": 40, "doc_count": 2 }, "SIX": { "from": 60, "doc_count": 0 } } } } }
5. Date Range Aggregation 时间范围分组聚合
示例1:
POST /bank/_search?size=0 { "aggs": { "range": { "date_range": { "field": "date", "format": "MM-yyy", "ranges": [ { "to": "now-10M/M" }, { "from": "now-10M/M" } ] } } } }
结果1:
{ "took": 115, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1000, "max_score": 0, "hits": [] }, "aggregations": { "range": { "buckets": [ { "key": "*-2017-08-01T00:00:00.000Z", "to": 1501545600000, "to_as_string": "2017-08-01T00:00:00.000Z", "doc_count": 0 }, { "key": "2017-08-01T00:00:00.000Z-*", "from": 1501545600000, "from_as_string": "2017-08-01T00:00:00.000Z", "doc_count": 0 } ] } } }
6. Date Histogram Aggregation 时间直方图(柱状)聚合
就是按天、月、年等进行聚合统计。可按 year (1y), quarter (1q), month (1M), week (1w), day (1d), hour (1h), minute (1m), second (1s) 间隔聚合或指定的时间间隔聚合。
示例1:
POST /bank/_search?size=0 { "aggs": { "sales_over_time": { "date_histogram": { "field": "date", "interval": "month" } } } }
结果1:
{ "took": 9, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1000, "max_score": 0, "hits": [] }, "aggregations": { "sales_over_time": { "buckets": [] } } }
7. Missing Aggregation 缺失值的桶聚合
示例:统计没有值的文档的数量
POST /book/_search?size=0 { "aggs" : { "account_without_a_age" : { "missing" : { "field" : "age" } } } }
结果1:
{ "took": 10, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 41, "max_score": 0, "hits": [] }, "aggregations": { "account_without_age": { "doc_count": 8 } } }
8. Geo Distance Aggregation 地理距离分区聚合
参考官网链接:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-geodistance-aggregation.html