Elasticsearch 第六篇:聚合统计查询
前面一直没有记录 Elasticsearch 的聚合查询或者其它复杂的查询。本篇做一下笔记,为了方便测试,索引数据依然是第五篇生成的测试索引库 db_student_test ,别名是 student_test
第一部分 基本聚合
1、最大值 max、最小值 min、平均值 avg 、总和 sum
场景:查询语文、数学、英语 这三科的最大值、最小值、平均值
POST http://localhost:9200/student_test1/_search?size=0 { "aggs" : { "max_chinese" : { "max" : { "field" : "chinese" } }, "min_chinese" : { "min" : { "field" : "chinese" } }, "avg_chinese" : { "avg" : { "field" : "chinese" } }, "max_math": { "max" : { "field" : "math" } }, "min_math": { "min" : { "field" : "math" } }, "avg_math": { "avg" : { "field" : "math" } }, "max_english": { "max" : { "field" : "english" } }, "min_english": { "min" : { "field" : "english" } }, "avg_english": { "avg" : { "field" : "english" } } } }
查询结果是:
{ "took": 0, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 10000, "relation": "gte" }, "max_score": null, "hits": [] }, "aggregations": { "avg_english": { "value": 57.78366490546798 }, "max_chinese": { "value": 98 }, "min_chinese": { "value": 25 }, "min_math": { "value": 15 }, "max_english": { "value": 98 }, "avg_chinese": { "value": 59.353859695794505 }, "avg_math": { "value": 56.92907568735187 }, "min_english": { "value": 21 }, "max_math": { "value": 99 } } }
也可以来查询语文科目分数总和,相当于 sql 的 sum 逻辑,虽然在这里并没有什么意义:
POST http://localhost:9200/student_test1/_search?size=0 { "aggs" : { "sum_chinese" : { "sum" : { "field" : "chinese" } } } }
2、求个数,相当于 sql 的 count 逻辑
场景:查询所有学生总数,这里随便 count 一个 字段就可以,例如数学这个字段
POST http://localhost:9200/student_test1/_search?size=0 { "aggs": { "age_count": { "value_count": { "field": "math" } } } }
返回结果是:
{ "took": 0, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 10000, "relation": "gte" }, "max_score": null, "hits": [] }, "aggregations": { "age_count": { "value": 50084828 } } }
课间总数是:50084828 跟第五篇我们生成的数据总量一致
3、distinct 聚合,相当于 sql 的 count ( distinct )
场景:统计语文成绩有多少种值
POST http://localhost:9200/student_test1/_search?size=0 { "aggs" : { "type_count" : { "cardinality" : { "field" : "chinese" } } } }
返回结果是:
{ "took": 0, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 10000, "relation": "gte" }, "max_score": null, "hits": [] }, "aggregations": { "type_count": { "value": 74 } } }
从结果上看,只有74个不同的分数,与第五篇随机生成数据的规则匹配
4、统计聚合
场景:查询语文成绩 总个数、最大值、最小值、平均值、总和等
POST http://localhost:9200/student_test1/_search?size=0 { "aggs": { "chinese_stats": { "stats": { "field": "chinese" } } } }
返回结果是:
{ "took": 0, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 10000, "relation": "gte" }, "max_score": null, "hits": [] }, "aggregations": { "chinese_stats": { "count": 50084828, "min": 25, "max": 98, "avg": 59.353859695794505, "sum": 2972727854 } } }
5、加强版统计聚合,查询结果在上面的基础上,加上方差等统计学上的数据
POST http://localhost:9200/student_test1/_search?size=0 { "aggs": { "chinese_stats": { "extended_stats": { "field": "chinese" } } } }
6、分位聚合统计
默认的分位是 1% 5% 25% 50% 75% 95% 99% 《= 的概念
分位数的概念:25% 的分位数是 54,意思是小于等于 54 的样本占据了总样本的 25% ,即是 54 这个数将最底层的1/4 的数据分割出来。
POST http://localhost:9200/student_test1/_search?size=0 { "aggs": { "chinese_percents": { "percentiles": { "field": "chinese" } } } }
也可以自定义分位:
POST http://localhost:9200/student_test1/_search?size=0 { "aggs": { "chinese_percents": { "percentiles": { "field": "chinese", "percents" : [10,20,30,40,50,60,70,80,90] } } } }
7、范围聚合统计
场景:分别查询语文成绩小于40分、小于50分、小于60分的比例
POST http://localhost:9200/student_test1/_search?size=0 { "aggs": { "gge_perc_rank": { "percentile_ranks": { "field": "chinese", "values": [40,50,60] } } } }
以上是查询成绩小于40,小于50,小于60的占比,得到的数据是: 21.29% 36.09% 51.12% 可以看到这是一个接近等差的数列,可见测试数据的随机性还是很好的。
第二部分 其它聚合方式
1、Term 聚合
场景:想知道学生的语文成绩,在所有分数值上的个数
POST http://localhost:9200/student_test1/_search?size=0 { "aggs" : { "genres" : { "terms" : { "field" : "chinese" } } } }
这个查询会将字段Chinese进行聚合,例如87分聚合成一个组,88分聚合成一个组,等等;
但是这里默认是按组的大小排序,而且不会将所有的组都显示出来,数量太小的组可能被忽略,查询结果如下:
{ "took": 1, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 10000, "relation": "gte" }, "max_score": null, "hits": [] }, "aggregations": { "genres": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 42560269, "buckets": [ { "key": 61, "doc_count": 752863 }, { "key": 68, "doc_count": 752835 }, { "key": 55, "doc_count": 752749 }, { "key": 59, "doc_count": 752444 }, { "key": 76, "doc_count": 752405 }, { "key": 74, "doc_count": 752309 }, { "key": 56, "doc_count": 752283 }, { "key": 49, "doc_count": 752273 }, { "key": 52, "doc_count": 752201 }, { "key": 50, "doc_count": 752197 } ] } } }
如果想要自定义筛选条件,Term聚合还可以按照以下设定来查询:
post http://localhost:9200/student_test1/_search?size=0 { "aggs" : { "genres" : { "terms" : { "field" : "chinese", "size" : 100, // 可能有100个不用的分数,我们将全部都展示出来 "order" : { "_count" : "asc" }, // 按照组数由小到大排序 "min_doc_count": 752200 //过滤条件:组数最小值是752200 } } } }
查询结果是:
{ "took": 0, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 10000, "relation": "gte" }, "max_score": null, "hits": [] }, "aggregations": { "genres": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": 52, "doc_count": 752201 }, { "key": 49, "doc_count": 752273 }, { "key": 56, "doc_count": 752283 }, { "key": 74, "doc_count": 752309 }, { "key": 76, "doc_count": 752405 }, { "key": 59, "doc_count": 752444 }, { "key": 55, "doc_count": 752749 }, { "key": 68, "doc_count": 752835 }, { "key": 61, "doc_count": 752863 } ] } } }
2、Filter 聚合
Filter 聚合会先进行条件过滤,在进行聚合
场景:查询华南理工大学的学生的数学科目平均分(先筛选学校,再进行分数统计聚合)
{ "aggs" : { "scut_math_avg" : { "filter" : { "term": { "school": "华南理工大学" } }, "aggs" : { "avg_price" : { "avg" : { "field" : "math" } } } } } }
查询结果是:
{ "took": 0, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 10000, "relation": "gte" }, "max_score": null, "hits": [] }, "aggregations": { "scut_math_avg": { "doc_count": 1854993, "avg_price": { "value": 56.93080027795253 } } } }
3、Filters 多重聚合
场景:查询各个学校,语文、数学、英语的平均分都是多少,可以采用多重聚合,速度可能有点慢,如下
POST http://localhost:9200/student_test1/_search?size=0 { "aggs" : { "messages" : { "filters" : { "filters" : { "school_1" : { "term" : { "school" : "华南理工大学" }}, "school_2" : { "term" : { "school" : "中山大学" }}, "school_3" : { "match" : { "school" : "暨南大学" }} } }, "aggs" : { "avg_chinese" : { "avg" : { "field" : "chinese" } }, "avg_math" : { "avg" : { "field" : "math" } } } } } }
于是得到结果:
{ "took": 0, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 10000, "relation": "gte" }, "max_score": null, "hits": [] }, "aggregations": { "messages": { "buckets": { "school_1": { "doc_count": 1854993, "avg_chinese": { "value": 59.353236912484306 }, "avg_math": { "value": 56.93080027795253 } }, "school_2": { "doc_count": 1855016, "avg_chinese": { "value": 59.349129064115886 }, "avg_math": { "value": 56.93540918245449 } }, "school_3": { "doc_count": 44519876, "avg_chinese": { "value": 59.35397212247402 }, "avg_math": { "value": 56.92948502372289 } } } } } }
4、Range 范围聚合
场景:想要查询语文成绩各个分数段的人数,可以这样查询
POST http://localhost:9200/student_test1/_search?size=0
{ "aggs" : { "chinese_ranges" : { "range" : { "field" : "chinese", "ranges" : [ { "to" : 60 }, { "from" : 60, "to" : 75 }, { "from" : 75, "to" : 85 }, { "from" : 85 } ] } } } }
查询结果是:
{ "took": 0, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 10000, "relation": "gte" }, "max_score": null, "hits": [] }, "aggregations": { "chinese_ranges": { "buckets": [ { "key": "*-60.0", "to": 60, "doc_count": 25096839 }, { "key": "60.0-75.0", "from": 60, "to": 75, "doc_count": 11278543 }, { "key": "75.0-85.0", "from": 75, "to": 85, "doc_count": 7424634 }, { "key": "85.0-*", "from": 85, "doc_count": 6284812 } ] } } }
这个返回结果的组名分别是 *-60.0 60.0-75.0 75.0-85.0 85.0-*
如果我们不想要这样的组名,可以自定义组名,例如:
POST http://localhost:9200/student_test1/_search?size=0 { "aggs" : { "chinese_ranges" : { "range" : { "field" : "chinese", "keyed" : true, "ranges" : [ { "key" : "不及格", "to" : 60 }, { "key" : "及格", "from" : 60, "to" : 75 }, { "key" : "良好", "from" : 75, "to" : 85 }, { "key" : "优秀", "from" : 85 } ] } } } }
查询结果将会是:
{ "took": 1675, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 10000, "relation": "gte" }, "max_score": null, "hits": [] }, "aggregations": { "chinese_ranges": { "buckets": { "不及格": { "to": 60, "doc_count": 25096839 }, "及格": { "from": 60, "to": 75, "doc_count": 11278543 }, "良好": { "from": 75, "to": 85, "doc_count": 7424634 }, "优秀": { "from": 85, "doc_count": 6284812 } } } } }
还有其它各种各样的、复杂的聚合查询,都是可以网上查资料,甚至还支持推荐系统的一些计算方法,例如矩阵的概念等等。
还可以参考 https://blog.csdn.net/alex_xfboy/article/details/86100037