Elasticsearch专题精讲—— Aggregations(聚合)
Aggregations [ˌægrɪˈgeʃən](聚合)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/search-aggregations.html#search-aggregations
An aggregation summarizes your data as metrics, statistics, or other analytics. Aggregations help you answer questions like:
聚合将您的数据汇总为度量、统计或其他分析数据。聚合有助于回答以下问题:
-
- What’s the average load time for my website?
- Who are my most valuable customers based on transaction volume?
- What would be considered a large file on my network?
- How many products are in each product category?
我的网站平均加载时间是多少?
根据交易量,谁是我最有价值的客户?
在您的网络中,什么大小的文件被认为是大文件?
每个产品类别有多少种产品?
Elasticsearch organizes aggregations into three categories:
Elasticsearch 将聚合分为三类:
-
- Metric aggregations that calculate metrics, such as a sum or average, from field values.
- Bucket aggregations that group documents into buckets, also called bins, based on field values, ranges, or other criteria.
- Pipeline aggregations that take input from other aggregations instead of documents or fields.
· 度量聚合可以从字段值计算度量(例如总和或平均值)(https://www.elastic.co/guide/en/elasticsearch/reference/8.8/search-aggregations-metrics.html)。
· 桶聚合可以根据字段值、范围或其他条件将文档分组到桶中,也称之为 bins。(https://www.elastic.co/guide/en/elasticsearch/reference/8.8/search-aggregations-bucket.html)
· 管道(https://www.elastic.co/guide/en/elasticsearch/reference/8.8/search-aggregations-pipeline.html)聚合可以从其他聚合获取输入,而不是文档或字段。
1、Run an aggregation(运行一个聚合)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/search-aggregations.html#run-an-agg
You can run aggregations as part of a search by specifying the search API's aggs parameter. The following search runs a terms aggregation on my-field:
聚合将您的数据汇总为度量、统计或其他分析数据。聚合有助于回答以下问题:
curl -X GET "localhost:9200/my-index-000001/_search?pretty" -H 'Content-Type: application/json' -d' { "aggs": { "my-agg-name": { "terms": { "field": "my-field" } } } }'
我理解意思是说: Elasticsearch 的查询请求,使用 GET 方法查询名为 my-index-000001 的索引。查询请求中包含一个聚合操作(aggregation),其目的是对指定字段的值进行统计分析,然后返回结果。以下是该段查询请求的详细解释: GET /my-index-000001/_search: 表明这是一个针对索引 my-index-000001 的搜索查询。 aggs: 用于定义聚合操作的部分。 my-agg-name: 此聚合操作的自定义名称,用于在查询结果中标识这个聚合操作的数据。 terms: 表明此聚合操作的类型为词项聚合。词项聚合主要用于统计某个字段中的不同值以及其出现次数。 field: 指定词项聚合中所要统计的字段名,这里是 my-field。 这个查询用于对索引 my-index-000001 中 my-field 字段的不同值及其出现次数进行统计分析,并将结果按词项聚合。在返回的结果中,可以找到 my-agg-name 来查看聚合的结果。这对于分析数据的分布情况、某些特征的出现频率等场景非常有用。
Aggregation results are in the response’s aggregations object:
聚合结果在响应的 aggregations 对象中:
{ "took": 78, // 查询执行耗时,单位为毫秒,这里为78毫秒。 "timed_out": false, // 查询是否超时,这里为false,表示没有超时。 "_shards": { // 分片信息: "total": 1, // 总分片数,这里为1。 "successful": 1, // 查询成功的分片数,这里为1。 "skipped": 0, // 跳过的分片数,这里为0。 "failed": 0 // 失败的分片数,这里为0。 }, "hits": { // 命中信息: "total": { // 命中的文档总数: "value": 5, // value字段,表示实际命中数量(这里为5); "relation": "eq" // relation字段,表示命中数量与实际文档数量的关系(这里为"eq",表示相等)。 }, "max_score": 1.0, // 命中文档中最高的相关性评分,这里为1.0。 "hits": [...] // 命中的文档数组,这里省略了具体内容,用“...”表示。 }, "aggregations": { // 聚合结果: "my-agg-name": { // 自定义聚合名称,对应前面查询请求中定义的名称。 "doc_count_error_upper_bound": 0, // 文档计数的误差上限,这里为0。 "sum_other_doc_count": 0, // 未包含在当前结果中的其他文档的数量,这里为0。 "buckets": [] // 包含聚合结果的桶列表,这里为空列表。 } } }
我理解意思是说: 这个响应包含了执行查询所花费的时间(took)、是否超时(timed_out)、分片信息(_shards)、命中信息(hits)以及聚合结果(aggregations)。
2、Change an aggregation's scope(更改聚合的范围)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/search-aggregations.html#change-agg-scope
Use the query parameter to limit the documents on which an aggregation runs:
使用查询参数限制聚合操作在哪些文档上运行:
curl -X GET "localhost:9200/my-index-000001/_search?pretty" -H 'Content-Type: application/json' -d' { "query": { "range": { "@timestamp": { "gte": "now-1d/d", "lt": "now/d" } } }, "aggs": { "my-agg-name": { "terms": { "field": "my-field" } } } }'
3、Return only aggregation results(仅返回聚合结果)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/search-aggregations.html#return-only-agg-results
By default, searches containing an aggregation return both search hits and aggregation results. To return only aggregation results, set size to 0:
默认情况下,包含聚合的搜索将返回搜索命中和聚合结果。若要仅返回聚合结果,将 size 设置为 0。
curl -X GET "localhost:9200/my-index-000001/_search?pretty" -H 'Content-Type: application/json' -d' { "size": 0, "aggs": { "my-agg-name": { "terms": { "field": "my-field" } } } }'
4、Run multiple aggregations(运行多个聚合)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/search-aggregations.html#run-multiple-aggs
You can specify multiple aggregations in the same request:
您可以在同一个请求中指定多个聚合:
curl -X GET "localhost:9200/my-index-000001/_search?pretty" -H 'Content-Type: application/json' -d' { "aggs": { "my-first-agg-name": { "terms": { "field": "my-field" } }, "my-second-agg-name": { "avg": { "field": "my-other-field" } } } }'
5、Run sub-aggregations(运行子聚合)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/search-aggregations.html#run-sub-aggs
我理解意思是说: 在 Elasticsearch 8 的聚合功能中,子聚合(Sub-aggregations)是指在父聚合内部执行的嵌套聚合操作。子聚合可以让用户在相同查询请求(单次查询请求)中实现多层次的数据分析。
Bucket aggregations support bucket or metric sub-aggregations. For example, a terms aggregation with an avg sub-aggregation calculates an average value for each bucket of documents. There is no level or depth limit for nesting sub-aggregations.
桶聚合支持桶或度量子聚合(管道聚合不能有子聚合https://www.elastic.co/guide/en/elasticsearch/reference/8.8/search-aggregations-pipeline.html#search-aggregations-pipeline)。例如,带有 avg 子聚合的 terms 聚合会为文档的每个桶计算平均值。对于嵌套子聚合的层级或深度没有限制。
curl -X GET "localhost:9200/my-index-000001/_search?pretty" -H 'Content-Type: application/json' -d' { "aggs": { "my-agg-name": { "terms": { "field": "my-field" }, "aggs": { "my-sub-agg-name": { "avg": { "field": "my-other-field" } } } } } }'
The response nests sub-aggregation results under their parent aggregation:
响应在其父聚合下嵌套子聚合结果:
{ ... "aggregations": { "my-agg-name": { // 自定义的父聚合名称 "doc_count_error_upper_bound": 0, // 文档数量的误差上限 "sum_other_doc_count": 0, // 未包含在当前结果中的其他文档的数量 "buckets": [ // 包含聚合结果的桶列表 { "key": "foo", // 桶的关键字 "doc_count": 5, // 桶中文档的数量 "my-sub-agg-name": { // 自定义的子聚合名称 "value": 75.0 // 子聚合计算的值 } } ] } } }
6、Add custom metadata(添加自定义元数据)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/search-aggregations.html#add-metadata-to-an-agg
Use the meta object to associate custom metadata with an aggregation:
使用元对象将自定义元数据与聚合关联:
curl -X GET "localhost:9200/my-index-000001/_search?pretty" -H 'Content-Type: application/json' -d' { "aggs": { "my-agg-name": { "terms": { "field": "my-field" }, "meta": { "my-metadata-field": "foo" } } } }'
The response returns the meta object in place:
响应返回原来的元对象:
{ ... "aggregations": { "my-agg-name": { "meta": { "my-metadata-field": "foo" }, "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [] } } }
7、Return the aggregation type(返回聚合类型)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/search-aggregations.html#return-agg-type
By default, aggregation results include the aggregation’s name but not its type. To return the aggregation type, use the typed_keys query parameter.
默认情况下,聚合结果包含聚合的名称,但不包括其类型。要返回聚合类型,请使用 typed_keys 查询参数。
curl -X GET "localhost:9200/my-index-000001/_search?pretty" -H 'Content-Type: application/json' -d' { "aggs": { "my-agg-name": { "terms": { "field": "my-field" }, "meta": { "my-metadata-field": "foo" } } } }'
The response returns the aggregation type as a prefix to the aggregation’s name.
响应将聚合类型作为聚合名称的前缀返回。
Some aggregations return a different aggregation type from the type in the request. For example, the terms, significant terms, and percentiles aggregations return different aggregations types depending on the data type of the aggregated field.
某些聚合会返回与请求中类型不同的聚合类型。例如,terms、significant terms 和 percentiles(百分位数)聚合会根据被聚合字段的数据类型返回不同的聚合类型
{ ... "aggregations": { "histogram#my-agg-name": { // 聚合类型和聚合名称,"histogram" 是聚合类型前缀,"my-agg-name" 是自定义的聚合名称 "buckets": [] // 聚合结果中的桶,这里为空表示没有结果 } } }
8、Use scripts in an aggregation(在聚合中使用脚本)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/search-aggregations.html#use-scripts-in-an-agg
When a field doesn’t exactly match the aggregation you need, you should aggregate on a runtime field:
当一个字段与你需要的聚合不完全匹配时,你应该对运行时字段 (runtime field) 进行聚合。
curl -X GET "localhost:9200/my-index-000001/_search?size=0&pretty" -H 'Content-Type: application/json' -d' { "runtime_mappings": { "message.length": { "type": "long", "script": "emit(doc[\u0027message.keyword\u0027].value.length())" } }, "aggs": { "message_length": { "histogram": { "interval": 10, "field": "message.length" } } } }'
Scripts calculate field values dynamically, which adds a little overhead to the aggregation. In addition to the time spent calculating, some aggregations like terms and filters can’t use some of their optimizations with runtime fields. In total, performance costs for using a runtime field varies from aggregation to aggregation.
脚本会动态计算字段值,这会给聚合带来一些额外开销。除了计算时间外,有些聚合(如 terms 和 filters)在与运行时字段一起使用时无法使用部分优化手段。总的来说,使用运行时字段的性能成本因聚合类型的不同而有所变化。
9、Aggregation caches(聚合缓存)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/search-aggregations.html#agg-caches
For faster responses, Elasticsearch caches the results of frequently run aggregations in the shard request cache. To get cached results, use the same preference string for each search. If you don’t need search hits, set size to 0 to avoid filling the cache.
为了获得更快的响应,Elasticsearch 将经常运行的聚合结果缓存在分片请求缓存中。要获取缓存的结果,请在每次搜索中使用相同的首选项字符串。如果你不需要搜索命中结果,将 size 设为 0,以避免填满缓存。
我理解意思是说: 这样做的目的是为了减少不必要的搜索结果数据,并专注于返回聚合结果,从而提高整体性能。
Elasticsearch routes searches with the same preference string to the same shards. If the shards' data doesn’t change between searches, the shards return cached aggregation results.
Elasticsearch 会将具有相同首选项字符串的搜索路由到相同的分片。如果分片数据在多次搜索之间没有发生变化,那么分片就会返回缓存的聚合结果。
我理解意思是说: 在数据不发生更改的情况下,使用相同的首选项字符串可以提高搜索效率,因为 Elasticsearch 可以直接从缓存中提取聚合结果而无需重新进行计算。
简而言之,通过将相同首选项字符串的搜索路由到相同的分片以及聚合结果的缓存策略,Elasticsearch 8 能够在聚合缓存场景中实现更快捷的响应时间,并有效提高搜索效率。
10、Limits for long values
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/search-aggregations.html#limits-for-long-values
When running aggregations, Elasticsearch uses double values to hold and represent numeric data. As a result, aggregations on long numbers greater than 253 are approximate.
在执行聚合操作时,Elasticsearch 使用双精度浮点数(double)来保存和表示数值数据。因此,对大于 2^53 的长整数进行聚合操作时,结果将是近似值。