Elasticsearch专题精讲—— Aggregations —— Pipeline aggregations(管道聚合)
Aggregations —— Pipeline aggregations(管道聚合)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/search-aggregations-pipeline.html#search-aggregations-pipeline
Pipeline aggregations work on the outputs produced from other aggregations rather than from document sets, adding information to the output tree. There are many different types of pipeline aggregation, each computing different information from other aggregations, but these types can be broken down into two families:
管道聚合不是在文档集上工作,而是在其他聚合生成的输出上工作,并向输出树添加信息。有许多不同类型的管道聚合,每个类型从其他聚合计算不同的信息,但这些类型可以分为两类:
Parent(['pεrənt], 父母)
family of pipeline aggregations that is provided with the output of its parent aggregation and is able to compute new buckets or new aggregations to add to existing buckets.
管道聚合的类别是具有父聚合输出的功能,并可以计算新的桶或将新的聚合添加到现有桶中。
我理解意思是说: Elasticsearch的管道聚合是一种强大的功能,它能够在已有聚合结果的基础上进一步计算新的桶或添加新的聚合。这种类型的聚合可以在知识检索、数据分析和报告等场景中发挥关键作用。 了解常规聚合与管道聚合的区别非常重要。常规聚合是直接针对文档进行操作,例如计算平均值、总和和最大值等。而管道聚合则在其他聚合的结果上进一步执行操作,从而消除了在客户端对数据进行二次处理的复杂性。
Sibling(['sɪblɪŋ], 兄弟姐妹)
Pipeline aggregations that are provided with the output of a sibling aggregation and are able to compute a new aggregation which will be at the same level as the sibling aggregation.
管道聚合能够接收相邻聚合的输出,然后计算出一个新的聚合,新聚合将与相邻聚合处于同一层次。
我理解意思是说: 在 Elasticsearch 8 中,兄弟管道聚合(Sibling Pipeline Aggregations)是一种高级的聚合,可以在多个相互关联的聚合之间共享输出。通过这种方式,计算得出的新聚合将处于与兄弟聚合相同的层级。以下是关于 Elasticsearch 8 中兄弟管道聚合的详细介绍: 兄弟管道聚合的工作原理:通过使用兄弟聚合的输出作为输入,兄弟管道聚合实现在相同层级的多个聚合之间计算新的聚合结果。这个功能对于包含根据多个维度进行数据分析的需求是非常有用的。
应用场景: 1、跨维度/跨字段分析:当需要在多个字段或维度之间进行相关性分析时,兄弟管道聚合可以方便地将这些聚合结果整合到一起,实现对多维度数据挖掘。 2、数据分组和数据对比:当需要对数据进行分组对比(例如将销售数据中的同一时间段的不同产品进行对比)时,兄弟管道聚合提供聚合数据之间生成新聚合的能力。 3、复杂数据计算:兄弟管道聚合可以在同一层级的多个聚合之间进行计算,如求和、平均、百分比等。这可以用于更高级的数据挖掘和分析。
优点: 1、简化数据处理:由于兄弟管道聚合将多个聚合结果整合为一个查询结果,开发者不需要在客户端手动进行数据整合,提高了开发效率。 2、支持跨维度/跨字段分析:通过兄弟管道聚合,可以方便地将不同维度/字段的数据整合到一起进行关联性分析,使得数据分析更加全面。 3、支持复杂数据计算:兄弟管道聚合支持多种数据计算操作,如求和、平均、百分比等,可以用于更高级的数据挖掘和分析。
总结,Elasticsearch 8 中的兄弟管道聚合作为一种强大的聚合功能,实现了在相同层级的多个聚合护理中计算新的聚合结果。在各种复杂的应用场景中,它提供了简洁高效的解决方案,使数据处理和分析变得更加方便。
Pipeline aggregations can reference the aggregations they need to perform their computation by using the buckets_path parameter to indicate the paths to the required metrics. The syntax for defining these paths can be found in the buckets_path Syntax section below.
管道聚合可以通过使用 buckets_path 参数来引用它们需要执行计算的聚合,并指出所需指标的路径。定义这些路径的语法可以在下方的 "buckets_path Syntax" 章节(https://www.elastic.co/guide/en/elasticsearch/reference/8.8/search-aggregations-pipeline.html#buckets-path-syntax)找到。
Pipeline aggregations cannot have sub-aggregations but depending on the type it can reference another pipeline in the buckets_path allowing pipeline aggregations to be chained. For example, you can chain together two derivatives to calculate the second derivative (i.e. a derivative of a derivative).
管道聚合不允许具有子聚合,但是根据类型,它可以在 buckets_path 中引用另一个管道聚合,从而实现管道聚合的链式操作。举例来说,您可以将两个导数连接在一起以计算二阶导数(即,一个导数的导数)。
1、buckets_path Syntax
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/search-aggregations-pipeline.html#buckets-path-syntax
Most pipeline aggregations require another aggregation as their input. The input aggregation is defined via the buckets_path parameter, which follows a specific format:
大多数管道聚合需要另一个聚合作为它们的输入。输入聚合是通过 buckets_path 参数定义的,该参数遵循特定的格式:
AGG_SEPARATOR = `>` ; METRIC_SEPARATOR = `.` ; AGG_NAME = ; METRIC = ; MULTIBUCKET_KEY = `[<KEY_NAME>]` PATH = <AGG_NAME><MULTIBUCKET_KEY>? (<AGG_SEPARATOR>, <AGG_NAME> )* ( <METRIC_SEPARATOR>, ) ;
For example, the path "my_bucket>my_stats.avg" will path to the avg value in the "my_stats" metric, which is contained in the "my_bucket" bucket aggregation.
例如,路径 "my_bucket>my_stats.avg" 将指向 "my_bucket" 桶聚合中包含的 "my_stats" 指标的平均值。
Here are some more examples:
下面是一些更多的例子:
multi_bucket["foo"]>single_bucket>multi_metric.avg
will go to theavg
metric in the"multi_metric"
agg under the single bucket"single_bucket"
within the"foo"
bucket of the"multi_bucket"
multi-bucket aggregation.
multi_bucket["foo"]>single_bucket>multi_metric.avg 将指向 "multi_bucket" 多桶聚合中 "foo" 桶下的 "single_bucket" 单桶内的 "multi_metric" 聚合中的平均指标。
agg1["foo"]._count
will get the_count
metric for the"foo"
bucket in the multi-bucket aggregation"multi_bucket"
agg1["foo"]._count 将获取多桶聚合 "multi_bucket" 中 "foo" 桶的 _count 指标。
Paths are relative from the position of the pipeline aggregation; they are not absolute paths, and the path cannot go back "up" the aggregation tree. For example, this derivative is embedded inside a date_histogram and refers to a "sibling" metric "the_sum":
路径是相对于管道聚合的位置的;它们不是绝对路径,路径不能回到聚合树的上层。例如,这个导数嵌套在一个 date_histogram 内,并且引用一个 "sibling" 指标 "the_sum":
curl -X POST "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d' { "aggs": { "my_date_histo": { "date_histogram": { "field": "timestamp", "calendar_interval": "day" }, "aggs": { "the_sum": { "sum": { "field": "lemmings" } }, "the_deriv": { "derivative": { "buckets_path": "the_sum" } } } } } }'
buckets_path is also used for Sibling pipeline aggregations, where the aggregation is "next" to a series of buckets instead of embedded "inside" them. For example, the max_bucket aggregation uses the buckets_path to specify a metric embedded inside a sibling aggregation:
buckets_path 也用于 Sibling 管道聚合,其中聚合是在一系列桶的 "旁边",而不是嵌套在它们的 "内部"。例如,max_bucket 聚合使用 buckets_path 指定嵌套在兄弟聚合内部的一个指标:
curl -X POST "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d' { "aggs": { "sales_per_month": { "date_histogram": { "field": "date", "calendar_interval": "month" }, "aggs": { "sales": { "sum": { "field": "price" } } } }, "max_monthly_sales": { "max_bucket": { "buckets_path": "sales_per_month>sales" } } } }'
If a Sibling pipeline agg references a multi-bucket aggregation, such as a terms agg, it also has the option to select specific keys from the multi-bucket. For example, a bucket_script could select two specific buckets (via their bucket keys) to perform the calculation:
如果一个 Sibling 管道聚合引用了一个多桶聚合,比如 terms 聚合,它还可以选择从多桶聚合中选择特定的键。例如,bucket_script 可以选择两个特定的桶(通过它们的桶键)来进行计算: (将 "translate into chinese" 部分视为输入错误)
curl -X POST "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d' { "aggs": { "sales_per_month": { "date_histogram": { "field": "date", "calendar_interval": "month" }, "aggs": { "sale_type": { "terms": { "field": "type" }, "aggs": { "sales": { "sum": { "field": "price" } } } }, "hat_vs_bag_ratio": { "bucket_script": { "buckets_path": { "hats": "sale_type[\u0027hat\u0027]>sales", "bags": "sale_type[\u0027bag\u0027]>sales" }, "script": "params.hats / params.bags" } } } } } }'
buckets_path selects the hats and bags buckets (via ['hat']/['bag']`) to use in the script specifically, instead of fetching all the buckets from sale_type aggregation
buckets_path 通过使用 ['hat']/['bag'] 选择 hats 和 bags 桶,以便在脚本中特定使用,而不是从 sale_type 聚合中获取所有桶。
2、Special Paths(特殊路径)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/search-aggregations-pipeline.html#_special_paths
Instead of pathing to a metric, buckets_path can use a special "_count" path. This instructs the pipeline aggregation to use the document count as its input. For example, a derivative can be calculated on the document count of each bucket, instead of a specific metric:
除了指向指标的路径,buckets_path 还可以使用特殊的 "_count" 路径。这会指示管道聚合使用文档计数作为其输入。例如,可以对每个桶的文档计数计算导数,而不是特定指标:
curl -X POST "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d' { "aggs": { "my_date_histo": { "date_histogram": { "field": "timestamp", "calendar_interval": "day" }, "aggs": { "the_deriv": { "derivative": { "buckets_path": "_count" } } } } } }'
By using _count instead of a metric name, we can calculate the derivative of document counts in the histogram
通过使用 _count 而不是指标名称,我们可以计算直方图中文档计数的导数。
The buckets_path can also use "_bucket_count" and path to a multi-bucket aggregation to use the number of buckets returned by that aggregation in the pipeline aggregation instead of a metric. For example, a bucket_selector can be used here to filter out buckets which contain no buckets for an inner terms aggregation:
buckets_path 还可以使用 "_bucket_count",并指向一个多桶聚合,以便在管道聚合中使用该聚合返回的桶的数量,而不是一个指标。例如,这里可以使用 bucket_selector 来过滤掉内部 terms 聚合中不包含任何桶的桶:
curl -X POST "localhost:9200/sales/_search?pretty" -H 'Content-Type: application/json' -d' { "size": 0, "aggs": { "histo": { "date_histogram": { "field": "date", "calendar_interval": "day" }, "aggs": { "categories": { "terms": { "field": "category" } }, "min_bucket_selector": { "bucket_selector": { "buckets_path": { "count": "categories._bucket_count" }, "script": { "source": "params.count != 0" } } } } } } }'
By using _bucket_count instead of a metric name, we can filter out histo buckets where they contain no buckets for the categories aggregation
通过使用 _bucket_count 而不是指标名称,我们可以过滤掉直方图桶,当它们不包含类别聚合的任何桶时。
3、Dealing with dots in agg names(处理带有点的聚合名字)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/search-aggregations-pipeline.html#dots-in-agg-names
An alternate syntax is supported to cope with aggregations or metrics which have dots in the name, such as the 99.9th percentile. This metric may be referred to as:
为了应对聚合或指标名称中有点的情况,例如第 99.9 百分位数,支持另一种语法。此指标可以称为:
"buckets_path": "my_percentile[99.9]"
4、Dealing with gaps in the data(处理数据中带有空白)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/search-aggregations-pipeline.html#gap-policy
Data in the real world is often noisy and sometimes contains gaps — places where data simply doesn’t exist. This can occur for a variety of reasons, the most common being:
现实世界中的数据通常有噪点,有时会存在缺口——即数据根本不存在的地方。这可能是由于各种原因引起的,最常见的原因包括:
- Documents falling into a bucket do not contain a required field
没有符合一个或多个存储桶的查询文档。
- There are no documents matching the query for one or more buckets
落入桶中的文档不包含所需的字段
- The metric being calculated is unable to generate a value, likely because another dependent bucket is missing a value. Some pipeline aggregations have specific requirements that must be met (e.g. a derivative cannot calculate a metric for the first value because there is no previous value, HoltWinters moving average need "warmup" data to begin calculating, etc)
正在计算的指标无法生成值,可能是因为其他相关的存储桶缺少值。某些管道聚合有特定的要求必须满足(例如,由于没有先前的值,派生不可能为第一个值计算指标,HoltWinters移动平均需要“预热”数据才能开始计算等)。
Gap policies are a mechanism to inform the pipeline aggregation about the desired behavior when "gappy" or missing data is encountered. All pipeline aggregations accept the gap_policy parameter. There are currently two gap policies to choose from:
Gap策略是一种机制,用于在遇到“gappy”或缺失数据时通知管道聚合所需的行为。所有管道聚合都接受gap_policy参数。目前有两种可供选择的gap策略:
- skip
- This
option
treats missing data as if the bucket does not exist. It will skip the
bucket and continue calculating using the next available value.
此选项将丢失的数据视为桶不存在。它将跳过 bucket 并使用下一个可用值继续计算。
- insert_zeros
- This
option will
replace missing values with a zero (
0
) and pipeline aggregation computation will proceed as normal.此选项将使用零(0)替换缺少的值,管道聚合计算将照常进行。
- keep_values
- This
option is
similar to skip, except if the metric provides a non-null, non-NaN
value
this value is used, otherwise the empty bucket is skipped.
此选项类似于跳过,除非指标提供非空、非 NaN 值,否则将跳过空桶。