Elasticsearch专题精讲——What's new in 8.7?

What's new in 8.7?

https://www.elastic.co/guide/en/elasticsearch/reference/8.7/release-highlights.html , orther versions:8.6 | 8.5 | 8.4 | 8.3 | 8.2 | 8.1 | 8.0

1、Time series (TSDS) GA (时间序列)

Time Series Data Stream (TSDS) is a feature for optimizing Elasticsearch indices for time series data. This involves sorting the indices to achieve better compression and using synthetic _source to reduce index size. As a result, TSDS indices are significantly smaller than non-time_series indices that contain the same data. TSDS is particularly useful for managing time series data with high volume.

时间序列数据流(TSDS)是用于优化时间序列数据的 Elasticsearch 索引的一个特性。这涉及到对索引进排序以实现更好的压缩,并使用综合 _source 来减少索引大小。因此, TSDS 指数明显小于包含相同数据的非时间序列指数。TSDS 对于管理大容量的时间序列数据特别有用。

2、Downsampling GA (降采样GA)

Downsampling is a feature that reduces the number of stored documents in Elasticsearch time series indices, resulting in smaller indices and improved query latency. This optimization is achieved by pre-aggregating time series indices, using the time_series index schema to identify the time series. Downsampling is configured as an action in ILM, making it a useful tool for managing large volumes of time series data in Elasticsearch.

降采样 (Downsampling) 是 Elasticsearch 中的一项功能,它可以减少时间序列索引中存储的文档数量,从而降低索引大小并提高查询响应速度。通过使用时间序列索引架构识别时间序列并进行预聚合,实现这种优化。降采样是在 ILM 中配置的一个操作,可用于管理 Elasticsearch 中的大量时间序列数据,是一个非常有用的工具。 通过预先聚合数据,降采样减少了查询时需要进行的计算量,从而提高了查询响应速度。此外,由于降采样过程减少了索引文档的数量,可以减少索引存储空间的要求,对于大规模部署来说非常重要。 总的来说,降采样是 Elasticsearch 中的一个强大功能,可帮助用户优化它们的时间序列数据存储和分析能力,并提高其整体查询性能。

3、Geohex aggregations on both geo_point and geo_shape fields (Geohex聚合可适用于包括geo_point和geo_shape字段在内的数据类型)

Previously Elasticsearch 8.1.0 expanded geo_grid aggregation support from rectangular tiles (geotile and geohash) to include hexagonal tiles, but for geo_point only. Now Elasticsearch 8.7.0 will support Geohex aggregations over geo_shape as well, which completes the long desired need to perform hexagonal aggregations on spatial data

之前, Elasticsearch 8.1.0 将 geo_grid 聚合的支持从 rectangular(矩形) tiles(瓦片) (geotile和geohash) 扩展到包括 hexagonal(六边形) tiles(瓦片), 但仅适用于 geo_point。现在, Elasticsearch 8.7.0 将支持在 geo_shape 上进行 Geohex 聚合, 这满足了长期以来在空间数据上执行 hexagonal(六边形) 聚合的需求。

In 2018 Uber announced they had open sourced their h2 library, enabling hexagonal tiling of the planet for much better analytics of their traffic and regional pricing models. The use of hexagonal tiles for analytics has become increasingly popular, due to the fact that each tile represents a very similar geographic area on the planet, as well as the fact that the distance between tile centers is very similar in all directions, and consistent across the map. These benefits are now available to all Elasticsearch users.

2018年,优步(Uber)宣布开放了自己的 h2库,使地球hexagonal tiling(正六边形)镶嵌能够更好地分析自己的流量和区域定价模型。使用 hexagonal tiles(六角瓦片)进行分析已经变得越来越流行,因为每个瓦片代表地球上非常相似的地理区域,以及瓦片中心之间的距离在所有方向上都非常相似,并且在整个地图上都是一致的。现在所有 Elasticsearch 用户都可以享受这些好处。

4、Allow more than one KNN search clause (允许多个 KNN 搜索子句)

Some vector search scenarios require relevance ranking using a few kNN clauses, e.g. when ranking based on several fields, each with its own vector, or when a document includes a vector for the image and another vector for the text. The user may want to obtain relevance ranking based on a combination of all of these kNN clauses.

一些向量搜索场景需要使用几个 kNN 子句进行相关性排序,例如,当基于几个字段进行排序时,每个字段都有自己的向量,或者当文档包含图像的向量和文本的另一个向量时。用户可能希望基于所有这些 kNN 子句的组合获得相关性排名。

5、Make natural language processing GA (制作自然语言处理)

From 8.7, NLP model management, model allocation, and support for inference against third party models are generally available. (The new text_embedding extension to knn search is still in technical preview.)

从 8.7 开始, NLP 模型管理、模型分配和支持对第三方模型的推理通常是可用的。( knn 搜索的新文本嵌入扩展仍处于技术预览阶段。 )

6、Speed up ingest geoip processors (加速社区地理位置处理器)

The geoip ingest processor is significantly faster.

Geoip ingest processor 明显更快。

Previous versions of the geoip library needed special permission to execute databinding code, requiring an expensive permissions check and AccessController.doPrivileged call. The current version of the geoip library no longer requires that, however, so the expensive code has been removed, resulting in better performance for the ingest geoip processor.

以前版本的 Geoip 库需要特殊权限来执行数据绑定代码,需要昂贵的权限检查和 AccessController.doPrivileged 调用。然而,当前版本的 Geoip 库不再需要这个功能,因此删除了昂贵的代码,从而为 ingest 的 Geoip 处理器带来了更好的性能。

7、Speed up ingest set and append processors (加速 ingest 集合 和 追加处理器)

The set and append ingest processors that use mustache templates are significantly faster.

使用 Mustache 模板的集合和附加 ingest processors 明显更快。

8、Improved downsampling performance (改进的下采样性能)

Several improvements were made to the performance of downsampling. All hashmap lookups were removed. Also metrics/label producers were modified so that they extract the doc_values directly from the leaves. This allows for extra optimizations for cases such as labels/counters that do not extract doc_values unless they are consumed. Those changes yielded a 3x-4x performance improvement of the downsampling operation, as measured by our benchmarks.

downsampling 的性能进行了几项改进。删除了所有哈希表查找。并且修改了指标/标签生成器, 使其可以直接从叶子节点中提取 doc_values。这允许对标签/计数器等不提取 doc_values 的情况进行额外的优化,除非它们被使用。我们的基准测试显示,这些更改使得 downsample 操作的性能提高了 3 倍到 4 倍。

9、The Health API is now generally available (HealthAPI 现在普遍可用)

Elasticsearch introduces a new Health API designed to report the health of the cluster. The new API provides both a high level overview of the cluster health, and a very detailed report that can include a precise diagnosis and a resolution.

Elasticsearch 引入了一个新的 HealthAPI,旨在报告集群的健康状况。新的 API 既提供了集群健康状况的高级概述,也提供了包括精确诊断和解决方案的非常详���的报告。

10、Improved performance for get, mget and indexing with explicit `_id`s (针对用显式 `_id` 进行的 get、mget 和索引操作进行了性能优化)

The false positive rate for the bloom filter on the _id field was reduced from ~10% to ~1%, reducing the I/O load if a term is not present in a segment. This improves performance when retrieving documents by _id, which happens when performing get or mget requests, or when issuing _bulk requests that provide explicit `_id`s.

_id 字段上布隆过滤器的误报率从约 10% 降低至约 1%,当一个词条在段中不存在时,可以减少 I/O 负载。这可以提高通过 _id 检索文档的性能,当执行 get 或 mget 请求以及提供显式 `_id` 的 _bulk 请求时会发生这种情况。

11、Speed up ingest processing with multiple pipelines (使用多个管道加速 ingest processing)

Processing documents with both a request/default and a final pipeline is significantly faster.

处理同时具有请求/默认值和最终管道的文档要快得多。

Rather than marshalling a document from and to json once per pipeline, a document is now marshalled from json before any pipelines execute and then back to json after all pipelines have executed.

现在不需要在每个管道之前将文档从 json 封送到 json,而是在执行任何管道之前将文档从 json 封送到 json,然后在执行所有管道之后将文档返回到 json。

12、Support geo_grid ingest processor (支持 geo _ grid 摄取处理器)

The geo_grid ingest processor supports creating indexable geometries from geohash, geotile and H3 cells.

Geo _ grid ingest processor 支持从 Geohash、 Geotiles 和 H3单元 创建可索引的几何图形。

There already exists a circle ingest processor that creates a polygon from a point and radius definition. This concept is useful when there is need to use spatial operations that work with indexable geometries on geometric objects that are not defined spatially (or at least not indexable by lucene). In this case, the string 4/8/5 does not have spatial meaning, until we interpret it as the address of a rectangular geotile, and save the bounding box defining its border for further use. Likewise we can interpret geohash strings like u0 as a tile, and H3 strings like 811fbffffffffff as an hexagonal cell, saving the cell border as a polygon.

已经存在一个圆形 ingest processor,它从点和半径定义创建一个多边形。当需要在几何对象上使用可索引几何空间操作时,而这些几何对象并没有明确定义空间信息(或者至少Lucene无法索引),这个概念就会有用。在这种情况下,字符串4/8/5没有空间含义,直到我们解释它为矩形地理瓦片的地址,并保存用于进一步使用的定义其边界的边界框。同样,我们可以将geohash字符串解释为瓷砖,H3字符串解释为六边形单元,将单元边界保存为多边形。

13、Make frequent_item_sets aggregation GA (make frequent_item_sets 聚合GA)

The frequent_item_sets aggregation has been moved from technical preview to general availability.

经常项目集聚合已经从技术预览转移到通用可用性。

14、Release time_series and rate (on counter fields) aggegations as tech preview (作为技术预览发布时间序列和速率(在计数器字段上)聚合)

Make time_series aggregation and rate aggregation (on counter fields) available without using the time series feature flag. This change makes these aggregations available as tech preview.

使时间序列聚合和速率聚合(在计数器字段上)可用,而不使用时间序列特性标志。此更改使这些聚合可以作为技术预览。

Currently there is no documentation about the time_series aggregation. This will be added in a followup change.

目前没有关于 time _ Series 聚合的文档,这将在后续更改中添加。

posted @ 2023-05-01 17:23  左扬  阅读(113)  评论(0编辑  收藏  举报
levels of contents