Elasticsearch专题精讲—— REST APIs —— Document APIs —— Term vectors API(向量API)

REST APIs —— Document APIs —— Term vectors API(向量API)

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-termvectors.html#docs-termvectors

Term 向量(Term vectors)用来存储文档字段的 Term 信息(字段文本分次得到的词条)和统计信息。文档可以存储在索引中,也可以由用户人工提供。

Term 向量在默认情况下是实时的。

Retrieves information and statistics for terms in the fields of a particular document.

检索特定文档字段中的 terms(我翻译成:词汇/词项)的信息和统计信息。

curl -X GET "localhost:9200/my-index-000001/_termvectors/1?pretty"

1、Request(请求)

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-termvectors.html#docs-termvectors-api-request

        GET /< index>/_termvectors/< _id>

        GET /< index>/_termVector/< _id>

2、Prerequisites(先决条件)

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-termvectors.html#docs-termvectors-api-prereqs

If the Elasticsearch security features are enabled, you must have the read index privilege for the target index or index alias.

如果启用了 Elasticsearch 安全特性,则必须拥有目标索引或索引别名的读索引特权。

3、Description(描述)

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-termvectors.html#docs-termvectors-api-desc

You can retrieve term vectors for documents stored in the index or for artificial documents passed in the body of the request.

您可以为索引中存储的文档或请求正文中传递的人工文档检索 term vectors(词项向量)。

You can specify the fields you are interested in through the fields parameter, or by adding the fields to the request body.

可以通过字段参数指定感兴趣的字段,也可以将字段添加到请求体中。

curl -X GET "localhost:9200/my-index-000001/_termvectors/1?fields=message&pretty"

Fields can be specified using wildcards, similar to the multi match query.

可以使用通配符指定字段,类似于多匹配查询(https://www.elastic.co/guide/en/elasticsearch/reference/8.8/query-dsl-multi-match-query.html)。

Term vectors are real-time by default, not near real-time. This can be changed by setting realtime parameter to false.

默认情况下,词项向量是实时的,而不是接近实时的。这可以通过将实时参数设置为 false 来更改。

You can request three types of values: term information, term statistics and field statistics. By default, all term information and field statistics are returned for all fields but term statistics are excluded.

您可以请求三种类型的值:term information(词项信息)、term statistics(词项统计)和 field statistics(字段统计)。默认情况下,将返回所有字段的所有词项信息和字段统计,但不会返回词项统计。

我理解意思是说: 在Elasticsearch 8中,词向量(Term Vectors)默认是实时的,而不是近实时。这与Elasticsearch的其他功能(例如搜索和聚合)不同,其他功能通常是近实时的。实时词向量意味着,当您请求一个文档的词向量时,您将立即收到包括最新更改在内的数据。这有助于确保您访问的词向量是最准确、最新的。 之所以选择实时词向量作为默认设置,是因为词向量通常用于文档相似性分析、聚类或文本分类等实时分析任务。在这些场景下,您可能希望尽可能快地获取到最新的文档数据,以便在分析中为您提供最佳结果。而对于诸如搜索和聚合的功能,近实时数据在很多情况下已经足够满足需求。 然而,这种实时性是可以更改的。如果您不需要实时词向量功能,可以通过将realtime参数设置为false来禁用实时功能。这会将词向量的更新同步到近实时,并可能在某些场景下提高性能。

4、Term information(词项信息)

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-termvectors.html#docs-termvectors-api-term-info

    • term frequency in the field (always returned)

字段中的词频(总是返回)

    • term positions (positions: true)

词项位置(postions: true)

    • start and end offsets (offsets: true)

起始和结束偏移量(offsets: true)

    • term payloads (payloads: true), as base64 encoded bytes

词项负载(payloads: true) ,以 base64 编码字节表示

If the requested information wasn’t stored in the index, it will be computed on the fly if possible. Additionally, term vectors could be computed for documents not even existing in the index, but instead provided by the user.

如果请求的信息没有存储在索引中,那么将在可能的情况下实时计算。此外,还可以为索引中甚至不存在的文档计算词项向量,而是由用户提供。

Start and end offsets assume UTF-16 encoding is being used. If you want to use these offsets in order to get the original text that produced this token, you should make sure that the string you are taking a sub-string of is also encoded using UTF-16.

开始和结束偏移量假设正在使用 UTF-16编码。如果您希望使用这些偏移量来获取产生此标记的原始文本,则应确保您所获取的子字符串也使用 UTF-16进行编码。

我理解意思是说: 在 Elasticsearch 8 中,词项信息(Term information)指的是每个索引中每个单词的一些特点数据,包括词项频率(词项在所有文档中的出现频率)和文档频率(包含该词项的文档数量)。词项信息在全文检索和相关性评分过程中起着关键作用,帮助 Elasticsearch 更有效地确定哪些文档与特定查询更相关,并对这些文档进行排序。 通过利用 term 信息,Elasticsearch 8 能够解决以下问题: 1、提高搜索性能:借助 term信息,搜索引擎可以加速文档搜索及筛选过程; 2、改进搜索质量:通过考虑词项频率和文档频率来计算文档相关性,从而提供与关键词更匹配的搜索结果; 3、支持高级搜索功能:例如,词项信息可以协助实现权重排序、过滤和聚合等高级搜索功能。 请注意,在 Elasticsearch 8 中,默认情况下不会返回词项信息,因为它可能对性能产生严重影响。若确实需要此类信息,应根据实际需求、硬件和系统性能进行调整。

5、Term statistics(向量统计)

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-termvectors.html#docs-termvectors-api-term-stats

Setting term_statistics to true (default is false) will return

将 term_statistics 设置为 true (默认为 false) 将返回

    • total term frequency (how often a term occurs in all documents)

总词频(指词项在所有文档中的出现次数)

    • document frequency (the number of documents containing the current term)

文件频率(包含当前词项的文件数量)

By default these values are not returned since term statistics can have a serious performance impact.

默认情况下,这些值不会返回,因为词项统计可能会严重影响性能。

6、Field statistics(字段统计)

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-termvectors.html#docs-termvectors-api-field-stats

Setting field_statistics to false (default is true) will omit :

将 field _ statistics 设置为 false (默认值为 true)将省略:

    • document count (how many documents contain this field)

文档计数(包含此字段的文档数量)

    • sum of document frequencies (the sum of document frequencies for all terms in this field)

文档频率之和(该字段中所有词项的文档频率之和)

    • sum of total term frequencies (the sum of total term frequencies of each term in this field)

总项频率之和(该字段中每个项的总项频率之和)

7、Terms filtering(向量过滤)

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-termvectors.html#docs-termvectors-api-terms-filtering

With the parameter filter, the terms returned could also be filtered based on their tf-idf scores. This could be useful in order find out a good characteristic vector of a document. This feature works in a similar manner to the second phase of the More Like This Query. See example 5 for usage.

通过过滤器参数,可以根据它们的 tf-idf 分数筛选返回的词项。这对于找出一个文档的良好特征向量可能很有用。这个特性的工作方式类似于 "More Like This Query" (https://www.elastic.co/guide/en/elasticsearch/reference/8.8/query-dsl-mlt-query.html) 的第二阶段 (https://www.elastic.co/guide/en/elasticsearch/reference/8.8/query-dsl-mlt-query.html#mlt-query-term-selection)。参见 示例5 (https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-termvectors.html#docs-termvectors-terms-filtering) 了解用法。

The following sub-parameters are supported:

支持下列子参数:

max_num_terms

Maximum number of terms that must be returned per field. Defaults to 25.

每个字段必须返回的最大词项数。默认值为25。
min_term_freq

Ignore words with less than this frequency in the source doc. Defaults to 1.

忽略源文档中低于此频率的单词。默认值为1。
max_term_freq  

Ignore words with more than this frequency in the source doc. Defaults to unbounded.

忽略源文档中超过这个频率的单词。默认为无界。
min_doc_freq

Ignore terms which do not occur in at least this many docs. Defaults to 1.

忽略至少在这么多文档中不会出现的词项。默认值为1。
max_doc_freq  

Ignore words which occur in more than this many docs. Defaults to unbounded.

忽略多于这个数量的文档中出现的单词。默认为无界。
min_word_length  

The minimum word length below which words will be ignored. Defaults to 0.

将被忽略的最小单词长度。默认值为0。
max_word_length  

The maximum word length above which words will be ignored. Defaults to unbounded (0) .

将被忽略的单词的最大长度。默认为无界(0)。

8、Behaviour(行为)

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-termvectors.html#docs-termvectors-api-behavior

The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in. The term and field statistics are therefore only useful as relative measures whereas the absolute numbers have no meaning in this context. By default, when requesting term vectors of artificial documents, a shard to get the statistics from is randomly selected. Use routing only to hit a particular shard.

词项和字段统计并不准确。删除的文档没有被考虑在内。信息仅从包含所请求文档的分片中检索。因此,词项和字段统计数据只适用于相对度量,而在此上下文中绝对数字没有意义。默认情况下,在请求人工文档的词向量时,将随机选择一个分片以获取统计数据。仅在需要命中特定分片时使用路由。

9、Returning stored term vectors(返回存储的项向量)

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-termvectors.html#docs-termvectors-api-stored-termvectors

First, we create an index that stores term vectors, payloads etc. :

首先,我们创建一个存储词向量、有效载荷等的索引:

        curl -X PUT "localhost:9200/my-index-000001?pretty" -H 'Content-Type: application/json' -d'
        { "mappings": {
            "properties": {
              "text": {
                "type": "text",
                "term_vector": "with_positions_offsets_payloads",
                "store" : true,
                "analyzer" : "fulltext_analyzer"
               },
               "fullname": {
                "type": "text",
                "term_vector": "with_positions_offsets_payloads",
                "analyzer" : "fulltext_analyzer"
              }
            }
          },
          "settings" : {
            "index" : {
              "number_of_shards" : 1,
              "number_of_replicas" : 0
            },
            "analysis": {
              "analyzer": {
                "fulltext_analyzer": {
                  "type": "custom",
                  "tokenizer": "whitespace",
                  "filter": [
                    "lowercase",
                    "type_as_payload"
                  ]
                }
              }
            }
          }
        }'
    

Second, we add some documents:

其次,我们添加一些文档:

        curl -X PUT "localhost:9200/my-index-000001/_doc/1?pretty" -H 'Content-Type: application/json' -d'
        {
          "fullname" : "John Doe",
          "text" : "test test test "
        }
        '
        curl -X PUT "localhost:9200/my-index-000001/_doc/2?refresh=wait_for&pretty" -H 'Content-Type: application/json' -d'
        {
          "fullname" : "Jane Doe",
          "text" : "Another test ..."
        }'
     

The following request returns all information and statistics for field text in document 1 (John Doe):

以下请求返回 文档1 (John Doe) 中字段文本的所有信息和统计数据:

        curl -X GET "localhost:9200/my-index-000001/_termvectors/1?pretty" -H 'Content-Type: application/json' -d'
        {
          "fields" : ["text"],
          "offsets" : true,
          "payloads" : true,
          "positions" : true,
          "term_statistics" : true,
          "field_statistics" : true
        }'
     

Response:

Response:

        {
            "_index": "my-index-000001",
            "_id": "1",
            "_version": 1,
            "found": true,
            "took": 6,
            "term_vectors": {
              "text": {
                "field_statistics": {
                  "sum_doc_freq": 4,
                  "doc_count": 2,
                  "sum_ttf": 6
                },
                "terms": {
                  "test": {
                    "doc_freq": 2,
                    "ttf": 4,
                    "term_freq": 3,
                    "tokens": [
                      {
                        "position": 0,
                        "start_offset": 0,
                        "end_offset": 4,
                        "payload": "d29yZA=="
                      },
                      {
                        "position": 1,
                        "start_offset": 5,
                        "end_offset": 9,
                        "payload": "d29yZA=="
                      },
                      {
                        "position": 2,
                        "start_offset": 10,
                        "end_offset": 14,
                        "payload": "d29yZA=="
                      }
                    ]
                  }
                }
              }
            }
          }
    

10、Generating term vectors on the fly(动态生成词项向量)

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-termvectors.html#docs-termvectors-api-generate-termvectors

Term vectors which are not explicitly stored in the index are automatically computed on the fly. The following request returns all information and statistics for the fields in document 1, even though the terms haven’t been explicitly stored in the index. Note that for the field text, the terms are not re-generated.

没有在索引中显式存储的词向量将自动实时计算。以下请求为文档1中的字段返回所有信息和统计数据,即使这些词项没有在索引中显式存储。请注意,对于文本字段,词项不会被重新生成。

        curl -X GET "localhost:9200/my-index-000001/_termvectors/1?pretty" -H 'Content-Type: application/json' -d'
        {
          "fields" : ["text", "some_field_without_term_vectors"],
          "offsets" : true,
          "positions" : true,
          "term_statistics" : true,
          "field_statistics" : true
        }'
    

11、Artificial documents(人工文档)

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-termvectors.html#docs-termvectors-artificial-doc

Term vectors can also be generated for artificial documents, that is for documents not present in the index. For example, the following request would return the same results as in example 1. The mapping used is determined by the index.

词项向量也可以为人工文档生成,即索引中不存在的文档。例如,以下请求将返回与示例1中相同的结果。使用的映射由索引确定。

        curl -X GET "localhost:9200/my-index-000001/_termvectors/1?pretty" -H 'Content-Type: application/json' -d'
        {
          "fields" : ["text", "some_field_without_term_vectors"],
          "offsets" : true,
          "positions" : true,
          "term_statistics" : true,
          "field_statistics" : true
        }'
    

11、Per-field analyzer(字段分析器)

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-termvectors.html#docs-termvectors-per-field-analyzer

Additionally, a different analyzer than the one at the field may be provided by using the per_field_analyzer parameter. This is useful in order to generate term vectors in any fashion, especially when using artificial documents. When providing an analyzer for a field that already stores term vectors, the term vectors will be re-generated.

此外,可以通过使用 per_field_analyzer 参数提供与领域中的分析器不同的分析器。这对于以任何方式生成词项向量特别有用,尤其是在使用人工文档时。为已存储词项向量的字段提供分析器时,将重新生成词项向量。

        curl -X GET "localhost:9200/my-index-000001/_termvectors/1?pretty" -H 'Content-Type: application/json' -d'
        {
          "fields" : ["text", "some_field_without_term_vectors"],
          "offsets" : true,
          "positions" : true,
          "term_statistics" : true,
          "field_statistics" : true
        }'
    
        curl -X GET "localhost:9200/my-index-000001/_termvectors?pretty" -H 'Content-Type: application/json' -d'
        {
          "doc" : {
            "fullname" : "John Doe",
            "text" : "test test test"
          },
          "fields": ["fullname"],
          "per_field_analyzer" : {
            "fullname": "keyword"
          }
        }' 
    

Response:

响应:

        {
            "_index": "my-index-000001",
            "_version": 0,
            "found": true,
            "took": 6,
            "term_vectors": {
              "fullname": {
                 "field_statistics": {
                    "sum_doc_freq": 2,
                    "doc_count": 4,
                    "sum_ttf": 4
                 },
                 "terms": {
                    "John Doe": {
                       "term_freq": 1,
                       "tokens": [
                          {
                             "position": 0,
                             "start_offset": 0,
                             "end_offset": 8
                          }
                       ]
                    }
                 }
              }
            }
          }

12、Terms filtering(词项过滤)

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-termvectors.html#docs-termvectors-per-field-analyzer

Finally, the terms returned could be filtered based on their tf-idf scores. In the example below we obtain the three most "interesting" keywords from the artificial document having the given "plot" field value. Notice that the keyword "Tony" or any stop words are not part of the response, as their tf-idf must be too low.

最后,返回的词项可以根据它们的 tf-idf 分数进行过滤。在下面的示例中,我们从具有给定 "plot" 字段值的人工文档中获得了三个最 "有趣" 的关键词。请注意,"Tony" 或任何停用词都不是响应的一部分,因为它们的 tf-idf 得分可能太低了。

        curl -X GET "localhost:9200/imdb/_termvectors?pretty" -H 'Content-Type: application/json' -d'
        {
          "doc": {
            "plot": "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil."
          },
          "term_statistics": true,
          "field_statistics": true,
          "positions": false,
          "offsets": false,
          "filter": {
            "max_num_terms": 3,
            "min_term_freq": 1,
            "min_doc_freq": 1
          }
        }'
    

Response:

响应:

        {
            "_index": "imdb",
            "_version": 0,
            "found": true,
            "term_vectors": {
               "plot": {
                  "field_statistics": {
                     "sum_doc_freq": 3384269,
                     "doc_count": 176214,
                     "sum_ttf": 3753460
                  },
                  "terms": {
                     "armored": {
                        "doc_freq": 27,
                        "ttf": 27,
                        "term_freq": 1,
                        "score": 9.74725
                     },
                     "industrialist": {
                        "doc_freq": 88,
                        "ttf": 88,
                        "term_freq": 1,
                        "score": 8.590818
                     },
                     "stark": {
                        "doc_freq": 44,
                        "ttf": 47,
                        "term_freq": 1,
                        "score": 9.272792
                     }
                  }
               }
            }
         }
    
posted @ 2023-06-09 15:41  左扬  阅读(56)  评论(0编辑  收藏  举报
levels of contents