Elasticsearch专题精讲—— REST APIs —— Document APIs —— GET API

REST APIs —— Document APIs —— GET API

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-get.html#docs-get

Retrieves the specified JSON document from an index.

从索引中检索指定的 JSON 文档。

curl -X GET "localhost:9200/my-index-000001/_doc/0?pretty"

1、Request（请求）

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-get.html#docs-get-api-request

            GET /_doc/<_id>
            GET < index >/_ doc/< _ id >
            HEAD /_doc/<_id>
            HEAD < index >/_ doc/< _ id >
            GET /_source/<_id>
            GET < index >/_ source/< _ id >
            HEAD /_source/<_id>
            HEAD < index >/_ source/< _ id >

2、Prerequisites（先决条件）

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-get.html#docs-get-api-prereqs

If the Elasticsearch security features are enabled, you must have the read index privilege for the target index or index alias.

如果启用了 Elasticsearch 安全特性，则必须拥有目标索引或索引别名的读索引特权(https://www.elastic.co/guide/en/elasticsearch/reference/8.8/security-privileges.html#privileges-list-indices)。

3、Description（描述）

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-get.html#docs-get-api-desc

You use GET to retrieve a document and its source or stored fields from a particular index. Use HEAD to verify that a document exists. You can use the _source resource retrieve just the document source or verify that it exists.

你可以使用 GET 方法从指定的索引中检索文档及其源码或存储字段。使用 HEAD 方法验证文档是否存在。你可以使用 _source 资源仅检索文档源码或验证其是否存在。

            head twitter/_doc/0

我理解意思是说：执行 "head twitter/_doc/0" 请求，将得到一个 HTTP 响应，其中包含状态码。如果该文档存在，状态码将为 200，表示该文档存在于名为 "twitter" 的索引中。如果文档不存在，则状态码将为 404。需要注意的是，"head" 命令只是用来验证该文档是否存在，它不会返回文档源码或存储字段信息。如果想要获取文档源代码，需要使用 GET 命令，例如 "GET twitter/_doc/0/_source"。

4、Realtime（实时的）

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-get.html#realtime

By default, the get API is realtime, and is not affected by the refresh rate of the index (when data will become visible for search). In case where stored fields are requested (see stored_fields parameter) and the document has been updated but is not yet refreshed, the get API will have to parse and analyze the source to extract the stored fields. In order to disable realtime GET, the realtime parameter can be set to false.

默认情况下，get API 是实时的，并且不受索引的刷新速率影响（即数据何时可见于搜索）。如果请求的是存储字段（参见 stored_fields 参数）且文档已更新但尚未刷新，则get API将需要解析和分析源码以提取存储字段。为了禁用实时 GET，可以将 realtime 参数设置为 false。

5、Source filtering（source 过滤）

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-get.html#get-source-filtering

By default, the get operation returns the contents of the _source field unless you have used the stored_fields parameter or if the _source field is disabled. You can turn off _source retrieval by using the _source parameter:

默认情况下，get 操作会返回_source字段的内容，除非您使用了 stored_fields 参数或 _source 字段已被禁用。您可以通过使用 _source 参数来关闭 _source 的检索：

        curl -X GET "localhost:9200/my-index-000001/_doc/0?_source=false&pretty"

If you only need one or two fields from the _source, use the _source_includes or _source_excludes parameters to include or filter out particular fields. This can be especially helpful with large documents where partial retrieval can save on network overhead. Both parameters take a comma separated list of fields or wildcard expressions. Example:

如果你只需要文档中的一个或几个字段，可以使用 _source_includes 或 _source_excludes 参数来包含或过滤出特定的字段。这对于大型文档来说尤其有帮助，因为部分检索可以节省网络开销。这两个参数都接受逗号分隔的字段列表或通配符表达式。例如：

我理解意思是说，例如，如果你只需要一个文档中的 "title" 和 "author" 字段，可以使用以下查询：

            GET /my_index/my_type/_search
            {
              "_source": ["title", "author"],
              "query": {
                "match": {
                  "title": "Elasticsearch"
                }
              }
            }

或者，你可以使用 _source_includes 参数来包含你需要的字段：

            GET /my_index/my_type/_search?_source_includes=title,author
            {
              "query": {
                "match": {
                  "title": "Elasticsearch"
                }
              }
            }

其中，只有在索引中存在 "title" 和 "author" 字段时，查询才会返回这些字段的值。如果你想要排除某些字段，可以使用 _source_excludes 参数。

6、Routing（路由）

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-get.html#get-routing

If routing is used during indexing, the routing value also needs to be specified to retrieve a document. For example:

如果在索引过程中使用了路由，那么还需要指定路由值来检索文档。例如:

curl -X GET "localhost:9200/my-index-000001/_doc/2?routing=user1&pretty"

This request gets the document with id 2, but it is routed based on the user. The document is not fetched if the correct routing is not specified.

这个请求获取 id 为2的文档，但是它是基于用户进行路由的。如果未指定正确的路由，则不获取文档。

我理解意思是说：这是一个使用路由查询特定文档的示例查询。在 Elasticsearch 中，文档可以使用路由键来指定它们被存储和分发到哪个特定分片中。这个路由键通常是文档 ID 或者文档的某个属性，但是它也可以是任意的字符串。当搜索时指定正确的路由键可以提高搜索性能和准确性。在这个特定的示例查询中，我们可以看到 ?routing={user} 的查询参数。这意味着 Elasticsearch 将根据 user 的值来路由查询请求，以确保只有存储该特定 user 的分片被检索。如果没有指定正确的路由键，则文档将无法被检索到。因此，这个请求获取的是 ID 为 2，并且在特定用户的分片中存储的文档。这个功能在需要存储大量数据并根据不同条件进行检索时非常实用，可以提高查询性能并减少网络传输。

7、Preference（优先/偏好）

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-get.html#preference

Controls a preference of which shard replicas to execute the get request on. By default, the operation is randomized between the shard replicas.

控制优先选择哪些分片副本执行 get 请求。默认情况下，该操作在分片副本之间进行随机选择。

The preference can be set to:

首选项可以设置为:

_local： The operation will prefer to be executed on a local allocated shard if possible.

如果可能的话，该操作将更愿意在本地分配的分片上执行。

我理解意思是说：假设你有一个 Elasticsearch 集群，其中有几个分片副本分布在不同的节点上，你可以使用 _local 参数让 GET 请求优先选择在本地节点上的副本执行。这有助于减少网络带宽和延迟，并提高 GET 请求的速度。例如，你可以这样使用 _local 参数：

GET /my_index/my_type/1?preference=_local

_Custom (string) value： A custom value will be used to guarantee that the same shards will be used for the same custom value. This can help with "jumping values" when hitting different shards in different refresh states. A sample value can be something like the web session id, or the user name.

使用自定义值来确保相同的分片用于相同的自定义值。这可以帮助解决当命中不同刷新状态的不同分片时出现的“跳跃值”问题。示例值可以是 Web 会话 ID 或用户名。

我理解意思是说：在 Elasticsearch 进行分片和复制时，一个文档会被拆分成多个分片，并被分散保存在多个节点上。默认情况下，Elasticsearch 会使用哈希值来决定哪些分片存储哪些文档，以实现负载均衡和分布式存储。但是，当文档数量很小时，哈希方式可能会导致不同的文档被分配到不同的分片中，导致访问同一文档时需要查询多个分片，而这些分片中的数据可能因为不同的刷新状态而不同，从而对数据一致性造成影响。另外，在某些情况下，我们可能需要确保相同的自定义值总是被分配到相同的分片中，比如我们想快速地获取某个用户在某个时间段内的所有日志记录，如果这些记录被分布到了多个不同的分片上，我们就需要依次检索这些分片，这样会大大增加查询的成本。因此，当我们需要确保相同的自定义值在被分片时总是落在相同的分片中时，就可以使用路由键。路由键是一个自定义的字符串值，可以在文档存储时指定，当 Elasticsearch 系统要创建此文档的分片和副本时，都会依据路由键确定将文档存储到哪个分片上。这样就可以确保相同的路由键的文档总是被分配到相同的分片和副本中，从而保持数据一致性和查询的高效性。

举个例子来说，假设我们有一些日志信息，每条信息都包括时间戳、日志内容和用户名等基本信息。我们想要按照用户名来查询某个用户的所有日志信息，但是如果不使用路由键，这些记录可能会分布在不同的分片上，导致需要查询多个分片，性能效率低下。这时，我们可以将用户名作为路由键，这样所有相同用户名的记录都将会被路由到同一个分片上，加快检索速度，并且保证数据的一致性。

8、Refresh（刷新）

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-get.html#get-refresh

The refresh parameter can be set to true in order to refresh the relevant shard before the get operation and make it searchable. Setting it to true should be done after careful thought and verification that this does not cause a heavy load on the system (and slows down indexing).

可以将 refesh 参数配置为 true，以便于在 GET 操作之前刷新相关的分片并使其可见。将其设置为 true 应慎重考虑，因为这可能导致系统负载过重，并减慢索引速度。

9、Distributed（分发的）

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-get.html#get-distributed

The get operation gets hashed into a specific shard id. It then gets redirected to one of the replicas within that shard id and returns the result. The replicas are the primary shard and its replicas within that shard id group. This means that the more replicas we have, the better GET scaling we will have.

GET 操作会被哈希到特定的分片 ID 中，然后被重定向到该分片 ID 组中的一个副本，并返回结果。副本包括该分片 ID 中的主分片和其副本。这意味着我们拥有的副本越多，就能获得更好的 GET 扩展性。

我理解意思是说：在 Elasticsearch 中，默认情况下，所有 GET 请求都从主分片中读取数据，这是为了确保数据的一致性和完整性。然而，在某些情况下，可能希望仅从副本分片中读取数据，以减轻主分片的负载，或者因为更注重查询速度而不太关心数据的实时性。要从副本分片中读取数据，需要在 GET 请求中加入 preference 参数，并将其设置为 _replica。例如，以下 GET 请求将从副本分片中读取数据：

 GET /my_index/_doc/123?preference=_replica

请注意，从副本分片读取数据可能会导致看到稍微滞后的数据。因为数据同步复制可能需要一些时间，副本分片上的数据可能不是最新的。如果非常关注数据实时性，建议仍然从主分片读取数据。

10、Versioning support（版本支持）

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-get.html#get-versioning

You can use the version parameter to retrieve the document only if its current version is equal to the specified one.

只有当文档的当前版本等于指定的版本时，才能使用 version 参数检索文档。

Internally, Elasticsearch has marked the old document as deleted and added an entirely new document. The old version of the document doesn’t disappear immediately, although you won’t be able to access it. Elasticsearch cleans up deleted documents in the background as you continue to index more data.

在内部，Elasticsearch 将旧文档标记为已删除，并添加了一个全新的文档。旧版本的文档不会立即消失，尽管您无法访问它。当您继续索引更多数据时，Elasticsearch 会清除后台中已删除的文档。

11、Get the source field only（仅获取源字段）

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-get.html#_source

Use the /_source/ resource to get just the _source field of a document. For example:

使用 < index>/_source/< id> 资源获取文档的 _source 字段:

curl -X GET "localhost:9200/my-index-000001/_source/1?pretty"

You can use the source filtering parameters to control which parts of the _source are returned:

您可以使用 source 过滤参数来控制返回 _ source 的哪些部分:

curl -X GET "localhost:9200/my-index-000001/_source/1/?_source_includes=*.id&_source_excludes=entities&pretty"

You can use HEAD with the _source endpoint to efficiently test whether or not the document _source exists. A document’s source is not available if it is disabled in the mapping.

可以将 HEAD 与 _ source 端点一起使用，以有效地测试 document _ source 是否存在。如果在映射中禁用文档源，则该文档的源不可用。

curl -I "localhost:9200/my-index-000001/_source/1?pretty"

12、Get stored fields（获取存储字段）

https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-get.html#get-stored-fields

Use the stored_fields parameter to specify the set of stored fields you want to retrieve. Any requested fields that are not stored are ignored. Consider for instance the following mapping:

GET 操作允许指定一组存储字段（store 属性值为 true），这些字段将通过传递 stored_fields 参数返回。任何未存储的请求字段都将被忽略。例如，考虑以下映射:

我理解意思是说：在 Elasticsearch 中，如果您希望检索文档中的字段，通常有两个选项。第一，您可以在文档中存储所有字段并使用 _source 字段检索。第二，您可以选择性地存储一些字段而不是所有字段，然后使用 stored_fields 参数检索这些字段。上文中提到 Any requested fields that are not stored are ignored 指的是，如果您在检索中请求的字段未被存储在 Elasticsearch 中，则 Elasticsearch 会忽略这些字段。这意味着如果您使用 stored_fields 参数检索的字段未被存储，则返回的响应中将不包含这些字段的值。

curl -X PUT "localhost:9200/my-index-000001?pretty" -H 'Content-Type: application/json' -d'
        {
           "mappings": {
               "properties": {
                  "counter": {
                     "type": "integer",
                     "store": false
                  },
                  "tags": {
                     "type": "keyword",
                     "store": true
                  }
               }
           }
        }'

Now we can add a document:

现在我们可以添加一个文档:

curl -X PUT "localhost:9200/my-index-000001/_doc/1?pretty" -H 'Content-Type: application/json' -d'
        {
          "counter": 1,
          "tags": [ "production" ]
        }'

And then try to retrieve it:

然后试着找回来:

curl -X GET "localhost:9200/my-index-000001/_doc/1?stored_fields=tags,counter&pretty"

The API returns the following result:

API 返回以下结果:

{
        "_index": "my-index-000001",
        "_id": "1",
        "_version": 1,
        "_seq_no" : 22,
        "_primary_term" : 1,
        "found": true,
        "fields": {
           "tags": [
              "production"
           ]
        }
     }

Field values fetched from the document itself are always returned as an array. Since the counter field is not stored, the get request ignores it.

从文档本身获取的字段值总是以数组的形式返回。由于未存储计数器字段，因此 get 请求将忽略该字段。

You can also retrieve metadata fields like the _routing field:

还可以检索元数据字段，如 _routing 字段:

curl -X PUT "localhost:9200/my-index-000001/_doc/2?routing=user1&pretty" -H 'Content-Type: application/json' -d'
                    {
                      "counter" : 1,
                      "tags" : ["env2"]
                    }'

curl -X GET "localhost:9200/my-index-000001/_doc/2?routing=user1&stored_fields=tags,counter&pretty"