Elasticsearch专题精讲—— REST APIs —— Document APIs —— 索引API
REST APIs —— Document APIs —— 索引API
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-index_.html#docs-index_
Adds a JSON document to the specified data stream or index and makes it searchable. If the target is an index and the document already exists, the request updates the document and increments its version.
将 JSON 文档添加到指定的数据流或索引并使其可搜索。如果目标是索引并且文档已经存在,则请求更新文档并递增其版本。
举个例子,下面是一个使用 PUT twitter/_doc/1
添加 JSON 文档到索引的示例:
请求:
PUT twitter/_doc/1 { "username": "johndoe", "content": "This is my first tweet!" }
响应:
{ "_index": "twitter", "_type": "_doc", "_id": "1", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "_seq_no": 0, "_primary_term": 1 }
这个请求将创建一个名为 "twitter" 的索引,然后添加一个 JSON 文档到该索引中。这个文档具有 ID 为 "1",包括一个 "username" 字段和一个 "content" 字段。如果该 ID 的文档不存在,则该文档将被创建;如果该 ID 的文档已经存在,则该文档将会被更新并且版本号会自动递增(在上述示例中,版本号从 1 变为 2)。
You cannot use the index API to send update requests for existing documents to a data stream. See Update documents in a data stream by query and Update or delete documents in a backing index.
你不能使用索引API向数据流发送更新现有文档的请求。请参阅“通过查询更新数据流中的文档”和“更新或删除后端索引中的文档”。
我理解意思是说,针对已经存在于数据流中的文档,我们不能通过索引API进行更新操作。相反,我们需要使用其他方式,如查询或直接在后端索引中进行更新或删除操作。这是因为数据流是具有特定架构的专用管道,不同于简单的索引。
1、Request
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-index_.html#docs-index-api-request
PUT /<target>/_doc/<_id> POST /<target>/_doc/ PUT /<target>/_create/<_id> POST /<target>/_create/<_id>
You cannot add new documents to a data stream using the
PUT /<target>/_doc/<_id>
request format. To specify a document ID, use thePUT /<target>/_create/<_id>
format instead. See Add documents to a data stream.无法使用PUT /<target>/_doc/<_id>请求格式向数据流中添加新文档。如果要指定文档ID,请改用PUT /<target>/_create/<_id>格式。您可以参考“向数据流中添加文档”以了解更多信息。
我理解意思是说:在修改或新增数据流中的文档时,需要注意使用合适的请求格式,并指定正确的ID以确保数据的正确性。具体而言,当新增文档时,不能使用PUT /<target>/_doc/<_id>请求格式,因为这种格式只适用于更新已有文档的情况。而PUT /<target>/_create/<_id>格式则用于新增文档,并且需要在请求中指定相应文档的ID。因此,在操作数据流时,需要根据具体情况选择合适的请求方式,并且注意指定正确的ID。
2、Prerequisites(先决条件)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-index_.html#docs-index-api-request
If the Elasticsearch security features are enabled, you must have the following index privileges for the target data stream, index, or index alias:
如果启用了 Elasticsearch 安全特性,您必须对目标数据流、索引或索引别名拥有以下索引特权:
1、To add or overwrite a document using the PUT /< target>/_doc/<_id> request format, you must have the create, index, or write index privilege.
1、要使用 PUT /< target>/_doc/<_id> 请求格式添加或覆盖文档,必须具有创建、索引或写入索引特权。
2、To add a document using the POST/< target>/_doc/, PUT/< target>/_create/<_id>, or POST/< target>/_create/<_id> request formats, you must have the create_doc, create, index, or write index privilege.
2、要使用 POST/< target>/_ doc/、 PUT/< target>/_ create/< _ id> 或 POST/< target>/_ create/< _ id> 请求格式添加文档,必须具有 create_doc、 create、 index 或 write index 特权。
3、To automatically create a data stream or index with an index API request, you must have the auto_configure, create_index, or manage index privilege.
3、若要使用索引 API 请求自动创建数据流或索引,必须具有 auto _ configure、 create _ index 或者管理索引特权。
4、Automatic data stream creation requires a matching index template with data stream enabled. See Set up a data stream.
4、自动数据流创建需要启用数据流的匹配索引模板(https://www.elastic.co/guide/en/elasticsearch/reference/8.8/set-up-a-data-stream.html)。
我理解意思是说:在 Elasticsearch 中启用数据流功能时,系统会自动创建索引和数据流以承载数据。但是,为了使 Elasticsearch 能够正确地创建这些数据流,我们需要提供一个与数据流启用相匹配的索引模板。索引模板会帮助 Elasticsearch 确定如何创建数据流,并对数据进行据流创建功能,你需要按照 Elasticsearch 的建议设置适当的模板。你可以查阅 Elasticsearch 官方文档中的 Set up a data stream 页面了解如何设置索引模板并启用自动数据流创建功能。 总之,这句话的意思是,为了正确启用数据流功能,我们必须提供一个与数据流启用相匹配的索引模板。如果没有正确的模板,自动数据流创建将无法正常工作。
3、Automatically create data streams and indices(自动创建数据流和索引)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-index_.html#index-creation
If request’s target doesn’t exist and matches an index template with a data_stream definition, the index operation automatically creates the data stream. See Set up a data stream.
如果请求的目标不��在并且匹配具有 data_stream 定义的索引模板,则索引操作会自动创建数据流。请参阅设置数据流。
If the target doesn’t exist and doesn’t match a data stream template, the operation automatically creates the index and applies any matching index templates.
如果目标不存在且不匹配数据流模板,则操作将自动创建索引并应用任何匹配的索引模板。
If the target doesn’t exist and doesn’t match a data stream template, the operation automatically creates the index and applies any matching index templates.
如果目标不存在且不匹配数据流模板,则操作将自动创建索引并应用任何匹配的索引模板。
If no mapping exists, the index operation creates a dynamic mapping. By default, new fields and objects are automatically added to the mapping if needed. For more information about field mapping, see mapping and the update mapping API.
如果不存在 mapping,则索引操作将创建动态映射。默认情况下,如果需要,新的字段和对象将自动添加到映射中。有关字段映射的更多信息,请参见映射(https://www.elastic.co/guide/en/elasticsearch/reference/8.8/mapping.html)和更新映射 API(https://www.elastic.co/guide/en/elasticsearch/reference/8.8/indices-put-mapping.html)。
Automatic index creation is controlled by the action.auto_create_index setting. This setting defaults to true, which allows any index to be created automatically. You can modify this setting to explicitly allow or block automatic creation of indices that match specified patterns, or set it to false to disable automatic index creation entirely. Specify a comma-separated list of patterns you want to allow, or prefix each pattern with + or - to indicate whether it should be allowed or blocked. When a list is specified, the default behaviour is to disallow.
自动索引创建由 action.auto_create_index 设置控制。此设置默认为 true,允许自动创建任何索引。您可以修改此设置以显式允许或阻止匹配指定模式的索引的自动创建,或将其设置为 false 以完全禁用自动索引创建。指定一个逗号分隔的模式列表,您想要允许的,或使用 + 或 - 前缀来指示是否允许或阻止它。当指定列表时,默认行为是禁止自动创建。
- Allow auto-creation of indices called
my-index-000001 or index10, block the creation of indices that match the pattern index1*, and allow
creation of any other indices that match the ind* pattern. Patterns are matched in the order
specified.允许自动创建名为
my-index-000001或 index10的索引,阻止创建匹配模式 index1
* 的索引,并允许创建匹配 ind * 模式的任何其他索引。模式按指定的顺序进行匹配。
curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d' { "persistent": { "action.auto_create_index": "my-index-000001,index10,-index1*,+ind*" } }'
- Disable automatic index creation
entirely.允许自动创建名为
完全禁用自动创建索引。
curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d' { "persistent": { "action.auto_create_index": "false" } }'
- Allow automatic creation of any index. This is
the default.允许自动创建任何索引。这是默认值。
curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d' { "persistent": { "action.auto_create_index": "true" } }'
You can force a create operation by using the _create resource or setting the op_type parameter to create. In this case, the index operation fails if a document with the specified ID already exists in the index.
你可以使用_create资源或设置op_type参数为create,强制进行创建操作。 在这种情况下,如果索引中已存在具有指定ID的文档,则索引操作会失败。
我理解意思是说: 在 Elasticsearch 可以使用 _create 资源或将 op_type 参数设置为 create 来强制创建操作。在这种情况下,如果索引中已经存在具有指定 ID 的文档,则索引操作会失败。 使用 _create 资源或 op_type=create 参数,可以确保在创建文档时不会覆盖现有文档。如果存在具有相同 ID 的现有文档,则 Elasticsearch 会拒绝创建新文档,而不是将其与现有文档合并。这对于保障数据唯一性和完整性非常有用。
4、Create document IDs automatically(ID 自动生成)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-index_.html#create-document-ids-automatically
When using the
POST /
当使用 POST/< target>/_ doc/request 格式时,op _ type 被自动设置为 create,并且 index 操作为文档生成一个惟一的 ID。、
curl -X POST "localhost:9200/my-index-000001/_doc/?pretty" -H 'Content-Type: application/json' -d' { "@timestamp": "2099-11-15T13:12:00", "message": "GET /search HTTP/1.1 200 1070000", "user": { "id": "kimchy" } }'
The API returns the following result:
API 返回以下结果:
{ "_shards": { "total": 2, "failed": 0, "successful": 2 }, "_index": "my-index-000001", "_id": "W0tpsmIBdwcYyG50zbta", "_version": 1, "_seq_no": 0, "_primary_term": 1, "result": "created" }
5、Routing(路由)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-index_.html#index-routing
By default, shard placement — or routing — is controlled by using a hash of the document’s id value. For more explicit control, the value fed into the hash function used by the router can be directly specified on a per-operation basis using the routing parameter. For example:
默认情况下,shard(分片)放置(或路由)是通过使用 document(文档) id 值的哈希来控制的。为了更明确地控制,可以在每个操作中直接指定传递给路由器使用的哈希函数的值,使用routing参数。例如:
我理解意思是说: Elasticsearch 默认情况下通过哈希函数将文档 ID 值路由到相应的分片,以实现负载均衡和分布式存储。如果您需要更加明确的控制,可以在每个操作中使用 _routing 参数来自定义哈希函数路由时用到的值。
curl -X POST "localhost:9200/my-index-000001/_doc?routing=kimchy&pretty" -H 'Content-Type: application/json' -d' { "@timestamp": "2099-11-15T13:12:00", "message": "GET /search HTTP/1.1 200 1070000", "user": { "id": "kimchy" } }'
In this example, the document is routed to a shard based on the routing parameter provided: "kimchy".
在本例中,文档根据提供的路由参数“ kimchy”被路由到一个分片。
When setting up explicit mapping, you can also use the _routing field to direct the index operation to extract the routing value from the document itself. This does come at the (very minimal) cost of an additional document parsing pass. If the _routing mapping is defined and set to be required, the index operation will fail if no routing value is provided or extracted.
当设置显式 mapping 时,还可以使用 _routing 字段,以从文档本身中提取路由值来指导索引操作。这会增加一个非常小的额外文档解析处理,如果定义了 mapping 的 _routing 值并将其设置为必需,则如果没有提供可提取路由值,索引操作将失败。
6、Distributed(分发)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-index_.html#index-distributed
The index operation is directed to the primary shard based on its route (see the Routing section above) and performed on the actual node containing this shard. After the primary shard completes the operation, if needed, the update is distributed to applicable replicas.
索引操作根据 primary shard(主分片)的路由(参见上面的路由部分),并在包含此分片的实际节点上执行。在 primary(主分片)完成操作后,如果需要,操作将分发到需要的其他的 replicas(副本)。
7、Active shards(等待活跃分片)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-index_.html#index-wait-for-active-shards
To improve the resiliency of writes to the system, indexing operations can be configured to wait for a certain number of active shard copies before proceeding with the operation. If the requisite number of active shard copies are not available, then the write operation must wait and retry, until either the requisite shard copies have started or a timeout occurs. By default, write operations only wait for the primary shards to be active before proceeding (i.e. wait_for_active_shards=1). This default can be overridden in the index settings dynamically by setting index.write.wait_for_active_shards. To alter this behavior per operation, the wait_for_active_shards request parameter can be used.
为了提高系统写入的效率和可靠性,可以将索引操作配置为在继续执行操作之前等待一定数量的 active shard copies(活动分片)。如果所需数量的活动分片不可用,那么写操作必须等待并重试,直到所需的分片已经启动或发生超时。 默认情况下,写操作在继续之前只等待主分片处于活动状态(即 wait_for_active_shards = 1)。通过设置 index.write.wait_for_active_shards,可以在索引设置中动态覆盖此默认值。为了改变每个操作的行为,可以使用 wait_for_active_shards 请求参数。
Valid values are all or any positive integer up to the total number of configured copies per shard in the index (which is number_of_replicas+1). Specifying a negative value or a number greater than the number of shard copies will throw an error.
有效的值是 "all" 或任何小于等于索引中每个分片所配置的副本总数(即 number_of_replicas+1)的正整数。如果指定了负数或大于副本数的数字,则会抛出错误。
For example, suppose we have a cluster of three nodes, A, B, and C and we create an index index with the number of replicas set to 3 (resulting in 4 shard copies, one more copy than there are nodes). If we attempt an indexing operation, by default the operation will only ensure the primary copy of each shard is available before proceeding. This means that even if B and C went down, and A hosted the primary shard copies, the indexing operation would still proceed with only one copy of the data. If wait_for_active_shards is set on the request to 3 (and all 3 nodes are up), then the indexing operation will require 3 active shard copies before proceeding, a requirement which should be met because there are 3 active nodes in the cluster, each one holding a copy of the shard. However, if we set wait_for_active_shards to all (or to 4, which is the same), the indexing operation will not proceed as we do not have all 4 copies of each shard active in the index. The operation will timeout unless a new node is brought up in the cluster to host the fourth copy of the shard.
例如,假设我们有一个由三个节点 A、B 和 C 组成的集群,在创建索引 index 时将副本数量设置为 3(导致拥有四个分片副本,比节点数多一个)。如果我们尝试进行索引操作,默认情况下,操作将仅在主分片的可用情况下继续进行。这意味着即使 B 和 C 崩溃,而 A 托管主分片副本,索引操作仍将仅使用一份数据副本进行。如果在请求上设置 wait_for_active_shards 参数为 3(并且所有 3 个节点都在线),则索引操作将在继续之前需要 3 个活动分片副本,这个要求应该可以满足,因为集群中有 3 个活动节点,每个节点都持有分片的副本。但是,如果我们将 wait_for_active_shards 参数设置为 all(或 4,这是相同的),则索引操作将不会继续进行,因为索引中不具有四个副本的所有分片都不是活动状态。除非在集群中带上第四个分片副本的节点,否则操作将超时。
It is important to note that this setting greatly reduces the chances of the write operation not writing to the requisite number of shard copies, but it does not completely eliminate the possibility, because this check occurs before the write operation commences. Once the write operation is underway, it is still possible for replication to fail on any number of shard copies but still succeed on the primary. The _shards section of the write operation’s response reveals the number of shard copies on which replication succeeded/failed.
需要注意的是,这个设置大大降低了写操作未能写入所需数目的分片副本的可能性,但并不能完全消除这种可能性,因为这个检查发生在写操作开始之前。一旦写操作开始,复制仍有可能在任意数量的分片副本上失败,但在主分片上仍能成功。写操作响应的 _shards 部分显示了复制成功/失败的分片副本数。
{ "_shards": { "total": 2, "failed": 0, "successful": 2 } }
8、Refresh(刷新)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-index_.html#index-refresh
Control when the changes made by this request are visible to search. See refresh.
控制此请求所做的更改何时对搜索可见。请参阅 Refresh(https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-refresh.html)。
8、Noop update(Noop 更新)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-index_.html#index-noop
When updating a document using the index API a new version of the document is always created even if the document hasn’t changed. If this isn’t acceptable use the _update API with detect_noop set to true. This option isn’t available on the index API because the index API doesn’t fetch the old source and isn’t able to compare it against the new source.
当使用索引 API 更新文档时,即使文档没有更改,也始终会创建文档的新版本。如果这是不可接受的,使用 _update API,并将 detect_noop = true,这个参数的作用是在更新之前与原文档对比,如果没有字段值的变化,则不做更新操作。这个选项在索引 API 上不可用,因为索引 API 不能获取旧源代码,也不能将其与新源代码进行比较。
9、Timeout(超时)
https://www.elastic.co/guide/en/elasticsearch/reference/8.8/docs-index_.html#timeout
The primary shard assigned to perform the index operation might not be available when the index operation is executed. Some reasons for this might be that the primary shard is currently recovering from a gateway or undergoing relocation. By default, the index operation will wait on the primary shard to become available for up to 1 minute before failing and responding with an error. The timeout parameter can be used to explicitly specify how long it waits. Here is an example of setting it to 5 minutes:
执行索引操作分配的主分片可能在执行索引操作时不可用。造成这种情况的原因可能是该主分片正在从网关恢复或正在进行重定位。默认情况下,在主分片在最多等待1分钟后仍未变为可用状态时,索引操作会失败并返回错误响应。timeout 参数可用于明确指定等待的时间长度。以下是将 timeout 参数设置为5分钟的示例:
curl -X PUT "localhost:9200/my-index-000001/_doc/1?timeout=5m&pretty" -H 'Content-Type: application/json' -d' { "@timestamp": "2099-11-15T13:12:00", "message": "GET /search HTTP/1.1 200 1070000", "user": { "id": "kimchy" } } '