PostgreSQL Create Index Concurrently

PostgreSQL支持在线创建索引(CREATE INDEX CONCURRENTLY),不堵塞其他会话对被创建索引表的DML(INSERT,UPDATE,DELETE)操作。

PostgreSQL 提供了一个创建索引的高效特性,即“并发索引”。此功能允许我们在关系上创建索引,而不会阻塞读写设施。这并不容易管理 PostgreSQL 数据库中的数据。创建并发索引的目的可以是几个,包括其中的情况;不阻塞表制作上的书写功能是最常见的一种。

为了进一步了解其原理,我们看官网文档说明

https://www.postgresql.org/docs/devel/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY

Building Indexes Concurrently

Creating an index can interfere with regular operation of a database. Normally PostgreSQL locks the table to be indexed against writes and performs the entire index build with a single scan of the table. Other transactions can still read the table, but if they try to insert, update, or delete rows in the table they will block until the index build is finished. This could have a severe effect if the system is a live production database. Very large tables can take many hours to be indexed, and even for smaller tables, an index build can lock out writers for periods that are unacceptably long for a production system.

PostgreSQL supports building indexes without locking out writes. This method is invoked by specifying the CONCURRENTLY option of CREATE INDEX. When this option is used, PostgreSQL must perform two scans of the table, and in addition it must wait for all existing transactions that could potentially modify or use the index to terminate. Thus this method requires more total work than a standard index build and takes significantly longer to complete. However, since it allows normal operations to continue while the index is built, this method is useful for adding new indexes in a production environment. Of course, the extra CPU and I/O load imposed by the index creation might slow other operations.

In a concurrent index build, the index is actually entered as an invalid” index into the system catalogs in one transaction, then two table scans occur in two more transactions. Before each table scan, the index build must wait for existing transactions that have modified the table to terminate. After the second scan, the index build must wait for any transactions that have a snapshot (see Chapter 13) predating the second scan to terminate, including transactions used by any phase of concurrent index builds on other tables, if the indexes involved are partial or have columns that are not simple column references. Then finally the index can be marked valid” and ready for use, and the CREATE INDEX command terminates. Even then, however, the index may not be immediately usable for queries: in the worst case, it cannot be used as long as transactions exist that predate the start of the index build.

谷歌翻译大概是:

创建索引可能会干扰数据库的常规操作。通常PostgreSQL会锁定要索引的表以防止写入,并通过对表的单次扫描来执行整个索引构建。其他事务仍然可以读取该表,但如果它们尝试插入、更新或删除表中的行,它们将阻塞,直到索引构建完成。如果系统是实时生产数据库,这可能会产生严重影响。对非常大的表进行索引可能需要很多小时,即使对于较小的表,索引构建也会在生产系统无法接受的长时期内锁定写入者。

PostgreSQL支持在不锁定写入的情况下构建索引。CONCURRENTLY通过指定 的选项调用此方法CREATE INDEX。当使用这个选项时,PostgreSQL必须对表执行两次扫描,此外它必须等待所有可能修改或使用索引的现有事务终止。因此,与标准索引构建相比,此方法需要更多的总工作量,并且需要更长的时间才能完成。但是,由于它允许在构建索引时继续正常操作,因此此方法对于在生产环境中添加新索引很有用。当然,索引创建带来的额外 CPU 和 I/O 负载可能会减慢其他操作。

在并发索引构建中,索引实际上在一个事务中作为无效”索引输入到系统目录中,然后在另外两个事务中发生两次表扫描。在每次表扫描之前,索引构建必须等待已修改表的现有事务终止。在第二次扫描之后,索引构建必须等待在第二次扫描之前具有快照(参见第 13 章)的任何事务终止,包括在其他表上的并发索引构建的任何阶段使用的事务,如果所涉及的索引是部分索引或具有不是简单列引用的列。然后最后可以将索引标记为有效”并准备好使用,并且CREATE INDEX命令终止。然而,即便如此,索引也可能无法立即用于查询:在最坏的情况下,只要在索引构建开始之前存在事务,它就无法使用。

使用CREATE INDEX CONCURRENTLY创建索引,分为三个阶段,扫描两次TABLE

 

postgres=# create table t1(id int,info text);
CREATE TABLE

我们将在上表的单个列上创建一个并发索引。创建索引的命令类似于创建表。在此命令中,关键字创建索引后,会写入索引的名称。表的名称是在其上创建索引的,在括号中指定列名。PostgreSQL 中使用了几个索引,所以我们需要提到它们来指定一个特定的索引。否则,如果你没有提到任何索引,PostgreSQL 会选择默认的索引类型,“btree”:

postgres=# create index CONCURRENTLY idx_t1_c on t1(id);
CREATE INDEX
Time: 13.760 ms

postgres=# \d+ t1;
Table "public.t1"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | De
scription
--------+---------+-----------+----------+---------+----------+-------------+--------------+---
----------
id | integer | | | | plain | | |
info | text | | | | extended | | |
Indexes:
"idx_t1_c" btree (id)
Access method: heap

按照前面的命令将索引应用于多个列。例如,我们想在两列 id 和info上应用索引,涉及到同一个表:

postgres=# create index CONCURRENTLY idx_t1_two on t1(id,info);
CREATE INDEX
Time: 21.186 ms

postgres=# \d+ t1;
Table "public.t1"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | De
scription
--------+---------+-----------+----------+---------+----------+-------------+--------------+---

id | integer | | | | plain | | |
info | text | | | | extended | | |
Indexes:
"idx_t1_c" btree (id)
"idx_t1_two" btree (id, info)
Access method: heap

创建唯一索引

postgres=# insert into t1 select n,n || '_andyxi' from generate_series(1,1000) n;
INSERT 0 1000
Time: 18.283 ms
postgres=# create UNIQUE INDEX CONCURRENTLY idx_unique_id on t1 using btree(id);
CREATE INDEX
Time: 20.337 ms

 

 

 

 

整个加索引流程如下:

1.start transtion 1,拿到current snapshot1.

2.扫描t1表,等待所有B表的DML(delete\update\insert)的事务结束。

3.扫描t1表,并建立idx_t1_c索引。

4.结束事务1.

5.开启事务2,拿到current snapshot2.

6.再次扫描t1表,等待所有B表的DML(delete\update\insert)的事务结束。

7.在snapshot2之后启动的事务对t1表执行的DML,会修改idx_t1_c的索引。

8.再次在扫描t1表,更新索引(从TUPLE中可以拿到版本号,在snapshot1到snapshot2之间变更的记录,将其合并到索引).

9.第8更新索引结束后,等待事务2之前开启的持有snapshot的事务结束。

10.结束索引创建,索引可见。

总结:

实际上create index CONCURRENTLY需要2次扫描,三次等待。

使用create index(不使用concurrently选项)对表建立索引时,需要对表加Share锁,即5号锁。

Share锁会阻止其它事务对表进行修改(插入、更新和删除)。

使用concurrently选项创建索引时只需要对表加4号锁,允许其它事务对表进行并发修改。

为了处理和其它写事务的并发,使用concurrently选项时,需要使用2个MVCC快照对表进行2次全表扫描,共分为3个阶段,使用3个事务。

 

 

 

posted @ 2022-10-12 10:24  青空如璃  阅读(1427)  评论(0编辑  收藏  举报