再谈PG索引-存储架构
1.索引的基本架构
PG的索引是B+树,B+树是为磁盘或其他直接存取辅助设备而设计的一种平衡查找树,在B+树中,所有记录节点都是按键值的大小顺序存放在同一层的叶节点中,各叶节点指针进行连接:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | meta page | root page(8kb,一个记录占32个bit,那么就能存256个branch page,超过了就需要扩充一级branch page来存储leaf page) | branch page … | | | branch page branch page branch page … | | | | | | | leaf page leaf page leaf page leaf page leaf page leaf page leaf page … | ———————------------- | key | | (block,offset) | 一个leaf page存放多个索引值 ———————------------- |
其中meta page和root page是必须有的,meta page需要一个页来存储,表示指向root page的page id。
随着记录数的增加,一个root page可能存不下所有的heap item,就会有branch page,甚至多层的branch page。
leaf page存储具体的key和value。
一共有几层branch,就用btree page元数据的 level 来表示,如果level为0,则表示没有branch层,root page直接指向leaf page,最多记录256条记录(假如条指针占32bit);level为1, 则表示有一层branch page,则root page存放branch page的指针,branch page指向leaf page,最多记录256*256条。依次类推,且bock size是可以设置的,当设置的更大,则一个级别存储的数据就更多。
2.看看具体的结构
PostgreSQL B-Tree是一种变种(high-concurrency B-tree management algorithm),算法详情请参考 src/backend/access/nbtree/README。我们可以使用pageinspect插件,内窥B-Tree的结构:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 | apple=# create extension pageinspect; CREATE EXTENSION apple=# \dx List of installed extensions Name | Version | Schema | Description -------------+---------+------------+------------------------------------------------------- pageinspect | 1.7 | public | inspect the contents of database pages at a low level plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language (2 rows) apple=# create table test(id int primary key, info text); CREATE TABLE apple=# insert into test select generate_series(1, 1000), md5(random()::text); INSERT 0 1000 apple=# vacuum ANALYZE test; VACUUM apple=# \d test Table "public.test" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- id | integer | | not null | info | text | | | Indexes: "test_pkey" PRIMARY KEY, btree (id) apple=# select * from bt_metap( 'test_pkey' ); magic | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples --------+---------+------+-------+----------+-----------+-------------+------------------------- 340322 | 3 | 3 | 1 | 3 | 1 | 0 | 1000 (1 row) apple=# select * from bt_page_stats( 'test_pkey' ,1); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 1 | l | 367 | 0 | 16 | 8192 | 808 | 0 | 2 | 0 | 1 (1 row) apple=# select * from bt_page_stats( 'test_pkey' ,2); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 2 | l | 367 | 0 | 16 | 8192 | 808 | 1 | 4 | 0 | 1 (1 row) apple=# select * from bt_page_stats( 'test_pkey' ,3); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 3 | r | 3 | 0 | 13 | 8192 | 8096 | 0 | 0 | 1 | 2 (1 row) apple=# select * from bt_page_stats( 'test_pkey' ,0); 2019-05-29 11:10:47.567 CST [49885] ERROR: block 0 is a meta page 2019-05-29 11:10:47.567 CST [49885] STATEMENT: select * from bt_page_stats( 'test_pkey' ,0); ERROR: block 0 is a meta page apple=# select * from bt_page_stats( 'test_pkey' ,4); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 4 | l | 268 | 0 | 16 | 8192 | 2788 | 2 | 0 | 0 | 1 (1 row) apple=# select * from bt_page_stats( 'test_pkey' ,5); 2019-05-29 11:11:18.181 CST [49885] ERROR: block number out of range 2019-05-29 11:11:18.181 CST [49885] STATEMENT: select * from bt_page_stats( 'test_pkey' ,5); ERROR: block number out of range |
meta page
root page # btpo_flags=2
branch page # btpo_flags=0
leaf page # btpo_flags=1
如果即是leaf又是root则 btpo_flags=3
根据btpo_flage可以看出结构应该是这样,索引占了5个block:
1 2 3 4 5 6 7 8 9 10 | meta page (block 0,btpo_flags为2,上面可以看到最后一列) | root page ( block 3) | | | leaf page( block 1) leaf page ( block 2)leaf page( block 4) | ———————------------- | key | | (block,offset) | ———————------------- |
查看root page:
1 2 3 4 5 6 7 8 9 10 11 12 13 | apple=# select * from bt_page_items( 'test_pkey' ,3); itemoffset | ctid | itemlen | nulls | vars | data ------------+--------+---------+-------+------+------------------------- 1 | (1,0) | 8 | f | f | 2 | (2,7) | 16 | f | f | 6f 01 00 00 00 00 00 00 3 | (4,13) | 16 | f | f | dd 02 00 00 00 00 00 00 (3 rows) |
查看leaf page:
1 2 3 4 5 6 7 8 9 10 11 12 | apple=# select * from bt_page_items( 'test_pkey' ,1); itemoffset | ctid | itemlen | nulls | vars | data ------------+---------+---------+-------+------+------------------------- 1 | (3,7) | 16 | f | f | 6f 01 00 00 00 00 00 00 2 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00 3 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00 4 | (0,3) | 16 | f | f | 03 00 00 00 00 00 00 00 5 | (0,4) | 16 | f | f | 04 00 00 00 00 00 00 00 6 | (0,5) | 16 | f | f | 05 00 00 00 00 00 00 00 7 | (0,6) | 16 | f | f | 06 00 00 00 00 00 00 00 8 | (0,7) | 16 | f | f | 07 00 00 00 00 00 00 00 …. |
查看一个item对应的记录:
1 2 3 4 5 | apple=# select * from test where ctid = '(3,7)' ; id | info -----+---------------------------------- 367 | 06818c090f9e5f63c95764342590a598 (1 row) |
那么索引里面的key是怎么排序的?块为什么不是连续的,块2变为了root page。
3.查看level的变化
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | apple=# drop table test; DROP TABLE apple=# create table test(id int primary key, info text); CREATE TABLE apple=# insert into test select t.id, md5(random()::text) from generate_series(1, 20) as t(id); INSERT 0 20 apple=# select * from bt_page_stats( 'test_pkey' , 1); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 1 | l | 20 | 0 | 16 | 8192 | 7748 | 0 | 0 | 0 | 3 (1 row) apple=# select * from bt_metap( 'test_pkey' ); magic | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples --------+---------+------+-------+----------+-----------+-------------+------------------------- 340322 | 3 | 1 | 0 | 1 | 0 | 0 | -1 (1 row) apple=# select bt_page_items( 'test_pkey' , 1); bt_page_items ------------------------------------------------ (1, "(0,1)" ,16,f,f, "01 00 00 00 00 00 00 00" ) (2, "(0,2)" ,16,f,f, "02 00 00 00 00 00 00 00" ) (3, "(0,3)" ,16,f,f, "03 00 00 00 00 00 00 00" ) (4, "(0,4)" ,16,f,f, "04 00 00 00 00 00 00 00" ) (5, "(0,5)" ,16,f,f, "05 00 00 00 00 00 00 00" ) (6, "(0,6)" ,16,f,f, "06 00 00 00 00 00 00 00" ) (7, "(0,7)" ,16,f,f, "07 00 00 00 00 00 00 00" ) (8, "(0,8)" ,16,f,f, "08 00 00 00 00 00 00 00" ) (9, "(0,9)" ,16,f,f, "09 00 00 00 00 00 00 00" ) (10, "(0,10)" ,16,f,f, "0a 00 00 00 00 00 00 00" ) (11, "(0,11)" ,16,f,f, "0b 00 00 00 00 00 00 00" ) (12, "(0,12)" ,16,f,f, "0c 00 00 00 00 00 00 00" ) (13, "(0,13)" ,16,f,f, "0d 00 00 00 00 00 00 00" ) (14, "(0,14)" ,16,f,f, "0e 00 00 00 00 00 00 00" ) (15, "(0,15)" ,16,f,f, "0f 00 00 00 00 00 00 00" ) (16, "(0,16)" ,16,f,f, "10 00 00 00 00 00 00 00" ) (17, "(0,17)" ,16,f,f, "11 00 00 00 00 00 00 00" ) (18, "(0,18)" ,16,f,f, "12 00 00 00 00 00 00 00" ) (19, "(0,19)" ,16,f,f, "13 00 00 00 00 00 00 00" ) (20, "(0,20)" ,16,f,f, "14 00 00 00 00 00 00 00" ) (20 rows) |
插入20条,一个root页面就能存放,那么就没有必要申请一个leaf page,root page就是leaf page,他们的btpo_flags就是3;由于没有branch page,因此level也就是0;
继续插入数据:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | apple=# insert into test select t.id, md5(random()::text) from generate_series(21, 1000) as t(id); INSERT 0 980 apple=# select * from bt_metap( 'test_pkey' ); magic | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples --------+---------+------+-------+----------+-----------+-------------+------------------------- 340322 | 3 | 3 | 1 | 3 | 1 | 0 | -1 (1 row) apple=# select * from bt_page_stats( 'test_pkey' , 0); ERROR: block 0 is a meta page apple=# select * from bt_page_stats( 'test_pkey' , 1); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 1 | l | 367 | 0 | 16 | 8192 | 808 | 0 | 2 | 0 | 1 (1 row) apple=# select * from bt_page_stats( 'test_pkey' , 2); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 2 | l | 367 | 0 | 16 | 8192 | 808 | 1 | 4 | 0 | 1 (1 row) apple=# select * from bt_page_stats( 'test_pkey' , 3); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 3 | r | 3 | 0 | 13 | 8192 | 8096 | 0 | 0 | 1 | 2 (1 row) apple=# select * from bt_page_stats( 'test_pkey' , 4); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 4 | l | 268 | 0 | 16 | 8192 | 2788 | 2 | 0 | 0 | 1 (1 row) apple=# select * from bt_page_stats( 'test_pkey' , 5); ERROR: block number out of range |
可以看到leve由0变为了1,有了新的root page,从block 1变为了block 3,且加入了三个新的leaf page。
我们可以看到一个leaf页面大概在以int类型为索引时,大概可以存放367条记录,而一个root page中记录一个leaf page指针只需要13bit大小,那么我们继续增大多少条,可以出现branch page呢?
(8192/13) * 367 - 1000 = 230210条,那么我们就插入数据试试:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | apple=# insert into test select t.id, md5(random()::text) from generate_series(1001, 230210) as t(id); INSERT 0 229210 apple=# select * from bt_page_stats( 'test_pkey' , 0); ERROR: block 0 is a meta page apple=# analyze ; ANALYZE apple=# select * from bt_metap( 'test_pkey' ); magic | version | root | level | fastroot | fastlevel | oldest_xact | last_cleanup_num_tuples --------+---------+------+-------+----------+-----------+-------------+------------------------- 340322 | 3 | 412 | 2 | 412 | 2 | 0 | -1 (1 row) apple=# select * from bt_page_stats( 'test_pkey' , 412); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 412 | r | 2 | 0 | 12 | 8192 | 8116 | 0 | 0 | 2 | 2 (1 row) apple=# select * from bt_page_stats( 'test_pkey' , 411); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 411 | i | 344 | 0 | 15 | 8192 | 1276 | 3 | 0 | 1 | 0 (1 row) apple=# select * from bt_page_stats( 'test_pkey' , 3); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 3 | i | 286 | 0 | 15 | 8192 | 2436 | 0 | 411 | 1 | 0 (1 row) apple=# select * from bt_page_stats( 'test_pkey' , 8); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 8 | l | 367 | 0 | 16 | 8192 | 808 | 7 | 9 | 0 | 1 (1 row) apple=# select * from bt_page_stats( 'test_pkey' , 1); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 1 | l | 367 | 0 | 16 | 8192 | 808 | 0 | 2 | 0 | 1 (1 row) apple=# select * from bt_page_stats( 'test_pkey' , 2); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 2 | l | 367 | 0 | 16 | 8192 | 808 | 1 | 4 | 0 | 1 (1 row) apple=# select * from bt_page_stats( 'test_pkey' , 4); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 4 | l | 367 | 0 | 16 | 8192 | 808 | 2 | 5 | 0 | 1 (1 row) apple=# select * from bt_page_stats( 'test_pkey' , 632); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 632 | l | 362 | 0 | 16 | 8192 | 908 | 631 | 0 | 0 | 1 (1 row) apple=# select * from bt_page_stats( 'test_pkey' , 633); ERROR: block number out of range |
上面可以看到level由1变为了2,root page又由上面的block 3变为了block 412,而block 3蜕变为一个branch节点;增加了两个branch page分别是btpo_flags为0的block 411和 block 3;可以计算到最后一个page应该是(8192/13) + 2 = 632。
两个branch page是从第一个指向第二个3的下一个是411,在哪里看到3号block是第一个branch page?
所有的leaf page是从0->1->2->4->5 … ->632->0,怎么知道哪个leaf page是从哪个branch page指出来的呢?
答案是可以通过看root page和branch page里面的值来看:
从root page的值可以看到是存放了对应的branch page的块号和偏移量
1 2 3 4 5 6 7 8 9 10 11 | apple=# select * from bt_page_items( 'test_pkey' , 412); itemoffset | ctid | itemlen | nulls | vars | data ------------+----------+---------+-------+------+------------------------- 1 | (3,0) | 8 | f | f | 2 | (411,31) | 16 | f | f | 77 97 01 00 00 00 00 00 (2 rows) |
branch page存放了leaf page的block number和offset
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | apple=# select * from bt_page_items( 'test_pkey' , 3); itemoffset | ctid | itemlen | nulls | vars | data ------------+-----------+---------+-------+------+------------------------- 1 | (287,31) | 16 | f | f | 77 97 01 00 00 00 00 00 2 | (1,0) | 8 | f | f | 3 | (2,7) | 16 | f | f | 6f 01 00 00 00 00 00 00 4 | (4,13) | 16 | f | f | dd 02 00 00 00 00 00 00 5 | (5,19) | 16 | f | f | 4b 04 00 00 00 00 00 00 6 | (6,25) | 16 | f | f | b9 05 00 00 00 00 00 00 7 | (7,31) | 16 | f | f | 27 07 00 00 00 00 00 00 8 | (8,37) | 16 | f | f | 95 08 00 00 00 00 00 00 9 | (9,43) | 16 | f | f | 03 0a 00 00 00 00 00 00 |
那么root page和branch page的每个item会存放他们对应的块里面索引key按照B+树的方式进行组织,我们这里B+树高度为2,每页可存放630条记录,因此,可以大致的画出当前的拓扑图:
当然这些数据都是基于B+树进行按顺序排列的,B+树可以指定树的宽度,我们这边的宽度不指定,而是按实际大小计算的吗?例如这里:8192/13 = 630,即M=630。
4.扫描时时间消耗估算
apple=# explain (analyze, verbose, timing, costs, buffers) select id from test where id = 11;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Index Only Scan using test_pkey on public.test (cost=0.42..8.44 rows=1 width=4) (actual time=0.247..0.248 rows=1 loops=1)
Output: id
Index Cond: (test.id = 11)
Heap Fetches: 1
Buffers: shared hit=4
Planning Time: 0.151 ms
Execution Time: 0.287 ms
(7 rows)
costs = 1 meta page + root page + branch page + leaf page = 4
apple=# explain (analyze, verbose, timing, costs, buffers) select id from test where id < 11;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Index Only Scan using test_pkey on public.test (cost=0.42..8.61 rows=11 width=4) (actual time=0.006..0.009 rows=10 loops=1)
Output: id
Index Cond: (test.id < 11)
Heap Fetches: 10
Buffers: shared hit=4
Planning Time: 0.100 ms
Execution Time: 0.027 ms
(7 rows)
costs = 1 meta page + root page + branch page + leaf page = 4,一个块的读取消耗几乎忽略不计。
apple=# explain (analyze, verbose, timing, costs, buffers) select id from test where id in (1,3,1000, 222222, 111111111, 1232244,11);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Index Only Scan using test_pkey on public.test (cost=0.42..35.06 rows=7 width=4) (actual time=0.035..0.088 rows=5 loops=1)
Output: id
Index Cond: (test.id = ANY ('{1,3,1000,222222,111111111,1232244,11}'::integer[]))
Heap Fetches: 5
Buffers: shared hit=27
Planning Time: 0.095 ms
Execution Time: 0.113 ms
(7 rows)
只有五条记录在leaf中有,因此:
costs = 1 meta page + 5*(root page + branch page 1 + branch page 2 + leaf page) + 2* (root page + branch page 1 + branch page 2) = 27
后续给出索引每个item对应的代码结构。
https://www.cnblogs.com/scu-cjx/p/9960483.html
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· 10年+ .NET Coder 心语 ── 封装的思维:从隐藏、稳定开始理解其本质意义
· 地球OL攻略 —— 某应届生求职总结
· 提示词工程——AI应用必不可少的技术
· Open-Sora 2.0 重磅开源!
· 周边上新:园子的第一款马克杯温暖上架