lightdb/postgresql toast解析&页面物理组织
先postgresql的页结构、页物理布局
源码定义如下:
struct HeapTupleHeaderData { union { HeapTupleFields t_heap; DatumTupleFields t_datum; }t_choice; ItemPointerData t_ctid;// uint16 t_infomask2; /* number of attributes + various flags */ uint16 t_infomask; /* various flag bits, see below */ uint8 t_hoff; /* sizeof header incl. bitmap, padding */ bits8 t_bits[FLEXIBLE_ARRAY_MEMBER]; /* bitmap of NULLs */ };
铺开来看的话:
Field | Type | Length | Description |
---|---|---|---|
t_xmin | TransactionId | 4 bytes | insert XID stamp |
t_xmax | TransactionId | 4 bytes | delete XID stamp |
t_cid | CommandId | 4 bytes | insert and/or delete CID stamp (overlays with t_xvac) |
t_xvac | TransactionId | 4 bytes | XID for VACUUM operation moving a row version |
t_ctid | ItemPointerData | 6 bytes | current TID of this or newer row version |
t_infomask2 | uint16 | 2 bytes | number of attributes, plus various flag bits |
t_infomask | uint16 | 2 bytes | various flag bits |
t_hoff | uint8 | 1 byte | offset to user data |
所有字段的详情定义在src/include/access/htup_details.h中(这里面的信息对开发非常重要)。
TOAST块在shared_buffer中也是压缩状态,只有在访问时才会被解压。
es_test=# \d+ big_search_doc_new_ic Table "public.big_search_doc_new_ic" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description --------------------+-------------------------+-----------+----------+---------+----------+--------------+------------- summary | character varying(1024) | | | | extended | | sentiment | character varying(100) | | | | extended | | industryname | character varying(100) | | | | extended | | secucode | character varying(16) | | | | extended | | filename | character varying(100) | | | | extended | | industrycode | character varying(16) | | | | extended | | format | character varying(100) | | | | extended | | reporttype | character varying(16) | | | | extended | | fstoregroup | character varying(16) | | | | extended | | title | character varying(100) | | | | extended | | docid | character varying(36) | | | | extended | | content | text | | | | extended | | industrychains | character varying[] | | | | extended | | investranking | character varying(16) | | | | extended | | industrychainnodes | character varying[] | | | | extended | | docsource | character varying(16) | | | | extended | | secuname | character varying(16) | | | | extended | | fstorepath | character varying(64) | | | | extended | | riskranking | character varying(16) | | | | extended | | orgname | character varying(32) | | | | extended | | publishdate | character varying(32) | | | | extended | | pagecount | integer | | | | plain | | authors | character varying[] | | | | extended | | Access method: heap
es_test=# select oid,relname,reltoastrelid,relpages,reltuples from pg_class where relname='big_search_doc_new_ic'; oid | relname | reltoastrelid | relpages | reltuples --------+-----------------------+---------------+----------+-------------- 335878 | big_search_doc_new_ic | 335881 | 618489 | 3.210117e+06 es_test=# select oid,relname,reltoastrelid,relpages,reltuples from pg_class where oid=335881; -- TOAST比表大是正常的 oid | relname | reltoastrelid | relpages | reltuples --------+-----------------+---------------+----------+-------------- 335881 | pg_toast_335878 | 0 | 18619610 | 6.538884e+07
select a.oid, a.relname, a.reltoastrelid, a.relpages, a.reltuples,
a.relfilenode, b.oid, b.relname, b.reltoastrelid, b.relpages,
b.relfilenode, b.reltuples from pg_class a, pg_class b where a.relname = 'data' and a.reltoastrelid = b.oid;
es_test=# select * from pg_toast_335878 limit 1; # TOAST表不能直接访问。 ERROR: relation "pg_toast_335878" does not exist LINE 1: select * from pg_toast_335878 limit 1; ^
es_test=# SELECT tuple_data_split('big_search_doc_new_ic'::regclass, t_data, t_infomask, t_infomask2, t_bits) FROM heap_page_items(get_raw_page('big_search_doc_new_ic', 0)) limit 1; tuple_data_split ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------- {"\\x840b0000e4bcb4e99a8fe79d80e4b8ade59bbde4bc81e4b89ae695b0e5ad97e58c96e58d87e7baa7e79a84e5a4a7e8b68be58abfefbc8c5361615320e8a18ce4b89ae5b086e698afe995bfe69c9fe58f97e79b8ae79a84e4bc98e8b4a 8e8b59be98193e38082e4b88ee7be8ee59bbd205361615320e8a18ce4b89ae79a84e7bea4e6989fe997aae88080e79bb8e6af94efbc8ce4b8ade59bbd205361615320e5b882e59cbae8bf98e5a484e4ba8ee58f91e5b195e697a9e69c9fefbc 8ce4bd86e99a8fe79d80e68891e59bbde4ba91e8aea1e7ae97e59fbae7a180e8aebee696bde98090e6b890e68890e7869fefbc8ce4b8ade59bbd2053616153e4b99fe5b086e8bf8ee69da5e799bee88ab1e9bd90e694bee38082e79bb8e5afb 9e4ba8ee2809ce88a82e6b581e59e8be2809d53616153efbc8ce68891e4bbace69bb4e79c8be5a5bde2809ce5bc80e6ba90e59e8be2809d5361615320e79a84e58f91e5b195e380825361615320e8a18ce4b89ae794b1e9809ae794a8e7b1bb e98090e6b890e59091e59e82e79bb4e8a18ce4b89ae7b1bbe6b897e9808fefbc8ce69bb4e5a49ae7bb86e58886e8a18ce4b89ae58685e68896e5b086e6b68ce78eb0e9be99e5a4b4e38082e68891e4bbace9a696e6aca1e8a686e79b96e4b8a de59bbd205361615320e8a18ce4b89ae5928ce585ade5aeb6e4bc98e8b4a8e585ace58fb8e6a087e79a84efbc8ce59d87e7bb99e4ba88e2809ce4b9b0e585a5e2809de8af84e7baa7efbc8ce68ea8e88d90e9a1bae5ba8fe4be9de6aca1e4b8 bae4b8ade59bbde69c89e8b59eefbc88383038332e484befbc89e38081e5beaee79b9fe99b86e59ba2efbc88323031332e484befbc89e38081e98791e5b1b1e8bdafe4bbb6efbc88333838382e484befbc89e38081e794a8e58f8be7bd91e7b b9cefbc883630303538382e5348efbc89e38081e98791e89db6e59bbde99985efbc883236382e484befbc89e5928ce6988ee6ba90e4ba91efbc883930392e484befbc89e380820a",NULL,NULL,NULL,"\\x49e4b8ade59bbd53616153e8a18 ce4b89aefbc9ae9a38ee887b3e4ba91e8b5b72e706466",NULL,NULL,"\\x1be585b6e4bb96e7a094e68aa5","\\x0f67726f757031","\\x41e4b8ade59bbd53616153e8a18ce4b89aefbc9ae9a38ee887b3e4ba91e8b5b7","\\x43666232 6564633232643931623462303439343466636464366137393833666631","\\x01123d470500cd1402000e20050009200500","\\x1b000000000000000013040000","\\x0fe8b685e9858d","\\x1b000000000000000013040000","\\x0 f6673746f7265","\\x03","\\x5b4d30302f30302f30302f4368516775315f354f3265414c5161484146354e72562d5a4639343031352e706466",NULL,"\\x27e6b5a6e993b6e59bbde99985e8af81e588b8","\\x33323032302d31322d3 2315430303a30303a30302b30383030","\\x88000000","\\x7b010000000000000013040000030000000100000034000000e69da8e5ad90e8b68500000028000000e69e97e790b0000028000000e8b5b5e4b8b90000","\\x0112564a0100 524a01000d20050009200500"} (1 row)
https://www.postgresql.org/docs/current/pageinspect.html#id-1.11.7.31.5
http://www.mytecdb.com/blogDetail.php?id=223
TOAST字段(可变长度类型字段,含varchar、bytea、jsonb等,不一定非得超过2000字节)的存储选项
默认情况下,当字段超过2KB的时候TOAST就会尝试压缩宽列,如果压缩之后仍然超过2KB,就会将宽字段分块,源表通过指针指向分块后的TOAST,如果低于2KB,则直接存储。所以有些宽列压缩后存储在行内,有些在TOAST是可能的。
ALTER TABLE ... SET (toast_tuple_target = N) 默认为2KB,支持表级别设置。
ALTER [ COLUMN ] ALTER TABLE ...
column_name
SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN }
-
PLAIN
prevents either compression or out-of-line storage; furthermore it disables use of single-byte headers for varlena types. This is the only possible strategy for columns of non-TOAST-able data types. -
EXTENDED
allows both compression and out-of-line storage. This is the default for most TOAST-able data types. Compression will be attempted first, then out-of-line storage if the row is still too big. -
EXTERNAL
allows out-of-line storage but not compression. Use ofEXTERNAL
will make substring operations on widetext
andbytea
columns faster (at the penalty of increased storage space) because these operations are optimized to fetch only the required parts of the out-of-line value when it is not compressed. -
MAIN
allows compression but not out-of-line storage. (Actually, out-of-line storage will still be performed for such columns, but only as a last resort when there is no other way to make the row small enough to fit on a page.
alter table data add column big_data_plain text; alter table data alter column big_data_plain set storage plain; ## plain需要注意最大长度不能超过8160,否则会报错:
zjh@postgres=# alter table t alter column v set storage plain; ALTER TABLE zjh@postgres=# zjh@postgres=# zjh@postgres=# begin; BEGIN zjh@postgres=*# select pg_backend_pid(); pg_backend_pid ---------------- 175641 (1 row) zjh@postgres=*# insert into t values(rpad('y',1000000,'y')); ERROR: row is too big: size 1000032, maximum size 8160 在raw_heap_insert函数中判断
alter table data add column big_data_external text; alter table data alter column big_data_external set storage external; alter table data add column big_data_main text; alter table data alter column big_data_main set storage main;
update data set big_data_plain = big_column,big_data_external = big_column,big_data_main = big_column;
analyze data;
可通过select attname,attstorage from pg_catalog.pg_attribute where attrelid = 24594;查询列存储格式,可知已经是指定格式。
id |p | data |x | big_column |x | big_data_plain |p | big_data_external|e | big_data_main |m |
但是查询toast关系大小,仍然是0页?难道在主表中?
oid |relname|reltoastrelid|relpages|reltuples|relfilenode|oid |relname |reltoastrelid|relfilenode|relpages|reltuples| -----+-------+-------------+--------+---------+-----------+-----+--------------+-------------+-----------+--------+---------+ 24594|data | 24598| 1| 16.0| 57388|24598|pg_toast_24594| 0| 57389| 0| 0.0|
查看对应的57389文件,大小不为空(应该算是缺陷)。
TOAST压缩
在PG中,针对TOAST,默认使用的是LZ压缩算法,实现为PGLZ。压缩得通常用法是在写磁盘和网络之前进行压缩,在应用接收和读取之后解压。
https://medium.com/@lk.snatch/postgresql-compression-854a4647ee43
pg 14为什么要增加LZ4算法选择
论压缩率,LZ4和PGLZ差异是不大的,但是LZ4速度相比PGLZ要更快,所以这在没有结果集缓存时非常重要,因为每次访问都要重复解压,单笔体现不出来。
https://stackoverflow.com/questions/67537111/how-do-i-decide-between-lz4-and-snappy-compression
postgres=# postgres=# insert into tab_compression_1 select x,lpad('abc',4000,'xfe') from generate_series(1,100000) as x; INSERT 0 100000 Time: 1274.374 ms (00:01.274) postgres=# postgres=# postgres=# postgres=# insert into tab_compression_1 select x,lpad('abc',4000,'xfe') from generate_series(1,100000) as x; INSERT 0 100000 Time: 1264.036 ms (00:01.264) postgres=# postgres=# postgres=# postgres=# insert into tab_compression_1 select x,lpad('abc',4000,'xfe') from generate_series(1,100000) as x; INSERT 0 100000 Time: 1270.027 ms (00:01.270) postgres=# postgres=# postgres=# insert into tab_compression_2 select x,lpad('abc',4000,'xfe') from generate_series(1,100000) as x; INSERT 0 100000 Time: 178.854 ms postgres=# insert into tab_compression_2 select x,lpad('abc',4000,'xfe') from generate_series(1,100000) as x; INSERT 0 100000 Time: 178.418 ms postgres=# insert into tab_compression_2 select x,lpad('abc',4000,'xfe') from generate_series(1,100000) as x; INSERT 0 100000 Time: 179.050 ms postgres=# select count(distinct substr(data,3,10)) as x from tab_compression_1 ; x --- 1 (1 row) Time: 140.865 ms postgres=# select count(distinct substr(data,3,10)) as x from tab_compression_1 ; x --- 1 (1 row) Time: 136.520 ms postgres=# select count(distinct substr(data,3,10)) as x from tab_compression_1 ; x --- 1 (1 row) Time: 136.078 ms postgres=# select count(distinct substr(data,3,10)) as x from tab_compression_2 ; x --- 1 (1 row) Time: 109.513 ms postgres=# select count(distinct substr(data,3,10)) as x from tab_compression_2 ; x --- 1 (1 row) Time: 110.300 ms postgres=# select count(distinct substr(data,3,10)) as x from tab_compression_2 ; x --- 1 (1 row) Time: 111.693 ms
查询时间降低了20%。插入时间降低了86%。
那为什么默认不使用lz4呢?因为升级问题,如果直接从13及之前版本升级上来,数据就会破坏。
最后
最后,还需要一提的是,postgresql还支持真正的大对象large object,并且包含对应的API和数据字典,虽然现在一般都用TOAST代替了,但是其比TOAST更加灵活、当然也更加原始,可参见https://www.hs.net/lightdb/docs/html/largeobjects.html。
java保存图片到postgresql:https://www.enterprisedb.com/postgres-tutorials/postgresql-toast-and-working-blobsclobs-explained
现代LZ算法
https://glinscott.github.io/lz/index.html
参考
https://wiki.postgresql.org/wiki/TOAST
https://www.postgresql.org/docs/current/storage-toast.html
mysql存储类型:https://www.cnblogs.com/f-ck-need-u/p/7729251.html
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 【自荐】一款简洁、开源的在线白板工具 Drawnix
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY
· Docker 太简单,K8s 太复杂?w7panel 让容器管理更轻松!
2016-07-20 媲美oracle awr/statspack的mysql awr第一版发布