lightdb中clog(commit log)内部实现、及与wal、commit的关系及细节

  WAL相当于oracle中的redo log,mysql中的redolog,9.6及之前名为xlog,10+当前在pg_wal文件夹中,wal段默认大小为16M,在initdb时可以指定大小,后续原则上不可以修改。可以通过pg_waldump查看二进制日志的内容。wal的结构解析https://www.cnblogs.com/abclife/p/13708947.html,虽然不完全正确,比如LSN的物理文件ID解析就不正确)。wal的物理结构(定义在src/include/access/xlogrecord.h)如下:

typedef struct XLogRecord
{
    uint32        xl_tot_len;        /* total len of entire record */
    TransactionId xl_xid;        /* xact id */
    XLogRecPtr    xl_prev;        /* ptr to previous record in log */
    uint8        xl_info;        /* flag bits, see below */
    RmgrId        xl_rmid;        /* resource manager for this record */
    /* 2 bytes of padding here, initialize to zero */
    pg_crc32c    xl_crc;            /* CRC for this record */

    /* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */

} XLogRecord;

typedef struct XLogRecordBlockHeader
{
    uint8        id;                /* block reference ID */
    uint8        fork_flags;        /* fork within the relation, and flags */
    uint16        data_length;    /* number of payload bytes (not including page
                                 * image) */

    /* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader struct follows */
    /* If BKPBLOCK_SAME_REL is not set, a RelFileNode follows */
    /* BlockNumber follows */
} XLogRecordBlockHeader;

  其中BlockHeader和block data一一对应,其中场景之一是第一次写入块或checkpoint之后第一次修改块时会生成block data。所以每个日志条目的最小长度为24字节+n字节(记录null标记,1个bit 1位,所以等于n字段数/8)+main data。

  具体为:解析wal记录的时候,首先读取XLogRecord,并根据xl_tot_len读取整条记录,然后根据其中的xl_rmid判断当前是哪个RMGR,调用具体rmgr read回调,RMGR首先根据xl_info(info是第二个关键口子,也是RMGR后的第一个关键依据)判断接下去是XLogRecordBlockHeader(0个或多个)、XLogRecordDataHeader(有且只有一个)还是其它(就是一事一议)。

   WAL归档的执行过程可见https://wiki.moritetu.xyz/?PostgreSQL/%E8%A7%A3%E6%9E%90/WAL%E3%82%A2%E3%83%BC%E3%82%AB%E3%82%A4%E3%83%96

   clog(全称Commit Log,PostgreSQL transaction-commit-log manager,主要在clog.c中实现)里面记录了事务的执行状态,每次事务提交和回滚的时候,都需要更新该状态(调用CommitTransactionCommand(void)),PostgreSQL服务器访问该文件确定事务的状态,保存在pg_xact目录中,每个文件大小为256KB,每个事务2位(bit),故1个文件可以包含131072个事务。对于第一次修改的数据行来说,因为事务状态存储在clog中,所以修改后第一次判断行的可见性需要通过访问clog来确定,而访问clog是一个非常耗费性能的过程,故关于clog访问优化,有一个很长的discussion。

  注:执行层的操作lt_xact文件的接口函数主要在transam.c中,clog.c是底层实现。

  事务在clog中的状态包括:

/*
 * Possible transaction statuses --- note that all-zeroes is the initial
 * state.
 *
 * A "subcommitted" transaction is a committed subtransaction whose parent
 * hasn't committed or aborted yet.
 */
typedef int XidStatus;

#define TRANSACTION_STATUS_IN_PROGRESS        0x00
#define TRANSACTION_STATUS_COMMITTED        0x01
#define TRANSACTION_STATUS_ABORTED            0x02
#define TRANSACTION_STATUS_SUB_COMMITTED    0x03

  在clog.c中。

  因为pg的MVCC在文件中实现undo,即使事务回滚了,新创建的行也不会被删除,但是因为clog中记录了事务的执行状态,所以其他事务在xmin和xmax判断时候可以过滤掉或不过滤掉这些记录(主要是xmax=0的情况,因为此时可能提交了、也可能稍微提交)。

pg_xact(9.6及之前名为pg_clog,虽然代码中还是clog.c)

[postgres@hs-10-20-30-194 pg_xact]$ ll
total 13208
-rw------- 1 postgres postgres 262144 May 24 17:26 0000
-rw------- 1 postgres postgres 262144 May 24 17:26 0001
-rw------- 1 postgres postgres 262144 May 24 17:27 0002
-rw------- 1 postgres postgres 262144 May 24 17:27 0003
-rw------- 1 postgres postgres 262144 May 24 17:27 0004
-rw------- 1 postgres postgres 262144 May 24 17:28 0005
-rw------- 1 postgres postgres 262144 May 24 17:28 0006
-rw------- 1 postgres postgres 262144 May 24 17:28 0007

  clog和wal的交互:这得先理解事务的完整过程。

  

  在AM层,调用xlog相关接口将WAL条目写入WAL文件,PortalDrop清理执行完成后,主入口exec_simple_query()->finish_xact_command()会依次调用CommitTransactionCommand()->CommitTransaction()->RecordTransactionCommit()->XactLogCommitRecord()调用XLogInsert()将commit wal条目写入WAL文件,然后RecordTransactionCommit()调用XLogFlush刷新commit WAL日志,然后调用TransactionIdCommitTree()更新clog。TransactionIdCommitTree->TransactionIdSetTreeStatus->TransactionIdSetPageStatus->TransactionIdSetPageStatusInternal,然后根据pageno找到slotno(使用slru简单最近最少访问算法管理),调用TransactionIdSetStatusBit(其根据xid找到偏移量,然后进行位运算更新事务状态)

 * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
 * CLOG page numbering also wraps around at 0xFFFFFFFF/CLOG_XACTS_PER_PAGE,
 * and CLOG segment numbering at
 * 0xFFFFFFFF/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
 * explicit notice of that fact in this module, except when comparing segment
 * and page numbers in TruncateCLOG (see CLOGPagePrecedes).
 */

/* We need two bits per xact, so four xacts fit in a byte */
#define CLOG_BITS_PER_XACT    2
#define CLOG_XACTS_PER_BYTE 4   每字节包含的事务数
#define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)    每BLOCK包含的事务数,32768
#define CLOG_XACT_BITMASK    ((1 << CLOG_BITS_PER_XACT) - 1)   0x11

#define TransactionIdToPage(xid)    ((xid) / (TransactionId) CLOG_XACTS_PER_PAGE)    根据事务ID找到页,事务ID 整除 32768
#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_PAGE)    页内事务相对顺序号偏移量 事务ID 取余 32768
#define TransactionIdToByte(xid)    (TransactionIdToPgIndex(xid) / CLOG_XACTS_PER_BYTE)  页内字节偏移量
#define TransactionIdToBIndex(xid)    ((xid) % (TransactionId) CLOG_XACTS_PER_BYTE)     字节内事务相对顺序号偏移量  事务ID 取余 4

 

 /* We store the latest async LSN for each group of transactions *

#define CLOG_XACTS_PER_LSN_GROUP 32 /* keep this a power of 2 */
#define CLOG_LSNS_PER_PAGE (CLOG_XACTS_PER_PAGE / CLOG_XACTS_PER_LSN_GROUP)

#define GetLSNIndex(slotno, xid) ((slotno) * CLOG_LSNS_PER_PAGE + \
((xid) % (TransactionId) CLOG_XACTS_PER_PAGE) / CLOG_XACTS_PER_LSN_GROUP)

  因为更新clog是内存中进行的,不会刷盘,那问题来了。1、重启恢复的时候哪里会用到?2、判断元祖可见性的时候哪里会调用到?

  所有的元祖在被fetch时,都会检查xmin、xmax是否已经提交,如果infomask_2上没有标记的话,就回去clog缓存区查询,如下:

TransactionIdGetStatus clog.c:654
TransactionLogFetch transam.c:79
TransactionIdDidCommit transam.c:129
HeapTupleSatisfiesMVCC heapam_visibility.c:1058
HeapTupleSatisfiesVisibility heapam_visibility.c:1695
heapgetpage heapam.c:476
heapgettup_pagemode heapam.c:917
heap_getnextslot heapam.c:1390
table_scan_getnextslot tableam.h:906
SeqNext nodeSeqscan.c:80
ExecScanFetch execScan.c:133
ExecScan execScan.c:182
ExecSeqScan nodeSeqscan.c:112
ExecProcNodeFirst execProcnode.c:454
ExecProcNode executor.h:248
ExecutePlan execMain.c:1632
standard_ExecutorRun execMain.c:350
CitusExecutorRun multi_executor.c:214
pgss_ExecutorRun pg_stat_statements.c:1043
pgsk_ExecutorRun pg_stat_kcache.c:1034
pgqs_ExecutorRun pg_qualstats.c:661
explain_ExecutorRun auto_explain.c:334
ExecutorRun execMain.c:292
PortalRunSelect pquery.c:912
PortalRun pquery.c:756
exec_simple_query postgres.c:1325
PostgresMain postgres.c:4415
BackendRun postmaster.c:4527
BackendStartup postmaster.c:4211
ServerLoop postmaster.c:1740
PostmasterMain postmaster.c:1413
main main.c:231
__libc_start_main 0x00007f3353efd555
_start 0x0000000000483aa9

 

/*
 * information stored in t_infomask:
 */
#define HEAP_HASNULL            0x0001    /* has null attribute(s) */
#define HEAP_HASVARWIDTH        0x0002    /* has variable-width attribute(s) */
#define HEAP_HASEXTERNAL        0x0004    /* has external stored attribute(s) */
#define HEAP_HASOID_OLD            0x0008    /* has an object-id field */
#define HEAP_XMAX_KEYSHR_LOCK    0x0010    /* xmax is a key-shared locker */
#define HEAP_COMBOCID            0x0020    /* t_cid is a combo cid */
#define HEAP_XMAX_EXCL_LOCK        0x0040    /* xmax is exclusive locker */
#define HEAP_XMAX_LOCK_ONLY        0x0080    /* xmax, if valid, is only a locker */

 /* xmax is a shared locker */
#define HEAP_XMAX_SHR_LOCK    (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK)

#define HEAP_LOCK_MASK    (HEAP_XMAX_SHR_LOCK | HEAP_XMAX_EXCL_LOCK | \
                         HEAP_XMAX_KEYSHR_LOCK)
#define HEAP_XMIN_COMMITTED        0x0100    /* t_xmin committed */
#define HEAP_XMIN_INVALID        0x0200    /* t_xmin invalid/aborted */
#define HEAP_XMIN_FROZEN        (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
#define HEAP_XMAX_COMMITTED        0x0400    /* t_xmax committed */
#define HEAP_XMAX_INVALID        0x0800    /* t_xmax invalid/aborted */
#define HEAP_XMAX_IS_MULTI        0x1000    /* t_xmax is a MultiXactId */
#define HEAP_UPDATED            0x2000    /* this is UPDATEd version of row */
#define HEAP_MOVED_OFF            0x4000    /* moved to another place by pre-9.0
                                         * VACUUM FULL(cluster同义词); kept for binary
                                         * upgrade support */
#define HEAP_MOVED_IN            0x8000    /* moved from another place by pre-9.0
                                         * VACUUM FULL; kept for binary
                                         * upgrade support */
#define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN)

#define HEAP_XACT_MASK            0xFFF0    /* visibility-related bits */

  每当一个新的clog页面(和pg中其他页面一样,也是BLCKSZ宏定义,默认8KB)被初始化为0的时候,clog.c就会生成一条wal记录。xact.c中针对提交和回滚操作的记录(recording)也会写clog。对于同步提交:在clog记录commit前,XLOG会确保被刷新,所以WAL可以自动被保证。对于异步提交:必须跟踪最新的LSN影响的每个CLOG页,这样才能刷新响应的xlog。clog的细节描述具体可以参见:https://www.interdb.jp/pg/pgsql05.html。clog的清理参见:https://www.interdb.jp/pg/pgsql06.html#_6.4.,由vacuum freeze负责清理。

部分结构化描述可以参见https://blog.csdn.net/weixin_39540651/article/details/115677138。 

其他目录说明:

  • pg_logical
  • pg_commit_ts
  • pg_multixact
  • pg_subtrans
  • pg_snapshots
  • pg_replslot
  • pg_dynshmem

9.6目录说明

10.0+目录说明(到14为止未在发生调整) 

https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND

vacuum的内部调用链路

[lightdb@hs-10-20-30-217 ~]$ pstack 2393906
#0  0x00007f7eb7bb0a86 in ?? () from /lib64/libpthread.so.0
#1  0x00007f7eb7bb0b78 in ?? () from /lib64/libpthread.so.0
#2  0x0000000000710202 in PGSemaphoreLock (sema=0x7f4e36e3e8b8) at pg_sema.c:327
#3  0x0000000000785cf5 in LWLockAcquire (lock=0x7f7deb25bb00, mode=mode@entry=LW_EXCLUSIVE) at lwlock.c:1331
#4  0x0000000000764d2a in StartBufferIO (buf=buf@entry=0x7f4e5a3a7b00, forInput=forInput@entry=true) at bufmgr.c:4188
#5  0x0000000000766cdd in BufferAlloc (foundPtr=0x7ffe45a794fb, strategy=0x2501320, blockNum=<optimized out>, forkNum=MAIN_FORKNUM, relpersistence=112 'p', smgr=0x258bc30) at bufmgr.c:1286
#6  ReadBuffer_common (smgr=0x258bc30, relpersistence=<optimized out>, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=<optimized out>, blockNum@entry=4198708, mode=RBM_NORMAL, strategy=0x2501320, hit=0x7ffe45a795af) at bufmgr.c:761
#7  0x0000000000767360 in ReadBufferExtended (reln=reln@entry=0x7f4e367f7548, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=4198708, mode=mode@entry=RBM_NORMAL, strategy=<optimized out>) at bufmgr.c:677
#8  0x00000000004dbcc2 in lazy_scan_heap (aggressive=true, nindexes=2, Irel=0x25682d8, vacrelstats=<optimized out>, params=0x2501434, onerel=<optimized out>) at vacuumlazy.c:1092
#9  heap_vacuum_rel (onerel=<optimized out>, params=0x2501434, bstrategy=<optimized out>) at vacuumlazy.c:518
#10 0x000000000062cfb8 in table_relation_vacuum (bstrategy=<optimized out>, params=0x2501434, rel=0x7f4e367f7548) at ../../../src/include/access/tableam.h:1463
#11 vacuum_rel (relid=29477, relation=<optimized out>, params=params@entry=0x2501434) at vacuum.c:1893
#12 0x000000000062e1cb in vacuum (relations=0x2539038, params=params@entry=0x2501434, bstrategy=<optimized out>, bstrategy@entry=0x2501320, isTopLevel=isTopLevel@entry=true) at vacuum.c:449
#13 0x000000000048acda in autovacuum_do_vac_analyze (bstrategy=0x2501320, tab=0x2501430) at autovacuum.c:3137
#14 do_autovacuum () at autovacuum.c:2467
#15 0x000000000048b2ea in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1694
#16 0x0000000000711b0f in StartAutoVacWorker () at autovacuum.c:1488
#17 0x000000000071e6de in StartAutovacuumWorker () at postmaster.c:5993
#18 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5700
#19 <signal handler called>
#20 0x00007f7eb745a987 in select () from /lib64/libc.so.6
#21 0x000000000048ca6a in ServerLoop () at postmaster.c:1818
#22 0x000000000071f6d9 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0x24490d0) at postmaster.c:1488
#23 0x000000000048e7de in main (argc=1, argv=0x24490d0) at main.c:231
[lightdb@hs-10-20-30-217 ~]$ pstack 2393906
#0  0x00007f7eb745a987 in select () from /lib64/libc.so.6
#1  0x00000000008f0d8a in pg_usleep (microsec=<optimized out>) at pgsleep.c:56
#2  0x000000000062ece3 in vacuum_delay_point () at vacuum.c:2034
#3  0x00000000004dbc1d in lazy_scan_heap (aggressive=true, nindexes=2, Irel=0x25682d8, vacrelstats=<optimized out>, params=0x2501434, onerel=<optimized out>) at vacuumlazy.c:1034
#4  heap_vacuum_rel (onerel=<optimized out>, params=0x2501434, bstrategy=<optimized out>) at vacuumlazy.c:518
#5  0x000000000062cfb8 in table_relation_vacuum (bstrategy=<optimized out>, params=0x2501434, rel=0x7f4e367f7548) at ../../../src/include/access/tableam.h:1463
#6  vacuum_rel (relid=29477, relation=<optimized out>, params=params@entry=0x2501434) at vacuum.c:1893
#7  0x000000000062e1cb in vacuum (relations=0x2539038, params=params@entry=0x2501434, bstrategy=<optimized out>, bstrategy@entry=0x2501320, isTopLevel=isTopLevel@entry=true) at vacuum.c:449
#8  0x000000000048acda in autovacuum_do_vac_analyze (bstrategy=0x2501320, tab=0x2501430) at autovacuum.c:3137
#9  do_autovacuum () at autovacuum.c:2467
#10 0x000000000048b2ea in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1694
#11 0x0000000000711b0f in StartAutoVacWorker () at autovacuum.c:1488
#12 0x000000000071e6de in StartAutovacuumWorker () at postmaster.c:5993
#13 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5700
#14 <signal handler called>
#15 0x00007f7eb745a987 in select () from /lib64/libc.so.6
#16 0x000000000048ca6a in ServerLoop () at postmaster.c:1818
#17 0x000000000071f6d9 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0x24490d0) at postmaster.c:1488
#18 0x000000000048e7de in main (argc=1, argv=0x24490d0) at main.c:231
[lightdb@hs-10-20-30-217 ~]$ pstack 2393906
#0  0x00007f7eb745a987 in select () from /lib64/libc.so.6
#1  0x00000000008f0d8a in pg_usleep (microsec=<optimized out>) at pgsleep.c:56
#2  0x000000000062ece3 in vacuum_delay_point () at vacuum.c:2034
#3  0x00000000004dbc1d in lazy_scan_heap (aggressive=true, nindexes=2, Irel=0x25682d8, vacrelstats=<optimized out>, params=0x2501434, onerel=<optimized out>) at vacuumlazy.c:1034
#4  heap_vacuum_rel (onerel=<optimized out>, params=0x2501434, bstrategy=<optimized out>) at vacuumlazy.c:518
#5  0x000000000062cfb8 in table_relation_vacuum (bstrategy=<optimized out>, params=0x2501434, rel=0x7f4e367f7548) at ../../../src/include/access/tableam.h:1463
#6  vacuum_rel (relid=29477, relation=<optimized out>, params=params@entry=0x2501434) at vacuum.c:1893
#7  0x000000000062e1cb in vacuum (relations=0x2539038, params=params@entry=0x2501434, bstrategy=<optimized out>, bstrategy@entry=0x2501320, isTopLevel=isTopLevel@entry=true) at vacuum.c:449
#8  0x000000000048acda in autovacuum_do_vac_analyze (bstrategy=0x2501320, tab=0x2501430) at autovacuum.c:3137
#9  do_autovacuum () at autovacuum.c:2467
#10 0x000000000048b2ea in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1694
#11 0x0000000000711b0f in StartAutoVacWorker () at autovacuum.c:1488
#12 0x000000000071e6de in StartAutovacuumWorker () at postmaster.c:5993
#13 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5700
#14 <signal handler called>
#15 0x00007f7eb745a987 in select () from /lib64/libc.so.6
#16 0x000000000048ca6a in ServerLoop () at postmaster.c:1818
#17 0x000000000071f6d9 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0x24490d0) at postmaster.c:1488
#18 0x000000000048e7de in main (argc=1, argv=0x24490d0) at main.c:231
[lightdb@hs-10-20-30-217 ~]$ 


[lightdb@hs-10-20-30-217 base]$ pstack 3695188
#0  0x00007f1394acb987 in select () from /lib64/libc.so.6
#1  0x00000000008f0d8a in pg_usleep (microsec=<optimized out>) at pgsleep.c:56
#2  0x000000000062ece3 in vacuum_delay_point () at vacuum.c:2034
#3  0x00000000004e8635 in btvacuumpage (vstate=vstate@entry=0x7fff02c75f80, scanblkno=scanblkno@entry=81599) at nbtree.c:1109
#4  0x00000000004e8e21 in btvacuumscan (info=info@entry=0x7fff02c76130, stats=stats@entry=0x1106688, callback=callback@entry=0x4daac0 <lazy_tid_reaped>, callback_state=callback_state@entry=0x7efc67f13048, cycleid=<optimized out>) at nbtree.c:1029
#5  0x00000000004e8f7e in btbulkdelete (info=0x7fff02c76130, stats=0x1106688, callback=0x4daac0 <lazy_tid_reaped>, callback_state=0x7efc67f13048) at nbtree.c:871
#6  0x00000000004da85e in lazy_vacuum_index (indrel=<optimized out>, stats=stats@entry=0x10f3df8, dead_tuples=0x7efc67f13048, reltuples=263139328, vacrelstats=vacrelstats@entry=0x10f3a58) at vacuumlazy.c:2444
#7  0x00000000004db677 in lazy_vacuum_all_indexes (onerel=onerel@entry=0x7efca8105c08, Irel=Irel@entry=0x10f3648, stats=stats@entry=0x10f3dd8, vacrelstats=vacrelstats@entry=0x10f3a58, lps=lps@entry=0x0, nindexes=nindexes@entry=12) at vacuumlazy.c:1810
#8  0x00000000004dbecf in lazy_scan_heap (aggressive=true, nindexes=12, Irel=0x10f3648, vacrelstats=<optimized out>, params=0x10b27c4, onerel=<optimized out>) at vacuumlazy.c:1696
#9  heap_vacuum_rel (onerel=<optimized out>, params=0x10b27c4, bstrategy=<optimized out>) at vacuumlazy.c:518
#10 0x000000000062cfb8 in table_relation_vacuum (bstrategy=<optimized out>, params=0x10b27c4, rel=0x7efca8105c08) at ../../../src/include/access/tableam.h:1463
#11 vacuum_rel (relid=9217732, relation=<optimized out>, params=params@entry=0x10b27c4) at vacuum.c:1893
#12 0x000000000062e1cb in vacuum (relations=0x111f518, params=params@entry=0x10b27c4, bstrategy=<optimized out>, bstrategy@entry=0x10b2550, isTopLevel=isTopLevel@entry=true) at vacuum.c:449
#13 0x000000000048acda in autovacuum_do_vac_analyze (bstrategy=0x10b2550, tab=0x10b27c0) at autovacuum.c:3137
#14 do_autovacuum () at autovacuum.c:2467
#15 0x000000000048b2ea in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1694
#16 0x0000000000711b0f in StartAutoVacWorker () at autovacuum.c:1488
#17 0x000000000071e6de in StartAutovacuumWorker () at postmaster.c:5993
#18 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5700
#19 <signal handler called>
#20 0x00007f1394acb987 in select () from /lib64/libc.so.6
#21 0x000000000048ca6a in ServerLoop () at postmaster.c:1818
#22 0x000000000071f6d9 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0xfd0110) at postmaster.c:1488
#23 0x000000000048e7de in main (argc=1, argv=0xfd0110) at main.c:231
[lightdb@hs-10-20-30-217 base]$ pstack 3695188
#0  0x00007f1395222ff4 in pread64 () from /lib64/libpthread.so.0
#1  0x000000000076cb41 in FileRead (file=<optimized out>, buffer=buffer@entry=0x7efd0674fb00 "Q\002", amount=amount@entry=8192, offset=offset@entry=814096384, wait_event_info=wait_event_info@entry=167772173) at fd.c:2102
#2  0x00000000007902a0 in mdread (reln=<optimized out>, forknum=<optimized out>, blocknum=99377, buffer=0x7efd0674fb00 "Q\002") at md.c:663
#3  0x0000000000766a05 in ReadBuffer_common (smgr=0x10cc198, relpersistence=<optimized out>, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=<optimized out>, blockNum@entry=99377, mode=RBM_NORMAL, strategy=0x10b2550, hit=0x7fff02c74e9f) at bufmgr.c:916
#4  0x0000000000767360 in ReadBufferExtended (reln=reln@entry=0x7efca810b6d8, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=99377, mode=mode@entry=RBM_NORMAL, strategy=<optimized out>) at bufmgr.c:677
#5  0x00000000004e864c in btvacuumpage (vstate=vstate@entry=0x7fff02c75f80, scanblkno=scanblkno@entry=99377) at nbtree.c:1117
#6  0x00000000004e8e21 in btvacuumscan (info=info@entry=0x7fff02c76130, stats=stats@entry=0x1106688, callback=callback@entry=0x4daac0 <lazy_tid_reaped>, callback_state=callback_state@entry=0x7efc67f13048, cycleid=<optimized out>) at nbtree.c:1029
#7  0x00000000004e8f7e in btbulkdelete (info=0x7fff02c76130, stats=0x1106688, callback=0x4daac0 <lazy_tid_reaped>, callback_state=0x7efc67f13048) at nbtree.c:871
#8  0x00000000004da85e in lazy_vacuum_index (indrel=<optimized out>, stats=stats@entry=0x10f3df8, dead_tuples=0x7efc67f13048, reltuples=263139328, vacrelstats=vacrelstats@entry=0x10f3a58) at vacuumlazy.c:2444
#9  0x00000000004db677 in lazy_vacuum_all_indexes (onerel=onerel@entry=0x7efca8105c08, Irel=Irel@entry=0x10f3648, stats=stats@entry=0x10f3dd8, vacrelstats=vacrelstats@entry=0x10f3a58, lps=lps@entry=0x0, nindexes=nindexes@entry=12) at vacuumlazy.c:1810
#10 0x00000000004dbecf in lazy_scan_heap (aggressive=true, nindexes=12, Irel=0x10f3648, vacrelstats=<optimized out>, params=0x10b27c4, onerel=<optimized out>) at vacuumlazy.c:1696
#11 heap_vacuum_rel (onerel=<optimized out>, params=0x10b27c4, bstrategy=<optimized out>) at vacuumlazy.c:518
#12 0x000000000062cfb8 in table_relation_vacuum (bstrategy=<optimized out>, params=0x10b27c4, rel=0x7efca8105c08) at ../../../src/include/access/tableam.h:1463
#13 vacuum_rel (relid=9217732, relation=<optimized out>, params=params@entry=0x10b27c4) at vacuum.c:1893
#14 0x000000000062e1cb in vacuum (relations=0x111f518, params=params@entry=0x10b27c4, bstrategy=<optimized out>, bstrategy@entry=0x10b2550, isTopLevel=isTopLevel@entry=true) at vacuum.c:449
#15 0x000000000048acda in autovacuum_do_vac_analyze (bstrategy=0x10b2550, tab=0x10b27c0) at autovacuum.c:3137
#16 do_autovacuum () at autovacuum.c:2467
#17 0x000000000048b2ea in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1694
#18 0x0000000000711b0f in StartAutoVacWorker () at autovacuum.c:1488
#19 0x000000000071e6de in StartAutovacuumWorker () at postmaster.c:5993
#20 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5700
#21 <signal handler called>
#22 0x00007f1394acb987 in select () from /lib64/libc.so.6
#23 0x000000000048ca6a in ServerLoop () at postmaster.c:1818
#24 0x000000000071f6d9 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0xfd0110) at postmaster.c:1488
#25 0x000000000048e7de in main (argc=1, argv=0xfd0110) at main.c:231


[lightdb@hs-10-20-30-217 data]$ pstack 4021998
#0  0x00007f78c40ac987 in select () from /lib64/libc.so.6
#1  0x00000000008f0d8a in pg_usleep (microsec=<optimized out>) at pgsleep.c:56
#2  0x000000000062ece3 in vacuum_delay_point () at vacuum.c:2034
#3  0x00000000004dce20 in lazy_scan_heap (aggressive=true, nindexes=12, Irel=0x2a18588, vacrelstats=<optimized out>, params=0x29d03cc, onerel=<optimized out>) at vacuumlazy.c:984
#4  heap_vacuum_rel (onerel=<optimized out>, params=0x29d03cc, bstrategy=<optimized out>) at vacuumlazy.c:518
#5  0x000000000062cfb8 in table_relation_vacuum (bstrategy=<optimized out>, params=0x29d03cc, rel=0x7f61d764f158) at ../../../src/include/access/tableam.h:1463
#6  vacuum_rel (relid=9217732, relation=<optimized out>, params=params@entry=0x29d03cc) at vacuum.c:1893
#7  0x000000000062e1cb in vacuum (relations=0x2a20378, params=params@entry=0x29d03cc, bstrategy=<optimized out>, bstrategy@entry=0x29d0158, isTopLevel=isTopLevel@entry=true) at vacuum.c:449
#8  0x000000000048acda in autovacuum_do_vac_analyze (bstrategy=0x29d0158, tab=0x29d03c8) at autovacuum.c:3137
#9  do_autovacuum () at autovacuum.c:2467
#10 0x000000000048b2ea in AutoVacWorkerMain (argv=0x0, argc=0) at autovacuum.c:1694
#11 0x0000000000711b0f in StartAutoVacWorker () at autovacuum.c:1488
#12 0x000000000071e6de in StartAutovacuumWorker () at postmaster.c:5993
#13 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5700
#14 <signal handler called>
#15 0x00007f78c40ac987 in select () from /lib64/libc.so.6
#16 0x000000000048ca6a in ServerLoop () at postmaster.c:1818
#17 0x000000000071f6d9 in PostmasterMain (argc=argc@entry=1, argv=argv@entry=0x28f4110) at postmaster.c:1488
#18 0x000000000048e7de in main (argc=1, argv=0x28f4110) at main.c:231

 https://cloud.tencent.com/developer/article/2027383

wal的内部实现解析

  WALbuffer分配在共享内存中,最小64KB,通过XLogCtlInsert控制具体backend的申请,作为XLogCtlData的成员对外提供,无单独访问接口。
45000 53000 会话1
61000 69000 会话2
69000 77000 会话1
通过调试的指针地址从上可见,确实是共享。
  XLogInsert->XLogInsertRecord->CopyXLogRecordToWAL,只负责拷贝到wal buffer中的buffer(根据lsn位置/wal块大小计算而来,类似整除,会提前计算wal lsn_start和lsn_end)(包括过大到下一页),并不写入wal文件。
LOG: INSERT @ 0/2091D708: - Heap/INSERT: off 3 flags 0x08
LOG: INSERT @ 0/2091D778: - Transaction/COMMIT: 2022-10-16 23:44:59.926298+08
  上述仅表示生成到XLogInsertRecord,无写入的意思。通过系统调用跟踪也可以看出,如下:
[{EPOLLIN, {u32=23473288, u64=23473288}}], 1, -1) = 1
recvfrom(11, "Q\0\0\0.insert into t values('wocao"..., 8192, 0, NULL, NULL) = 47
lseek(6, 0, SEEK_END) = 8192
open("base/14199/16388_fsm", O_RDWR) = 7
lseek(7, 0, SEEK_END) = 24576
open("base/14199/16388", O_RDWR) = 8
lseek(8, 0, SEEK_END) = 245760
setitimer(ITIMER_REAL, {it_interval={tv_sec=0, tv_usec=0}, it_value={tv_sec=180, tv_usec=0}}, NULL) = 0
sendto(11, "C\0\0\0\17INSERT 0 1\0Z\0\0\0\5T", 22, 0, NULL, 0) = 22
recvfrom(11, 0xf94780, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(5,
------------insert没有产生pwrite64系统调用

[{EPOLLIN, {u32=23473288, u64=23473288}}], 1, -1) = 1
recvfrom(11, "Q\0\0\0\fcommit;\0", 8192, 0, NULL, NULL) = 13
open("lt_wal/000000010000000000000001", O_RDWR) = 9
pwrite64(9, "\6\321\5\0\1\0\0\0\0\0\222 \0\0\0\0%\32\0\0\0\0\0\0yyyyyyyy"..., 8192, 9568256) = 8192
fdatasync(9) = 0
sendto(10, "\2\0\0\0\250\3\0\0w7\0\0\10\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 936, 0, NULL, 0) = 936
sendto(10, "\2\0\0\0\350\1\0\0w7\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 488, 0, NULL, 0) = 488
sendto(10, "\16\0\0\0H\0\0\0\6\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0"..., 72, 0, NULL, 0) = 72
sendto(11, "C\0\0\0\vCOMMIT\0Z\0\0\0\5I", 18, 0, NULL, 0) = 18
recvfrom(11, 0xf94780, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(5,

  如果wal buffer 64KB,满了的话,那会怎么样呢?理论上需要8个单行8000字节的请求就可以占满,按照实现,是进行evict淘汰不再需要的wal buffer的,但是实际未写到wal文件的话,怎么evict出去呢?后面如果又写会怎么样?这个还需要研究!!!

  当前wal文件句柄保存在xlog.c文件的openLogFile全局变量中
  因为wal也是分8k为单位读写,通过XLogWritePages写入wal文件,每次8K*n字节,那如何保证每次写小量而不重复拷贝呢?比如几十字节?实际是在commit flush(RecordTransactionCommit函数中)的时候,每次写仍然是8KB为单位,具体由startoffset += XLogWritePages(from, npages, startoffset)每次写的from和startoffset参数可以看出。因为文件系统写以块为单位,所以写8KB(如果文件系统8KB块的话,如果512/4KB,则多了8倍或2倍)和80字节本质上的io是一样的。

btree节点分裂的wal日志记录

  在 btree 索引中,页面拆分需要分配一个新页面,并在父 btree 级别插入一个新键,但出于锁定原因,这必须由两个单独的 WAL 记录反映出来。 重放第一条记录,以分配新页面并将元组移动到其中,在页面上设置一个标志,以指示键尚未插入到父级。 重放第二条记录将清除标志。 在正常操作期间,其他后端永远不会看到此中间状态,因为子页面上的锁在两个操作期间会一直持有,但如果操作在写入第二个 WAL 记录之前中断,则会看到这种状态。 搜索算法实现时考虑了这种不一致的异常,如果插入遇到设置了未完全拆分标志的页面,它将通过在继续之前将键插入父级来完成中断的拆分。

In btree indexes, for example,a page split requires a new page to be allocated, and an insertion of a new key in the parent btree level, but for locking reasons this has to be reflected by two separate WAL records. Replaying the first record, to allocate the new page and move tuples to it, sets a flag on the page to indicate that the key has not been inserted to the parent yet. Replaying the second record clears the flag. This intermediate state is never seen by other backends during normal operation, because the lock on the child page is held across the two actions, but will be seen if the operation is interrupted before writing the second WAL record. The search algorithm works with the intermediate state as normal, but if an insertion encounters a page with the incomplete-split flag set, it will finish the interrupted split by inserting the key to the parent, before proceeding.

posted @ 2021-11-07 17:03  zhjh256  阅读(1308)  评论(0编辑  收藏  举报