对PostgreSQL xmax的理解
xmax
The identity (transaction ID) of the deleting transaction, or zero for an undeleted row version. It is possible for this column to be nonzero in a visible row version. That usually indicates that the deleting transaction hasn't committed yet, or that an attempted deletion was rolled back.
http://www.postgresql.org/docs/9.1/static/ddl-system-columns.html
作一个实验:
我数据库关闭前,最后一条transaction的id是 1874。
我开一个终端A,此时终端A的当前transactionId为 1875。
[pgsql@localhost bin]$ ./psql psql (9.1.2) Type "help" for help. pgsql=# begin; BEGIN pgsql=# select xmin,xmax,* from tab01; xmin | xmax | id | cd ------+------+----+---- 1866 | 0 | 3 | 3 1867 | 0 | 4 | 4 1868 | 0 | 5 | 5 1870 | 0 | 6 | 6 1872 | 0 | 7 | 7 1873 | 0 | 8 | 8 1874 | 0 | 9 | 9 (7 rows) pgsql=#
我再开一个终端B,此时,终端B的transactionId为:1876。
[pgsql@localhost bin]$ ./psql psql (9.1.2) Type "help" for help. pgsql=# begin; BEGIN pgsql=# select xmin,xmax,* from tab01; xmin | xmax | id | cd ------+------+----+---- 1866 | 0 | 3 | 3 1867 | 0 | 4 | 4 1868 | 0 | 5 | 5 1870 | 0 | 6 | 6 1872 | 0 | 7 | 7 1873 | 0 | 8 | 8 1874 | 0 | 9 | 9 (7 rows) pgsql=#
回到终端A,执行 delete 操作:
pgsql=# delete from tab01 where id=9; DELETE 1 pgsql=#
此时,在终端A中,已经看不到删除后的数据:
pgsql=# select xmin,xmax,* from tab01; xmin | xmax | id | cd ------+------+----+---- 1866 | 0 | 3 | 3 1867 | 0 | 4 | 4 1868 | 0 | 5 | 5 1870 | 0 | 6 | 6 1872 | 0 | 7 | 7 1873 | 0 | 8 | 8 (6 rows) pgsql=#
此时,由于终端A尚未提交,所以,可以在终端B中看到如下的情形:
pgsql=# select xmin,xmax,* from tab01; xmin | xmax | id | cd ------+------+----+---- 1866 | 0 | 3 | 3 1867 | 0 | 4 | 4 1868 | 0 | 5 | 5 1870 | 0 | 6 | 6 1872 | 0 | 7 | 7 1873 | 0 | 8 | 8 1874 | 1875 | 9 | 9 (7 rows) pgsql=#
也就是说,id为9的那条记录,其xmax为1875,表明其为 transactionid为 1875的事务所删除。
回到终端A,进行提交:
pgsql=# commit; COMMIT pgsql=# select xmin,xmax,* from tab01; xmin | xmax | id | cd ------+------+----+---- 1866 | 0 | 3 | 3 1867 | 0 | 4 | 4 1868 | 0 | 5 | 5 1870 | 0 | 6 | 6 1872 | 0 | 7 | 7 1873 | 0 | 8 | 8 (6 rows) pgsql=#
再回到终端B,查看:
pgsql=# select xmin,xmax,* from tab01; xmin | xmax | id | cd ------+------+----+---- 1866 | 0 | 3 | 3 1867 | 0 | 4 | 4 1868 | 0 | 5 | 5 1870 | 0 | 6 | 6 1872 | 0 | 7 | 7 1873 | 0 | 8 | 8 (6 rows) pgsql=#
让我来再进一步,看一看:
重新开两个终端:
终端A和终端B。
在终端A中:
[pgsql@localhost bin]$ ./psql psql (9.1.2) Type "help" for help. pgsql=# select xmin,xmax,cmin,cmax,* from tab01; xmin | xmax | cmin | cmax | id | cd ------+------+------+------+----+---- 1878 | 0 | 0 | 0 | 1 | 1 1879 | 0 | 0 | 0 | 2 | 2 (2 rows) pgsql=# \q
可以看到两条由不同的事务提交所形成的记录。
然后再次使用psql: 经过update 后,自己所看到的是 xmin的变化,这时尚未提交,别的终端看到就不一样了。
终端A:
[pgsql@localhost bin]$ ./psql psql (9.1.2) Type "help" for help. pgsql=# begin; BEGIN pgsql=# update tab01 set id=3 where cd='2'; UPDATE 1 pgsql=# select xmin,xmax,cmin,cmax,* from tab01; xmin | xmax | cmin | cmax | id | cd ------+------+------+------+----+---- 1878 | 0 | 0 | 0 | 1 | 1 1880 | 0 | 0 | 0 | 3 | 2 (2 rows) pgsql=#
此时,进入终端B:
[pgsql@localhost bin]$ ./psql psql (9.1.2) Type "help" for help. pgsql=# select xmin,xmax,cmin,cmax,* from tab01; xmin | xmax | cmin | cmax | id | cd ------+------+------+------+----+---- 1878 | 0 | 0 | 0 | 1 | 1 1879 | 1880 | 0 | 0 | 2 | 2 (2 rows) pgsql=#
我推测,update的时候,自身事务所看到的是内存中的影像。同时它也已经提交到了物理文件上。而别的事务是从物理文件来读取的。
在A终端提交以后:
pgsql=# commit;
COMMIT
pgsql=#
B终端所看到的:
pgsql=# select xmin,xmax,cmin,cmax,* from tab01; xmin | xmax | cmin | cmax | id | cd ------+------+------+------+----+---- 1878 | 0 | 0 | 0 | 1 | 1 1880 | 0 | 0 | 0 | 3 | 2 (2 rows) pgsql=#
继续从代码上进行分析:
/* * heap_update - replace a tuple * * NB: do not call this directly unless you are prepared to deal with * concurrent-update conditions. Use simple_heap_update instead. * * relation - table to be modified (caller must hold suitable lock) * otid - TID of old tuple to be replaced * newtup - newly constructed tuple data to store * ctid - output parameter, used only for failure case (see below) * update_xmax - output parameter, used only for failure case (see below) * cid - update command ID (used for visibility test, and stored into * cmax/cmin if successful) * crosscheck - if not InvalidSnapshot, also check old tuple against this * wait - true if should wait for any conflicting update to commit/abort * * Normal, successful return value is HeapTupleMayBeUpdated, which * actually means we *did* update it. Failure return codes are * HeapTupleSelfUpdated, HeapTupleUpdated, or HeapTupleBeingUpdated * (the last only possible if wait == false). * * On success, the header fields of *newtup are updated to match the new * stored tuple; in particular, newtup->t_self is set to the TID where the * new tuple was inserted, and its HEAP_ONLY_TUPLE flag is set iff a HOT * update was done. However, any TOAST changes in the new tuple's * data are not reflected into *newtup. * * In the failure cases, the routine returns the tuple's t_ctid and t_xmax. * If t_ctid is the same as otid, the tuple was deleted; if different, the * tuple was updated, and t_ctid is the location of the replacement tuple. * (t_xmax is needed to verify that the replacement tuple matches.) */ HTSU_Result heap_update(Relation relation, ItemPointer otid, HeapTuple newtup, ItemPointer ctid, TransactionId *update_xmax, CommandId cid, Snapshot crosscheck, bool wait) { HTSU_Result result; TransactionId xid = GetCurrentTransactionId(); Bitmapset *hot_attrs; ItemId lp; HeapTupleData oldtup; HeapTuple heaptup; Page page; Buffer buffer, newbuf; bool need_toast, already_marked; Size newtupsize, pagefree; bool have_tuple_lock = false; bool iscombo; bool use_hot_update = false; bool all_visible_cleared = false; bool all_visible_cleared_new = false; ... ////////////////First Phase marked by gaojian newtup->t_data->t_infomask &= ~(HEAP_XACT_MASK); newtup->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK); newtup->t_data->t_infomask |= (HEAP_XMAX_INVALID | HEAP_UPDATED); HeapTupleHeaderSetXmin(newtup->t_data, xid); HeapTupleHeaderSetCmin(newtup->t_data, cid); HeapTupleHeaderSetXmax(newtup->t_data, 0); /* for cleanliness */ newtup->t_tableOid = RelationGetRelid(relation); ...
if (!already_marked) { /* Clear obsolete visibility flags ... */ oldtup.t_data->t_infomask &= ~(HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID | HEAP_XMAX_IS_MULTI | HEAP_IS_LOCKED | HEAP_MOVED); /* ... and store info about transaction updating this tuple */ ///HeapTupleHeaderSetXmax(oldtup.t_data, xid); /////>>>>>added by gaojian for testing. ////xid = (TransactionId)8888; fprintf(stderr,"x-----1,xid is :%d \n",(int)xid); HeapTupleHeaderSetXmax(oldtup.t_data, xid); HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo); sleep(60); } ... return HeapTupleMayBeUpdated; }
可以看到,第一段的 :
newtup->t_data->t_infomask &= ~(HEAP_XACT_MASK); newtup->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK); newtup->t_data->t_infomask |= (HEAP_XMAX_INVALID | HEAP_UPDATED); HeapTupleHeaderSetXmin(newtup->t_data, xid); HeapTupleHeaderSetCmin(newtup->t_data, cid); HeapTupleHeaderSetXmax(newtup->t_data, 0); /* for cleanliness */ newtup->t_tableOid = RelationGetRelid(relation);
写的很明白, HeapTupleHeaderSetXmin(newtup->t_data, xid);
而第二段的 :
if (!already_marked) { /* Clear obsolete visibility flags ... */ oldtup.t_data->t_infomask &= ~(HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID | HEAP_XMAX_IS_MULTI | HEAP_IS_LOCKED | HEAP_MOVED); /* ... and store info about transaction updating this tuple */ HeapTupleHeaderSetXmax(oldtup.t_data, xid); HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo); }
然后,我再来验证一下,加点调试代码,这样,我执行sql时会出错:
if (!already_marked) { /* Clear obsolete visibility flags ... */ oldtup.t_data->t_infomask &= ~(HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID | HEAP_XMAX_IS_MULTI | HEAP_IS_LOCKED | HEAP_MOVED); /* ... and store info about transaction updating this tuple */ ///HeapTupleHeaderSetXmax(oldtup.t_data, xid); /////>>>>>added by gaojian for testing. ////xid = (TransactionId)8888; fprintf(stderr,"x-----1,xid is :%d \n",(int)xid); HeapTupleHeaderSetXmax(oldtup.t_data, xid); HeapTupleHeaderSetCmax(oldtup.t_data, cid, iscombo); }
执行结果:
pgsql=# begin; BEGIN pgsql=# update tab01 set id=2 where cd = '2'; ERROR: could not access status of transaction 8888 DETAIL: Could not read from file "pg_subtrans/0000" at offset 32768: Success. pgsql=# \q [pgsql@loca
再次登陆进来看看,哈,这证明我所看到的代码部分,就是实现写入xmax的部分。
总结来说,就是,在update(可能delete也是),用向oldtup写入transaction id的方式,来记录哪个transaction改动了记录。
[pgsql@localhost bin]$ ./psql psql (9.1.2) Type "help" for help. pgsql=# select xmin,xmax,cmin,cmax, * from tab01; xmin | xmax | cmin | cmax | id | cd ------+------+------+------+-----------+---- 1878 | 0 | 0 | 0 | 1 | 1 1884 | 8888 | 0 | 0 | 999888777 | 2 1885 | 0 | 0 | 0 | 2 | 2 (3 rows) pgsql=#