mysql中binlog与存储引擎的2PC
mysql内部的2PC
mysql开启binlog后实际上可以认为其数据有两份,binlog中一份,引擎中一份(这里先把存储引擎中数据看成整体的单独一份,另外也可以把binlog看成是一个引擎)。既然出现了副本,那么就不可避免的牵涉到了一致性问题,mysql在内核内部使用了经典的2PC协议实现数据一致性。
2PC协议需要一个协调者,而在binlog与引擎的2PC协议实现中,由binlog充当这一角色。
mysql事务的提交函数为ha_commit_trans
//sql/handler.cc
int ha_commit_trans(THD *thd, bool all, bool ignore_global_read_lock)
{
······
}
这个函数为mysql的提交函数,这个函数在事务提交时被调用,在内部实现了2PC的事务提交。众所周知,2PC分为两个阶段,prepare与commit,而在这个函数的代码中,实际上也可以看到以这两个阶段为名的调用 tc_log->prepare(thd, all)与tc_log->commit:
int ha_commit_trans(THD *thd, bool all, bool ignore_global_read_lock)
{
······
if (!trn_ctx->no_2pc(trx_scope) && (trn_ctx->rw_ha_count(trx_scope) > 1))
error= tc_log->prepare(thd, all);
······
if (error || (error= tc_log->commit(thd, all)))
{
ha_rollback_trans(thd, all);
error= 1;
goto end;
}
······
}
tc_log是一个全局指针:
//sql/tc_log.cc
TC_LOG *tc_log;
查看tc_log的定义:
/**
Transaction Coordinator Log.
A base abstract class for three different implementations of the
transaction coordinator.
The server uses the transaction coordinator to order transactions
correctly and there are three different implementations: one using
an in-memory structure, one dummy that does not do anything, and one
using the binary log for transaction coordination.
*/
class TC_LOG
{
public:
/**
Perform heuristic recovery, if --tc-heuristic-recover was used.
@note no matter whether heuristic recovery was successful or not
mysqld must exit. So, return value is the same in both cases.
@retval false no heuristic recovery was requested
@retval true heuristic recovery was performed
*/
bool using_heuristic_recover();
TC_LOG() {}
virtual ~TC_LOG() {}
enum enum_result {
RESULT_SUCCESS,
RESULT_ABORTED,
RESULT_INCONSISTENT
};
/**
Initialize and open the coordinator log.
Do recovery if necessary. Called during server startup.
@param opt_name Name of logfile.
@retval 0 sucess
@retval 1 failed
*/
virtual int open(const char *opt_name)=0;
/**
Close the transaction coordinator log and free any resources.
Called during server shutdown.
*/
virtual void close()=0;
/**
Log a commit record of the transaction to the transaction
coordinator log.
When the function returns, the transaction commit is properly
logged to the transaction coordinator log and can be committed in
the storage engines.
@param thd Session to log transaction for.
@param all @c True if this is a "real" commit, @c false if it is a "statement" commit.
@return Error code on failure, zero on success.
*/
virtual enum_result commit(THD *thd, bool all) = 0;
/**
Log a rollback record of the transaction to the transaction
coordinator log.
When the function returns, the transaction have been aborted in
the transaction coordinator log.
@param thd Session to log transaction record for.
@param all @c true if an explicit commit or an implicit commit
for a statement, @c false if an internal commit of the statement.
@return Error code on failure, zero on success.
*/
virtual int rollback(THD *thd, bool all) = 0;
/**
Log a prepare record of the transaction to the storage engines.
@param thd Session to log transaction record for.
@param all @c true if an explicit commit or an implicit commit
for a statement, @c false if an internal commit of the statement.
@return Error code on failure, zero on success.
*/
virtual int prepare(THD *thd, bool all) = 0;
};
从源码和注释中可以看出,这是一个虚基类,作为事务提交的协调器,实现了事务的prepare、commit、rollback等接口。那么既然是一个虚基类,那么实际上tc_log指针在ha_commit_trans中应该是指向继承TC_LOG类的一个子类的对象。查看这个指针在何处初始化可以发现在init_server_components中,该指针被赋值:
//sql/mysqld.cc
static int init_server_components()
{
······
if (total_ha_2pc > 1 || (1 == total_ha_2pc && opt_bin_log))
{
if (opt_bin_log)
tc_log= &mysql_bin_log;
else
tc_log= &tc_log_mmap;
}
······
}
从代码中可以看出当有超过一个支持2PC的存储引擎或者只有一个但是开启了binlog的话,tc_log就会被设置,如果开启了binlog就会以binlog作为事务的协调器。因此,ha_commit_trans中调用的实际上就是MYSQL_BIN_LOG中实现的prepare与commit.查看MYSQL_BIN_LOG中实现的prepare:
int MYSQL_BIN_LOG::prepare(THD *thd, bool all)
{
DBUG_ENTER("MYSQL_BIN_LOG::prepare");
DBUG_ASSERT(opt_bin_log);
/*
The applier thread explicitly overrides the value of sql_log_bin
with the value of log_slave_updates.
*/
DBUG_ASSERT(thd->slave_thread ?
opt_log_slave_updates : thd->variables.sql_log_bin);
/*
Set HA_IGNORE_DURABILITY to not flush the prepared record of the
transaction to the log of storage engine (for example, InnoDB
redo log) during the prepare phase. So that we can flush prepared
records of transactions to the log of storage engine in a group
right before flushing them to binary log during binlog group
commit flush stage. Reset to HA_REGULAR_DURABILITY at the
beginning of parsing next command.
*/
thd->durability_property= HA_IGNORE_DURABILITY;
int error= ha_prepare_low(thd, all);
DBUG_RETURN(error);
}
可以看出,这个函数实际上就只是调用了存储引擎接口的ha_prepare_low接口,使得引擎处于prepare状态。对于binlog而言,实际上只需要写入文件即可,因此可以认为本身就处于prepare状态。事务处于prepare状态后,调用MYSQL_BIN_LOG的commit函数进行提交。
TC_LOG::enum_result MYSQL_BIN_LOG::commit(THD *thd, bool all)
{
······
if (ordered_commit(thd, all, skip_commit))
DBUG_RETURN(RESULT_INCONSISTENT);
······
}
这个函数比较长,先省略其他部分,只看进行的提交的调用,实际上提交操作是在ordered_commit中实现。ordered_commit入口是mysql组提交的入口,暂时不深入,在这个函数中会将binlog写入,并且调用ha_commit_low在引擎进行提交。
将几个主要的函数调用栈总结如下:
ha_commit_trans
|
|_ _ _ _ MYSQL_BIN_LOG::prepare
| |
| |_ _ _ _ ha_prepare_low
|
|_ _ _ _ MYSQL_BIN_LOG::commit
|
| _ _ _ _MYSQL_BIN_LOG::ordered_commit
|
|_ _ _ _ha_commit_low