CMU_15445_P4_Part2

MVCC Executors

在这个 Task 中, 我们需要实现在 MVCC 下的 Plan Executor. 在 MVCC 下, 由于数据库中的对象会存储多个物理版本, 因此所有涉及到数据库对象直接读取, 存储, 修改, 删除的 Executor 都需要考虑到 tuple 的多版本物理存储问题, 也就是在前面提到的 undolog 版本链的构建, 读取等问题.

Insert Executor

Insert Executor 会在 TableHeap 中新插入一个 Tuple. 下图展示了在 TableHeap 中插入的过程, 实际流程中因为 Insert 一个 tuple 不会涉及到版本链的构建, 因此较为简单. 具体要完成的事情为:

设置 TableHeap 中新插入的 tuple 的 ts_ 为当前插入这个 tuple 的事务的 Txn_ID
将修改的 tuple 加入到当前事务的 WriteSet 中.

插入过程

中间还插入了一个 commit 的流程, 这部分按照课程描述一步步实现也是很简单的, 就不过多描述了.

Update and Delete Executor

Insert Executor 仅涉及到 TableHeap 的修改, 不需要新建以及修改版本链, 新建的 tuple 本质上没有版本链. 但是 Update 和 Delete Executor 会修改 Tuple, 会涉及到 undolog 版本链的插入或者修改, 因此情况要更加复杂一些.
本次我打算先将需要做的步骤写清楚, 再完成这部分代码的修改, 最后再回来改一改这个文章. 这部分我们需要实现 UndoLog 的生成逻辑, 以及更新 TableHeap 中的 Tuple 的原始数据.

update 和 delete Executor 中有部分步骤是相同的, 可以在 execution_common.cpp 中实现.
简单点说, update executor 更新 TableHeap 中的最新数据, delete executor 设置 TableHeap 中的 Tuple 的 is_delete_ 标志位, 但是并不会直接修改数据.

写写冲突检查

Delete Exectuor 和 Update Executor 本质都是在修改 TableHeap 中的 tuple.
由于一个 Tuple 可能会被不同的事务在不同的时间片修改, 因此我们需要进行写写冲突的检查. 在 BUSTUB 中, 写写冲突检查有下列两种情况:

一个 Tuple 已经被一个 uncommitted 的事务修改, 其他的事务不可以修改这个 Tuple, 尝试修改这个 Tuple 的其他事务会发生冲突.
另一种情况是, 一个 Tuple 已经被一个事务 Txn_A 删除, 并且这个事务 Txn_A 已经 commited, 此时如果来了一个事务 Txn_B, 它的开始时间 read_ts_ 是小于 Txn_A 的 commit_ts_ 的, 事务 Txn_B 也想去删除或者修改这个 Tuple, 此时也会发生写写冲突.

我们用下面的例子来说明写写冲突的情况.

写写冲突检查

case1

在上面的数字 1 中, txn10 删除了 Tuple (A,2) 但是还没有 commit, txn9 仍然还可以读 (A,2), 因此 txn9 的 read_ts_ 是 3.

case2

如果 txn9 这时想要修改或者删除 Tuple (A,2), 那么会出现写写冲突, 导致 txn9 Aborted.

case3

在第三行中, 我们可以看到, 这个 Tuple 在时间片 2 的时候为 (C,2), 一个事务将其修改为了 (C,4), 并且已经提交, commit_ts_ 是 4, 此时 txn9 可以读到这个 Tuple 的旧版本, 得到 (C,2), 但是不能修改和删除它, 如果 txn9 尝试修改或者删除这个 Tuple, txn9 会被 Aborted, 因为 txn9 的 read_ts_ 是 3, 小于这个 Tuple 最新 commited 的时间片

当监测到事务冲突时, 冲突的事务的状态应该被设置为 TAINTED, 同时应该抛出一个执行异常 ExecutionException . ExecuteSqlTxn 也会返回一个 false, 当前, 还没有实现 Abort 函数, 因此不需要调用 Abort 函数. 这里没有 Abort 在后续中避免了对已经生成的 undoLog 版本链的修改.

我们需要考虑的冲突都是简单的冲突, 并且 BUSTUB 的执行是单线程的. 写写冲突只会发生在 TableHeap 中的 base tuple 的时间片上.
因此在实现的过程也比较简单, 我在冲突检测的部分实现如下:

auto IsWriteWriteConflict(const TupleMeta &current_tuple_meta, const ExecutorContext *exec_ctx) -> bool {
  if (current_tuple_meta.ts_ >= TXN_START_ID &&
      current_tuple_meta.ts_ != exec_ctx->GetTransaction()->GetTransactionId()) {
    return true;
  }
  if (current_tuple_meta.ts_ < TXN_START_ID && current_tuple_meta.ts_ > exec_ctx->GetTransaction()->GetReadTs()) {
    return true;
  }
  return false;
}

在 Update Executor 和 Delete Executor 中应该先检查冲突, 检查步骤如下:

/** write-write conflict checking */
if (IsWriteWriteConflict(old_tuple_meta, exec_ctx_)) {
    exec_ctx_->GetTransaction()->SetTainted();
    throw ExecutionException(
        "WRITE-WRITE CONFLICT IN UPDATE-EXECUTOR, \nThere is a write-write conflict because this tuple has been "
        "update by other transaction!!!");
}

Update Executor

update executor 应该是 pipeline 的方式实现, 在执行更新 TableHeap 之前, 它首先应该读取所有 child executor 的 Tuple 到一个临时的 Buffer 中. 全部读完之后, 再从这个 Buffer 中读取 Tuples, 然后更新 TableHeap 中的 tuples. 这样做是为了保证 update executor 执行过程中的原子性, 以及防止 update executor 执行时, 其他事务的 update executor 修改了 tuple, 导致数据不一致的问题.

在 Executor 中创建 UndoLog

由于 Delete Executor 本质上实际是在修改一个 TableHeap 中的 Tuple 的标志为删除, 和 Update Executor都是在修改 Table Heap 中最新版本的 tuple 的信息, 因此实际上有很多共通的地方, 以及相同的地方, 例如创建 undolog 与修改 undolog 的流程.

创建 UndoLog 的基本流程

在 update executor 与 delete executor 中, 创建一个 UndoLog 的基本流程如下:

对本次修改创建 UndoLog, 对于删除操作, 需要保存删除之前的 Tuple 的完整数据, 对于 update 操作, 仅记录原来的 tuple 中被修改的部分, 然后需要将 undoLog 写入到当前事务记录的 undologs 数组中, 如果当前事务已经修改过这个 Tuple, 已经存在这个 Tuple 的 undoLog, 那么需要更新这个 undoLog.
更新版本链, 将新的 undoLog 插入到版本链中, 这里插入只需要修改 version link 指向这个新的 undoLog 就可以了
更新 TableHeap 中存储的 tuple 与 tuplemeta 的基本数据.
将这个事务修改的 tuple 添加到当前事务的 write_set_ 中

下面的图中分别说明了 Update Executor 和 Delete Executor 中创建 Undolog 的流程.

Update Executor 创建 UndoLog

Delete Executor 创建 UndoLog

在实现过程中, 在 Update Executor 中创建 UndoLog 的流程如下所示:

/** 创建一个新的undolog, 并且插入到版本链中, 作为 latest undolog */
auto new_undolog = GenerateDiffLog(old_tuple, new_tuple, &table_info_->schema_);
if (new_undolog.has_value()) {
    /** if we should build a undolog, set the timestamp for this undolog */
    new_undolog->ts_ = old_tuple_meta.ts_;
    /** insert the undolog into the undolog version link */
    auto prev_undo_link = exec_ctx_->GetTransactionManager()->GetUndoLink(*rid);
    new_undolog->prev_version_ = prev_undo_link.has_value() ? prev_undo_link.value() : UndoLink{INVALID_TXN_ID};
    /** append the undolog into this txn undolog array */
    exec_ctx_->GetTransaction()->AppendUndoLog(new_undolog.value());
    auto prev_undolog_index = exec_ctx_->GetTransaction()->GetUndoLogNum() - 1;
    UndoLink latest_undolink = UndoLink{exec_ctx_->GetTransaction()->GetTransactionId(), (int)prev_undolog_index};
    /** update the undolink version chain */
    exec_ctx_->GetTransactionManager()->UpdateUndoLink(*rid, latest_undolink, nullptr);
    /** update the tuple in the TableHeap */
    auto new_tuple_meta = old_tuple_meta;
    new_tuple_meta.ts_ = exec_ctx_->GetTransaction()->GetTransactionId();
    auto update_in_place_result = table_info_->table_->UpdateTupleInPlace(new_tuple_meta, new_tuple, *rid);
    if (!update_in_place_result) {
        LOG_DEBUG("UpdateTupleInPlace Failed, This shouldn't happen");
        return false;
    }
    /** 记录事务修改的 RID */
    exec_ctx_->GetTransaction()->AppendWriteSet(this->plan_->GetTableOid(), *rid);
}

事务多次修改同一个 tuple 的情况

update 和 delete executor 还有一个不同是, 同一个事务可能会对同一个 tuple 多次执行 update 操作或者 delete 操作.
在事务进行写写冲突检查之前, 需要检查是否自身修改相同的 tuple. 如果一个 tuple 已经被当前事务修改过一次, 这种情况不算做写写冲突, 如果这个 tuple 是新插入的, 没有形成 undoLog 版本链, 也无需处理. 否则, 一个事务最多持有一个 tuple 的 undoLog, 因此需要更新 undoLog, 也就是说, 由于同一个事务可以多次修改相同的 tuple, 所以这个事务存储的这个 tuple 的 undoLog 是可能会改变的.

我们使用上图中的例子说明这种情况, 在这个例子中, txn9 将这个 tuple 从 (A,3) 依次更新为 (A,4) -> (A,5) -> (B,5) -> (A,5), 最终删除了这个 tuple, 在这过程中 txn9 仅维护一个 undoLog, 这个 undoLog 会随着修改 tuple 而变化, 但是当从 (B,5) -> (A,5) 的时候, undoLog 中存储的还是 (A,3), 而没有变成 (_,3), 这是因为 undoLog 记录的内容只会增加, 不会减少. 这样处理起来要简单一些.

还有一种特殊的情况如下, 如果 txn9 insert 了一个 Tuple, 然后修改了这个 tuple 很多次, 但是最后选择删除了这个 tuple, 这时会发生什么呢.
因为最后我们会进行一次删除, 这个 tuple 的状态位 is_delete_ 肯定为 true, 表示被删除了, 然后这个 tuple 的 tuple_meta 中的时间片被设置为 0, 表示它从来不存在, 但是如果这个 tuple 存在版本链, 那么 tuple_meta 中的时间片就不可以设置为 0, 而是设置为这个事务的 commit_ts_, 因为设置为 0 的话, 就看不到之前的版本, 但是这个 tuple 是新创建的, 之前怎么会有版本链呢, 在后续的 Bonus Task2 中会出现这种情况

由于此时我们没有索引, 并且仅使用 RID 识别所有 Tuple, 因此版本链只有一个删除标记在 TableHeap 中, undoLog 中不会有删除标记, 因为删除后不会在同一位置重新创建 Tuple. 因此, 在生成 undoLog 的时候, 可以跳过很多情况, 这些情况我还没有想到. 在后续的 Task 中引入检索与并发会出现这些问题.

最后, 我们总结, update 和 delete 的整体流程如下:

从 child executor 中获取 RID, 以及当前 Tuple 的 TupleMeta, 用于后续判断冲突与设置 UndoLog 时间
检查是否存在写写冲突, 如果当前事务存在写写冲突, 设置当前事务状态为 TAINTED 然后抛出异常
生成 update tuple 或者 delete tuple, 用于更新 TableHeap 中的 Tuple
判断当前 tuple 是否被当前事务修改过, 如果是当前事务之前修改过, 需要更新 TableHeap 中的内容, 如果有 undoLog, 还需要更新 undoLog 版本链中的 undolog
如果当前事务第一次修改这个 tuple, 创建本次修改的 undolog, 然后更新整个版本链.

一个灵魂的问题

如果一个事务正在 update 一个 Tuple, 如果这个 Tuple 被一个 uncommitted 的事务修改, 那么当前事务应该 Abort. 这是写写冲突
如果一个事务正在 update 一个 Tuple, 如果这个 Tuple 是被这个事务自身修改过的, 还有两种情况
1. 如果 Tuple 是新建的, 那么这个 Tuple 的版本链应该是为空
2. 如果这个 Tuple 不是新建的, 那么这个 Tuple 一定存在 undolog 版本链, 并且这个 Tuple 的 Undolog 版本链中的 latest undolog 应该是当前这个事务之前记录的 undolog. 因为当我当前事务生成 undolog 之后, 不允许其他可以生成 undolog 的 executor 操作这个 tuple, 只有等当前事务 commit 之后, 才有可能生成新的 undolog
  这一点在代码实现中比较重要, 可以避免一些复杂的情况.

Stop-the-world Garbage Collection

在多版本并发控制中, 会存储数据库对象的旧的版本内容, 但是正在运行的事务可能不会访问到很久以前的 tuple 的版本信息, 而这些访问不到的版本信息需要即使清除, 回收垃圾. 我们知道 Txn_Manager 使用 watermark 存储当前正在运行的所有事务的最小的 read_ts_.

上图中是垃圾回收时对 tuple 和事务进行的删除操作. Garbage Collection 的基本流程可以总结如下:

在 catalog_ 中遍历所有的 tuples 以及对应的版本链, 第一个 ts_ 小于等于 watermark 的 undolog 是最后可以看到的 undolog, 这个 undolog 之后的 undolog 都标记为删除, 例如在上图中, watermark 为 3, 那么第一个 tuple (A,4) 中, ts=3 的undolog 是最后一个可见的 undolog, 版本链在此之前的 undolog 都可以标记为删除. 而第三个 tuple(C,4) , ts=2 这个 undolog 仍然需要访问, 最后一个 ts=1 的undolog 可以标记为删除.
对于已经 committed 事务, 遍历这个事务的所有 undolog, 如果所有 undolog 都标记为删除, 那么这个事务可以被删除.
Garbage Collection 真正删除的是事务, 不是 undolog 版本链中的 undolog. 因为 undolog 就存储在事务中, 只是通过 UndoLink 构成了版本链.

我的部分实现如下, 这部分是标记 undolog 版本链中的某个 undolog 为删除:

auto smallest_read_ts = running_txns_.GetWatermark();
/**
* 遍历所有的 Table 的 Table_Heap
* 然后遍历所有的 tuple 对应的版本链, 将不会访问的 undolog 的 ts_ 设置为 -1
* 标记为可以删除该 undolog
*/
auto table_names = catalog_->GetTableNames();
for (const auto &table_name : table_names) {
auto table = catalog_->GetTable(table_name);
auto table_heap = table->table_.get();
auto table_iterator = std::make_unique<TableIterator>(table_heap->MakeIterator());
/** traverse all tuple in the TableHeap */
while (!table_iterator->IsEnd()) {
    /** 获取当前 Tuple */
    auto [tuple_meta, current_tuple] = table_iterator->GetTuple();
    auto rid = current_tuple.GetRid();
    auto version_undo_link = this->GetUndoLink(rid);
    bool undolog_start_deleted = tuple_meta.ts_ <= smallest_read_ts;
    /** 如果 TableHeap 中的 tuple_meta 的 ts_ 小于 smallest_read_ts, 那么要删除版本链中的 undolog */
    if (undolog_start_deleted) {
    this->UpdateUndoLink(rid, UndoLink{INVALID_TXN_ID, 0}, nullptr);
    }
    while (version_undo_link.has_value() && version_undo_link->IsValid()) {
    /** 打印当前的 UndoLog 版本链的信息 */
    auto undo_log = this->GetUndoLog(version_undo_link.value());
    /** 当前 undoLog 在事务的 undologs 数组的下标 */
    auto undo_log_index = version_undo_link->prev_log_idx_;
    /** 获取当前版本链的对应的事务信息 */
    auto iter = this->txn_map_.find(version_undo_link->prev_txn_);
    if (iter == this->txn_map_.end()) {
        continue;
    }

    /** 获取版本链中的下一个 UndoLink, 因为要修改 prev_version_ */
    version_undo_link = undo_log.prev_version_;

    if (undolog_start_deleted && undo_log.ts_ < smallest_read_ts) {
        /** 使用 ts_ == INVALID_TS 标记这个 undolog 实际上可以被删除, 并且这个 undolog 在版本链中的 */
        undo_log.ts_ = INVALID_TS;
        undo_log.prev_version_ = UndoLink{INVALID_TXN_ID, 0};
        iter->second->ModifyUndoLog(undo_log_index, undo_log);
    }

    /** 如果这个 undolog 的 ts_ 小于等于 smallest_read_ts, 这个 undolog 后面的 undolog 都可以删除了 */
    if (!undolog_start_deleted && undo_log.ts_ <= smallest_read_ts) {
        undolog_start_deleted = true;
        undo_log.prev_version_ = UndoLink{INVALID_TXN_ID, 0};
        iter->second->ModifyUndoLog(undo_log_index, undo_log);
    }
    }
    ++(*table_iterator);
}
}

posted @ 2025-03-03 16:57 虾野百鹤阅读(95) 评论(0) 收藏举报

刷新页面返回顶部

行远自迩登高自卑

CMU_15445_P4_Part2

MVCC Executors

Insert Executor

Update and Delete Executor

写写冲突检查

case1

case2

case3

Update Executor

在 Executor 中创建 UndoLog

创建 UndoLog 的基本流程

事务多次修改同一个 tuple 的情况

一个灵魂的问题

Stop-the-world Garbage Collection

公告

行远自迩 登高自卑

CMU_15445_P4_Part2

MVCC Executors

Insert Executor

Update and Delete Executor

写写冲突检查

case1

case2

case3

Update Executor

在 Executor 中创建 UndoLog

创建 UndoLog 的基本流程

事务多次修改同一个 tuple 的情况

一个灵魂的问题

Stop-the-world Garbage Collection

公告

行远自迩登高自卑