levelDB实现相关概念及基础

一、levelDB基础概念

在levelDB的实现中,level(分层)无疑是最为关键的一个核心概念,在分层基础上定义和实现的compact也是该算法的基本操作。在levelDB之前,总有一个先入为主的错误观念:就是只要分层的结构,都有类似于B+树的查询结构:上层节点有指针指向子节点,并且可以在N层确定从本层的哪个节点的继续向子节点查找。但是在levelDB中这些描述并不成立,在不同的层级之间,N层并没有指针来为下一层的查询提供信息。从本质上来说,每一层的所有table从左到右只是组成了一个有序链表,而不同的层级本质上代表了其中数据的新老关系,越顶层的数据越新,这和地面化石堆积的原理相似。这意味着当查找一个元素的时候,必须要从跟开始逐层查找才能查询到最新数据。
有了这个概念,就可以理解其中关键的compact过程:在compact的时候,其实就是一个老化数据自然下沉的过程。对于第N层中的一个SSTable,它的下沉要保证它下沉到N+1层之后,N+1层的所有SSTable依然是从左到右游戏排列,并且每个SSTable的键值互不相交。这意味着需要对这个SSTable进行拆分,其内容化整为零向更低一层扩散,这个操作也会导致N+1中某些SSTable被打散重组。
那么继续回归到开始,一个原始的SSTable中的内容从哪里来呢?其它它们本身没有任何血缘关系,被放置在同一个SSTable,完全是由于它们的修改顺序连续,当这些修改生成的数据打到一定大小之后就被统一打包到这个SSTable文件。这也就意味着一个SSTable内部是有序的,但是键值的区间返回是任意的。

二、LevelDB自带文档对于实现的说明

这里明确说明:除了第一层的文件可能包含有重叠键值,其它层级的所有文件的键值区间都不会重叠。当L层的一个文件大于配置的限量大小之后,这个文件会和L+1层所有和该文件键值重叠的文件一起进行merge,从而组成一个新的L+1层文件集合。这个merge使用了bulk读取和写入,因此磁盘操作效率很高。
Files in the young level may contain overlapping keys. However files in other levels have distinct non-overlapping key ranges. Consider level number L where L >= 1. When the combined size of files in level-L exceeds (10^L) MB (i.e., 10MB for level-1, 100MB for level-2, ...), one file in level-L, and all of the overlapping files in level-(L+1) are merged to form a set of new files for level-(L+1). These merges have the effect of gradually migrating new updates from the young level to the largest level using only bulk reads and writes (i.e., minimizing expensive seeks).
关于compact:压缩过程从L层选择一个文件L,以及L+1层所有和该文件有键值重叠的文件文件作为输入。这里还特意提到:如果L层文件只和L+1层的某个文件的部分键值重叠,那么L+1层的这个文件内容全部都会当作输入来处理,并且在压缩完成之后整个丢弃。这个起始也就是说levelDB操作的基本粒度是SSTable,不可能从一个文件中取出部分键值而保留另一部分。
When the size of level L exceeds its limit, we compact it in a background thread. The compaction picks a file from level L and all overlapping files from the next level L+1. Note that if a level-L file overlaps only part of a level-(L+1) file, the entire file at level-(L+1) is used as an input to the compaction and will be discarded after the compaction. Aside: because level-0 is special (files in it may overlap each other), we treat compactions from level-0 to level-1 specially: a level-0 compaction may pick more than one level-0 file in case some of these files overlap each other.

三、levelDB的查找Get

这里实现有几点
1、一定是从第零层开始依次从上到下遍历所有层级。这是由于层级本身代表了这些数据的新老关系,之后最顶层的数据才是最后的修改结果。
2、除了第零层之外,其它所有层级可以使用二分查找,说明这些层级文件的键值区间互不重叠。
3、找到之后立即返回。这和第1点原因相同,第一次找到的就是最新的修改结果。
leveldb-master\db\version_set.cc
Status Version::Get(const ReadOptions& options,
const LookupKey& k,
std::string* value,
GetStats* stats) {
……
for (int level = 0; level < config::kNumLevels; level++) {
size_t num_files = files_[level].size();
if (num_files == 0) continue;

// Get the list of files to search in this level
FileMetaData* const* files = &files_[level][0];
if (level == 0) {
// Level-0 files may overlap each other. Find all files that
// overlap user_key and process them in order from newest to oldest.
tmp.reserve(num_files);
for (uint32_t i = 0; i < num_files; i++) {
FileMetaData* f = files[i];
if (ucmp->Compare(user_key, f->smallest.user_key()) >= 0 &&
ucmp->Compare(user_key, f->largest.user_key()) <= 0) {
tmp.push_back(f);
}
}
if (tmp.empty()) continue;

std::sort(tmp.begin(), tmp.end(), NewestFirst);
files = &tmp[0];
num_files = tmp.size();
} else {
// Binary search to find earliest index whose largest key >= ikey.
uint32_t index = FindFile(vset_->icmp_, files_[level], ikey);
if (index >= num_files) {
files = NULL;
num_files = 0;
} else {
tmp2 = files[index];
if (ucmp->Compare(user_key, tmp2->smallest.user_key()) < 0) {
// All of "tmp2" is past any data for user_key
files = NULL;
num_files = 0;
} else {
files = &tmp2;
num_files = 1;
}
}
}

for (uint32_t i = 0; i < num_files; ++i) {
if (last_file_read != NULL && stats->seek_file == NULL) {
// We have had more than one seek for this read. Charge the 1st file.
stats->seek_file = last_file_read;
stats->seek_file_level = last_file_read_level;
}

FileMetaData* f = files[i];
last_file_read = f;
last_file_read_level = level;

Saver saver;
saver.state = kNotFound;
saver.ucmp = ucmp;
saver.user_key = user_key;
saver.value = value;
s = vset_->table_cache_->Get(options, f->number, f->file_size,
ikey, &saver, SaveValue);
if (!s.ok()) {
return s;
}
switch (saver.state) {
case kNotFound:
break; // Keep searching in other files
case kFound:
return s;
case kDeleted:
s = Status::NotFound(Slice()); // Use empty error message for speed
return s;
case kCorrupt:
s = Status::Corruption("corrupted key for ", user_key);
return s;
}
}
}
……

四、LevleDB的compact

compact本质上就是被merge的SSTable和一组有序文件之间的多路归并排序,这种归并本身其实非常简单,使用一个循环即可,代码位于:
leveldb-master\table\merger.cc
virtual void Next() {
assert(Valid());

// Ensure that all children are positioned after key().
// If we are moving in the forward direction, it is already
// true for all of the non-current_ children since current_ is
// the smallest child and key() == current_->key(). Otherwise,
// we explicitly position the non-current_ children.
if (direction_ != kForward) {
for (int i = 0; i < n_; i++) {
IteratorWrapper* child = &children_[i];
if (child != current_) {
child->Seek(key());
if (child->Valid() &&
comparator_->Compare(key(), child->key()) == 0) {
child->Next();
}
}
}
direction_ = kForward;
}

current_->Next();
FindSmallest();
}
从这里可以看到,所有的多路(也就是children列表)循环比较链表开始的最大元素,挑选最小一个
void MergingIterator::FindSmallest() {
IteratorWrapper* smallest = NULL;
for (int i = 0; i < n_; i++) {
IteratorWrapper* child = &children_[i];
if (child->Valid()) {
if (smallest == NULL) {
smallest = child;
} else if (comparator_->Compare(child->key(), smallest->key()) < 0) {
smallest = child;
}
}
}
current_ = smallest;
}

五、compact过程和Get过程的互斥

当执行一次查询操作时,需要的是在每个数据库结构中的当前版本信息,也就是versions_->current()函数返回的当前版本指针。这个地方的更新其实只是需要一个指针的更新,在使用之前已经使用该数据库实例的互斥锁锁定,所以在变化(指针更新)的过程中不会有影响。
Status DBImpl::Get(const ReadOptions& options,
const Slice& key,
std::string* value) {
Status s;
MutexLock l(&mutex_);
SequenceNumber snapshot;
if (options.snapshot != NULL) {
snapshot = reinterpret_cast<const SnapshotImpl*>(options.snapshot)->number_;
} else {
snapshot = versions_->LastSequence();
}

MemTable* mem = mem_;
MemTable* imm = imm_;
Version* current = versions_->current();
……

六、过期SSTable文件的删除

在每次compact之后删除没有引用计数的table文件
DeleteObsoleteFiles() is called at the end of every compaction and at the end of recovery. It finds the names of all files in the database. It deletes all log files that are not the current log file. It deletes all table files that are not referenced from some level and are not the output of an active compaction.

posted on 2019-03-19 15:39  tsecer  阅读(226)  评论(0编辑  收藏  举报

导航