mdb
计划开发高性能KV数据库, 学习MongoDB leveldb innodb, 练手贴+日记贴: http://bbs.chinaunix.net/thread-4244870-1-1.html
超高性能网络编程, Asynchronous network I/O http://bbs.chinaunix.net/thread-1214570-1-1.html
从LSM-Tree、COLA-Tree谈到StackOverflow、OSQA
4. Implementation Highlights
4.实现要点
Since all of the source code has been publicly available from the outset, and due to space limitations in this paper, only a few of the most notable implementation details will be described here. Interested parties are invited to read the code in the OpenLDAP git repository and post questions on the openldap-technical mailing list.
由于所有的源代码已经公开提供从一开始,并且由于在本文中的空间限制,只有少数的最显着的实施细节将在这里描述。有意者请阅读在OpenLDAP Git仓库代码并发布OpenLDAP的技术性邮件列表上的问题。
The MDB library API was loosely modeled after the BDB API, to ease migration of BDB-based code. The first cut of the back-mdb code was simply copied from the back-bdb source tree, and then all references to the caching layers were deleted. After a few minor API differences were accounted for, the backend was fully operational (though still in need of optimization). As of today back-mdb comprises 340KB of source code, compared to 476KB for back-bdb/hdb, so back-mdb is approximately 30% smaller.
MDB的API参考了BDB的API,以方便从BDB进行迁移。back-mdb代码是简单的从back-bdb进行复制,然后删除缓存层的全部代码。经过几次小的API维护,现在backend是完全可用的(还需要一些优化)。现在back-mdb包含340K代码,相比476K的back-bdb/hdb,它大概要小30%。
The MDB code itself started from Martin Hedenfalk's append-only Btree code in the OpenBSD ldapd source repository[21]. The first cut of the MDB code was simply copied from the ldapd source, and then all of the Btree page cache manager was deleted and replaced with mmap accesses. The original Btree source yielded an object file of 39KB; the MDB version was 32KB. Initial testing with the append-only code proved that approach to be completely impractical. With a small test database and only a few hundred add/delete operations, the DB occupied 1027 pages but only 10 pages actually contained current data; over 99% of the space was wasted.
MDB 代码是从Martin Hedenfalk's在 OpenBSD ldapd 源码仓库的 append-only bTree开始的。先是删除了btree 的页缓存管理代码,取而代之的是内存映射访问(mmap)。原来的btree是39kB, mDB版本的是32KB。初步测试表明,append-only代码是不切实际的,通过一个包含几百个添加/删除操作的数据测试,DB占用了1027页,但只有10页包含当前的数据,99%的空间都被浪费了。
Along with the mmap management and page reclamation, many other significant changes were made to arrive at the current MDB library, mostly to add features from BDB that backmdb would need. As of today the MDB library comprises 35KB of object code. (Comparing source code is not very informative since the MDB source code has been heavily expanded with Doxygen comments. The initial version of mdb.c was 59KB as opposed to btree.c at 76KB but with full documentation embedded mdb.c is now 162KB. Also for comparison, BDB is now over 1.5MB of object code.)
伴随着MMAP管理和页回收,当前MDB库上发生了许多其他显著变化在。大多是从BDB添加功能backmdb需要。由于今天的MDB库包含的对象代码35KB。(比较的源代码是不是非常丰富,因为MDB的源代码已经与Doxygen的意见被大量扩展。mdb.c的最初版本是59KB,而不是在76KB但完整的文档嵌入到mdb.c是btree.c现在162KB。另外比较,BDB现已超过目标代码1.5MB)。
4.1 MDB Change Summary
4.1 MDB更改摘要
The append-only Btree code used a meta page at the end of the database file to point at the current root node of the Btree. New pages were always written out sequentially at the end of the file, followed by a new meta page upon tansaction commit. Any application opening the database needed to search backward from the end of the file to find the most recent meta page, to get a current snapshot of the database. (Further explanation of append-only operation is available at Martin's web site[22].)
append-only Btree 使用一个 meta页指向当前btree的根节点, meta 页是在文件的结尾处。在事务提交时,新page总是按顺序写在文件的结尾,然后是一个新的meta页。所有应用在打开数据库文件时,都要先从文件尾开始向前查找最近的meta页,以获取当前数据库的快照。(更多append-only tree的操作请见)
In MDB there are two meta pages occupying page 0 and page 1 of the file. They are used alternately by transactions. Each meta page points to the root node of two Btrees - one for the free list and one for the application data. New data first re-uses any available pages from the free list, then writes sequentially at the end of the file if no free pages are available. Then the older meta page is written on transaction commit. This is nothing more than standard double-buffering - any application opening the database uses the newer meta page, while a committer overwrites the older one. No locks are needed to protect readers from writers; readers are guaranteed to always see a valid root node.
在MDB中有两个meta页分别映射文件的page 0和page 1, 他们在发生事务时交易使用。每一个meta页都指向两颗btree的根节点——一个指向 free list,另一个则指向应用的数据。新的数据首先重用free list中的可用页,如果free list中没有可用页,则顺序写入到文件尾部。当提交事务时,旧的meta页会被写入。这是标准的双缓冲——其它应用打开数据库时,使用新的meta页。同时提交者根覆盖旧的。不需要使用锁来进行保护,读者保证能看到一个有效的根节点。
The original code only supported a single Btree in a given database file. For MDB we wanted to support multiple trees in a single database file. The back-mdb indexing code uses individual databases for each attribute index, and it would be a non-starter to require a sysadmin to configure multiple mmap regions for a single back-mdb instance. Additionally, the indexing code uses BDB's sorted duplicate feature, which allows multiple data items with the same key to be stored in a Btree, and this feature needed to be added to MDB as well. These features were both added using a subdatabase mechanism, which allows a data item in a Btree to be treated as the root node of another Btree.
原来的代码只支持在一个数据库文件中有一个Btree, 我们希望在MDB中能在一个数据库文件中支持多个Btree。back-mdb的索引代码中,为每个属性索引(attribute index)都提供了一个单独的数据库。系统管理员为每个back-mdb实例配置多个mmap region.此外,代码使用BDB的排序复制功能,它允许相同Key的多个数据项都能被存储在B树中。and this feature needed to be added to MDB as well. 这些特征使用子数据库机制,它允许在一个B树的数据项作为另一个B树的根节点进行处理。
4.2 Locking
For simplicity the MDB library allows only one writer at a time. Creating a write transaction acquires a lock on a writer mutex; the mutex normally resides in a shared memory region so that it can be shared between multiple processes. This shared memory is separate from the region occupied by the main database. The lock region also contains a table with one slot for every active reader in the database. The slots record the reader's process and thread ID, as well as the ID of the transaction snapshot the reader is using. (The process and thread ID are recorded to allow detection of stale entries in the table, e.g. threads that exited without releasing their reader slot.) The table is constructed in processor cache-aligned memory such that False Sharing[23] of cache lines is avoided.
为简单起见,MDB库允许在同一时间只有一个写入。创建一个写事务上获得一个写互斥锁;互斥通常驻留在共享存储器区域,以便它可以在多个进程之间共享。这个共享存储区跟数据库区是分开的。锁区域还包含一个数据表,数据表中包含一个插槽,插槽中记录了当前活动读者的信息,包括进程和线程ID,以及该读者使用的事务快照ID。(记录进程及线程ID是用于失效条目的检测,如线程在退出时不释放它们的reader slot),表是初始化在处理器高速缓冲存储器(processor cache-aligned memory)中,以避免多线程伪共享(False Sharing of cache lines)问题。
Readers acquire a slot the first time a thread opens a read transaction. Acquiring an empty slot in the table requires locking a mutex on the table. The slot address is saved in threadlocal storage and re-used the next time the thread opens a read transaction, so the thread never needs to touch the table mutex ever again. The reader stores its transaction ID in the slot at the start of the read transaction and zeroes the ID in the slot at the end of the transaction. In normal operation, there is nothing that can block the operation of readers.
当一个线程第一次开始一个读事务的时候,读者将获得一个插槽。在表中获得一个空的插槽需要把表锁定。插槽地址保存在threadlocal中,下次该线程打开读事务的时候可以重用。所以该线程不需要再次进行锁定表。读者在开始事务时,在插槽中保存它的事务ID,当事务结束时,在插槽中清空事务ID,在一般操作中读者都不会被阻塞。
The reader table is used when a writer wants to allocate a page, and knows that the free list is not empty. Writes are performed using copy-on-write semantics; whenever a page is to be written, a copy is made and the copy is modified instead of the original. Once copied, the original page's ID is added to an in-memory free list. When a transaction is committed, the inmemory free list is saved as a single record in the free list DB along with the ID of the transaction for this commit. When a writer wants to pull a page from the free list DB, it compares the transaction ID of the oldest record in the free list DB with the transaction IDs of all of the active readers. If the record in the free list DB is older than all of the readers, then all of the pages in that record may be safely re-used because nothing else in the DB points to them any more.