erlang 分布式数据库Mnesia 实现及应用

先推荐一篇:mnesia源码分析(yufeng)
 
- linear hash  
ETS/DETS/mnesia 都使用了linear hash算法
 
 
redis dict 的实现类似于linear hash,渐进式rehash,保证操作是O(1)。不过除了每次操作时执行一个bucket的rehash,而且每100ms内使用1ms 执行加快rehash进程。
虽然虽然rehash过程渐进式的,但在key space过大时,同时使用LRU过期,buckets 这个大数组的malloc 就能让refis卡上一阵子。
曾遇到的一个案例:现网redis使用主备自动切换模式,有段时间老无故自动切换。排查发现是key space 1000kw+,切换时大量evict,bluckets 需要malloc一个*2的,也就是10M* 24 * 2 = 480M内存,内存一直处于满地状态,靠着LRU替换,此时需要清理出这么大一块,导致redis 实例数秒停止响应导致切换。从这个案例和内存利用率来看,redis 使用时尽量保证keyspace 别太大吧。
 
- ETS
     Erlang内置数据库挑战7000WQPS
     ETS 实现很简单,就一个内存字典。使用读写锁,只读情况下达到很高的TPS,曾在我老T420笔记本 测试过字典在单核心情况下读写400w/s。从这个测试数据看ETS 的读操作其实和全局内存字典读取速度差不多,效率很高。写性能因为全局锁的关系,不可避免受限且并发越高性能越差。建议对写入频繁ETS做分表操作。
 
 
- DETS 
ETS的落地存储方式,有单表2G大小限制,可以有cache 但默认cache 0 也就是默认读写都操作磁盘。
前面说到DETS 是基于linear hash 存储,hash 方式不是很磁盘友好、不是文件块 cache友好;cache 只是作为行级索引,没有块级索引。
总的说DETS 和真正完整的存储引擎还有一定差距,单独使用价值不大,所以基本都是用于基于它的Mnesia集群版本来使。

Since all operations performed by Dets are disk operations, it is important to realize that a single look-up operation involves a series of disk seek and read operations. For this reason, the Dets functions are much slower than the corresponding Ets functions, although Dets exports a similar interface.

Dets organizes data as a linear hash list and the hash list grows gracefully as more data is inserted into the table. Space management on the file is performed by what is called a buddy system. The current implementation keeps the entire buddy system in RAM, which implies that if the table gets heavily fragmented, quite some memory can be used up. The only way to defragment a table is to close it and then open it again with the repair option set to force.


- Mnesia 
   基于ETS/DETS, 的纯erlang 实现的强大分布式数据库,而disc Mnesia 表大小受dets 限制,但可以使用fragmentation,frag 类似于分区表。
 
使用LevelDB 替换DETS(1/4启动时间,1/2冲突,1/3 内存占用)
Mnesia Backend Plugin Framework and a LevelDB-based Plugin: Roland Karlsson, Malcolm Matalka
 
whatsapp:
disc_copies tables
Partitioned islands and fragmented tables
All operations run async_dirty
Use key hashing to collapse all ops per key
to a single process
 
 

First of all, mnesia has no 2 gigabyte limit. It is limited on a 32bit architecture, but hardly any are present anymore for real work. And on 64bit, you are not limited to 2 gigabyte. I have seen databases on the order of several hundred gigabytes. The only problem is the initial start-up time for those.

Mnesia is built to handle:
 
  • Very low latency K/V lookup, not necessarily linearizible.
  • Proper transactions with linearizible changes (C in the CAP theorem). These are allowed to run at a much worse latency as they are expected to be relatively rare.
  • On-line schema change
  • Survival even if nodes fail in a cluster (where cluster is smallish, say 10-50 machines at most)

The design is such that you avoid a separate process since data is in the Erlang system already. You have QLC for datalog-like queries. And you have the ability to store any Erlang term.

Mnesia fares well if the above is what you need. Its limits are:

  • You can't get a machine with more than 2 terabytes of memory. And loading 2 teras from scratch is going to be slow.
  • Since it is a CP system and not an AP system, the loss of nodes requires manual intervention. You may not need transactions as well. You might also want to be able to seamlessly add more nodes to the system and so on. For this, Riak is a better choice.
  • It uses optimistic locking which gives trouble if many processes tries to access the same row in a transaction.
posted @ 2014-09-01 23:22  LittlePeng  阅读(4332)  评论(0编辑  收藏  举报