Data replication 同步技术
http://horicky.blogspot.com/2009/11/nosql-patterns.html, NOSQL Patterns
There are 2 ways how the master propagate updates to the slave; State transfer and Operation transfer.
- In State transfer, the master passes its latest state to the slave, which then replace its current state with the latest state.
- In Operation transfer, the master propagate a sequence of operations to the slave which then apply the operations in its local state.
State Transfer Model (状态传输模型)
优点是比较robust, 消息丢失后面还可以同步弥补, 缺点传输的流量比较大
当然为了提高效率, 也不会每次都传输所有的state, 只会传输改变的部分(delta change)
问题是, 如果在分布式的环境下知道异地复本之间的差异? 最简单的方法就是把一方的数据全传过来看看, 但这样无法减少传输量
常用的方法是使用Merkle tree, 通过传递digest来降低传输量, 来定位delta change
The state transfer model is more robust against message lost because as long as a latter more updated message arrives, the replica still be able to advance to the latest state.
Even in state transfer mode, we don't want to send the full object for updating other replicas because changes typically happens within a small portion of the object. In will be a waste of network bandwidth if we send the unchanged portion of the object, so we need a mechanism to detect and send just the delta (the portion that has been changed).
One common approach is break the object into chunks and compute a hash tree of the object (Merkle tree). So the replica can just compare their hash tree to figure out which chunk of the object has been changed and only send those over.
Merkle Tree
Hash tree, http://en.wikipedia.org/wiki/Hash_tree
Hash trees were invented in 1979 by Ralph Merkle.
a hash tree or Merkle tree is a tree in which every non-leaf node is labelled with the hash of the labels of its children nodes.如何使用Merkle trees经行高效的复本同步, 参考Amazon's Dynamo, 4.7处理永久性故障:副本同步
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
http://www.slideshare.net/quipo/modern-algorithms-and-data-structures-1-bloom-filters-merkle-trees
Operation transfer mode (操作日志传输模型)
Operation transfer 优点是需要传输的数据少, 但需要可靠的消息系统来保证不丢失
在分布式的环境下会大大增加复杂性, 似乎很少有使用这种方案的
In operation transfer mode, usually much less data need to be send over the network. However, it requires a reliable message mechanism with delivery order guarantee.
基于Gossip和Vector Clock的同步技术
State Transfer Model
In a state transfer model, each replica maintain a vector clock as well as a state version tree where each state is neither > or < among each other (based on vector clock comparison). In other words, the state version tree contains all the conflicting updates.
每个replica都维护vector clock, 以及state version tree(Merkle tree)
Query Processing
Client发出query request, 并附上该replica在client端的V-client
Server端收到后, 返回具有比client所附带vector clock新(V-state > V-client)的那部分数据和其对应的vector clock
Client收到查询结果后, 会将自己的vector clock和server传回的vector clock进行merge
Update Processing
Client发出update命令并附上当前V-client
Server收到后, 发现Vclient比当前服务器的所有V-state都旧(V-client < all V-state, 表明当前服务器的状态已经包含此更新), 则抛弃该更新命令
否则说明是新的改动, server更新状态值, V-state和merkle tree, 并返回新的V-state
Internode Gossiping
版本间同步, 需要通过Merkle Tree来定位delta change, 并最终完成同步
Operation Transfer Model
In an operation transfer approach, the sequence of applying the operations is very important. At the minimum causal order need to be maintained.
Because of the ordering issue, each replica has to defer executing the operation until all the preceding operations has been executed.
Therefore replicas save the operation request to a log file and exchange the log among each other and consolidate these operation logs to figure out the right sequence to apply the operations to their local store in an appropriate order.
对于这种模式, 必须保证
1. 消息不丢失
2. 所有更新命令以相同顺序被执行, 全序问题
3. 更新命令需要defer执行, 直到在他之前的更新命令都已经被执行
Query Processing, 和STM没啥区别
When a query is submitted by the client, it will also send along its vector clock which reflect the client's view of the world. The replica will check if it has a view of the state that is later than the client's view.
Update Processing
When an update operation is received, the replica will buffer the update operation until it can be applied to the local state. Every submitted operation will be tag with 2 timestamp, V-client indicates the client's view when he is making the update request. V-receive is the replica's view when it receives the submission.
This update operation request will be sitting in the queue until the replica has received all the other updates that this one depends on. This condition is reflected in the vector clock Vi when it is larger than V-client.
关键在于如何知道该更新操作之间的操作是否已经被执行?
通过vector clock的比较是一种办法, 如图中算法, 当V-client < Vi(replica当前的vector)的时候, 认为replica已经执行过之前所有的更新命令
但vector clock只能保证偏序, 而非全序, 所以这个方法并无法保证更新顺序完全一致
The concurrent update problem at different replica can also happen. Which means there can be multiple valid sequences of operation. In order for different replica to apply concurrent update in the same order, we need a total ordering mechanism.
Internode Gossiping, 本质上update没有区别
On the background, different replicas exchange their log for the queued updates and update each other's vector clock. After the log exchange, each replica will check whether certain operation can be applied (when all the dependent operation has been received) and apply them accordingly.