Akka源码分析-Cluster-DistributedData
上一篇博客我们研究了集群的分片源码,虽然akka的集群分片的初衷是用来解决actor分布的,但如果我们稍加改造就可以很轻松的开发出一个简单的分布式缓存系统,怎么做?哈哈很简单啊,实体actor的id就是key,actor的状态就是value,而且还可以无锁的改变状态。
其实akka的DistributedData有点类似缓存系统,当你需要在集群中分享数据的话,DistributedData就非常有用了。可以通过跟K/V缓存系统类似的API来存取数据,不过DistributedData中南的数据是 Conflict Free Replicated Data Types (CRDTs),即无冲突可复制数据类型,CRDT我也不太熟,就不介绍了,感兴趣的同学可以自行谷歌。我们姑且先认为它是用来解决数据复制的最终一致性的吧。
Akka Distributed Data的所有数据实体分布在所有节点或一组节点上,这是通过基于gossip协议的复制来实现的。可以有更细粒度的一致性读写控制。CRDT可以在没有协调器的情况下对数据进行更新,所有的一致性更新会被所有节点通过可监控的合并操作解决。数据的状态最终达到一致。
akka.cluster.ddata.Replicator这个actor提供数据交互的API,Replicator需要在所有节点都要启动,想想都知道为啥,毕竟它需要分发数据啊。由于Replicator只是一个具有特殊功能的普通actor,它可以正常的启动,但要注意各个参数的一致,当然也可以通过akka.cluster.ddata.DistributedData插件来启动。
通过Replicator我们可以Update(更新)、Get(获取)、Subscribe(订阅)、Delete(删除)对应的数据。
Akka Distributed Data支持的数据类型必须是收敛的CRDT,且继承ReplicatedData特质,也就是说都必须提供单调的合并函数,并且状态变化总是收敛的。akka内置的数据类型有:
- Counters:
GCounter
,PNCounter
- Sets:
GSet
,ORSet
- Maps:
ORMap
,ORMultiMap
,LWWMap
,PNCounterMap
- Registers:
LWWRegister
,Flag
GCounter是一个只增长的计数器,它只能增加,不能减少。它以类似于向量时钟的方式工作,跟踪所有节点的值,以最大值进行合并。如果同时需要对计数器递增和递减 ,就需要使用PNCounter(正负计数器)了。PNCounter单独对递增和递减进行跟踪,二者都是以内部的GCounter来表示,合并的时候也是通过GCounter。
GSet是一个只能增加元素的集合;ORSet(observed-remove set)可以同时增加、删除元素。ORSet有一个版本向量,它在增加元素的时候递增。版本向量被一个叫“birth dot”的对象跟踪。
ORMap(observed-remove map)是一个map,其key可以是任何类型,values必须是ReplicatedData类型,它支持增加、更新、删除。如果增加和删除同时执行,则增加会成功。如果多个更新同时执行,则values会被合并。
ORMultiMap (observed-remove multi-map)是一个多值映射的map,其values是ORSet类型。
PNCounterMap (positive negative counter map) 是一个命名计数器,其values是PNCounte类型。
LWWMap (last writer wins map)是一个map,其values是LWWRegister (last writer wins register)。
Flag是一个布尔值,初始值是false,可以被设置成ture。且一旦设置成ture就不能再改变。
LWWRegister (last writer wins register)可以保存任何能序列化的值。它保存最后更新的值,其实“最后”是很难判断的,因为在分布式环境下,各个节点很难达到绝对的时间一致的状态。且如果时间一致,会以IP地址最小的值为准。这也就意味着LWWRegister的值并不一定是物理上最新的值,也就意味着不一定是一致性更新,说白了就不是真的最终一致。
说了那么多,Akka Distributed Data其实不是一个缓存系统,它并不适用于所有类型的问题,最终一致性也并不一定符合所有的场景。而且它也不是为大数据准备的,顶级实体的数量不应超过10万。当有新节点加入集群的时候,所有的数据都会被转移到新节点。所有的数据都是在内存中的,这也是不适合大数据的另外一个原因。当数据实体变化的时候,它的所有状态可能会被复制到其他所有节点,如果它支持增量CRDT也是可以增量赋值的。
有读者看到这里可能会问,你为啥讲了这么多的概念性的东西,而没有看源码,那是因为我觉得分布式数据的重点在于概念,而且akka的这一特性很少有人用。
class DataBot extends Actor with ActorLogging { import DataBot._ val replicator = DistributedData(context.system).replicator implicit val node = Cluster(context.system) import context.dispatcher val tickTask = context.system.scheduler.schedule(5.seconds, 5.seconds, self, Tick) val DataKey = ORSetKey[String]("key") replicator ! Subscribe(DataKey, self) def receive = { case Tick ⇒ val s = ThreadLocalRandom.current().nextInt(97, 123).toChar.toString if (ThreadLocalRandom.current().nextBoolean()) { // add log.info("Adding: {}", s) replicator ! Update(DataKey, ORSet.empty[String], WriteLocal)(_ + s) } else { // remove log.info("Removing: {}", s) replicator ! Update(DataKey, ORSet.empty[String], WriteLocal)(_ - s) } case _: UpdateResponse[_] ⇒ // ignore case c @ Changed(DataKey) ⇒ val data = c.get(DataKey) log.info("Current elements: {}", data.elements) } override def postStop(): Unit = tickTask.cancel() }
上面是官方demo,介绍的是ORSet这个数据类型的Subscribe、Update的操作。可以看到,数据的所有操作都是通过replicator来实现的,而且都是发消息的形式。所以按照这个思路,我们首先就需要看看DistributedData(context.system).replicator的源码。
/** * Akka extension for convenient configuration and use of the * [[Replicator]]. Configuration settings are defined in the * `akka.cluster.ddata` section, see `reference.conf`. */ class DistributedData(system: ExtendedActorSystem) extends Extension { private val config = system.settings.config.getConfig("akka.cluster.distributed-data") private val settings = ReplicatorSettings(config) /** * Returns true if this member is not tagged with the role configured for the * replicas. */ def isTerminated: Boolean = Cluster(system).isTerminated || !settings.roles.subsetOf(Cluster(system).selfRoles) /** * `ActorRef` of the [[Replicator]] . */ val replicator: ActorRef = if (isTerminated) { system.log.warning("Replicator points to dead letters: Make sure the cluster node is not terminated and has the proper role!") system.deadLetters } else { val name = config.getString("name") system.systemActorOf(Replicator.props(settings), name) } }
DistributedData这个扩展似乎有点太简单了哈,就是用systemActorOf创建了一个Replicator。
/** * A replicated in-memory data store supporting low latency and high availability * requirements. * * The `Replicator` actor takes care of direct replication and gossip based * dissemination of Conflict Free Replicated Data Types (CRDTs) to replicas in the * the cluster. * The data types must be convergent CRDTs and implement [[ReplicatedData]], i.e. * they provide a monotonic merge function and the state changes always converge. * * You can use your own custom [[ReplicatedData]] or [[DeltaReplicatedData]] types, * and several types are provided by this package, such as: * * <ul> * <li>Counters: [[GCounter]], [[PNCounter]]</li> * <li>Registers: [[LWWRegister]], [[Flag]]</li> * <li>Sets: [[GSet]], [[ORSet]]</li> * <li>Maps: [[ORMap]], [[ORMultiMap]], [[LWWMap]], [[PNCounterMap]]</li> * </ul> * * The `Replicator` actor must be started on each node in the cluster, or group of * nodes tagged with a specific role. It communicates with other `Replicator` instances * with the same path (without address) that are running on other nodes . For convenience it * can be used with the [[DistributedData]] extension but it can also be started as an ordinary * actor using the `Replicator.props`. If it is started as an ordinary actor it is important * that it is given the same name, started on same path, on all nodes. * * The protocol for replicating the deltas supports causal consistency if the data type * is marked with [[RequiresCausalDeliveryOfDeltas]]. Otherwise it is only eventually * consistent. Without causal consistency it means that if elements 'c' and 'd' are * added in two separate `Update` operations these deltas may occasionally be propagated * to nodes in different order than the causal order of the updates. For this example it * can result in that set {'a', 'b', 'd'} can be seen before element 'c' is seen. Eventually * it will be {'a', 'b', 'c', 'd'}. * * == CRDT Garbage == * * One thing that can be problematic with CRDTs is that some data types accumulate history (garbage). * For example a `GCounter` keeps track of one counter per node. If a `GCounter` has been updated * from one node it will associate the identifier of that node forever. That can become a problem * for long running systems with many cluster nodes being added and removed. To solve this problem * the `Replicator` performs pruning of data associated with nodes that have been removed from the * cluster. Data types that need pruning have to implement [[RemovedNodePruning]]. The pruning consists * of several steps: * <ol> * <li>When a node is removed from the cluster it is first important that all updates that were * done by that node are disseminated to all other nodes. The pruning will not start before the * `maxPruningDissemination` duration has elapsed. The time measurement is stopped when any * replica is unreachable, but it's still recommended to configure this with certain margin. * It should be in the magnitude of minutes.</li> * <li>The nodes are ordered by their address and the node ordered first is called leader. * The leader initiates the pruning by adding a `PruningInitialized` marker in the data envelope. * This is gossiped to all other nodes and they mark it as seen when they receive it.</li> * <li>When the leader sees that all other nodes have seen the `PruningInitialized` marker * the leader performs the pruning and changes the marker to `PruningPerformed` so that nobody * else will redo the pruning. The data envelope with this pruning state is a CRDT itself. * The pruning is typically performed by "moving" the part of the data associated with * the removed node to the leader node. For example, a `GCounter` is a `Map` with the node as key * and the counts done by that node as value. When pruning the value of the removed node is * moved to the entry owned by the leader node. See [[RemovedNodePruning#prune]].</li> * <li>Thereafter the data is always cleared from parts associated with the removed node so that * it does not come back when merging. See [[RemovedNodePruning#pruningCleanup]]</li> * <li>After another `maxPruningDissemination` duration after pruning the last entry from the * removed node the `PruningPerformed` markers in the data envelope are collapsed into a * single tombstone entry, for efficiency. Clients may continue to use old data and therefore * all data are always cleared from parts associated with tombstoned nodes. </li> * </ol> */ final class Replicator(settings: ReplicatorSettings) extends Actor with ActorLogging
Replicator支持低延迟和高可用的内存存储,而且就是普通的actor。这个actor的字段和方法很多,但有一个字段需要我们注意。
// the actual data var dataEntries = Map.empty[KeyId, (DataEnvelope, Digest)]
type KeyId = String // Gossip Status message contains SHA-1 digests of the data to determine when // to send the full data type Digest = ByteString
官方注释说这是实际的数据,而且,KeyId是一个String类型数据,value是一个元组,元组的第一个元素是一个数据envelop,它包含数据实体,保持当前实体的修剪进程。
/** * The `DataEnvelope` wraps a data entry and carries state of the pruning process for the entry. */ final case class DataEnvelope( data: ReplicatedData, pruning: Map[UniqueAddress, PruningState] = Map.empty, deltaVersions: VersionVector = VersionVector.empty) extends ReplicatorMessage
我们知道对所有数据的操作都是通过replicator发消息来完成的,那就来看receive的源码。
def receive = if (hasDurableKeys) load else normalReceive
actor刚启动的时候,可能会处于load阶段,这个我们先忽略。
val normalReceive: Receive = { case Get(key, consistency, req) ⇒ receiveGet(key, consistency, req) case u @ Update(key, writeC, req) ⇒ receiveUpdate(key, u.modify, writeC, req) case Read(key) ⇒ receiveRead(key) case Write(key, envelope) ⇒ receiveWrite(key, envelope) case ReadRepair(key, envelope) ⇒ receiveReadRepair(key, envelope) case DeltaPropagation(from, reply, deltas) ⇒ receiveDeltaPropagation(from, reply, deltas) case FlushChanges ⇒ receiveFlushChanges() case DeltaPropagationTick ⇒ receiveDeltaPropagationTick() case GossipTick ⇒ receiveGossipTick() case ClockTick ⇒ receiveClockTick() case Status(otherDigests, chunk, totChunks) ⇒ receiveStatus(otherDigests, chunk, totChunks) case Gossip(updatedData, sendBack) ⇒ receiveGossip(updatedData, sendBack) case Subscribe(key, subscriber) ⇒ receiveSubscribe(key, subscriber) case Unsubscribe(key, subscriber) ⇒ receiveUnsubscribe(key, subscriber) case Terminated(ref) ⇒ receiveTerminated(ref) case MemberWeaklyUp(m) ⇒ receiveWeaklyUpMemberUp(m) case MemberUp(m) ⇒ receiveMemberUp(m) case MemberRemoved(m, _) ⇒ receiveMemberRemoved(m) case evt: MemberEvent ⇒ receiveOtherMemberEvent(evt.member) case UnreachableMember(m) ⇒ receiveUnreachable(m) case ReachableMember(m) ⇒ receiveReachable(m) case GetKeyIds ⇒ receiveGetKeyIds() case Delete(key, consistency, req) ⇒ receiveDelete(key, consistency, req) case RemovedNodePruningTick ⇒ receiveRemovedNodePruningTick() case GetReplicaCount ⇒ receiveGetReplicaCount() case TestFullStateGossip(enabled) ⇒ fullStateGossipEnabled = enabled }
normalReceive的分支很多,它支持的所有操作都在这里了,我们先来看Subscrive消息。
def receiveSubscribe(key: KeyR, subscriber: ActorRef): Unit = { newSubscribers.addBinding(key.id, subscriber) if (!subscriptionKeys.contains(key.id)) subscriptionKeys = subscriptionKeys.updated(key.id, key) context.watch(subscriber) }
也比较简单,就是把key.id和subscriber进行绑定,然后wartch。那么KeyR又是什么类型呢?
private[akka] type KeyR = Key[ReplicatedData]
/** * Key for the key-value data in [[Replicator]]. The type of the data value * is defined in the key. Keys are compared equal if the `id` strings are equal, * i.e. use unique identifiers. * * Specific classes are provided for the built in data types, e.g. [[ORSetKey]], * and you can create your own keys. */ abstract class Key[+T <: ReplicatedData](val id: Key.KeyId) extends Serializable
keyR其实就是对应数据类型的key,不再深入研究。下面来看Updae是如何操作的。
def receiveUpdate(key: KeyR, modify: Option[ReplicatedData] ⇒ ReplicatedData, writeConsistency: WriteConsistency, req: Option[Any]): Unit = { val localValue = getData(key.id) def deltaOrPlaceholder(d: DeltaReplicatedData): Option[ReplicatedDelta] = { d.delta match { case s @ Some(_) ⇒ s case None ⇒ Some(NoDeltaPlaceholder) } } Try { localValue match { case Some(DataEnvelope(DeletedData, _, _)) ⇒ throw new DataDeleted(key, req) case Some(envelope @ DataEnvelope(existing, _, _)) ⇒ modify(Some(existing)) match { case d: DeltaReplicatedData if deltaCrdtEnabled ⇒ (envelope.merge(d.resetDelta.asInstanceOf[existing.T]), deltaOrPlaceholder(d)) case d ⇒ (envelope.merge(d.asInstanceOf[existing.T]), None) } case None ⇒ modify(None) match { case d: DeltaReplicatedData if deltaCrdtEnabled ⇒ (DataEnvelope(d.resetDelta), deltaOrPlaceholder(d)) case d ⇒ (DataEnvelope(d), None) } } } match { case Success((envelope, delta)) ⇒ log.debug("Received Update for key [{}]", key) // handle the delta delta match { case Some(d) ⇒ deltaPropagationSelector.update(key.id, d) case None ⇒ // not DeltaReplicatedData } // note that it's important to do deltaPropagationSelector.update before setData, // so that the latest delta version is used val newEnvelope = setData(key.id, envelope) val durable = isDurable(key.id) if (isLocalUpdate(writeConsistency)) { if (durable) durableStore ! Store(key.id, new DurableDataEnvelope(newEnvelope), Some(StoreReply(UpdateSuccess(key, req), StoreFailure(key, req), replyTo))) else replyTo ! UpdateSuccess(key, req) } else { val (writeEnvelope, writeDelta) = delta match { case Some(NoDeltaPlaceholder) ⇒ (newEnvelope, None) case Some(d: RequiresCausalDeliveryOfDeltas) ⇒ val v = deltaPropagationSelector.currentVersion(key.id) (newEnvelope, Some(Delta(newEnvelope.copy(data = d), v, v))) case Some(d) ⇒ (newEnvelope.copy(data = d), None) case None ⇒ (newEnvelope, None) } val writeAggregator = context.actorOf(WriteAggregator.props(key, writeEnvelope, writeDelta, writeConsistency, req, nodes, unreachable, replyTo, durable) .withDispatcher(context.props.dispatcher)) if (durable) { durableStore ! Store(key.id, new DurableDataEnvelope(newEnvelope), Some(StoreReply(UpdateSuccess(key, req), StoreFailure(key, req), writeAggregator))) } } case Failure(e: DataDeleted[_]) ⇒ log.debug("Received Update for deleted key [{}]", key) replyTo ! e case Failure(e) ⇒ log.debug("Received Update for key [{}], failed: {}", key, e.getMessage) replyTo ! ModifyFailure(key, "Update failed: " + e.getMessage, e, req) } }
这段代码的逻辑也很简单,其实就是通过ID获取本节点的值,然后用自定义的modify函数,对其进行修改,修改之后通过merge方法,修改本地变量的值。
def getData(key: KeyId): Option[DataEnvelope] = dataEntries.get(key).map { case (envelope, _) ⇒ envelope }
getData也非常简单,其实就是去map中找对应key的值。
其实分析到这里,对数据的操作就不需要再分析了,为啥呢?所有的增删改查,基本都是修改当前actor的dataEntries来完成的。基于我们之前分析源码的知识来看,同步机制也不需要再深入研究了(我们的系列定位就是简单初级的源码入门)。因为可以猜到,一定有一个定时器,会把当前的dataEntries通过gossip协议分发出去,当其他节点收到对应的数据后,会调用CRDT数据类型的merge来应用修改。由于CRDT的特性,所以merge的时候不需要考虑冲突的问题,所以经过一轮的gossip广播,所有节点的数据可以达到最终一致,在最终一致之前,节点是看不到对应的变化的数据的。
def write(key: KeyId, writeEnvelope: DataEnvelope): Option[DataEnvelope] = { getData(key) match { case someEnvelope @ Some(envelope) if envelope eq writeEnvelope ⇒ someEnvelope case Some(DataEnvelope(DeletedData, _, _)) ⇒ Some(DeletedEnvelope) // already deleted case Some(envelope @ DataEnvelope(existing, _, _)) ⇒ try { // DataEnvelope will mergeDelta when needed val merged = envelope.merge(writeEnvelope).addSeen(selfAddress) Some(setData(key, merged)) } catch { case e: IllegalArgumentException ⇒ log.warning( "Couldn't merge [{}], due to: {}", key, e.getMessage) None } case None ⇒ // no existing data for the key val writeEnvelope2 = writeEnvelope.data match { case d: ReplicatedDelta ⇒ val z = d.zero writeEnvelope.copy(data = z.mergeDelta(d.asInstanceOf[z.D])) case _ ⇒ writeEnvelope } val writeEnvelope3 = writeEnvelope2.addSeen(selfAddress) Some(setData(key, writeEnvelope3)) } }
上面是真正对数据进行write的源码,也可以对我们的猜测进行佐证。
好了,akka的Cluster-Distributed-Data源码就分析到这里了,读者可能会问,为啥分析的这么浅显。有两个方面的考虑,首先是这个特性应用范围比较有限,而且基于内存,保存的数据不会太大,另外读者觉得使用sharding能更好的解决数据共享的问题,而且还没有锁。如果读者有兴趣,可以自行研读这一部分的源码。