Akka(13): 分布式运算:Cluster-Sharding-运算的集群分片
通过上篇关于Cluster-Singleton的介绍,我们了解了Akka为分布式程序提供的编程支持:基于消息驱动的运算模式特别适合分布式程序编程,我们不需要特别的努力,只需要按照普通的Actor编程方式就可以实现集群分布式程序了。Cluster-Singleton可以保证无论集群节点出了任何问题,只要集群中还有节点在线,都可以持续的安全运算。Cluster-Singleton这种模式保证了某种Actor的唯一实例可以安全稳定地在集群环境下运行。还有一种情况就是如果有许多特别占用资源的Actor需要同时运行,而这些Actor同时占用的资源远远超过一台服务器的容量,如此我们必须把这些Actor分布到多台服务器上,或者是一个由多台服务器组成的集群环境,这时就需要Cluster-Sharding模式来帮助解决这样的问题了。
我把通过使用Cluster-Sharding后达到的一些目的和大家分享一下,大家一起来分析分析到底这些达成的目标里是否包括了Actor在集群节点间的分布:
首先我有个Actor,它的名称是一个自编码,由Cluster-Sharding在集群中某个节点上构建。由于在一个集群环境里所以这个Actor到底在哪个节点上,具体地址是什么我都不知道,我只需要用这个自编码就可以和它沟通。如果我有许多自编码的消耗资源的Actor,我可以通过自编码中的分片(shard)编号来指定在其它的分片(shard)里构建这些Actor。Akka-Cluster还可以根据整个集群中节点的增减按当前集群节点情况进行分片在集群节点调动来重新配载(rebalance),包括在某些节点因故脱离集群时把节点上的所有Actor在其它在线节点上重新构建。这样看来,这个Actor的自编码应该是Cluster-Sharding的应用核心元素了。按惯例我们还是用例子来示范Cluster-Sharding的使用。我们需要分片(sharding)的Actor就是前几篇讨论里提到的Calculator:
package clustersharding.entity
import akka.actor._
import akka.cluster._
import akka.persistence._
import scala.concurrent.duration._
import akka.cluster.sharding._
object Calculator {
sealed trait Command
case class Num(d: Double) extends Command
case class Add(d: Double) extends Command
case class Sub(d: Double) extends Command
case class Mul(d: Double) extends Command
case class Div(d: Double) extends Command
case object ShowResult extends Command
sealed trait Event
case class SetResult(d: Any) extends Event
def getResult(res: Double, cmd: Command) = cmd match {
case Num(x) => x
case Add(x) => res + x
case Sub(x) => res - x
case Mul(x) => res * x
case Div(x) => {
val _ = res.toInt / x.toInt //yield ArithmeticException when /0.00
res / x
}
case _ => new ArithmeticException("Invalid Operation!")
}
case class State(result: Double) {
def updateState(evt: Event): State = evt match {
case SetResult(n) => copy(result = n.asInstanceOf[Double])
}
}
case object Disconnect extends Command //exit cluster
def props = Props(new Calcultor)
}
class Calcultor extends PersistentActor with ActorLogging {
import Calculator._
val cluster = Cluster(context.system)
var state: State = State(0)
override def persistenceId: String = self.path.parent.name+"-"+self.path.name
override def receiveRecover: Receive = {
case evt: Event => state = state.updateState(evt)
case SnapshotOffer(_,st: State) => state = state.copy(result = st.result)
}
override def receiveCommand: Receive = {
case Num(n) => persist(SetResult(getResult(state.result,Num(n))))(evt => state = state.updateState(evt))
case Add(n) => persist(SetResult(getResult(state.result,Add(n))))(evt => state = state.updateState(evt))
case Sub(n) => persist(SetResult(getResult(state.result,Sub(n))))(evt => state = state.updateState(evt))
case Mul(n) => persist(SetResult(getResult(state.result,Mul(n))))(evt => state = state.updateState(evt))
case Div(n) => persist(SetResult(getResult(state.result,Div(n))))(evt => state = state.updateState(evt))
case ShowResult => log.info(s"Result on ${cluster.selfAddress.hostPort} is: ${state.result}")
case Disconnect =>
log.info(s"${cluster.selfAddress} is leaving cluster!!!")
cluster.leave (cluster.selfAddress)
}
override def preRestart(reason: Throwable, message: Option[Any]): Unit = {
log.info(s"Restarting calculator: ${reason.getMessage}")
super.preRestart(reason, message)
}
}
class CalcSupervisor extends Actor {
def decider: PartialFunction[Throwable,SupervisorStrategy.Directive] = {
case _: ArithmeticException => SupervisorStrategy.Resume
}
override def supervisorStrategy: SupervisorStrategy =
OneForOneStrategy(maxNrOfRetries = 5, withinTimeRange = 5 seconds){
decider.orElse(SupervisorStrategy.defaultDecider)
}
val calcActor = context.actorOf(Calculator.props,"calculator")
override def receive: Receive = {
case msg@ _ => calcActor.forward(msg)
}
}
我们看到:Calculator是一个普通的PersisitentActor,内部状态可以实现持久化,Actor重启时可以恢复状态。CalcSupervisor是Calculator的监管,这样做是为了实现新的监管策略SupervisorStrategy。
Calculator就是我们准备集群分片(sharding)的目标enitity。一种Actor的分片是通过Akka的Cluster-Sharding的ClusterSharding.start方法在集群中构建的。我们需要在所有将承载分片的节点上运行这个方法来部署分片:
/**
* Register a named entity type by defining the [[akka.actor.Props]] of the entity actor and
* functions to extract entity and shard identifier from messages. The [[ShardRegion]] actor
* for this type can later be retrieved with the [[#shardRegion]] method.
*
* The default shard allocation strategy [[ShardCoordinator.LeastShardAllocationStrategy]]
* is used. [[akka.actor.PoisonPill]] is used as `handOffStopMessage`.
*
* Some settings can be configured as described in the `akka.cluster.sharding` section
* of the `reference.conf`.
*
* @param typeName the name of the entity type
* @param entityProps the `Props` of the entity actors that will be created by the `ShardRegion`
* @param settings configuration settings, see [[ClusterShardingSettings]]
* @param extractEntityId partial function to extract the entity id and the message to send to the
* entity from the incoming message, if the partial function does not match the message will
* be `unhandled`, i.e. posted as `Unhandled` messages on the event stream
* @param extractShardId function to determine the shard id for an incoming message, only messages
* that passed the `extractEntityId` will be used
* @return the actor ref of the [[ShardRegion]] that is to be responsible for the shard
*/
def start(
typeName: String,
entityProps: Props,
settings: ClusterShardingSettings,
extractEntityId: ShardRegion.ExtractEntityId,
extractShardId: ShardRegion.ExtractShardId): ActorRef = {
val allocationStrategy = new LeastShardAllocationStrategy(
settings.tuningParameters.leastShardAllocationRebalanceThreshold,
settings.tuningParameters.leastShardAllocationMaxSimultaneousRebalance)
start(typeName, entityProps, settings, extractEntityId, extractShardId, allocationStrategy, PoisonPill)
}
start返回了ShardRegion,是个ActorRef类型。ShardRegion是一个特殊的Actor,负责管理可能多个分片(shard)内称为Entity的Actor实例。这些分片可能是分布在不同的集群节点上的,外界通过ShardRegion与其辖下Entities沟通。从start函数参数entityProps我们看到:每个分片中只容许一个种类的Actor;具体的Entity实例是由另一个内部Actor即shard构建的,shard可以在一个分片中构建多个Entity实例。多shard多entity的特性可以从extractShardId,extractEntityId这两个方法中得到一些信息。我们说过Actor自编码即entity-id是Cluster-Sharding的核心元素。在entity-id这个自编码中还包含了shard-id,所以用户可以通过entity-id的编码规则来设计整个分片系统包括每个ShardRegion下shard和entity的数量。当ShardRegion得到一个entity-id后,首先从中抽取shard-id,如果shard-id在集群中不存在的话就按集群各节点负载情况在其中一个节点上构建新的shard;然后再用entity-id在shard-id分片中查找entity,如果不存在就构建一个新的entity实例。整个shard和entity的构建过程都是通过用户提供的函数extractShardId和extractEntityId实现的,Cluster-Sharding就是通过这两个函数按用户的要求来构建和使用shard和entity的。这个自编码无需按一定的顺序,只需要保证唯一性。下面是一个编码例子:
object CalculatorShard {
import Calculator._
case class CalcCommands(eid: String, msg: Command) //user should use it to talk to shardregion
val shardName = "calcShard"
val getEntityId: ShardRegion.ExtractEntityId = {
case CalcCommands(id,msg) => (id,msg)
}
val getShardId: ShardRegion.ExtractShardId = {
case CalcCommands(id,_) => id.head.toString
}
def entityProps = Props(new CalcSupervisor)
}
用户是用CalcCommands与ShardRegion沟通的。这是一个专门为与分片系统沟通而设的包嵌消息类型,包嵌的信息里除了Calculator正常支持的Command消息外,还包括了目标Entity实例的编号eid。这个eid的第一个字节代表shard-id,这样我们可以直接指定目标entity所在分片或者随意任选一个shard-id如:Random.NextInt(9).toString。由于每个分片只含一种类型的Actor,不同的entity-id代表多个同类Actor实例的同时存在,就像前面讨论的Router一样:所有实例针对不同的输入进行相同功能的运算处理。一般来说用户会通过某种算法任意产生entity-id,希望能做到各分片中entity的均衡部署,Cluster-Sharding可以根据具体的集群负载情况自动调整分片在集群节点层面上的部署。
下面的代码示范了如何在一个集群节点上部署分片:
package clustersharding.shard
import akka.persistence.journal.leveldb._
import akka.actor._
import akka.cluster.sharding._
import com.typesafe.config.ConfigFactory
import akka.util.Timeout
import scala.concurrent.duration._
import akka.pattern._
import clustersharding.entity.CalculatorShard
object CalcShards {
def create(port: Int) = {
val config = ConfigFactory.parseString(s"akka.remote.netty.tcp.port=${port}")
.withFallback(ConfigFactory.load("sharding"))
// Create an Akka system
val system = ActorSystem("ShardingSystem", config)
startupSharding(port,system)
}
def startupSharedJournal(system: ActorSystem, startStore: Boolean, path: ActorPath): Unit = {
// Start the shared journal one one node (don't crash this SPOF)
// This will not be needed with a distributed journal
if (startStore)
system.actorOf(Props[SharedLeveldbStore], "store")
// register the shared journal
import system.dispatcher
implicit val timeout = Timeout(15.seconds)
val f = (system.actorSelection(path) ? Identify(None))
f.onSuccess {
case ActorIdentity(_, Some(ref)) =>
SharedLeveldbJournal.setStore(ref, system)
case _ =>
system.log.error("Shared journal not started at {}", path)
system.terminate()
}
f.onFailure {
case _ =>
system.log.error("Lookup of shared journal at {} timed out", path)
system.terminate()
}
}
def startupSharding(port: Int, system: ActorSystem) = {
startupSharedJournal(system, startStore = (port == 2551), path =
ActorPath.fromString("akka.tcp://ShardingSystem@127.0.0.1:2551/user/store"))
ClusterSharding(system).start(
typeName = CalculatorShard.shardName,
entityProps = CalculatorShard.entityProps,
settings = ClusterShardingSettings(system),
extractEntityId = CalculatorShard.getEntityId,
extractShardId = CalculatorShard.getShardId
)
}
}
具体的部署代码在startupSharding方法里。下面这段代码示范了如何使用分片里的entity:
package clustersharding.demo
import akka.actor.ActorSystem
import akka.cluster.sharding._
import clustersharding.entity.CalculatorShard.CalcCommands
import clustersharding.entity._
import clustersharding.shard.CalcShards
import com.typesafe.config.ConfigFactory
object ClusterShardingDemo extends App {
CalcShards.create(2551)
CalcShards.create(0)
CalcShards.create(0)
CalcShards.create(0)
Thread.sleep(1000)
val shardingSystem = ActorSystem("ShardingSystem",ConfigFactory.load("sharding"))
CalcShards.startupSharding(0,shardingSystem)
Thread.sleep(1000)
val calcRegion = ClusterSharding(shardingSystem).shardRegion(CalculatorShard.shardName)
calcRegion ! CalcCommands("1012",Calculator.Num(13.0)) //shard 1, entity 1012
calcRegion ! CalcCommands("1012",Calculator.Add(12.0))
calcRegion ! CalcCommands("1012",Calculator.ShowResult) //shows address too
calcRegion ! CalcCommands("1012",Calculator.Disconnect) //disengage cluster
calcRegion ! CalcCommands("2012",Calculator.Num(10.0)) //shard 2, entity 2012
calcRegion ! CalcCommands("2012",Calculator.Mul(3.0))
calcRegion ! CalcCommands("2012",Calculator.Div(2.0))
calcRegion ! CalcCommands("2012",Calculator.Div(0.0)) //divide by zero
Thread.sleep(15000)
calcRegion ! CalcCommands("1012",Calculator.ShowResult) //check if restore result on another node
calcRegion ! CalcCommands("2012",Calculator.ShowResult)
}
以上代码里人为选定了分片和entity-id,其中包括了从集群中抽出一个节点的操作。运算结果如下:
[INFO] [07/15/2017 09:32:49.414] [ShardingSystem-akka.actor.default-dispatcher-20] [akka.tcp://ShardingSystem@127.0.0.1:50456/system/sharding/calcShard/1/1012/calculator] Result on ShardingSystem@127.0.0.1:50456 is: 25.0 [INFO] [07/15/2017 09:32:49.414] [ShardingSystem-akka.actor.default-dispatcher-20] [akka.tcp://ShardingSystem@127.0.0.1:50456/system/sharding/calcShard/1/1012/calculator] akka.tcp://ShardingSystem@127.0.0.1:50456 is leaving cluster!!! [WARN] [07/15/2017 09:32:49.431] [ShardingSystem-akka.actor.default-dispatcher-18] [akka://ShardingSystem/system/sharding/calcShard/2/2012/calculator] / by zero [INFO] [07/15/2017 09:33:01.320] [ShardingSystem-akka.actor.default-dispatcher-4] [akka.tcp://ShardingSystem@127.0.0.1:50464/system/sharding/calcShard/2/2012/calculator] Result on ShardingSystem@127.0.0.1:50464 is: 15.0 [INFO] [07/15/2017 09:33:01.330] [ShardingSystem-akka.actor.default-dispatcher-18] [akka.tcp://ShardingSystem@127.0.0.1:50457/system/sharding/calcShard/1/1012/calculator] Result on ShardingSystem@127.0.0.1:50457 is: 25.0
结果显示entity1012在节点50456退出集群后被转移到节点50457上,并行保留了状态。
下面是本次示范的源代码:
build.sbt
name := "cluster-sharding"
version := "1.0"
scalaVersion := "2.11.9"
resolvers += "Akka Snapshot Repository" at "http://repo.akka.io/snapshots/"
val akkaversion = "2.4.8"
libraryDependencies ++= Seq(
"com.typesafe.akka" %% "akka-actor" % akkaversion,
"com.typesafe.akka" %% "akka-remote" % akkaversion,
"com.typesafe.akka" %% "akka-cluster" % akkaversion,
"com.typesafe.akka" %% "akka-cluster-tools" % akkaversion,
"com.typesafe.akka" %% "akka-cluster-sharding" % akkaversion,
"com.typesafe.akka" %% "akka-persistence" % "2.4.8",
"com.typesafe.akka" %% "akka-contrib" % akkaversion,
"org.iq80.leveldb" % "leveldb" % "0.7",
"org.fusesource.leveldbjni" % "leveldbjni-all" % "1.8")
resources/sharding.conf
akka.actor.warn-about-java-serializer-usage = off
akka.log-dead-letters-during-shutdown = off
akka.log-dead-letters = off
akka {
loglevel = INFO
actor {
provider = "akka.cluster.ClusterActorRefProvider"
}
remote {
log-remote-lifecycle-events = off
netty.tcp {
hostname = "127.0.0.1"
port = 0
}
}
cluster {
seed-nodes = [
"akka.tcp://ShardingSystem@127.0.0.1:2551"]
log-info = off
}
persistence {
journal.plugin = "akka.persistence.journal.leveldb-shared"
journal.leveldb-shared.store {
# DO NOT USE 'native = off' IN PRODUCTION !!!
native = off
dir = "target/shared-journal"
}
snapshot-store.plugin = "akka.persistence.snapshot-store.local"
snapshot-store.local.dir = "target/snapshots"
}
}
Calculator.scala
package clustersharding.entity
import akka.actor._
import akka.cluster._
import akka.persistence._
import scala.concurrent.duration._
import akka.cluster.sharding._
object Calculator {
sealed trait Command
case class Num(d: Double) extends Command
case class Add(d: Double) extends Command
case class Sub(d: Double) extends Command
case class Mul(d: Double) extends Command
case class Div(d: Double) extends Command
case object ShowResult extends Command
sealed trait Event
case class SetResult(d: Any) extends Event
def getResult(res: Double, cmd: Command) = cmd match {
case Num(x) => x
case Add(x) => res + x
case Sub(x) => res - x
case Mul(x) => res * x
case Div(x) => {
val _ = res.toInt / x.toInt //yield ArithmeticException when /0.00
res / x
}
case _ => new ArithmeticException("Invalid Operation!")
}
case class State(result: Double) {
def updateState(evt: Event): State = evt match {
case SetResult(n) => copy(result = n.asInstanceOf[Double])
}
}
case object Disconnect extends Command //exit cluster
def props = Props(new Calcultor)
}
class Calcultor extends PersistentActor with ActorLogging {
import Calculator._
val cluster = Cluster(context.system)
var state: State = State(0)
override def persistenceId: String = self.path.parent.name+"-"+self.path.name
override def receiveRecover: Receive = {
case evt: Event => state = state.updateState(evt)
case SnapshotOffer(_,st: State) => state = state.copy(result = st.result)
}
override def receiveCommand: Receive = {
case Num(n) => persist(SetResult(getResult(state.result,Num(n))))(evt => state = state.updateState(evt))
case Add(n) => persist(SetResult(getResult(state.result,Add(n))))(evt => state = state.updateState(evt))
case Sub(n) => persist(SetResult(getResult(state.result,Sub(n))))(evt => state = state.updateState(evt))
case Mul(n) => persist(SetResult(getResult(state.result,Mul(n))))(evt => state = state.updateState(evt))
case Div(n) => persist(SetResult(getResult(state.result,Div(n))))(evt => state = state.updateState(evt))
case ShowResult => log.info(s"Result on ${cluster.selfAddress.hostPort} is: ${state.result}")
case Disconnect =>
log.info(s"${cluster.selfAddress} is leaving cluster!!!")
cluster.leave (cluster.selfAddress)
}
override def preRestart(reason: Throwable, message: Option[Any]): Unit = {
log.info(s"Restarting calculator: ${reason.getMessage}")
super.preRestart(reason, message)
}
}
class CalcSupervisor extends Actor {
def decider: PartialFunction[Throwable,SupervisorStrategy.Directive] = {
case _: ArithmeticException => SupervisorStrategy.Resume
}
override def supervisorStrategy: SupervisorStrategy =
OneForOneStrategy(maxNrOfRetries = 5, withinTimeRange = 5 seconds){
decider.orElse(SupervisorStrategy.defaultDecider)
}
val calcActor = context.actorOf(Calculator.props,"calculator")
override def receive: Receive = {
case msg@ _ => calcActor.forward(msg)
}
}
object CalculatorShard {
import Calculator._
case class CalcCommands(eid: String, msg: Command) //user should use it to talk to shardregion
val shardName = "calcShard"
val getEntityId: ShardRegion.ExtractEntityId = {
case CalcCommands(id,msg) => (id,msg)
}
val getShardId: ShardRegion.ExtractShardId = {
case CalcCommands(id,_) => id.head.toString
}
def entityProps = Props(new CalcSupervisor)
}
CalcShard.scala
package clustersharding.shard
import akka.persistence.journal.leveldb._
import akka.actor._
import akka.cluster.sharding._
import com.typesafe.config.ConfigFactory
import akka.util.Timeout
import scala.concurrent.duration._
import akka.pattern._
import clustersharding.entity.CalculatorShard
object CalcShards {
def create(port: Int) = {
val config = ConfigFactory.parseString(s"akka.remote.netty.tcp.port=${port}")
.withFallback(ConfigFactory.load("sharding"))
// Create an Akka system
val system = ActorSystem("ShardingSystem", config)
startupSharding(port,system)
}
def startupSharedJournal(system: ActorSystem, startStore: Boolean, path: ActorPath): Unit = {
// Start the shared journal one one node (don't crash this SPOF)
// This will not be needed with a distributed journal
if (startStore)
system.actorOf(Props[SharedLeveldbStore], "store")
// register the shared journal
import system.dispatcher
implicit val timeout = Timeout(15.seconds)
val f = (system.actorSelection(path) ? Identify(None))
f.onSuccess {
case ActorIdentity(_, Some(ref)) =>
SharedLeveldbJournal.setStore(ref, system)
case _ =>
system.log.error("Shared journal not started at {}", path)
system.terminate()
}
f.onFailure {
case _ =>
system.log.error("Lookup of shared journal at {} timed out", path)
system.terminate()
}
}
def startupSharding(port: Int, system: ActorSystem) = {
startupSharedJournal(system, startStore = (port == 2551), path =
ActorPath.fromString("akka.tcp://ShardingSystem@127.0.0.1:2551/user/store"))
ClusterSharding(system).start(
typeName = CalculatorShard.shardName,
entityProps = CalculatorShard.entityProps,
settings = ClusterShardingSettings(system),
extractEntityId = CalculatorShard.getEntityId,
extractShardId = CalculatorShard.getShardId
)
}
}
ClusterShardingDemo.scala
package clustersharding.demo
import akka.actor.ActorSystem
import akka.cluster.sharding._
import clustersharding.entity.CalculatorShard.CalcCommands
import clustersharding.entity._
import clustersharding.shard.CalcShards
import com.typesafe.config.ConfigFactory
object ClusterShardingDemo extends App {
CalcShards.create(2551)
CalcShards.create(0)
CalcShards.create(0)
CalcShards.create(0)
Thread.sleep(1000)
val shardingSystem = ActorSystem("ShardingSystem",ConfigFactory.load("sharding"))
CalcShards.startupSharding(0,shardingSystem)
Thread.sleep(1000)
val calcRegion = ClusterSharding(shardingSystem).shardRegion(CalculatorShard.shardName)
calcRegion ! CalcCommands("1012",Calculator.Num(13.0)) //shard 1, entity 1012
calcRegion ! CalcCommands("1012",Calculator.Add(12.0))
calcRegion ! CalcCommands("1012",Calculator.ShowResult) //shows address too
calcRegion ! CalcCommands("1012",Calculator.Disconnect) //disengage cluster
calcRegion ! CalcCommands("2012",Calculator.Num(10.0)) //shard 2, entity 2012
calcRegion ! CalcCommands("2012",Calculator.Mul(3.0))
calcRegion ! CalcCommands("2012",Calculator.Div(2.0))
calcRegion ! CalcCommands("2012",Calculator.Div(0.0)) //divide by zero
Thread.sleep(15000)
calcRegion ! CalcCommands("1012",Calculator.ShowResult) //check if restore result on another node
calcRegion ! CalcCommands("2012",Calculator.ShowResult)
}