以spark源码为参照分析模式匹配及种类

介绍模式匹配的文章已经很多了，这里对模式匹配做下归类，以便在日常开发或读别人代码时更容易理解。说明一下本文中的代码全引自Apache spark源码。

一、通配模式

通配模式（_）匹配任意对象，被用作默认的”全匹配“的备选项，如：

/** Returns an `akka.tcp://...` URL for the Master actor given a sparkUrl `spark://host:ip`. */
def toAkkaUrl(sparkUrl: String): String = {
  sparkUrl match {
    case sparkUrlRegex(host, port) =>
      "akka.tcp://%s@%s:%s/user/%s".format(systemName, host, port, actorName)
    case _ =>
      throw new SparkException("Invalid master URL: " + sparkUrl)
  }
}

当sparkUrl匹配sparkUrlRegex(host, port)类型时，格式化成akka地址格式，否则抛出Invalid master URL的异常。

另外，通配模式可以用来忽略对象中你不关心的的部分，如：

private def shouldCompress(blockId: BlockId): Boolean = {
  blockId match {
    case _: ShuffleBlockId => compressShuffle
    case _: BroadcastBlockId => compressBroadcast
    case _: RDDBlockId => compressRdds
    case _: TempLocalBlockId => compressShuffleSpill
    case _: TempShuffleBlockId => compressShuffle
    case _ => false
  }
}

只要blockId匹配ShuffleBlockId，就使用compressShuffle压缩方式。

二、常量模式

常量模式仅匹配自身，任何字面量都可以用作常量，另外任何val或单例对象也可以用作常量。

    persistenceEngine = RECOVERY_MODE match {
      case "ZOOKEEPER" =>
        logInfo("Persisting recovery state to ZooKeeper")
        new ZooKeeperPersistenceEngine(SerializationExtension(context.system), conf)
      case "FILESYSTEM" =>
        logInfo("Persisting recovery state to directory: " + RECOVERY_DIR)
        new FileSystemPersistenceEngine(RECOVERY_DIR, SerializationExtension(context.system))
      case _ =>
        new BlackHolePersistenceEngine()
    }

当RECOVERY_MODE是"ZOOKEEPER"时，persistenceEngine（持久化引擎）为ZooKeeperPersistenceEngine对象。

三、变量模式

变量模式类似于通配符，可以匹配任意对象，对通配符的差别在于scala把变量绑定在匹配的对象上。

private def substituteVariables(argument: String): String = argument match {
  case "{{WORKER_URL}}" => workerUrl
  case other => other
}

other为任意对象类型。

scala如何区分是常量匹配还是变量匹配，这里有个重要的文法规则：用小写字母开始的简单名被当作变量模式，所有其它的引用被认为常量模式。

四、构造器模式

其语法格式：SimplePattern ::= StableId '(' [Patterns [',']] ')' 由名称StableId和若干括号之内的模式Patterns构成。

假如这个名称指定了一个样本类，那么这个模式就是表示首先检查对象是该名称的样本类的成员，然后检查对象的构造器参数是符合额外提供的模式的。

这种模式在Spark源码中随处可见，用来处理Akka 消息通信。

private[scheduler] sealed trait DAGSchedulerEvent

private[scheduler] case class JobSubmitted(
    jobId: Int,
    finalRDD: RDD[_],
    func: (TaskContext, Iterator[_]) => _,
    partitions: Array[Int],
    allowLocal: Boolean,
    callSite: CallSite,
    listener: JobListener,
    properties: Properties = null)
  extends DAGSchedulerEvent


 def receive = {
    case JobSubmitted(jobId, rdd, func, partitions, allowLocal, callSite, listener, properties) =>
      dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, allowLocal, callSite,
        listener, properties)

    ...
  }

五、序列模式

可以像匹配样本类那样匹配如List 或 Array 这样的序列，还可以指定模式内任意数据量的元素，可以指定 _* 指定模式的最后元素。

    if (unpersistData) {
      logDebug("Unpersisting old RDDs: " + oldRDDs.values.map(_.id).mkString(", "))
      oldRDDs.values.foreach { rdd =>
        rdd.unpersist(false)
        // Explicitly remove blocks of BlockRDD
        rdd match {
          case b: BlockRDD[_] =>
            logInfo("Removing blocks of RDD " + b + " of time " + time)
            b.removeBlocks()
          case _ =>
        }
      }
    }

六、元组模式

语法格式为：SimplePattern ::= '(' [Patterns [',']] ')' 一个元组模式(p1,...,pn)其实是构造器模式scala.Tuplen(p1,...,pn)的别名(n>=2)，也可以在末尾多加一个逗号:(p1,...,pn,)。空元组()是类型为scala.Unit的唯一值。

    (clusterManager, deployMode) match {
      case (MESOS, CLUSTER) =>
        printErrorAndExit("Cluster deploy mode is currently not supported for Mesos clusters.")
      case (_, CLUSTER) if args.isPython =>
        printErrorAndExit("Cluster deploy mode is currently not supported for python applications.")
      case (_, CLUSTER) if isShell(args.primaryResource) =>
        printErrorAndExit("Cluster deploy mode is not applicable to Spark shells.")
      case _ =>
    }

七、类型模式

由类型、类型变量和通配符构成，可以当作类型测试或类型转换的简易替代，如下：

for (loc <- tasks(index).preferredLocations) {
  loc match {
    case e: ExecutorCacheTaskLocation =>
      addTo(pendingTasksForExecutor.getOrElseUpdate(e.executorId, new ArrayBuffer))
    case e: HDFSCacheTaskLocation => {
      val exe = sched.getExecutorsAliveOnHost(loc.host)
      exe match {
        case Some(set) => {
          for (e <- set) {
            addTo(pendingTasksForExecutor.getOrElseUpdate(e, new ArrayBuffer))
          }
          logInfo(s"Pending task $index has a cached location at ${e.host} " +
            ", where there are executors " + set.mkString(","))
        }
        case None => logDebug(s"Pending task $index has a cached location at ${e.host} " +
            ", but there are no executors alive there.")
      }
    }
    case _ => Unit
  }
  addTo(pendingTasksForHost.getOrElseUpdate(loc.host, new ArrayBuffer))
  for (rack <- sched.getRackForHost(loc.host)) {
    addTo(pendingTasksForRack.getOrElseUpdate(rack, new ArrayBuffer))
  }
}

八、变量绑定

这一条还有下面一条，不应该再作为模式种类，但为什么这样写，主要是它们在模式匹配太重要了，而很多人对此又容易陷入困惑。

语法形式很简单，只是简单地写变量名、@、模式即可，如下：

  def removeBroadcast(broadcastId: Long, tellMaster: Boolean): Int = {
    logInfo(s"Removing broadcast $broadcastId")
    val blocksToRemove = blockInfo.keys.collect {
      case bid @ BroadcastBlockId(`broadcastId`, _) => bid
    }
    blocksToRemove.foreach { blockId => removeBlock(blockId, tellMaster) }
    blocksToRemove.size
  }

当匹配BroadcastBlockId(`broadcastId`, _)时，bid就会替代匹配值。

九、模式守卫

当觉得模式结果不够精准或者由于scala要求模式是线性的，模式变量仅允许在模式中出现一次。如果匹配成功后继续使用模式规则（在=>后），那么就会报错，解决方法就是使用模式守卫重新制定这个匹配规则。如：

  def main(args: Array[String]) {
    args.length match {
      case x if x < 5 =>
        System.err.println(
          // Worker url is used in spark standalone mode to enforce fate-sharing with worker
          "Usage: CoarseGrainedExecutorBackend <driverUrl> <executorId> <hostname> " +
          "<cores> <appid> [<workerUrl>] ")
        System.exit(1)

      // NB: These arguments are provided by SparkDeploySchedulerBackend (for standalone mode)
      // and CoarseMesosSchedulerBackend (for mesos mode).
      case 5 =>
        run(args(0), args(1), args(2), args(3).toInt, args(4), None)
      case x if x > 5 =>
        run(args(0), args(1), args(2), args(3).toInt, args(4), Some(args(5)))
    }
  }

最后，在写模式匹配时，要遵循先特殊再一般，先部分再全部的原则，以免有些情况没有被匹配，不过针对这种情况，编译器可能会作出提示。

posted on 2015-03-09 23:31 Ai_togic 阅读(378) 评论(0) 编辑收藏举报

刷新页面返回顶部

Marshall

以spark源码为参照分析模式匹配及种类

导航

公告