sprak_算子

一、sortBy算子

前言:spark中的排序采用的是tera sort算法,先分区间有序再分区内有序,从而达到全局有序:

  • 1,采样确定边界:对每个分区采样,然后汇总排序,确定每个分区保存数据的范围,最后输出范围的上界数组;

  • 2,shuffle write分区间有序:用RangePartitioner并按照上界数组计算每条数据的分区号;

  • 3,shuffle read分区内有序:拉取分布在多个节点上的相同分区数据并排序,使得分区内有序;

  • 1,将RDD构造为RDD[(k, v)],然后调用sortByKey算子;

org.apache.spark.rdd.RDD.scala   // spark2.2版本

def sortBy[K](
    f: (T) => K,   // 根据返回的字段K来排序,要求K是可以compare的;
    ascending: Boolean = true,   // 默认升序;
    numPartitions: Int = this.partitions.length)   // 排序后的分区数,默认等于rdd分区数;
    (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
  this.keyBy[K](f)  // 构造RDD(k, v)
      .sortByKey(ascending, numPartitions)  // (2)调用sortByKey
      .values
}
  • 2,实例化RangerPartitioner分区器,并在shuffle write时使用该分区器对数据分区,使分区间有序;并设置shuffle read的排序key使分区内有序;
org.apache.spark.rdd.OrderedRDDFunctions.scala

def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
      : RDD[(K, V)] = self.withScope
  {
    val part = new RangePartitioner(numPartitions, self, ascending)  // (3)构建RangePartitioner;
    new ShuffledRDD[K, V, V](self, part)  // shuffle时根据构建的RangePartitioner对数据分区;
      .setKeyOrdering(if (ascending) ordering else ordering.reverse)  // 设置shuffle时的排序key;
  }
  • 3,构建RangerPartitioner分区器,对原始rdd数据采样,得到range分区的边界数组;
org.apache.spark.Partitioner.scala  // 该文件中Partitioner类的defaultPartitioner方法决定RDD的join分区器
// HashPartitioner是很多场景默认的分区器;最后就是排序的RangePartitioner分区器;
RangePartitioner类中

// rangeBounds数组中的元素就是分区之间的边界;
private var rangeBounds: Array[K] = {
  if (partitions <= 1) {
    Array.empty
  } else {
    // 采样的基础参数,不超过1M;
    val sampleSize = math.min(20.0 * partitions, 1e6)
    // 初步采样的大小,先假设原始分区的数据是均衡的,因为采样时每个分区采样的数量一样;
    // 不设置排序分区参数时,默认就是每个分区采样60个;
    val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt
    // (4)采样,numItems是采样的总数量,sketched:[idx:Int, n:Long, sample:Array[K]]其中idx是分区号,
    // n是该分区的元素个数,sample是采样的结果;
    val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)
    if (numItems == 0L) {
      Array.empty
    } else {
      // 如果一个分区的元素个数大于平均数量3倍,就会重新采样,以保证最后的分区数量均衡; 
      val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)
      val candidates = ArrayBuffer.empty[(K, Float)]  // 均衡的采样,最终的结果;
      val imbalancedPartitions = mutable.Set.empty[Int]  // 不均衡的分区采样,还要重新采样;
      sketched.foreach { case (idx, n, sample) =>
        if (fraction * n > sampleSizePerPartition) {
          imbalancedPartitions += idx
        } else {
          // 为每个采样元素设置权重weight,权重代表的含义这个key是从weight个数中选出的一个,也就是
          // 1 / 改元素的抽样概率;
          val weight = (n.toDouble / sample.length).toFloat
          for (key <- sample) {
            candidates += ((key, weight))
          }
        }
      }
      if (imbalancedPartitions.nonEmpty) {
        // 对不均衡的分区重新采样;
        val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)
        val seed = byteswap32(-rdd.id - 1)
        // 不放回的概率采样,采样的条数=20* m/n,m:不均衡分区的元素个数,n:分区平均元素个数;
        // 由于m > 3n,所以平均一个分区采样的数量会大于60;
        val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()
        val weight = (1.0 / fraction).toFloat
        candidates ++= reSampled.map(x => (x, weight))
      }
      RangePartitioner.determineBounds(candidates, partitions)  // (5)根据无序的采样结果划分边界;
    }
  }
}
  • 4,采样,对每个分区的数据采样;
org.apache.spark.Partitioner.scala 
RangePartitioner类中

// 采样
def sketch[K : ClassTag](
      rdd: RDD[K],
      sampleSizePerPartition: Int): (Long, Array[(Int, Long, Array[K])]) = {
    val shift = rdd.id
    // 以分区为单位采样
    val sketched = rdd.mapPartitionsWithIndex { (idx, iter) =>
      val seed = byteswap32(idx ^ (shift << 16))  // 随机种子
      val (sample, n) = SamplingUtils.reservoirSampleAndCount(   // 采用随机替换采样,会遍历所有数据;
        iter, sampleSizePerPartition, seed)
      Iterator((idx, n, sample))  // (分区号,分区元素个数,采样结果),默认每个分区采样60个;
    }.collect()   // collect到driver端;
    val numItems = sketched.map(_._2).sum   // 采样时遍历的元素个数,等于rdd的元素个数;
    (numItems, sketched)
  }
  • 5,根据采样的结果划分分区边界;
org.apache.spark.Partitioner.scala 
RangePartitioner类中

def determineBounds[K : Ordering : ClassTag](
      candidates: ArrayBuffer[(K, Float)],   // 无序的 (采样结果, 权重);
      partitions: Int): Array[K] = {
    val ordering = implicitly[Ordering[K]]
    val ordered = candidates.sortBy(_._1)
    val numCandidates = ordered.size   // 采样的数量
    val sumWeights = ordered.map(_._2.toDouble).sum  // rdd的总数据量
    val step = sumWeights / partitions  // 平均每个分区的数据量
    var cumWeight = 0.0
    var target = step
    val bounds = ArrayBuffer.empty[K]  // 最终的边界,最多partition-1个元素
    var i = 0
    var j = 0
    var previousBound = Option.empty[K]  // 上一个边界元素,防止重复
    while ((i < numCandidates) && (j < partitions - 1)) {
      val (key, weight) = ordered(i)
      cumWeight += weight
      if (cumWeight >= target) {  // 权重w表示这条数据的前面有w条数据;
        // 防止边界重复
        if (previousBound.isEmpty || ordering.gt(key, previousBound.get)) {
          bounds += key
          target += step
          j += 1
          previousBound = Some(key)
        }
      }
      i += 1
    }
    // 注意:最终的分区数等于bounds.length + 1,并不一定等于源rdd或设定的分区数;
    // 当源rdd中有大量重复的key时,就会导致bounds.length+1小于源rdd或设定的分区数;
    bounds.toArray
  }
  • 6,shuffle write时确定数据所在的分区号;
org.apache.spark.Partitioner.scala 
RangePartitioner类中

def getPartition(key: Any): Int = {
    val k = key.asInstanceOf[K]
    var partition = 0
    // 从这里可以看出,最后的分区数是等于rangeBounds + 1
    if (rangeBounds.length <= 128) {
      // 分区数小于128个就使用普通查找;
      while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) {
        partition += 1
      }
    } else {
      // 大于128就使用二分查找
      partition = binarySearch(rangeBounds, k)
      // 小于0表示在两个元素(a, b)的中间,该元素应该写到 b分区;此时二分查找返回的是 -b-1,
      // 所以这里 -(-b-1)-1 = b
      if (partition < 0) {
        partition = -partition-1
      }
      if (partition > rangeBounds.length) {  // 最后一个分区
        partition = rangeBounds.length
      }
    }
    if (ascending) {
      partition
    } else {
      rangeBounds.length - partition
    }
  }

二、RDD的 join算子

前言:PairRDDFunctions是为(key, value)类型RDD扩展的一个方法类,groupBy、sortBy最终都是转换成了(key, value)类型的RDD,然后调用PairRDDFunctions中对应的groupByKey和sortByKey算子;同样的,join是PairRDDFunctions类中的方法,并且都是调用cogroup算子来实现各种 join的,而cogroup中是CoGroupedRDD;

  • 1,确定 join的分区器;
org.apache.spark.rdd.PairRDDFunctions.scala

// 不传其他参数,会使用默认分区的方式来决定分区器;
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] = self.withScope {
    // (2)调用 join的重构方法;
    join(other, defaultPartitioner(self, other))
  }
// 传入分区数,则使用HashPartitioner分区器;
def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))] = self.withScope {
    join(other, new HashPartitioner(numPartitions))
  }

// 选择默认分区器
def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
    val rdds = (Seq(rdd) ++ others)
    val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))
    if (hasPartitioner.nonEmpty) {
      // 如果两个rdd的分区器不同时为空,则使用分区数较大的rdd的分区器;
      hasPartitioner.maxBy(_.partitions.length).partitioner.get
    } else {
      // 否则使用HashPartitioner
      if (rdd.context.conf.contains("spark.default.parallelism")) {
        // 设了默认并行度,则分区数为默认并行度;
        new HashPartitioner(rdd.context.defaultParallelism)
      } else {
        // 否则分区数为两个rdd分区数的较大者;
        new HashPartitioner(rdds.map(_.partitions.length).max)
      }
    }
  }
}
  • 2,所有的 join方式都是调用cogroup算子;
org.apache.spark.rdd.PairRDDFunctions.scala

// (3)都是先使用cogroup得到(k, (Iterable[V1], Iterable[V2])),然后根据不同的 join方式筛选结果;
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues( pair =>
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)  // 两个集合都有元素才返回;
    )
  }

def leftOuterJoin[W](
      other: RDD[(K, W)],
      partitioner: Partitioner): RDD[(K, (V, Option[W]))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues { pair =>
      if (pair._2.isEmpty) {
        pair._1.iterator.map(v => (v, None))   // 右表为空,则返回 (左表v, None);
      } else {
        for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, Some(w))  // 加上内连接的结果;
      }
    }
  }

def fullOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner)
      : RDD[(K, (Option[V], Option[W]))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues {
      case (vs, Seq()) => vs.iterator.map(v => (Some(v), None))  // 右表为空的
      case (Seq(), ws) => ws.iterator.map(w => (None, Some(w)))  // 左表为空的
      case (vs, ws) => for (v <- vs.iterator; w <- ws.iterator) yield (Some(v), Some(w))  // 加上内连接的结果
    }
  }
  • 3,cogroup算子使用的是CoGroupedRDD;
org.apache.spark.rdd.PairRDDFunctions.scala

def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
      : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {
    // 使用HashPartitioner时,key不能是数组类型;
    if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
      throw new SparkException("HashPartitioner cannot partition array keys.")
    }
    // (4)使用CoGroupedRDD
    val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
    cg.mapValues { case Array(vs, w1s) =>
      (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
    }
  }
  • 4,选择宽窄依赖,并将结果聚合为(key, Array[Iterable]);
org.apache.spark.rdd.CoGroupPartition.scala

// 选择宽窄依赖
override def getDependencies: Seq[Dependency[_]] = {
    rdds.map { rdd: RDD[_] =>
      if (rdd.partitioner == Some(part)) {
        logDebug("Adding one-to-one dependency with " + rdd)
        // 如果这个rdd的分区器与CoGroupRDD的分区器相同,就是窄依赖;
        new OneToOneDependency(rdd)
      } else {
        logDebug("Adding shuffle dependency with " + rdd)
        // 如果这个rdd的分区器与CoGroupRDD的分区器不同,就是宽依赖
        // 需要对这个rdd使用CoGroupRDD的分区器进行shuffle操作;
        // 所以 rdd1.join(rdd2),如果rdd1和rdd2的分区器相同,就不会再次触发shuffle;
        new ShuffleDependency[K, Any, CoGroupCombiner](
          rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)
      }
    }
  }

// 实际计算
override def compute(s: Partition, context: TaskContext): Iterator[(K, Array[Iterable[_]])] = {
    val split = s.asInstanceOf[CoGroupPartition]
    val numRdds = dependencies.length
    // ...
    // 使用的是ExternalAppendOnlyMap,自定义了聚合器,聚合器的返回类型为(key, Array[Iterable])
    // Array中的每一个元素都是一个Iterable,表示一个rdd中key相同的value集合;
    val map = createExternalMap(numRdds)
    for ((it, depNum) <- rddIterators) {
      map.insertAll(it.map(pair => (pair._1, new CoGroupValue(pair._2, depNum))))
    }
    // ...
  }

// 选择聚合数据结构,并自定义聚合器
private def createExternalMap(numRdds: Int)
    : ExternalAppendOnlyMap[K, CoGroupValue, CoGroupCombiner] = {
    // newCombiner的类型是 Array[CoGroup],CoGroup就是一个Seq[T]
    // 所以newCombiner的类型可以理解为Array[Seq[T]]
    val createCombiner: (CoGroupValue => CoGroupCombiner) = value => {
      // value是(Any, Int)类型,_2是rdd在rdds中的index,_1是这个rdd对应的value;
      val newCombiner = Array.fill(numRdds)(new CoGroup)
      newCombiner(value._2) += value._1
      newCombiner
    }
    val mergeValue: (CoGroupCombiner, CoGroupValue) => CoGroupCombiner =
      (combiner, value) => {
      combiner(value._2) += value._1
      combiner
    }
    // 如果是 rdd0.join(rdd1),则CoGroupRDD的rdds就是Seq[RDD](rdd0, rdd1)
    // combiner1数组就是Array[Seq[T]](Seq[T0], Seq[T1]),Seq[T0]就是rdd0相同key的value列表;
    val mergeCombiners: (CoGroupCombiner, CoGroupCombiner) => CoGroupCombiner =
      (combiner1, combiner2) => {
        var depNum = 0
        while (depNum < numRdds) {
          combiner1(depNum) ++= combiner2(depNum)
          depNum += 1
        }
        combiner1
      }
    new ExternalAppendOnlyMap[K, CoGroupValue, CoGroupCombiner](
      createCombiner, mergeValue, mergeCombiners)
  }

三、persist算子

前言:persist会根据缓存的级别,将rdd的数据存储到BlockManager的 memoryStore和diskStore;cache最终也是调用persist方法,rdd默认的缓存级别是MEMORY_ONLY,dataset默认是MEMORY_AND_DIST;spark中的缓存只是给rdd的分区加上一个缓存级别的标记,只有在这个分区被task执行完后,才会真正的缓存;并且一个rdd只能设置一次缓存级别,再次设置会报错;

  • 1,RDD中判断缓存级别;
org.apache.spark.rdd.RDD.scala

final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
    if (storageLevel != StorageLevel.NONE) {  // 如果设置了缓存
      getOrCompute(split, context)  // 从缓存获取,没有则计算
    } else {  // 默认没有缓存
      computeOrReadCheckpoint(split, context)  // 计算,或者读checkpoint;
    }
  }

private[spark] def getOrCompute(partition: Partition, context: TaskContext): Iterator[T] = {
    // rdd的每个partition对应一个唯一的block(rdd_id, partition_id)
    val blockId = RDDBlockId(id, partition.index) 
    // (2)调用BlockManager的getOrElseUpdate获取缓存,或者使用传入的方法重新计算;
    SparkEnv.get.blockManager.getOrElseUpdate(blockId, storageLevel, elementClassTag, () => {
      readCachedBlock = false
      computeOrReadCheckpoint(partition, context)
    }) match {
      // Left表示成功put到BlockManager中,就返回block
      case Left(blockResult) =>
        if (readCachedBlock) {  // 已经缓存了,直接从BlockManager中拿到了该block;
          // ...
          new InterruptibleIterator[T](context, blockResult.data.asInstanceOf[Iterator[T]]) {
            // ...
          }
        } else {  // 先计算然后put到BlockManager中,再重新获取该block,所返回的block;
          new InterruptibleIterator(context, blockResult.data.asInstanceOf[Iterator[T]])
        }
      // Right表示put到BlockManager失败(例如缓存级别为MEM_ONLY,但是内存不够的情况)
      // 这种情况的block就相当于没有缓存,每次都要重新计算;
      case Right(iter) =>
        new InterruptibleIterator(context, iter.asInstanceOf[Iterator[T]])
    }
  }
  • 2,从BlockManager获取block,或者重新计算;
org.apache.spark.storage.BlockManager.scala

def getOrElseUpdate[T](
      blockId: BlockId,
      level: StorageLevel,
      classTag: ClassTag[T],
      makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = {
    // 先从BlockManager中获取该block,获取到了就返回,否则就要计算该block;
    get[T](blockId)(classTag) match {
      case Some(block) =>
        return Left(block)
      case _ =>
    }
    // (3)需要计算该block,并put到BlockManager
    doPutIterator(blockId, makeIterator, level, classTag, keepReadLock = true) match {
      // 返回为空表示计算之后已经put到BlockManager中了,这里直接从BlockManager中获取block;
      case None =>
        val blockResult = getLocalValues(blockId).getOrElse {
          releaseLock(blockId)
          throw new SparkException(s"get() failed for block $blockId even though we held a lock")
        }
        releaseLock(blockId)
        Left(blockResult)
      // 返回迭代器,表示由于内存或磁盘空间不够导致put失败,下次使用该block时还要重新计算;
      case Some(iter) =>
       Right(iter)
    }
  }
  • 3,把block put到BlockManager;
org.apache.spark.storage.BlockManager.scala

private def doPutIterator[T](
      blockId: BlockId,
      iterator: () => Iterator[T],
      level: StorageLevel,
      classTag: ClassTag[T],
      tellMaster: Boolean = true,
      keepReadLock: Boolean = false): Option[PartiallyUnrolledIterator[T]] = {
    doPut(blockId, level, classTag, tellMaster = tellMaster, keepReadLock = keepReadLock) { info =>
      // 缓存级别使用了内存(mem_only、mem_and_disk等)
      if (level.useMemory) {
        // 如果数据是反序列化了的(也就是对象)
        // deserialized表示是否是反序列化的,反序列化后就是对象,也就是是否以对象的形式存储;
        if (level.deserialized) {
          // 直接将对象缓存到内存,读取时不需要反序列化
          memoryStore.putIteratorAsValues(blockId, iterator(), classTag) match {
            // 内存足够
            case Right(s) =>
              size = s
            // 内存不够
            case Left(iter) =>
              // 缓存级别使用了磁盘
              if (level.useDisk) {
                diskStore.put(blockId) {// ... }
              } else {
                // 如果内存不够,又没有使用缓存到磁盘,则这个block不缓存,返回这个block的计算Iter;
                iteratorFromFailedMemoryStorePut = Some(iter)
              }
          }
        } else { // 如果不是以反序列化的形式存储,则先将数据序列化再存储,读取时需要反序列化
          memoryStore.putIteratorAsBytes(blockId, iterator(), classTag, level.memoryMode) match {
            case Right(s) =>
              size = s
            case Left(partiallySerializedValues) =>
              // 步骤与上面相同
              if (level.useDisk) {//... }
                size = diskStore.getSize(blockId)
              } else {
                iteratorFromFailedMemoryStorePut = Some(partiallySerializedValues.valuesIterator)
              }
          }
        }

      } else if (level.useDisk) {   // 只缓存到磁盘
        diskStore.put(blockId) {//... }
      // ... 缓存级别中副本数量大于1时,进行副本复制操作
      // 如果block没有缓存成功,就会返回这个block的 Iter计算,让下游去执行;
      iteratorFromFailedMemoryStorePut
    }
  }

四、coalesce算子

前言:coalesce的shuffle参数决定了重分区是否触发shuffle;默认的repartition也是调用coalesce并且shuffle为true,默认的coalesce的shuffle为false;下面介绍shuffle为false的原理;

  • 1,选择是否触发shuffle;如果要增大分区数或者减少的分区数较多就使用shuffle重分区,如果只是减少分区数,并且减少的不多就可以不触发shuffle;
org.apache.spark.rdd.RDD.scala

def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null): RDD[T] = withScope {
    if (shuffle) {
      // 触发shuffle
      val distributePartition = (index: Int, items: Iterator[T]) => {
        // 使用随机数作为分区key;
        var position = (new Random(index)).nextInt(numPartitions)
        items.map { t =>
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
        new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      // (2)不触发shuffle
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
  }
  • 2,CoalesceRDD中的按组分区和计算;
org.apache.spark.rdd.CoalescedRDD.scala

override def getPartitions: Array[Partition] = {
    // 分区器默认为空
    val pc = partitionCoalescer.getOrElse(new DefaultPartitionCoalescer())
    // DefaultPartitionCoalescer.coalesce方法返回Array[PartitionGroup]
    // 每个PartitionGroup就是父RDD的一组分区,一个task将运行一个PartitionGroup;
    // DefaultPartitionCoalescer在将父rdd的分区划分到每个PartitionGroup中时,会尽量保证数据均衡
    // 以及本地性;
    pc.coalesce(maxPartitions, prev).zipWithIndex.map {
      case (pg, i) =>
        val ids = pg.partitions.map(_.index).toArray
        new CoalescedRDDPartition(i, prev, ids, pg.prefLoc)
    }
  }

override def compute(partition: Partition, context: TaskContext): Iterator[T] = {
    // CoalescedRDDPartition.parents就是这个task要处理的一组父rdd的分区;
    // 所以当父rdd有1000个分区,coalesce(10)时,子rdd的一个task要处理100个父rdd分区;
    // 所以当coalesce的分区数比父rdd的分区数小很多时(1~2个数量级),最好使用shuffle重分区;
    partition.asInstanceOf[CoalescedRDDPartition].parents.iterator.flatMap { parentPartition =>
      firstParent[T].iterator(parentPartition, context)
    }
  }

五、shuffle write原理

前言:spark的shuffle(Sort Shuffle)与MapReduce的shuffle过程类似,都是先写内存,内存不够就spill溢写到磁盘,等处理到最后一批数据时,就把最后一批数据与之前溢写到磁盘的数据合并,生成一个按照(partId, keyId)排序的大文件,并生成一个索引文件用来标识每个分区的起始和结束位置;下面是Sort Shuffle的write过程:

  • 1,spill溢写:数据先写内存,内存不够就spill溢写到磁盘,spill之前会先根据 (partId, keyId)的hash值排序;

  • 2,merge合并:当计算到最后一批数据时,会合并之前溢写的文件为一个按(partId, keyId)排序的文件,并生成一个索引文件记录每个分区的起始和结束位置的偏移量;

  • PS:spark2.2提供了三种Shuffle write实现,SortShuffleWriter、BypassMergeSortShuffleWriter、UnsafeShuffleWriter(Tungsten-sort);

  • 1,生成tasks,一个job只有最后一个stage是ResultStage,其他的都是ShuffleMapStage(要对数据重分区和持久化),所以Shuffle write就是ShuffleMapStage,对应task就是ShuffleMapTask;

org.apache.spark.scheduler.DAGScheduler.scala
submitMissingTasks方法中

val tasks: Seq[Task[_]] = try {
      val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
      stage match {
        case stage: ShuffleMapStage =>
          stage.pendingPartitions.clear()
          partitionsToCompute.map { id =>
            val locs = taskIdToLocations(id)
            val part = stage.rdd.partitions(id)
            stage.pendingPartitions += id
            // (2)实例化ShuffleMapTask,会调用其中的runTask方法;
            new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
              taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
              Option(sc.applicationId), sc.applicationAttemptId)
          }

        case stage: ResultStage =>
          partitionsToCompute.map { id =>
            val p: Int = stage.partitions(id)
            val part = stage.rdd.partitions(p)
            val locs = taskIdToLocations(id)
            new ResultTask(stage.id, stage.latestInfo.attemptId,
              taskBinary, part, locs, id, properties, serializedTaskMetrics,
              Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
          }
      }
    }
  • 2,获取shuffleManager对象,并调用该对象的write方法开始写数据;
org.apache.spark.scheduler.ShuffleMapTask.scala

override def runTask(context: TaskContext): MapStatus = {
   // 省略部分代码...

    var writer: ShuffleWriter[Any, Any] = null
    try {
      // (3)从SparkEnv获取实例化的shuffleManager对象;
      val manager = SparkEnv.get.shuffleManager 
      // (4)选择write对象
      writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
      // (5)开始写数据
      writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
      writer.stop(success = true).get
    } catch {
       // 省略部分代码...
  }
  • 3,选择Shuffle Manager,也就是选择Shuffle write的方式,默认是SortShuffleManager;
org.apache.spark.SparkEnv.scala
create方法中

val shortShuffleMgrNames = Map(
  "sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName,
  // 排序和spill直接在serialized binary data上操作而不是java objects,性能高,但是使用的限制条件较多;
  "tungsten-sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName)
val shuffleMgrName = conf.get("spark.shuffle.manager", "sort")  // 默认是SortShuffleManager
val shuffleMgrClass =
  shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase(Locale.ROOT), shuffleMgrName)
val shuffleManager = instantiateClass[ShuffleManager](shuffleMgrClass)  // 实例化shuffleManager对象
  • 4,确定ShuffleHandle的实现类,并选择对应的Shuffle Write实现类;
org.apache.spark.shuffle.sort.SortShuffleManager.scala

// ShuffleDependency中注册Shuffle时调用该方法,来选择handle
override def registerShuffle[K, V, C](
      shuffleId: Int,
      numMaps: Int,
      dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
    if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
      // 1,mapSideCombine为false,即不能使用map-side聚合;
      // 2,并且分区数小于spark.shuffle.sort.bypassMergeThreshold (默认200);
      // 因为过程中每个task会为每个分区创建一个文件,虽然最后会合并,但中间的文件数量还是很大;
      new BypassMergeSortShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
      // 1,依赖中没有aggregation或者对输出排序;
      // 2,序列化器支持序列化后的重定位(Kryo序列化器和SparkSQL自定义的序列化器支持);
      // 3,分区数小于16777216(2的24次方);
      new SerializedShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else {
      // 否则就是BaseShuffleHandle,前面两个也都继承BaseShuffleHandle;
      new BaseShuffleHandle(shuffleId, numMaps, dependency)
    }
  }

// 根据注册时选择的handle来选择Writer实现
override def getWriter[K, V](/* 省略参数... */)
    val env = SparkEnv.get
    handle match {
      case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>
        // 且叫他序列化Shuffle,也就是tungsten-sort shuffle的实现;之前版本使用时需要设置参数;
        new UnsafeShuffleWriter(/* 省略参数... */)
      case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>
        // 开启bypass机制的SortShuffle
        new BypassMergeSortShuffleWriter(/* 省略参数... */)
      case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
        // 默认就是SortShuffle
        new SortShuffleWriter(/* 省略参数... */)
    }
  }
  • 5,使用对应的shuffle write实现开始写数据,这里只介绍SortShuffleWriter;
org.apache.spark.shuffle.sort.SortShuffleWriter.scala

override def write(records: Iterator[Product2[K, V]]): Unit = {
    sorter = if (dep.mapSideCombine) {
      // 如果依赖链中有聚合算子,则会在map端聚合,也就是shuffle write之前聚合;
      require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
      new ExternalSorter[K, V, C](
        context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
    } else {
      new ExternalSorter[K, V, V](
        context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
    }
    // (6)将数据插入内存缓存中,其中包括spill到磁盘的操作 (优化的重点)
    sorter.insertAll(records)

    val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
    val tmp = Utils.tempFileWith(output)
    try {
      val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
      // 对所有spill溢写的文件使用TimSort归并排序,合并成一个按 (partId, keyId)排序的数据文件;
      val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
      // 生成索引文件,记录每个分区数据的起始和结束位置的偏移量;
      shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
      mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
    } finally {
      if (tmp.exists() && !tmp.delete()) {
        logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
      }
    }
  }
  • 6,将数据插入内存缓存中,其中包括spill到磁盘的操作 (优化的重点);

    org.apache.spark.util.collection.ExternalSorter.scala
    // shuffle write和shuffle read都是使用这个类来读数据到内存,然后spill溢写,最后合并;
    
    // 插入一个分区的数据
    def insertAll(records: Iterator[Product2[K, V]]): Unit = {
        val shouldCombine = aggregator.isDefined
        // 选择存储数据的数据结构,如果依赖是聚合算子就使用map: PartitionedAppendOnlyMap (例如
        // reduceByKey),否则就使用buffer: PartitionedPairBuffer (例如 join);
        // 这两个数据结构内部都是数组,data(2*n) =(partId, key),data(2*n+1)=value;
        if (shouldCombine) {
          // 构造聚合器,在读数据时使用聚合器聚合;
          val mergeValue = aggregator.get.mergeValue
          val createCombiner = aggregator.get.createCombiner
          var kv: Product2[K, V] = null
          val update = (hadValue: Boolean, oldValue: C) => {
            if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
          }
          while (records.hasNext) {
            addElementsRead()
            kv = records.next()
            map.changeValue((getPartition(kv._1), kv._1), update)
            // *添加这条数据后可能会触发溢写
            maybeSpillCollection(usingMap = true)
          }
        } else {
          while (records.hasNext) {
            addElementsRead()
            val kv = records.next()
            buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
            maybeSpillCollection(usingMap = false)
          }
        }
      }
    
    // 可能会溢写
    private def maybeSpillCollection(usingMap: Boolean): Unit = {
        var estimatedSize = 0L
        if (usingMap) {
          // (6.1)估算map占用的内存大小(并不是实际的占用内存大小,而是推算出来的);
          // 由于是估算出来的,所以可能实际使用的内存过大,而没有spill导致OOM;
          estimatedSize = map.estimateSize()
          // (6.2)可能触发spill溢写
          if (maybeSpill(map, estimatedSize)) {
            // 如果溢写了,就重新实例化map;
            map = new PartitionedAppendOnlyMap[K, C]
          }
        } else {
          estimatedSize = buffer.estimateSize()
          if (maybeSpill(buffer, estimatedSize)) {
            buffer = new PartitionedPairBuffer[K, C]
          }
        }
    
        if (estimatedSize > _peakMemoryUsedBytes) {
          _peakMemoryUsedBytes = estimatedSize
        }
      }
    
    
    
    • 6.1,估算map占用的内存大小;获取内存大小需要几毫秒,上亿条数据如果每条都获取一次肯定是不行的;

      // 回调函数,每更新一条数据调用一次
      protected def afterUpdate(): Unit = {
          numUpdates += 1
          if (nextSampleNum == numUpdates) {
            // 每隔当前大小的1.1倍,就获取一次map的实际内存大小
            takeSample()
          }
        }
      
      // 获取实际内存大小
      private def takeSample(): Unit = {
          samples.enqueue(Sample(SizeEstimator.estimate(this), numUpdates))
          // 只保存最近的两次内存占用大小
          if (samples.size > 2) {
            samples.dequeue()
          }
          // 平均每条数据的字节大小
          val bytesDelta = samples.toList.reverse match {
            case latest :: previous :: tail =>
              // 最近两次增加的字节大小 / 最近两次增加的数据条数
              (latest.size - previous.size).toDouble / (latest.numUpdates - previous.numUpdates)
            case _ => 0
          }
          // 平均每条数据的字节大小
          bytesPerUpdate = math.max(0, bytesDelta)
          // 下一次获取实际内存大小的时机,当前条数 * 1.1;
          nextSampleNum = math.ceil(numUpdates * SAMPLE_GROWTH_RATE).toLong
        }
      
      // 推算内存的大小
      def estimateSize(): Long = {
          assert(samples.nonEmpty)
          // 推算增加的内存大小 = 一条数据平均大小 * 增加的条数
          val extrapolatedDelta = bytesPerUpdate * (numUpdates - samples.last.numUpdates)
          // 上一次实际内存大小 + 推算增加的内存大小
          (samples.last.size + extrapolatedDelta).toLong
        }
      
    • 6.2,如果获取不到内存,就触发spill溢写;

      // 可能触发spill
      protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {
          var shouldSpill = false
          // 每32条记录触发一次,并且当前内存大于等于当前内存阈值;
          if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
            // 从shuffle内存中再获取一倍的当前内存;
            val amountToRequest = 2 * currentMemory - myMemoryThreshold
            // 实际获取的内存大小
            val granted = acquireMemory(amountToRequest)
            myMemoryThreshold += granted
            // 如果实际获取的内存大小为0或者很少,currentMemory就会大于等于 myMemoryThreshold
            // 这就说明shuffle内存已经不足了,就会触发溢写;
            shouldSpill = currentMemory >= myMemoryThreshold
          }
          // 如果内存不足,或者内存中的记录数大于spark.shuffle.spill.numElementsForceSpillThreshold
          // 就会触发溢写,该参数默认是Long.MaxValue(几乎不可能触发)
          shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
          // 实际触发溢写
          if (shouldSpill) {
            _spillCount += 1
            logSpillage(currentMemory)
            // 系列化和反序列化也是按批次的一个批次spark.shuffle.spill.batchSize(默认10000)条;
            // 溢写之前会先根据k的hash值排序;
            // 先写buffer大小是spark.shuffle.file.buffer(默认32k)然后再把buffer一次性写入磁盘;
            spill(collection)
            _elementsRead = 0
            _memoryBytesSpilled += currentMemory
            // 释放内存
            releaseMemory()
          }
          shouldSpill
        }
      

六、shuffle read原理

前言:shuffle read由于要到远程节点拉取数据所以有网络IO,并且是按批次,以block (shuffle write的一个partition数据)为最小单位拉取;当数据倾斜时,一个block的数据会很大,而一个拉取请求中至少会请求一个block并放入内存,所以数据倾斜时很容易造成OOM;但可以通过设置参数来解决;

  • 1,实例化ShuffleReader对象;
org.apache.spark.shuffle.sort.SortShuffleManager.scala

override def getReader[K, C](
      handle: ShuffleHandle,
      startPartition: Int,
      endPartition: Int,
      context: TaskContext): ShuffleReader[K, C] = {
    // (2)实例化BlockStoreShuffleReader对象来读取;
    new BlockStoreShuffleReader(
      handle.asInstanceOf[BaseShuffleHandle[K, _, C]], startPartition, endPartition, context)
  }
  • 2,实例化ShuffleBlockFetcherIterator对象;
org.apache.spark.shuffle.BlockStoreShuffleReader.scala

override def read(): Iterator[Product2[K, C]] = {
    // (3)按批次拉取block;
    val wrappedStreams = new ShuffleBlockFetcherIterator(
      context,
      blockManager.shuffleClient,  // blockManager客户端,用来拉取远程块;
      blockManager,   // 当前executor的blockManager,用来拉取本地块;
      // 根据该task处理的shuffleId,以及处理的分区范围,来获取包含这些分区的节点和blockId及长度
      // 返回的类型为 (BlockManagerId, Seq[BlockId, Long]),该服务由BlockManager提供;
      mapOutputTracker.getMapSizesByExecutorId(handle.shuffleId, startPartition, endPartition),
      serializerManager.wrapStream,
      // 一次请求从远程拉取数据块的最大值(同时从5个节点拉取);
      SparkEnv.get.conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024,
      // 一次请求从一个节点拉取的最大block数量;
      SparkEnv.get.conf.getInt("spark.reducer.maxReqsInFlight", Int.MaxValue),
      // 拉取结果小于该阈值就放入内存,否则直接写入磁盘;
      SparkEnv.get.conf.get(config.REDUCER_MAX_REQ_SIZE_SHUFFLE_TO_MEM),
      // 开启数据校验;只会校验压缩文件、小文件(小于maxBytesInFlight/3)或 大文件的开头一部分;
      SparkEnv.get.conf.getBoolean("spark.shuffle.detectCorrupt", true))

    // 序列化器
    val serializerInstance = dep.serializer.newInstance()

    // 将迭代器 wrappedStreams(BlockId, InputStream)转为recordIter(key, value)迭代器;
    val recordIter = wrappedStreams.flatMap { case (blockId, wrappedStream) =>
      serializerInstance.deserializeStream(wrappedStream).asKeyValueIterator
    }

    val readMetrics = context.taskMetrics.createTempShuffleReadMetrics()
    // 在recordIter基础上加入了度量,统计读入了多少(key, value);
    val metricIter = CompletionIterator[(Any, Any), Iterator[(Any, Any)]](
      recordIter.map { record =>
        readMetrics.incRecordsRead(1)
        record
      },
      context.taskMetrics().mergeShuffleReadMetrics())

    // 使读取可中断,最后操作的迭代器interruptibleIter
    val interruptibleIter = new InterruptibleIterator[(Any, Any)](context, metricIter)

    val aggregatedIter: Iterator[Product2[K, C]] = if (dep.aggregator.isDefined) {
      if (dep.mapSideCombine) {
        // 对聚合算子并且mapSideCombine为true,调用聚合方法;
        // 底层使用ExternalAppendOnlyMap聚合数据并溢写,最后合并spill再聚合;
        val combinedKeyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, C)]]
        dep.aggregator.get.combineCombinersByKey(combinedKeyValuesIterator, context)
      } else {
        val keyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, Nothing)]]
        dep.aggregator.get.combineValuesByKey(keyValuesIterator, context)
      }
    } else {
      require(!dep.mapSideCombine, "Map-side combine without Aggregator specified!")
      interruptibleIter.asInstanceOf[Iterator[Product2[K, C]]]
    }

    // 如果要自定义排序,这里与Shuffle Write一样,都是使用ExternalSorter来排序;
    // 由于上一步是根据key的hash值排序,而用户可能自定义key的排序规则,所以这里只做排序,不做聚合;
    // ExternalSorter中使用的是buffer: PartitionedPairBuffer数据结构;
    dep.keyOrdering match {
      case Some(keyOrd: Ordering[K]) =>
        val sorter =
          new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), serializer = dep.serializer)
        sorter.insertAll(aggregatedIter)
        context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled)
        context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled)
        context.taskMetrics().incPeakExecutionMemory(sorter.peakMemoryUsedBytes)
        CompletionIterator[Product2[K, C], Iterator[Product2[K, C]]](sorter.iterator, sorter.stop())
      case None =>
        aggregatedIter
    }
  }
  • 3,初始化,切分本地请求和远程请求,并发拉取本地block;
org.apache.spark.storage.ShuffleBlockFetcherIterator.scala

// 实例化该对象时会调用该方法;
private[this] def initialize(): Unit = {
    context.addTaskCompletionListener(_ => cleanup())
    // (4)切分本地和远程块的拉取请求,并返回远程块;
    val remoteRequests = splitLocalRemoteBlocks()
    // 随机打散拉取请求
    fetchRequests ++= Utils.randomize(remoteRequests)
    // ...
    // 发送第一个拉取远程block的请求;
    fetchUpToMaxBytes()
    // ...
    // 获取本地块,与获取远程块同时进行;
    // 实现方式是直接使用 blockManager.getBlockData(blockId)获取本地的block;
    fetchLocalBlocks()
    logDebug("Got local blocks in " + Utils.getUsedTimeMs(startTime))
  }
  • 4,构建远程拉取请求;Shuffle Read优化的重点;
org.apache.spark.storage.ShuffleBlockFetcherIterator.scala

private[this] def splitLocalRemoteBlocks(): ArrayBuffer[FetchRequest] = {
    // 并行从5个节点拉取数据,每个节点拉取 maxBytesInFlight / 5 字节的数据;
    val targetRequestSize = math.max(maxBytesInFlight / 5, 1L)
    // 远程拉取请求
    val remoteRequests = new ArrayBuffer[FetchRequest]
    // 遍历BlockManager和block的信息,这些BlockManager(节点)上包含该task处理的block分区数据;
    for ((address, blockInfos) <- blocksByAddress) {
      totalBlocks += blockInfos.size
      if (address.executorId == blockManager.blockManagerId.executorId) {
        // 获取长度不为0的本地块;
        localBlocks ++= blockInfos.filter(_._2 != 0).map(_._1)
        numBlocksToFetch += localBlocks.size
      } else {
        val iterator = blockInfos.iterator
        var curRequestSize = 0L
        var curBlocks = new ArrayBuffer[(BlockId, Long)]
        // 遍历这个节点的目标block
        while (iterator.hasNext) {
          val (blockId, size) = iterator.next()
          if (size > 0) {
            curBlocks += ((blockId, size))  // 这个请求要拉取的block;
            remoteBlocks += blockId
            numBlocksToFetch += 1
            curRequestSize += size  // 这个请求要拉取的block大小;
          } else if (size < 0) {
            throw new BlockException(blockId, "Negative block size " + size)
          }
          if (curRequestSize >= targetRequestSize) {
            // 如果这个请求要拉取的大小大于等于 maxBytesInFlight / 5 就构建一个请求
            // 并将该请求加入到远程拉取请求的队列中;
            remoteRequests += new FetchRequest(address, curBlocks)
            curBlocks = new ArrayBuffer[(BlockId, Long)]
            logDebug(s"Creating fetch request of $curRequestSize at $address")
            curRequestSize = 0
          }
        }
        // 构建最后一个请求;curBlocks中有值,但是长度小于targetRequestSize;
        if (curBlocks.nonEmpty) {
          remoteRequests += new FetchRequest(address, curBlocks)
        }
      }
    }
    logInfo(s"Getting $numBlocksToFetch non-empty blocks out of $totalBlocks blocks")
    remoteRequests
  }
  • 5,调用next()方法来获取数据,并再次发送远程拉取请求;
org.apache.spark.storage.ShuffleBlockFetcherIterator.scala

override def next(): (BlockId, InputStream) = {
    // ...
    while (result == null) {
      val startFetchWait = System.currentTimeMillis()
      result = results.take()    // 获取队列的第一个元素;
      val stopFetchWait = System.currentTimeMillis()

      result match {
         // 拉取请求成功,构建输入流,并校验数据;
        case r @ SuccessFetchResult(blockId, address, size, buf, isNetworkReqDone) =>
          // ... 构建输入流
          // 校验数据,一次读入整个block,有可能OOM
          if (detectCorrupt && !input.eq(in) && size < maxBytesInFlight / 3) {
            // 只校验压缩的,长度小于maxBytesInFlight / 3的block;
            // 第一次校验失败会重新拉取,并把该blockId加入到HashSet中,第二次失败则返回请求失败;
          }
        // 拉取请求失败
        case FailureFetchResult(blockId, address, e) =>
          throwFetchFailedException(blockId, address, e)
      }

      // (6)再发送一次远程拉取请求获取数据;
      fetchUpToMaxBytes()
    }
    // 返回 (blockId, 该blockId的输入流)
    currentResult = result.asInstanceOf[SuccessFetchResult]
    (currentResult.blockId, new BufferReleasingInputStream(input, this))
  }
  • 6,发送远程拉取请求;
org.apache.spark.storage.ShuffleBlockFetcherIterator.scala

// 发送远程拉取请求
private def fetchUpToMaxBytes(): Unit = {
    // 发送远程拉取请求,拉取长度为maxBytesInFlight(48M)的数据
    // 由于构建一个请求时的长度限制为maxBytesInFlight(48M)/5,如果每个请求长度都等于 48M/5
    // 那么一次就会发送5个请求,每个请求不超过 48M / 5,总大小不超过 48M;
    while (fetchRequests.nonEmpty &&
      (bytesInFlight == 0 ||   // 还有远程请求
        (reqsInFlight + 1 <= maxReqsInFlight &&  // 这一批次的请求数要小于这个阈值
          bytesInFlight + fetchRequests.front.size <= maxBytesInFlight))) {  // 这一批次长度不超过48M
      sendRequest(fetchRequests.dequeue())  // (7)实际的发送请求
    }
  }
  • 7,实际发送请求,拉取数据;
org.apache.spark.storage.ShuffleBlockFetcherIterator.scala

private[this] def sendRequest(req: FetchRequest) {
    // ...
    val blockFetchingListener = new BlockFetchingListener {
      override def onBlockFetchSuccess(blockId: String, buf: ManagedBuffer): Unit = {
         // ...
        results.put(new SuccessFetchResult(/*...*/))
         // ...
      }
      // 无论成功还是失败都向results添加相应的实例,在下游进行处理;
      override def onBlockFetchFailure(blockId: String, e: Throwable): Unit = {
        results.put(new FailureFetchResult(/*...*/))
      }
    }
    // 请求的大小小于maxReqSizeShuffleToMem就放到内存,否则直接写到磁盘
    if (req.size > maxReqSizeShuffleToMem) {
      val shuffleFiles = blockIds.map { _ =>
        blockManager.diskBlockManager.createTempLocalBlock()._2
      }.toArray
      shuffleFilesSet ++= shuffleFiles
      shuffleClient.fetchBlocks(address.host, address.port, address.executorId, blockIds.toArray,
        blockFetchingListener, shuffleFiles)
    } else {
      shuffleClient.fetchBlocks(address.host, address.port, address.executorId, blockIds.toArray,
        blockFetchingListener, null)
    }
  }

七、SparkSQL执行计划解析

image-20210727100427878

image-20210726182157328

前言:SparkSQL经过解析逻辑计划(Parsed) -> 分析逻辑计划(Analyzed) -> 优化逻辑计划(Optimized) -> 物理计划(Physocal),最后生成RDD来执行;由于会经过优化器优化,所以理论上会比不规范的直接使用rdd的性能高;由于有schema信息,所以可读性好;

  • 生成逻辑计划(Parsed):将字符串的sql通过ANTLR解析成AST抽象语法树,再把抽象语法树构建成逻辑计划;

  • 分析逻辑计划(Analyzed) :上一步是unresolved,这一步会catalog检查表和字段,生成分析后的逻辑计划;

  • 优化逻辑计划(Optimized):对上一步的逻辑计划进行优化,主要是列裁剪、合并、谓词下推等;

  • 物理计划(Physocal):生成最终的物理执行计划;

  • 1,Parsed Logical Plan,将字符串sql解析成抽象语法树,再构建成unresolved Logical Plan逻辑计划;

    • ANTLR根据语法文件SqlBase.g4生成的SqlBaseLexer和SqlBaseParser java类对字符串sql进行词法分析和语法分析,生成语法树;
    • 使用astBuilder将语法树构建成unresolved logical plan逻辑计划,系统并不知道每个词的含义;
org.apache.spark.sql.SparkSession.scala
def sql(sqlText: String): DataFrame = {
    // sessionState.sqlParser.parsePlan(sqlText)就是将字符串sql解析成逻辑计划
    Dataset.ofRows(self, sessionState.sqlParser.parsePlan(sqlText))
  }

org.apache.spark.sql.catalyst.parser.ParseDriver.scala
// ParserInterface的实现类
AbstractSqlParser抽象类中

// 将抽象语法树解析成逻辑计划
override def parsePlan(sqlText: String): LogicalPlan = parse(sqlText) { parser => // 解析成AST
    astBuilder.visitSingleStatement(parser.singleStatement()) match {  // 将AST构建成逻辑计划
      case plan: LogicalPlan => plan
      case _ =>
        val position = Origin(None, None)
        throw new ParseException(Option(sqlText), "Unsupported SQL statement", position, position)
    }
  }

// 将sql解析成抽象语法树
protected def parse[T](command: String)(toResult: SqlBaseParser => T): T = {
    logInfo(s"Parsing command: $command")
    // 词法分析和语法分析的类SqlBaseLexer和SqlBaseParser都是由ANTLR 4自动生成的java类;
    // 进行词法分析
    val lexer = new SqlBaseLexer(new ANTLRNoCaseStringStream(command))
    // ...
    // 进行语法分析
    val tokenStream = new CommonTokenStream(lexer)
    val parser = new SqlBaseParser(tokenStream)
    // ...
  }
  • 2,创建QueryExecution对象,进行分析和优化逻辑计划,并生成最终的物理计划;
    • analyzed:将parse的unresolved logical plan解析成logical plan;
    • optimized:对logical plan进行优化;
    • sparkPlan:将优化后的逻辑计划解析成spark可以执行的物理计划;
org.apache.spark.sql.Dataset.scala
def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame = {
    // 构建QueryExecution对象
    val qe = sparkSession.sessionState.executePlan(logicalPlan)
    qe.assertAnalyzed()
    new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema))
  }

// 进行Analyzed分析、Optimized优化、SparkPlan物理计划的核心类
org.apache.spark.sql.execution.QueryExecution.scala
// (3)使用Analyzer对象,将parse的unresolved logical plan解析成logical plan;
lazy val analyzed: LogicalPlan = {
    SparkSession.setActiveSession(sparkSession)
    sparkSession.sessionState.analyzer.execute(logical)
  }

lazy val withCachedData: LogicalPlan = {
    assertAnalyzed()
    assertSupported()
    sparkSession.sharedState.cacheManager.useCachedData(analyzed)
  }
// (4)使用Optimizer,优化逻辑计划
lazy val optimizedPlan: LogicalPlan = sparkSession.sessionState.optimizer.execute(withCachedData)
// (5)使用SparkPlanner,生成物理计划
lazy val sparkPlan: SparkPlan = {
    SparkSession.setActiveSession(sparkSession)
    planner.plan(ReturnAnswer(optimizedPlan)).next()
  }
  • 3,Analyzer,将parse的unresolved logical plan解析成logical plan;
    • Analyzer中的batches会定义很多batche(类别),这里会对每一个batche,按照batche中定义的rule(规则)对Unresolved的逻辑计划进行解析;
    • 例如常用的名为Resolution的batch就是将parse后的unresolved节点解析为resolved节点,其中的ResolveRelations规则会调用catalog对象来寻找当前表的结构,从中解析出表的字段;catalog会缓存表名和LogicalPlan键值对;具体就是对unresolved上的节点加上数据类型绑定和函数绑定;
    • catalog:spark2.0添加的API,用来操作SparkSQL以及Hive中的元数据,可以获取库、表、字段、函数;并能进行hive表的DDL;
org.apache.spark.sql.catalyst.rules.RuleExecutor.scala

// Batch的结构
protected case class Batch(name: String, strategy: Strategy, rules: Rule[TreeType]*)

// analyzer调用的是Analyzer父类RuleExecutor的execute方法
def execute(plan: TreeType): TreeType = {
    var curPlan = plan

    batches.foreach { batch =>
    // ...
    curPlan
  }
  • 4,Optimizer,优化逻辑计划;
    • 与Analyzer一样,都是继承自父类RuleExecutor,Optimizer中也会定义很多batches,来优化逻辑计划;
    • 例如常用的名为Operator Optimizations的betch就是对操作优化;batches是按顺序执行优化的;
    • SQL经典的优化规则有:谓词下推、常亮累加、列裁剪、Limits合并;
    • Optimizer的优化规则有:合并(union)、替换(semi join)、算子下推、算子组合、常量折叠与长度削减;
org.apache.spark.sql.catalyst.optimizer.Optimizer.scala

// SQL优化是我们关注的重点,所以这里主要介绍Optimizer中batches的优化规则
Batch("Union", Once,
      CombineUnions) ::  // 合并相邻的两个union,嵌套union中的distinct只需要在最外层distinct就可以了;
    Batch("Pullup Correlated Expressions", Once,
      PullupCorrelatedPredicates) ::  // 将子查询的filter上提;
    Batch("Subquery", Once,
      OptimizeSubqueries) ::  // 遇到子查询时,进一步调用Optimizer.this.execute(Subquery(s.plan))优化;
    Batch("Replace Operators", fixedPoint,
      ReplaceIntersectWithSemiJoin,  // 交集替换为semi join
      ReplaceExceptWithAntiJoin,   // 除外替换为anti join
      ReplaceDistinctWithAggregate) ::  // distinct替换为聚合group by;
    Batch("Aggregate", fixedPoint,
      RemoveLiteralFromGroupExpressions,  // 删除group by中的常数
      RemoveRepetitionFromGroupExpressions) ::  // 删除group by中的重复表达式;
    Batch("Operator Optimizations", fixedPoint, Seq(
      // Operator push down  // 算子下推;
      PushProjectionThroughUnion,   // 列裁剪下推;多个连续union后再select,会每个union中select;
      ReorderJoin(conf),   // join顺序优化;CBO(Cost Based Optimizer)根据数据量对 join顺序调整;
      // 有过滤条件的out join转为inner join;例如left out join后过滤右表字段,右表join不上就是空
      // 对空值filter,肯定匹配不上,所以最后的结果跟 inner join是一样的,inner join在filter时数据量更小;
      EliminateOuterJoin(conf), 
      PushPredicateThroughJoin,  // join过滤条件下推到join两边;也就是先filter再join;
      PushDownPredicate,   // 数据源谓词下推;读取数据源后的filter,会在读数据时执行;
      LimitPushDown(conf),  // limit下推;当union或join后再limit时,把limit推到union和join的子节点;
      ColumnPruning,   // 列裁剪;只获取要使用的列;
      InferFiltersFromConstraints(conf),  // 约束条件提取;例如filter(a>2)变为filter(isnotnull(a) && a>2);
      // Operator combine   // 算子合并;
      CollapseRepartition,  // 合并repartition;
      CollapseProject,   // 合并Project(去掉不必要的select);
      CollapseWindow,  // 合并window(相同分区和排序);
      CombineFilters,  // 合并filter;
      CombineLimits,  // 合并limit;相邻的limit合并,取较小的limit;
      CombineUnions,  // 合并union;与第一个Union优化一样;
      // Constant folding and strength reduction  // 常量折叠和长度削减;
      NullPropagation(conf),  // Null提取;避免Null在语法树中的传播;
      FoldablePropagation,   // 常量传递;select 'c' as a order by a => select 'c' as a order by 'c';
      OptimizeIn(conf),  // 优化 in;空处理,重复处理;
      ConstantFolding,   // 常数折叠;例如表达式中的 1+2会先计算为3,而不是每条数据都计算一次;
      ReorderAssociativeOperator, // 排序与折叠变量;如x+2+y+7会被flatten成[2,7],[x,y],再把[2,7]变为9;
      LikeSimplification,  // like化简;例如name like 'shen%'替换为name.startWith(shen);
      BooleanSimplification, // Boolean表达式优化;例如(a=1 and b=2) or (a=1 and b>2);变为(a=1) and (b=2 || b>2)
      SimplifyConditionals,  // if/case语句优化;与BooleanSimplification类似;
      RemoveDispensableExpressions,  // 删除不必要的节点;
      SimplifyBinaryComparison,  // 比较算子简化;如果 = 两边的表达式相同就优化为true;
      PruneFilters(conf),  // 对filter减枝;例如父节点a>4 and b=2, 子节点b=2,则子节点的filter(b=2)去掉;
      EliminateSorts,  //sort消除;删除sort后没有操作或重复的sort;
      SimplifyCasts,  // cast简化;如果cast前后类型没有变化,就删除cast操作;
      SimplifyCaseConversionExpressions,  // 简化字符串的大小写转换;如果有多次转换,只保留最后一次;
      RewriteCorrelatedScalarSubquery,  // 子查询改写为 left outer join;
      EliminateSerialization,   // 序列化消除;
      RemoveRedundantAliases,  // 消除冗余的别名;
      RemoveRedundantProject,  // 消除冗余的Project投影(select);
      SimplifyCreateStructOps,  // 操作下推到CreateStructOps;
      SimplifyCreateArrayOps,  // 操作下推到CreateArrayOps;
      SimplifyCreateMapOps) ++ // 操作下推到CreateMapOps;
      extendedOperatorOptimizationRules: _*) ::
    Batch("Check Cartesian Products", Once,
      CheckCartesianProducts(conf)) ::  // 检测笛卡尔积 join;如果spark.sql.crossJoin.enabled=false时发生了笛卡尔积,就会报错;
    Batch("Join Reorder", Once,
      CostBasedJoinReorder(conf)) ::  // 基于成本的连接重新排序,选择合适的 join顺序(基于动态规划);
    Batch("Decimal Optimizations", fixedPoint,
      DecimalAggregates(conf)) ::  // decimal类型聚合优化;
    Batch("Object Expressions Optimization", fixedPoint,
      EliminateMapObjects,  // 消除MapObject;
      CombineTypedFilters) ::  // 合并相邻的类型过滤
    Batch("LocalRelation", fixedPoint,
      ConvertToLocalRelation,  // 优化 LocalRelation
      PropagateEmptyRelation) ::  // 优化 EmptyRelation
    Batch("OptimizeCodegen", Once,
      OptimizeCodegen(conf)) ::  // 优化生成的代码;
    Batch("RewriteSubquery", Once,
      RewritePredicateSubquery,  // 优化子查询为 left semi join和left anti join;
      CollapseProject) :: Nil  // 合并Project(去掉不必要的select);上面出现了;
  • 5,SparkPlan,生成物理计划;优化后的逻辑计划也只是个抽象的概念,例如 join,代表两个表根据相同的字段进行连接,但是具体怎么实现这个 join,逻辑计划中没有说明,此时就需要CBO根据最小耗时选择一个 join实现;

    org.apache.spark.sql.execution.QueryExecution.scala
    lazy val sparkPlan: SparkPlan = {
        SparkSession.setActiveSession(sparkSession)
        // (5.1)调用SparkPlanner对象的plan方法,plan方法定义在父类QueryPlanner中
        // plan会返回一个或多个物理计划(目前只会返回一个),并使用第一个;目前CBO主要是优化 join,选
        // 择合适的 join实现,但是这里只是根据 join表的大小来选择 join实现,后续会根据代价模型来选择;
        planner.plan(ReturnAnswer(optimizedPlan)).next()
      }
    // 执行物理计划
    lazy val executedPlan: SparkPlan = prepareForExecution(sparkPlan)
    // 执行物理计划之前,会对物理计划应用一些规则;
    protected def prepareForExecution(plan: SparkPlan): SparkPlan = {
        preparations.foldLeft(plan) { case (sp, rule) => rule.apply(sp) }
    }
    // 这些规则也都继承自Rule类,类似于Optimizer中的规则优化,所以这里也认为是对物理计划的优化;
    protected def preparations: Seq[Rule[SparkPlan]] = Seq(
        python.ExtractPythonUDFs,
        PlanSubqueries(sparkSession),  // 对子查询再次解析和优化,并生成一个QueryExecution对象;
        EnsureRequirements(sparkSession.sessionState.conf), // 验证是否是目标分区数,不是就shuffle;
        // 将一串的算子(map、filter等)转换为一个java方法;
        // 就是在支持Codegen的SparkPlan上添加一个WholeStageCodegenExec,不支持Codegen的SparkPlan则会添加一个InputAdapter;
        CollapseCodegenStages(sparkSession.sessionState.conf),
        // Exchange可以认为是一个shuffle,这里就是找重复的Exchange然后替换,避免重复计算;
        ReuseExchange(sparkSession.sessionState.conf),
        // 与上一步类似,这里是找重复的子查询然后替换,避免重复计算;
        ReuseSubquery(sparkSession.sessionState.conf))
    
    • 5.1,生成物理计划;继承关系SparkPlanner -> SparkStrategies -> QueryPlanner;
    // SparkPlanner中定义的strategies策略
    def strategies: Seq[Strategy] =
        experimentalMethods.extraStrategies ++
          extraPlanningStrategies ++ (
          FileSourceStrategy ::
          DataSourceStrategy(conf) ::
          SpecialLimits ::
          Aggregation ::
          JoinSelection ::  // 这里就会选择具体的 join实现类;定义在父类SparkStrategies中;
          InMemoryScans ::  // 处理cacheTable
          BasicOperators :: Nil)
    
    // QueryPlanner中定义plan方法
    def plan(plan: LogicalPlan): Iterator[PhysicalPlan] = {
        // 这里的strategies就是SparkPlanner中定义的规则,调用每个规则的apply(plan)方法
        val candidates = strategies.iterator.flatMap(_(plan))
    
        // 可能有子查询,递归调用plan方法
        val plans = candidates.flatMap { candidate =>
          // ...
          }
        }
        // 最后返回物理计划
      }
    
  • 6,SparkPlan执行生成RDD阶段;调用doExecute方法;

// QueryExecution中会调用execute方法;
lazy val toRdd: RDD[InternalRow] = executedPlan.execute()
// 而execute方法会调用doExecute方法,每个继承SparkPlan的类都要实现自己的doExecute方法;
final def execute(): RDD[InternalRow] = executeQuery {
    doExecute()
 }

// 例如读取hive表的执行器HiveTableScanExec;
// 通常继承SparkPlan的类都是Exec结尾,表示物理计划的实际执行器;
org.apache.spark.sql.hive.execution.HiveTableScanExec.scala

protected override def doExecute(): RDD[InternalRow] = {
    // 这里创建了RDD
    val rdd = if (!relation.isPartitioned) {
      Utils.withDummyCallSite(sqlContext.sparkContext) {
        hadoopReader.makeRDDForTable(hiveQlTable)
      }
    } else {
      Utils.withDummyCallSite(sqlContext.sparkContext) {
        hadoopReader.makeRDDForPartitionedTable(prunePartitions(rawPartitions))
      }
    }
    val numOutputRows = longMetric("numOutputRows")
    val outputSchema = schema
    // 这里调用了内部的mapPartitionsWithIndexInternal,与RDD的mapPartitionsWithIndex类似;
    rdd.mapPartitionsWithIndexInternal { (index, iter) =>
      // (7)这里会使用GenerateUnsafeProjection根据表达式生成可执行的java代码,并生成java字节码;
      val proj = UnsafeProjection.create(outputSchema)
      proj.initialize(index)
      iter.map { r =>
        numOutputRows += 1
        proj(r)
      }
    }
  }
  • 7,生成java代码,并编译为java字节码,发送给executor执行;
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection.scala

// HiveTableScanExec的doExecute中的UnsafeProjection.create,就是调用了这个create方法来生成代码;
private def create(
      expressions: Seq[Expression],
      subexpressionEliminationEnabled: Boolean): UnsafeProjection = {
    val ctx = newCodeGenContext()
    val eval = createCode(ctx, expressions, subexpressionEliminationEnabled)

    // 这里是java代码的模板
    val codeBody = s"""
      public java.lang.Object generate(Object[] references) {
        return new SpecificUnsafeProjection(references);
      }

      class SpecificUnsafeProjection extends ${classOf[UnsafeProjection].getName} {

        private Object[] references;
        ${ctx.declareMutableStates()}

        public SpecificUnsafeProjection(Object[] references) {
          this.references = references;
          ${ctx.initMutableStates()}
        }
        // ...
      }
      """
    // format格式化模板
    val code = CodeFormatter.stripOverlappingComments(
      new CodeAndComment(codeBody, ctx.getPlaceHolderToComments()))
    logDebug(s"code for ${expressions.mkString(",")}:\n${CodeFormatter.format(code)}")

    // 编译java代码为java字节码;
    val c = CodeGenerator.compile(code)
    // 调用模板中的generate方法,传入需要的对象;
    c.generate(ctx.references.toArray).asInstanceOf[UnsafeProjection]
  }

参考文章

posted @ 2021-08-16 21:47  huas_lqy  阅读(241)  评论(0编辑  收藏  举报