sprak_算子

一、sortBy算子
二、RDD的 join算子
三、persist算子
四、coalesce算子
五、shuffle write原理
六、shuffle read原理
七、SparkSQL执行计划解析
参考文章

一、sortBy算子

前言：spark中的排序采用的是tera sort算法，先分区间有序再分区内有序，从而达到全局有序：

1，采样确定边界：对每个分区采样，然后汇总排序，确定每个分区保存数据的范围，最后输出范围的上界数组；
2，shuffle write分区间有序：用RangePartitioner并按照上界数组计算每条数据的分区号；
3，shuffle read分区内有序：拉取分布在多个节点上的相同分区数据并排序，使得分区内有序；
1，将RDD构造为RDD[(k, v)]，然后调用sortByKey算子；

org.apache.spark.rdd.RDD.scala   // spark2.2版本

def sortBy[K](
    f: (T) => K,   // 根据返回的字段K来排序，要求K是可以compare的；
    ascending: Boolean = true,   // 默认升序；
    numPartitions: Int = this.partitions.length)   // 排序后的分区数，默认等于rdd分区数；
    (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
  this.keyBy[K](f)  // 构造RDD(k, v)
      .sortByKey(ascending, numPartitions)  // (2)调用sortByKey
      .values
}

2，实例化RangerPartitioner分区器，并在shuffle write时使用该分区器对数据分区，使分区间有序；并设置shuffle read的排序key使分区内有序；

org.apache.spark.rdd.OrderedRDDFunctions.scala

def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
      : RDD[(K, V)] = self.withScope
  {
    val part = new RangePartitioner(numPartitions, self, ascending)  // (3)构建RangePartitioner；
    new ShuffledRDD[K, V, V](self, part)  // shuffle时根据构建的RangePartitioner对数据分区；
      .setKeyOrdering(if (ascending) ordering else ordering.reverse)  // 设置shuffle时的排序key；
  }

3，构建RangerPartitioner分区器，对原始rdd数据采样，得到range分区的边界数组；

org.apache.spark.Partitioner.scala  // 该文件中Partitioner类的defaultPartitioner方法决定RDD的join分区器
// HashPartitioner是很多场景默认的分区器；最后就是排序的RangePartitioner分区器；
RangePartitioner类中

// rangeBounds数组中的元素就是分区之间的边界；
private var rangeBounds: Array[K] = {
  if (partitions <= 1) {
    Array.empty
  } else {
    // 采样的基础参数，不超过1M；
    val sampleSize = math.min(20.0 * partitions, 1e6)
    // 初步采样的大小，先假设原始分区的数据是均衡的，因为采样时每个分区采样的数量一样；
    // 不设置排序分区参数时，默认就是每个分区采样60个；
    val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt
    // (4)采样，numItems是采样的总数量，sketched:[idx:Int, n:Long, sample:Array[K]]其中idx是分区号，
    // n是该分区的元素个数，sample是采样的结果；
    val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)
    if (numItems == 0L) {
      Array.empty
    } else {
      // 如果一个分区的元素个数大于平均数量3倍，就会重新采样，以保证最后的分区数量均衡； 
      val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)
      val candidates = ArrayBuffer.empty[(K, Float)]  // 均衡的采样，最终的结果；
      val imbalancedPartitions = mutable.Set.empty[Int]  // 不均衡的分区采样，还要重新采样；
      sketched.foreach { case (idx, n, sample) =>
        if (fraction * n > sampleSizePerPartition) {
          imbalancedPartitions += idx
        } else {
          // 为每个采样元素设置权重weight，权重代表的含义这个key是从weight个数中选出的一个，也就是
          // 1 / 改元素的抽样概率；
          val weight = (n.toDouble / sample.length).toFloat
          for (key <- sample) {
            candidates += ((key, weight))
          }
        }
      }
      if (imbalancedPartitions.nonEmpty) {
        // 对不均衡的分区重新采样；
        val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)
        val seed = byteswap32(-rdd.id - 1)
        // 不放回的概率采样，采样的条数=20* m/n，m:不均衡分区的元素个数，n:分区平均元素个数；
        // 由于m > 3n，所以平均一个分区采样的数量会大于60；
        val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()
        val weight = (1.0 / fraction).toFloat
        candidates ++= reSampled.map(x => (x, weight))
      }
      RangePartitioner.determineBounds(candidates, partitions)  // (5)根据无序的采样结果划分边界；
    }
  }
}

4，采样，对每个分区的数据采样；

org.apache.spark.Partitioner.scala 
RangePartitioner类中

// 采样
def sketch[K : ClassTag](
      rdd: RDD[K],
      sampleSizePerPartition: Int): (Long, Array[(Int, Long, Array[K])]) = {
    val shift = rdd.id
    // 以分区为单位采样
    val sketched = rdd.mapPartitionsWithIndex { (idx, iter) =>
      val seed = byteswap32(idx ^ (shift << 16))  // 随机种子
      val (sample, n) = SamplingUtils.reservoirSampleAndCount(   // 采用随机替换采样，会遍历所有数据；
        iter, sampleSizePerPartition, seed)
      Iterator((idx, n, sample))  // (分区号，分区元素个数，采样结果)，默认每个分区采样60个；
    }.collect()   // collect到driver端；
    val numItems = sketched.map(_._2).sum   // 采样时遍历的元素个数，等于rdd的元素个数；
    (numItems, sketched)
  }

5，根据采样的结果划分分区边界；

org.apache.spark.Partitioner.scala 
RangePartitioner类中

def determineBounds[K : Ordering : ClassTag](
      candidates: ArrayBuffer[(K, Float)],   // 无序的 (采样结果, 权重)；
      partitions: Int): Array[K] = {
    val ordering = implicitly[Ordering[K]]
    val ordered = candidates.sortBy(_._1)
    val numCandidates = ordered.size   // 采样的数量
    val sumWeights = ordered.map(_._2.toDouble).sum  // rdd的总数据量
    val step = sumWeights / partitions  // 平均每个分区的数据量
    var cumWeight = 0.0
    var target = step
    val bounds = ArrayBuffer.empty[K]  // 最终的边界，最多partition-1个元素
    var i = 0
    var j = 0
    var previousBound = Option.empty[K]  // 上一个边界元素，防止重复
    while ((i < numCandidates) && (j < partitions - 1)) {
      val (key, weight) = ordered(i)
      cumWeight += weight
      if (cumWeight >= target) {  // 权重w表示这条数据的前面有w条数据；
        // 防止边界重复
        if (previousBound.isEmpty || ordering.gt(key, previousBound.get)) {
          bounds += key
          target += step
          j += 1
          previousBound = Some(key)
        }
      }
      i += 1
    }
    // 注意：最终的分区数等于bounds.length + 1，并不一定等于源rdd或设定的分区数；
    // 当源rdd中有大量重复的key时，就会导致bounds.length+1小于源rdd或设定的分区数；
    bounds.toArray
  }

6，shuffle write时确定数据所在的分区号；

org.apache.spark.Partitioner.scala 
RangePartitioner类中

def getPartition(key: Any): Int = {
    val k = key.asInstanceOf[K]
    var partition = 0
    // 从这里可以看出，最后的分区数是等于rangeBounds + 1
    if (rangeBounds.length <= 128) {
      // 分区数小于128个就使用普通查找；
      while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) {
        partition += 1
      }
    } else {
      // 大于128就使用二分查找
      partition = binarySearch(rangeBounds, k)
      // 小于0表示在两个元素(a, b)的中间，该元素应该写到 b分区；此时二分查找返回的是 -b-1，
      // 所以这里 -(-b-1)-1 = b
      if (partition < 0) {
        partition = -partition-1
      }
      if (partition > rangeBounds.length) {  // 最后一个分区
        partition = rangeBounds.length
      }
    }
    if (ascending) {
      partition
    } else {
      rangeBounds.length - partition
    }
  }

二、RDD的 join算子

前言：PairRDDFunctions是为(key, value)类型RDD扩展的一个方法类，groupBy、sortBy最终都是转换成了(key, value)类型的RDD，然后调用PairRDDFunctions中对应的groupByKey和sortByKey算子；同样的，join是PairRDDFunctions类中的方法，并且都是调用cogroup算子来实现各种 join的，而cogroup中是CoGroupedRDD；

1，确定 join的分区器；

org.apache.spark.rdd.PairRDDFunctions.scala

// 不传其他参数，会使用默认分区的方式来决定分区器；
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] = self.withScope {
    // (2)调用 join的重构方法；
    join(other, defaultPartitioner(self, other))
  }
// 传入分区数，则使用HashPartitioner分区器；
def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))] = self.withScope {
    join(other, new HashPartitioner(numPartitions))
  }

// 选择默认分区器
def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
    val rdds = (Seq(rdd) ++ others)
    val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))
    if (hasPartitioner.nonEmpty) {
      // 如果两个rdd的分区器不同时为空，则使用分区数较大的rdd的分区器；
      hasPartitioner.maxBy(_.partitions.length).partitioner.get
    } else {
      // 否则使用HashPartitioner
      if (rdd.context.conf.contains("spark.default.parallelism")) {
        // 设了默认并行度，则分区数为默认并行度；
        new HashPartitioner(rdd.context.defaultParallelism)
      } else {
        // 否则分区数为两个rdd分区数的较大者；
        new HashPartitioner(rdds.map(_.partitions.length).max)
      }
    }
  }
}

2，所有的 join方式都是调用cogroup算子；

org.apache.spark.rdd.PairRDDFunctions.scala

// (3)都是先使用cogroup得到(k, (Iterable[V1], Iterable[V2]))，然后根据不同的 join方式筛选结果；
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues( pair =>
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)  // 两个集合都有元素才返回；
    )
  }

def leftOuterJoin[W](
      other: RDD[(K, W)],
      partitioner: Partitioner): RDD[(K, (V, Option[W]))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues { pair =>
      if (pair._2.isEmpty) {
        pair._1.iterator.map(v => (v, None))   // 右表为空，则返回 (左表v, None)；
      } else {
        for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, Some(w))  // 加上内连接的结果；
      }
    }
  }

def fullOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner)
      : RDD[(K, (Option[V], Option[W]))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues {
      case (vs, Seq()) => vs.iterator.map(v => (Some(v), None))  // 右表为空的
      case (Seq(), ws) => ws.iterator.map(w => (None, Some(w)))  // 左表为空的
      case (vs, ws) => for (v <- vs.iterator; w <- ws.iterator) yield (Some(v), Some(w))  // 加上内连接的结果
    }
  }

3，cogroup算子使用的是CoGroupedRDD；

org.apache.spark.rdd.PairRDDFunctions.scala

def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
      : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {
    // 使用HashPartitioner时，key不能是数组类型；
    if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
      throw new SparkException("HashPartitioner cannot partition array keys.")
    }
    // (4)使用CoGroupedRDD
    val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
    cg.mapValues { case Array(vs, w1s) =>
      (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
    }
  }

4，选择宽窄依赖，并将结果聚合为(key, Array[Iterable])；

org.apache.spark.rdd.CoGroupPartition.scala

// 选择宽窄依赖
override def getDependencies: Seq[Dependency[_]] = {
    rdds.map { rdd: RDD[_] =>
      if (rdd.partitioner == Some(part)) {
        logDebug("Adding one-to-one dependency with " + rdd)
        // 如果这个rdd的分区器与CoGroupRDD的分区器相同，就是窄依赖；
        new OneToOneDependency(rdd)
      } else {
        logDebug("Adding shuffle dependency with " + rdd)
        // 如果这个rdd的分区器与CoGroupRDD的分区器不同，就是宽依赖
        // 需要对这个rdd使用CoGroupRDD的分区器进行shuffle操作；
        // 所以 rdd1.join(rdd2)，如果rdd1和rdd2的分区器相同，就不会再次触发shuffle；
        new ShuffleDependency[K, Any, CoGroupCombiner](
          rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)
      }
    }
  }

// 实际计算
override def compute(s: Partition, context: TaskContext): Iterator[(K, Array[Iterable[_]])] = {
    val split = s.asInstanceOf[CoGroupPartition]
    val numRdds = dependencies.length
    // ...
    // 使用的是ExternalAppendOnlyMap，自定义了聚合器，聚合器的返回类型为(key, Array[Iterable])
    // Array中的每一个元素都是一个Iterable，表示一个rdd中key相同的value集合；
    val map = createExternalMap(numRdds)
    for ((it, depNum) <- rddIterators) {
      map.insertAll(it.map(pair => (pair._1, new CoGroupValue(pair._2, depNum))))
    }
    // ...
  }

// 选择聚合数据结构，并自定义聚合器
private def createExternalMap(numRdds: Int)
    : ExternalAppendOnlyMap[K, CoGroupValue, CoGroupCombiner] = {
    // newCombiner的类型是 Array[CoGroup]，CoGroup就是一个Seq[T]
    // 所以newCombiner的类型可以理解为Array[Seq[T]]
    val createCombiner: (CoGroupValue => CoGroupCombiner) = value => {
      // value是(Any, Int)类型，_2是rdd在rdds中的index，_1是这个rdd对应的value；
      val newCombiner = Array.fill(numRdds)(new CoGroup)
      newCombiner(value._2) += value._1
      newCombiner
    }
    val mergeValue: (CoGroupCombiner, CoGroupValue) => CoGroupCombiner =
      (combiner, value) => {
      combiner(value._2) += value._1
      combiner
    }
    // 如果是 rdd0.join(rdd1)，则CoGroupRDD的rdds就是Seq[RDD](rdd0, rdd1)
    // combiner1数组就是Array[Seq[T]](Seq[T0], Seq[T1])，Seq[T0]就是rdd0相同key的value列表；
    val mergeCombiners: (CoGroupCombiner, CoGroupCombiner) => CoGroupCombiner =
      (combiner1, combiner2) => {
        var depNum = 0
        while (depNum < numRdds) {
          combiner1(depNum) ++= combiner2(depNum)
          depNum += 1
        }
        combiner1
      }
    new ExternalAppendOnlyMap[K, CoGroupValue, CoGroupCombiner](
      createCombiner, mergeValue, mergeCombiners)
  }

三、persist算子

前言：persist会根据缓存的级别，将rdd的数据存储到BlockManager的 memoryStore和diskStore；cache最终也是调用persist方法，rdd默认的缓存级别是MEMORY_ONLY，dataset默认是MEMORY_AND_DIST；spark中的缓存只是给rdd的分区加上一个缓存级别的标记，只有在这个分区被task执行完后，才会真正的缓存；并且一个rdd只能设置一次缓存级别，再次设置会报错；

1，RDD中判断缓存级别；

org.apache.spark.rdd.RDD.scala

final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
    if (storageLevel != StorageLevel.NONE) {  // 如果设置了缓存
      getOrCompute(split, context)  // 从缓存获取，没有则计算
    } else {  // 默认没有缓存
      computeOrReadCheckpoint(split, context)  // 计算，或者读checkpoint；
    }
  }

private[spark] def getOrCompute(partition: Partition, context: TaskContext): Iterator[T] = {
    // rdd的每个partition对应一个唯一的block(rdd_id, partition_id)
    val blockId = RDDBlockId(id, partition.index) 
    // (2)调用BlockManager的getOrElseUpdate获取缓存，或者使用传入的方法重新计算；
    SparkEnv.get.blockManager.getOrElseUpdate(blockId, storageLevel, elementClassTag, () => {
      readCachedBlock = false
      computeOrReadCheckpoint(partition, context)
    }) match {
      // Left表示成功put到BlockManager中，就返回block
      case Left(blockResult) =>
        if (readCachedBlock) {  // 已经缓存了，直接从BlockManager中拿到了该block；
          // ...
          new InterruptibleIterator[T](context, blockResult.data.asInstanceOf[Iterator[T]]) {
            // ...
          }
        } else {  // 先计算然后put到BlockManager中，再重新获取该block，所返回的block；
          new InterruptibleIterator(context, blockResult.data.asInstanceOf[Iterator[T]])
        }
      // Right表示put到BlockManager失败(例如缓存级别为MEM_ONLY，但是内存不够的情况)
      // 这种情况的block就相当于没有缓存，每次都要重新计算；
      case Right(iter) =>
        new InterruptibleIterator(context, iter.asInstanceOf[Iterator[T]])
    }
  }

2，从BlockManager获取block，或者重新计算；

org.apache.spark.storage.BlockManager.scala

def getOrElseUpdate[T](
      blockId: BlockId,
      level: StorageLevel,
      classTag: ClassTag[T],
      makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = {
    // 先从BlockManager中获取该block，获取到了就返回，否则就要计算该block；
    get[T](blockId)(classTag) match {
      case Some(block) =>
        return Left(block)
      case _ =>
    }
    // (3)需要计算该block，并put到BlockManager
    doPutIterator(blockId, makeIterator, level, classTag, keepReadLock = true) match {
      // 返回为空表示计算之后已经put到BlockManager中了，这里直接从BlockManager中获取block；
      case None =>
        val blockResult = getLocalValues(blockId).getOrElse {
          releaseLock(blockId)
          throw new SparkException(s"get() failed for block $blockId even though we held a lock")
        }
        releaseLock(blockId)
        Left(blockResult)
      // 返回迭代器，表示由于内存或磁盘空间不够导致put失败，下次使用该block时还要重新计算；
      case Some(iter) =>
       Right(iter)
    }
  }

3，把block put到BlockManager；

org.apache.spark.storage.BlockManager.scala

private def doPutIterator[T](
      blockId: BlockId,
      iterator: () => Iterator[T],
      level: StorageLevel,
      classTag: ClassTag[T],
      tellMaster: Boolean = true,
      keepReadLock: Boolean = false): Option[PartiallyUnrolledIterator[T]] = {
    doPut(blockId, level, classTag, tellMaster = tellMaster, keepReadLock = keepReadLock) { info =>
      // 缓存级别使用了内存(mem_only、mem_and_disk等)
      if (level.useMemory) {
        // 如果数据是反序列化了的（也就是对象）
        // deserialized表示是否是反序列化的，反序列化后就是对象，也就是是否以对象的形式存储；
        if (level.deserialized) {
          // 直接将对象缓存到内存，读取时不需要反序列化
          memoryStore.putIteratorAsValues(blockId, iterator(), classTag) match {
            // 内存足够
            case Right(s) =>
              size = s
            // 内存不够
            case Left(iter) =>
              // 缓存级别使用了磁盘
              if (level.useDisk) {
                diskStore.put(blockId) {// ... }
              } else {
                // 如果内存不够，又没有使用缓存到磁盘，则这个block不缓存，返回这个block的计算Iter；
                iteratorFromFailedMemoryStorePut = Some(iter)
              }
          }
        } else { // 如果不是以反序列化的形式存储，则先将数据序列化再存储，读取时需要反序列化
          memoryStore.putIteratorAsBytes(blockId, iterator(), classTag, level.memoryMode) match {
            case Right(s) =>
              size = s
            case Left(partiallySerializedValues) =>
              // 步骤与上面相同
              if (level.useDisk) {//... }
                size = diskStore.getSize(blockId)
              } else {
                iteratorFromFailedMemoryStorePut = Some(partiallySerializedValues.valuesIterator)
              }
          }
        }

      } else if (level.useDisk) {   // 只缓存到磁盘
        diskStore.put(blockId) {//... }
      // ... 缓存级别中副本数量大于1时，进行副本复制操作
      // 如果block没有缓存成功，就会返回这个block的 Iter计算，让下游去执行；
      iteratorFromFailedMemoryStorePut
    }
  }

四、coalesce算子

前言：coalesce的shuffle参数决定了重分区是否触发shuffle；默认的repartition也是调用coalesce并且shuffle为true，默认的coalesce的shuffle为false；下面介绍shuffle为false的原理；

1，选择是否触发shuffle；如果要增大分区数或者减少的分区数较多就使用shuffle重分区，如果只是减少分区数，并且减少的不多就可以不触发shuffle；

org.apache.spark.rdd.RDD.scala

def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null): RDD[T] = withScope {
    if (shuffle) {
      // 触发shuffle
      val distributePartition = (index: Int, items: Iterator[T]) => {
        // 使用随机数作为分区key；
        var position = (new Random(index)).nextInt(numPartitions)
        items.map { t =>
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
        new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      // (2)不触发shuffle
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
  }

2，CoalesceRDD中的按组分区和计算；

org.apache.spark.rdd.CoalescedRDD.scala

override def getPartitions: Array[Partition] = {
    // 分区器默认为空
    val pc = partitionCoalescer.getOrElse(new DefaultPartitionCoalescer())
    // DefaultPartitionCoalescer.coalesce方法返回Array[PartitionGroup]
    // 每个PartitionGroup就是父RDD的一组分区，一个task将运行一个PartitionGroup；
    // DefaultPartitionCoalescer在将父rdd的分区划分到每个PartitionGroup中时，会尽量保证数据均衡
    // 以及本地性；
    pc.coalesce(maxPartitions, prev).zipWithIndex.map {
      case (pg, i) =>
        val ids = pg.partitions.map(_.index).toArray
        new CoalescedRDDPartition(i, prev, ids, pg.prefLoc)
    }
  }

override def compute(partition: Partition, context: TaskContext): Iterator[T] = {
    // CoalescedRDDPartition.parents就是这个task要处理的一组父rdd的分区；
    // 所以当父rdd有1000个分区，coalesce(10)时，子rdd的一个task要处理100个父rdd分区；
    // 所以当coalesce的分区数比父rdd的分区数小很多时(1~2个数量级)，最好使用shuffle重分区；
    partition.asInstanceOf[CoalescedRDDPartition].parents.iterator.flatMap { parentPartition =>
      firstParent[T].iterator(parentPartition, context)
    }
  }

五、shuffle write原理

前言：spark的shuffle(Sort Shuffle)与MapReduce的shuffle过程类似，都是先写内存，内存不够就spill溢写到磁盘，等处理到最后一批数据时，就把最后一批数据与之前溢写到磁盘的数据合并，生成一个按照(partId, keyId)排序的大文件，并生成一个索引文件用来标识每个分区的起始和结束位置；下面是Sort Shuffle的write过程：

1，spill溢写：数据先写内存，内存不够就spill溢写到磁盘，spill之前会先根据 (partId, keyId)的hash值排序；
2，merge合并：当计算到最后一批数据时，会合并之前溢写的文件为一个按(partId, keyId)排序的文件，并生成一个索引文件记录每个分区的起始和结束位置的偏移量；
PS：spark2.2提供了三种Shuffle write实现，SortShuffleWriter、BypassMergeSortShuffleWriter、UnsafeShuffleWriter(Tungsten-sort)；
1，生成tasks，一个job只有最后一个stage是ResultStage，其他的都是ShuffleMapStage(要对数据重分区和持久化)，所以Shuffle write就是ShuffleMapStage，对应task就是ShuffleMapTask；

org.apache.spark.scheduler.DAGScheduler.scala
submitMissingTasks方法中

val tasks: Seq[Task[_]] = try {
      val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
      stage match {
        case stage: ShuffleMapStage =>
          stage.pendingPartitions.clear()
          partitionsToCompute.map { id =>
            val locs = taskIdToLocations(id)
            val part = stage.rdd.partitions(id)
            stage.pendingPartitions += id
            // (2)实例化ShuffleMapTask，会调用其中的runTask方法；
            new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
              taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
              Option(sc.applicationId), sc.applicationAttemptId)
          }

        case stage: ResultStage =>
          partitionsToCompute.map { id =>
            val p: Int = stage.partitions(id)
            val part = stage.rdd.partitions(p)
            val locs = taskIdToLocations(id)
            new ResultTask(stage.id, stage.latestInfo.attemptId,
              taskBinary, part, locs, id, properties, serializedTaskMetrics,
              Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
          }
      }
    }

2，获取shuffleManager对象，并调用该对象的write方法开始写数据；

org.apache.spark.scheduler.ShuffleMapTask.scala

override def runTask(context: TaskContext): MapStatus = {
   // 省略部分代码...

    var writer: ShuffleWriter[Any, Any] = null
    try {
      // (3)从SparkEnv获取实例化的shuffleManager对象；
      val manager = SparkEnv.get.shuffleManager 
      // (4)选择write对象
      writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
      // (5)开始写数据
      writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
      writer.stop(success = true).get
    } catch {
       // 省略部分代码...
  }

3，选择Shuffle Manager，也就是选择Shuffle write的方式，默认是SortShuffleManager；

org.apache.spark.SparkEnv.scala
create方法中

val shortShuffleMgrNames = Map(
  "sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName,
  // 排序和spill直接在serialized binary data上操作而不是java objects，性能高，但是使用的限制条件较多；
  "tungsten-sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName)
val shuffleMgrName = conf.get("spark.shuffle.manager", "sort")  // 默认是SortShuffleManager
val shuffleMgrClass =
  shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase(Locale.ROOT), shuffleMgrName)
val shuffleManager = instantiateClass[ShuffleManager](shuffleMgrClass)  // 实例化shuffleManager对象

4，确定ShuffleHandle的实现类，并选择对应的Shuffle Write实现类；

org.apache.spark.shuffle.sort.SortShuffleManager.scala

// ShuffleDependency中注册Shuffle时调用该方法，来选择handle
override def registerShuffle[K, V, C](
      shuffleId: Int,
      numMaps: Int,
      dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
    if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
      // 1，mapSideCombine为false，即不能使用map-side聚合；
      // 2，并且分区数小于spark.shuffle.sort.bypassMergeThreshold (默认200)；
      // 因为过程中每个task会为每个分区创建一个文件，虽然最后会合并，但中间的文件数量还是很大；
      new BypassMergeSortShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
      // 1，依赖中没有aggregation或者对输出排序；
      // 2，序列化器支持序列化后的重定位（Kryo序列化器和SparkSQL自定义的序列化器支持）；
      // 3，分区数小于16777216（2的24次方）；
      new SerializedShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else {
      // 否则就是BaseShuffleHandle，前面两个也都继承BaseShuffleHandle；
      new BaseShuffleHandle(shuffleId, numMaps, dependency)
    }
  }

// 根据注册时选择的handle来选择Writer实现
override def getWriter[K, V](/* 省略参数... */)
    val env = SparkEnv.get
    handle match {
      case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>
        // 且叫他序列化Shuffle，也就是tungsten-sort shuffle的实现；之前版本使用时需要设置参数；
        new UnsafeShuffleWriter(/* 省略参数... */)
      case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>
        // 开启bypass机制的SortShuffle
        new BypassMergeSortShuffleWriter(/* 省略参数... */)
      case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
        // 默认就是SortShuffle
        new SortShuffleWriter(/* 省略参数... */)
    }
  }

5，使用对应的shuffle write实现开始写数据，这里只介绍SortShuffleWriter；

org.apache.spark.shuffle.sort.SortShuffleWriter.scala

override def write(records: Iterator[Product2[K, V]]): Unit = {
    sorter = if (dep.mapSideCombine) {
      // 如果依赖链中有聚合算子，则会在map端聚合，也就是shuffle write之前聚合；
      require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
      new ExternalSorter[K, V, C](
        context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
    } else {
      new ExternalSorter[K, V, V](
        context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
    }
    // (6)将数据插入内存缓存中，其中包括spill到磁盘的操作 (优化的重点)
    sorter.insertAll(records)

    val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
    val tmp = Utils.tempFileWith(output)
    try {
      val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
      // 对所有spill溢写的文件使用TimSort归并排序，合并成一个按 (partId, keyId)排序的数据文件；
      val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
      // 生成索引文件，记录每个分区数据的起始和结束位置的偏移量；
      shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
      mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
    } finally {
      if (tmp.exists() && !tmp.delete()) {
        logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
      }
    }
  }

6，将数据插入内存缓存中，其中包括spill到磁盘的操作 (优化的重点)；

org.apache.spark.util.collection.ExternalSorter.scala
// shuffle write和shuffle read都是使用这个类来读数据到内存，然后spill溢写，最后合并；

// 插入一个分区的数据
def insertAll(records: Iterator[Product2[K, V]]): Unit = {
    val shouldCombine = aggregator.isDefined
    // 选择存储数据的数据结构，如果依赖是聚合算子就使用map: PartitionedAppendOnlyMap (例如
    // reduceByKey)，否则就使用buffer: PartitionedPairBuffer (例如 join)；
    // 这两个数据结构内部都是数组，data(2*n) =(partId, key)，data(2*n+1)=value；
    if (shouldCombine) {
      // 构造聚合器，在读数据时使用聚合器聚合；
      val mergeValue = aggregator.get.mergeValue
      val createCombiner = aggregator.get.createCombiner
      var kv: Product2[K, V] = null
      val update = (hadValue: Boolean, oldValue: C) => {
        if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2)
      }
      while (records.hasNext) {
        addElementsRead()
        kv = records.next()
        map.changeValue((getPartition(kv._1), kv._1), update)
        // *添加这条数据后可能会触发溢写
        maybeSpillCollection(usingMap = true)
      }
    } else {
      while (records.hasNext) {
        addElementsRead()
        val kv = records.next()
        buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C])
        maybeSpillCollection(usingMap = false)
      }
    }
  }

// 可能会溢写
private def maybeSpillCollection(usingMap: Boolean): Unit = {
    var estimatedSize = 0L
    if (usingMap) {
      // (6.1)估算map占用的内存大小（并不是实际的占用内存大小，而是推算出来的）；
      // 由于是估算出来的，所以可能实际使用的内存过大，而没有spill导致OOM；
      estimatedSize = map.estimateSize()
      // (6.2)可能触发spill溢写
      if (maybeSpill(map, estimatedSize)) {
        // 如果溢写了，就重新实例化map；
        map = new PartitionedAppendOnlyMap[K, C]
      }
    } else {
      estimatedSize = buffer.estimateSize()
      if (maybeSpill(buffer, estimatedSize)) {
        buffer = new PartitionedPairBuffer[K, C]
      }
    }

    if (estimatedSize > _peakMemoryUsedBytes) {
      _peakMemoryUsedBytes = estimatedSize
    }
  }

6.1，估算map占用的内存大小；获取内存大小需要几毫秒，上亿条数据如果每条都获取一次肯定是不行的；

// 回调函数，每更新一条数据调用一次
protected def afterUpdate(): Unit = {
    numUpdates += 1
    if (nextSampleNum == numUpdates) {
      // 每隔当前大小的1.1倍，就获取一次map的实际内存大小
      takeSample()
    }
  }

// 获取实际内存大小
private def takeSample(): Unit = {
    samples.enqueue(Sample(SizeEstimator.estimate(this), numUpdates))
    // 只保存最近的两次内存占用大小
    if (samples.size > 2) {
      samples.dequeue()
    }
    // 平均每条数据的字节大小
    val bytesDelta = samples.toList.reverse match {
      case latest :: previous :: tail =>
        // 最近两次增加的字节大小 / 最近两次增加的数据条数
        (latest.size - previous.size).toDouble / (latest.numUpdates - previous.numUpdates)
      case _ => 0
    }
    // 平均每条数据的字节大小
    bytesPerUpdate = math.max(0, bytesDelta)
    // 下一次获取实际内存大小的时机，当前条数 * 1.1；
    nextSampleNum = math.ceil(numUpdates * SAMPLE_GROWTH_RATE).toLong
  }

// 推算内存的大小
def estimateSize(): Long = {
    assert(samples.nonEmpty)
    // 推算增加的内存大小 = 一条数据平均大小 * 增加的条数
    val extrapolatedDelta = bytesPerUpdate * (numUpdates - samples.last.numUpdates)
    // 上一次实际内存大小 + 推算增加的内存大小
    (samples.last.size + extrapolatedDelta).toLong
  }

6.2，如果获取不到内存，就触发spill溢写；

// 可能触发spill
protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {
    var shouldSpill = false
    // 每32条记录触发一次，并且当前内存大于等于当前内存阈值；
    if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
      // 从shuffle内存中再获取一倍的当前内存；
      val amountToRequest = 2 * currentMemory - myMemoryThreshold
      // 实际获取的内存大小
      val granted = acquireMemory(amountToRequest)
      myMemoryThreshold += granted
      // 如果实际获取的内存大小为0或者很少，currentMemory就会大于等于 myMemoryThreshold
      // 这就说明shuffle内存已经不足了，就会触发溢写；
      shouldSpill = currentMemory >= myMemoryThreshold
    }
    // 如果内存不足，或者内存中的记录数大于spark.shuffle.spill.numElementsForceSpillThreshold
    // 就会触发溢写，该参数默认是Long.MaxValue(几乎不可能触发)
    shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold
    // 实际触发溢写
    if (shouldSpill) {
      _spillCount += 1
      logSpillage(currentMemory)
      // 系列化和反序列化也是按批次的一个批次spark.shuffle.spill.batchSize(默认10000)条；
      // 溢写之前会先根据k的hash值排序；
      // 先写buffer大小是spark.shuffle.file.buffer(默认32k)然后再把buffer一次性写入磁盘；
      spill(collection)
      _elementsRead = 0
      _memoryBytesSpilled += currentMemory
      // 释放内存
      releaseMemory()
    }
    shouldSpill
  }

六、shuffle read原理

前言：shuffle read由于要到远程节点拉取数据所以有网络IO，并且是按批次，以block (shuffle write的一个partition数据)为最小单位拉取；当数据倾斜时，一个block的数据会很大，而一个拉取请求中至少会请求一个block并放入内存，所以数据倾斜时很容易造成OOM；但可以通过设置参数来解决；

1，实例化ShuffleReader对象；

org.apache.spark.shuffle.sort.SortShuffleManager.scala

override def getReader[K, C](
      handle: ShuffleHandle,
      startPartition: Int,
      endPartition: Int,
      context: TaskContext): ShuffleReader[K, C] = {
    // (2)实例化BlockStoreShuffleReader对象来读取；
    new BlockStoreShuffleReader(
      handle.asInstanceOf[BaseShuffleHandle[K, _, C]], startPartition, endPartition, context)
  }

2，实例化ShuffleBlockFetcherIterator对象；

org.apache.spark.shuffle.BlockStoreShuffleReader.scala

override def read(): Iterator[Product2[K, C]] = {
    // (3)按批次拉取block；
    val wrappedStreams = new ShuffleBlockFetcherIterator(
      context,
      blockManager.shuffleClient,  // blockManager客户端，用来拉取远程块；
      blockManager,   // 当前executor的blockManager，用来拉取本地块；
      // 根据该task处理的shuffleId，以及处理的分区范围，来获取包含这些分区的节点和blockId及长度
      // 返回的类型为 (BlockManagerId, Seq[BlockId, Long])，该服务由BlockManager提供；
      mapOutputTracker.getMapSizesByExecutorId(handle.shuffleId, startPartition, endPartition),
      serializerManager.wrapStream,
      // 一次请求从远程拉取数据块的最大值(同时从5个节点拉取)；
      SparkEnv.get.conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024,
      // 一次请求从一个节点拉取的最大block数量；
      SparkEnv.get.conf.getInt("spark.reducer.maxReqsInFlight", Int.MaxValue),
      // 拉取结果小于该阈值就放入内存，否则直接写入磁盘；
      SparkEnv.get.conf.get(config.REDUCER_MAX_REQ_SIZE_SHUFFLE_TO_MEM),
      // 开启数据校验；只会校验压缩文件、小文件(小于maxBytesInFlight/3)或 大文件的开头一部分；
      SparkEnv.get.conf.getBoolean("spark.shuffle.detectCorrupt", true))

    // 序列化器
    val serializerInstance = dep.serializer.newInstance()

    // 将迭代器 wrappedStreams(BlockId, InputStream)转为recordIter(key, value)迭代器；
    val recordIter = wrappedStreams.flatMap { case (blockId, wrappedStream) =>
      serializerInstance.deserializeStream(wrappedStream).asKeyValueIterator
    }

    val readMetrics = context.taskMetrics.createTempShuffleReadMetrics()
    // 在recordIter基础上加入了度量，统计读入了多少(key, value)；
    val metricIter = CompletionIterator[(Any, Any), Iterator[(Any, Any)]](
      recordIter.map { record =>
        readMetrics.incRecordsRead(1)
        record
      },
      context.taskMetrics().mergeShuffleReadMetrics())

    // 使读取可中断，最后操作的迭代器interruptibleIter
    val interruptibleIter = new InterruptibleIterator[(Any, Any)](context, metricIter)

    val aggregatedIter: Iterator[Product2[K, C]] = if (dep.aggregator.isDefined) {
      if (dep.mapSideCombine) {
        // 对聚合算子并且mapSideCombine为true，调用聚合方法；
        // 底层使用ExternalAppendOnlyMap聚合数据并溢写，最后合并spill再聚合；
        val combinedKeyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, C)]]
        dep.aggregator.get.combineCombinersByKey(combinedKeyValuesIterator, context)
      } else {
        val keyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, Nothing)]]
        dep.aggregator.get.combineValuesByKey(keyValuesIterator, context)
      }
    } else {
      require(!dep.mapSideCombine, "Map-side combine without Aggregator specified!")
      interruptibleIter.asInstanceOf[Iterator[Product2[K, C]]]
    }

    // 如果要自定义排序，这里与Shuffle Write一样，都是使用ExternalSorter来排序；
    // 由于上一步是根据key的hash值排序，而用户可能自定义key的排序规则，所以这里只做排序，不做聚合；
    // ExternalSorter中使用的是buffer: PartitionedPairBuffer数据结构；
    dep.keyOrdering match {
      case Some(keyOrd: Ordering[K]) =>
        val sorter =
          new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), serializer = dep.serializer)
        sorter.insertAll(aggregatedIter)
        context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled)
        context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled)
        context.taskMetrics().incPeakExecutionMemory(sorter.peakMemoryUsedBytes)
        CompletionIterator[Product2[K, C], Iterator[Product2[K, C]]](sorter.iterator, sorter.stop())
      case None =>
        aggregatedIter
    }
  }

3，初始化，切分本地请求和远程请求，并发拉取本地block；

org.apache.spark.storage.ShuffleBlockFetcherIterator.scala

// 实例化该对象时会调用该方法；
private[this] def initialize(): Unit = {
    context.addTaskCompletionListener(_ => cleanup())
    // (4)切分本地和远程块的拉取请求，并返回远程块；
    val remoteRequests = splitLocalRemoteBlocks()
    // 随机打散拉取请求
    fetchRequests ++= Utils.randomize(remoteRequests)
    // ...
    // 发送第一个拉取远程block的请求；
    fetchUpToMaxBytes()
    // ...
    // 获取本地块，与获取远程块同时进行；
    // 实现方式是直接使用 blockManager.getBlockData(blockId)获取本地的block；
    fetchLocalBlocks()
    logDebug("Got local blocks in " + Utils.getUsedTimeMs(startTime))
  }

4，构建远程拉取请求；Shuffle Read优化的重点；

org.apache.spark.storage.ShuffleBlockFetcherIterator.scala

private[this] def splitLocalRemoteBlocks(): ArrayBuffer[FetchRequest] = {
    // 并行从5个节点拉取数据，每个节点拉取 maxBytesInFlight / 5 字节的数据；
    val targetRequestSize = math.max(maxBytesInFlight / 5, 1L)
    // 远程拉取请求
    val remoteRequests = new ArrayBuffer[FetchRequest]
    // 遍历BlockManager和block的信息，这些BlockManager(节点)上包含该task处理的block分区数据；
    for ((address, blockInfos) <- blocksByAddress) {
      totalBlocks += blockInfos.size
      if (address.executorId == blockManager.blockManagerId.executorId) {
        // 获取长度不为0的本地块；
        localBlocks ++= blockInfos.filter(_._2 != 0).map(_._1)
        numBlocksToFetch += localBlocks.size
      } else {
        val iterator = blockInfos.iterator
        var curRequestSize = 0L
        var curBlocks = new ArrayBuffer[(BlockId, Long)]
        // 遍历这个节点的目标block
        while (iterator.hasNext) {
          val (blockId, size) = iterator.next()
          if (size > 0) {
            curBlocks += ((blockId, size))  // 这个请求要拉取的block；
            remoteBlocks += blockId
            numBlocksToFetch += 1
            curRequestSize += size  // 这个请求要拉取的block大小；
          } else if (size < 0) {
            throw new BlockException(blockId, "Negative block size " + size)
          }
          if (curRequestSize >= targetRequestSize) {
            // 如果这个请求要拉取的大小大于等于 maxBytesInFlight / 5 就构建一个请求
            // 并将该请求加入到远程拉取请求的队列中；
            remoteRequests += new FetchRequest(address, curBlocks)
            curBlocks = new ArrayBuffer[(BlockId, Long)]
            logDebug(s"Creating fetch request of $curRequestSize at $address")
            curRequestSize = 0
          }
        }
        // 构建最后一个请求；curBlocks中有值，但是长度小于targetRequestSize；
        if (curBlocks.nonEmpty) {
          remoteRequests += new FetchRequest(address, curBlocks)
        }
      }
    }
    logInfo(s"Getting $numBlocksToFetch non-empty blocks out of $totalBlocks blocks")
    remoteRequests
  }

5，调用next()方法来获取数据，并再次发送远程拉取请求；

org.apache.spark.storage.ShuffleBlockFetcherIterator.scala

override def next(): (BlockId, InputStream) = {
    // ...
    while (result == null) {
      val startFetchWait = System.currentTimeMillis()
      result = results.take()    // 获取队列的第一个元素；
      val stopFetchWait = System.currentTimeMillis()

      result match {
         // 拉取请求成功，构建输入流，并校验数据；
        case r @ SuccessFetchResult(blockId, address, size, buf, isNetworkReqDone) =>
          // ... 构建输入流
          // 校验数据，一次读入整个block，有可能OOM
          if (detectCorrupt && !input.eq(in) && size < maxBytesInFlight / 3) {
            // 只校验压缩的，长度小于maxBytesInFlight / 3的block；
            // 第一次校验失败会重新拉取，并把该blockId加入到HashSet中，第二次失败则返回请求失败；
          }
        // 拉取请求失败
        case FailureFetchResult(blockId, address, e) =>
          throwFetchFailedException(blockId, address, e)
      }

      // (6)再发送一次远程拉取请求获取数据；
      fetchUpToMaxBytes()
    }
    // 返回 (blockId, 该blockId的输入流)
    currentResult = result.asInstanceOf[SuccessFetchResult]
    (currentResult.blockId, new BufferReleasingInputStream(input, this))
  }

6，发送远程拉取请求；

org.apache.spark.storage.ShuffleBlockFetcherIterator.scala

// 发送远程拉取请求
private def fetchUpToMaxBytes(): Unit = {
    // 发送远程拉取请求，拉取长度为maxBytesInFlight(48M)的数据
    // 由于构建一个请求时的长度限制为maxBytesInFlight(48M)／5，如果每个请求长度都等于 48M/5
    // 那么一次就会发送5个请求，每个请求不超过 48M / 5，总大小不超过 48M；
    while (fetchRequests.nonEmpty &&
      (bytesInFlight == 0 ||   // 还有远程请求
        (reqsInFlight + 1 <= maxReqsInFlight &&  // 这一批次的请求数要小于这个阈值
          bytesInFlight + fetchRequests.front.size <= maxBytesInFlight))) {  // 这一批次长度不超过48M
      sendRequest(fetchRequests.dequeue())  // (7)实际的发送请求
    }
  }

7，实际发送请求，拉取数据；

org.apache.spark.storage.ShuffleBlockFetcherIterator.scala

private[this] def sendRequest(req: FetchRequest) {
    // ...
    val blockFetchingListener = new BlockFetchingListener {
      override def onBlockFetchSuccess(blockId: String, buf: ManagedBuffer): Unit = {
         // ...
        results.put(new SuccessFetchResult(/*...*/))
         // ...
      }
      // 无论成功还是失败都向results添加相应的实例，在下游进行处理；
      override def onBlockFetchFailure(blockId: String, e: Throwable): Unit = {
        results.put(new FailureFetchResult(/*...*/))
      }
    }
    // 请求的大小小于maxReqSizeShuffleToMem就放到内存，否则直接写到磁盘
    if (req.size > maxReqSizeShuffleToMem) {
      val shuffleFiles = blockIds.map { _ =>
        blockManager.diskBlockManager.createTempLocalBlock()._2
      }.toArray
      shuffleFilesSet ++= shuffleFiles
      shuffleClient.fetchBlocks(address.host, address.port, address.executorId, blockIds.toArray,
        blockFetchingListener, shuffleFiles)
    } else {
      shuffleClient.fetchBlocks(address.host, address.port, address.executorId, blockIds.toArray,
        blockFetchingListener, null)
    }
  }

七、SparkSQL执行计划解析

前言：SparkSQL经过解析逻辑计划(Parsed) -> 分析逻辑计划(Analyzed) -> 优化逻辑计划(Optimized) -> 物理计划(Physocal)，最后生成RDD来执行；由于会经过优化器优化，所以理论上会比不规范的直接使用rdd的性能高；由于有schema信息，所以可读性好；

生成逻辑计划(Parsed)：将字符串的sql通过ANTLR解析成AST抽象语法树，再把抽象语法树构建成逻辑计划；
分析逻辑计划(Analyzed) ：上一步是unresolved，这一步会catalog检查表和字段，生成分析后的逻辑计划；
优化逻辑计划(Optimized)：对上一步的逻辑计划进行优化，主要是列裁剪、合并、谓词下推等；
物理计划(Physocal)：生成最终的物理执行计划；
1，Parsed Logical Plan，将字符串sql解析成抽象语法树，再构建成unresolved Logical Plan逻辑计划；
- ANTLR根据语法文件SqlBase.g4生成的SqlBaseLexer和SqlBaseParser java类对字符串sql进行词法分析和语法分析，生成语法树；
- 使用astBuilder将语法树构建成unresolved logical plan逻辑计划，系统并不知道每个词的含义；

org.apache.spark.sql.SparkSession.scala
def sql(sqlText: String): DataFrame = {
    // sessionState.sqlParser.parsePlan(sqlText)就是将字符串sql解析成逻辑计划
    Dataset.ofRows(self, sessionState.sqlParser.parsePlan(sqlText))
  }

org.apache.spark.sql.catalyst.parser.ParseDriver.scala
// ParserInterface的实现类
AbstractSqlParser抽象类中

// 将抽象语法树解析成逻辑计划
override def parsePlan(sqlText: String): LogicalPlan = parse(sqlText) { parser => // 解析成AST
    astBuilder.visitSingleStatement(parser.singleStatement()) match {  // 将AST构建成逻辑计划
      case plan: LogicalPlan => plan
      case _ =>
        val position = Origin(None, None)
        throw new ParseException(Option(sqlText), "Unsupported SQL statement", position, position)
    }
  }

// 将sql解析成抽象语法树
protected def parse[T](command: String)(toResult: SqlBaseParser => T): T = {
    logInfo(s"Parsing command: $command")
    // 词法分析和语法分析的类SqlBaseLexer和SqlBaseParser都是由ANTLR 4自动生成的java类；
    // 进行词法分析
    val lexer = new SqlBaseLexer(new ANTLRNoCaseStringStream(command))
    // ...
    // 进行语法分析
    val tokenStream = new CommonTokenStream(lexer)
    val parser = new SqlBaseParser(tokenStream)
    // ...
  }

2，创建QueryExecution对象，进行分析和优化逻辑计划，并生成最终的物理计划；
- analyzed：将parse的unresolved logical plan解析成logical plan；
- optimized：对logical plan进行优化；
- sparkPlan：将优化后的逻辑计划解析成spark可以执行的物理计划；

org.apache.spark.sql.Dataset.scala
def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame = {
    // 构建QueryExecution对象
    val qe = sparkSession.sessionState.executePlan(logicalPlan)
    qe.assertAnalyzed()
    new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema))
  }

// 进行Analyzed分析、Optimized优化、SparkPlan物理计划的核心类
org.apache.spark.sql.execution.QueryExecution.scala
// (3)使用Analyzer对象，将parse的unresolved logical plan解析成logical plan；
lazy val analyzed: LogicalPlan = {
    SparkSession.setActiveSession(sparkSession)
    sparkSession.sessionState.analyzer.execute(logical)
  }

lazy val withCachedData: LogicalPlan = {
    assertAnalyzed()
    assertSupported()
    sparkSession.sharedState.cacheManager.useCachedData(analyzed)
  }
// (4)使用Optimizer，优化逻辑计划
lazy val optimizedPlan: LogicalPlan = sparkSession.sessionState.optimizer.execute(withCachedData)
// (5)使用SparkPlanner，生成物理计划
lazy val sparkPlan: SparkPlan = {
    SparkSession.setActiveSession(sparkSession)
    planner.plan(ReturnAnswer(optimizedPlan)).next()
  }

3，Analyzer，将parse的unresolved logical plan解析成logical plan；
- Analyzer中的batches会定义很多batche(类别)，这里会对每一个batche，按照batche中定义的rule(规则)对Unresolved的逻辑计划进行解析；
- 例如常用的名为Resolution的batch就是将parse后的unresolved节点解析为resolved节点，其中的ResolveRelations规则会调用catalog对象来寻找当前表的结构，从中解析出表的字段；catalog会缓存表名和LogicalPlan键值对；具体就是对unresolved上的节点加上数据类型绑定和函数绑定；
- catalog：spark2.0添加的API，用来操作SparkSQL以及Hive中的元数据，可以获取库、表、字段、函数；并能进行hive表的DDL；

org.apache.spark.sql.catalyst.rules.RuleExecutor.scala

// Batch的结构
protected case class Batch(name: String, strategy: Strategy, rules: Rule[TreeType]*)

// analyzer调用的是Analyzer父类RuleExecutor的execute方法
def execute(plan: TreeType): TreeType = {
    var curPlan = plan

    batches.foreach { batch =>
    // ...
    curPlan
  }

4，Optimizer，优化逻辑计划；
- 与Analyzer一样，都是继承自父类RuleExecutor，Optimizer中也会定义很多batches，来优化逻辑计划；
- 例如常用的名为Operator Optimizations的betch就是对操作优化；batches是按顺序执行优化的；
- SQL经典的优化规则有：谓词下推、常亮累加、列裁剪、Limits合并；
- Optimizer的优化规则有：合并(union)、替换(semi join)、算子下推、算子组合、常量折叠与长度削减；

org.apache.spark.sql.catalyst.optimizer.Optimizer.scala

// SQL优化是我们关注的重点，所以这里主要介绍Optimizer中batches的优化规则
Batch("Union", Once,
      CombineUnions) ::  // 合并相邻的两个union，嵌套union中的distinct只需要在最外层distinct就可以了；
    Batch("Pullup Correlated Expressions", Once,
      PullupCorrelatedPredicates) ::  // 将子查询的filter上提；
    Batch("Subquery", Once,
      OptimizeSubqueries) ::  // 遇到子查询时，进一步调用Optimizer.this.execute(Subquery(s.plan))优化；
    Batch("Replace Operators", fixedPoint,
      ReplaceIntersectWithSemiJoin,  // 交集替换为semi join
      ReplaceExceptWithAntiJoin,   // 除外替换为anti join
      ReplaceDistinctWithAggregate) ::  // distinct替换为聚合group by；
    Batch("Aggregate", fixedPoint,
      RemoveLiteralFromGroupExpressions,  // 删除group by中的常数
      RemoveRepetitionFromGroupExpressions) ::  // 删除group by中的重复表达式；
    Batch("Operator Optimizations", fixedPoint, Seq(
      // Operator push down  // 算子下推；
      PushProjectionThroughUnion,   // 列裁剪下推；多个连续union后再select，会每个union中select；
      ReorderJoin(conf),   // join顺序优化；CBO(Cost Based Optimizer)根据数据量对 join顺序调整；
      // 有过滤条件的out join转为inner join；例如left out join后过滤右表字段，右表join不上就是空
      // 对空值filter，肯定匹配不上，所以最后的结果跟 inner join是一样的，inner join在filter时数据量更小；
      EliminateOuterJoin(conf), 
      PushPredicateThroughJoin,  // join过滤条件下推到join两边；也就是先filter再join；
      PushDownPredicate,   // 数据源谓词下推；读取数据源后的filter，会在读数据时执行；
      LimitPushDown(conf),  // limit下推；当union或join后再limit时，把limit推到union和join的子节点；
      ColumnPruning,   // 列裁剪；只获取要使用的列；
      InferFiltersFromConstraints(conf),  // 约束条件提取；例如filter(a>2)变为filter(isnotnull(a) && a>2)；
      // Operator combine   // 算子合并；
      CollapseRepartition,  // 合并repartition；
      CollapseProject,   // 合并Project(去掉不必要的select)；
      CollapseWindow,  // 合并window(相同分区和排序)；
      CombineFilters,  // 合并filter；
      CombineLimits,  // 合并limit；相邻的limit合并，取较小的limit；
      CombineUnions,  // 合并union；与第一个Union优化一样；
      // Constant folding and strength reduction  // 常量折叠和长度削减；
      NullPropagation(conf),  // Null提取；避免Null在语法树中的传播；
      FoldablePropagation,   // 常量传递；select 'c' as a order by a => select 'c' as a order by 'c'；
      OptimizeIn(conf),  // 优化 in；空处理，重复处理；
      ConstantFolding,   // 常数折叠；例如表达式中的 1+2会先计算为3，而不是每条数据都计算一次；
      ReorderAssociativeOperator, // 排序与折叠变量；如x+2+y+7会被flatten成[2,7],[x,y],再把[2,7]变为9；
      LikeSimplification,  // like化简；例如name like 'shen%'替换为name.startWith(shen)；
      BooleanSimplification, // Boolean表达式优化；例如(a=1 and b=2) or (a=1 and b>2);变为(a=1) and (b=2 || b>2)
      SimplifyConditionals,  // if/case语句优化；与BooleanSimplification类似；
      RemoveDispensableExpressions,  // 删除不必要的节点；
      SimplifyBinaryComparison,  // 比较算子简化；如果 = 两边的表达式相同就优化为true；
      PruneFilters(conf),  // 对filter减枝；例如父节点a>4 and b=2, 子节点b=2,则子节点的filter(b=2)去掉；
      EliminateSorts,  //sort消除；删除sort后没有操作或重复的sort；
      SimplifyCasts,  // cast简化；如果cast前后类型没有变化，就删除cast操作；
      SimplifyCaseConversionExpressions,  // 简化字符串的大小写转换；如果有多次转换，只保留最后一次；
      RewriteCorrelatedScalarSubquery,  // 子查询改写为 left outer join；
      EliminateSerialization,   // 序列化消除；
      RemoveRedundantAliases,  // 消除冗余的别名；
      RemoveRedundantProject,  // 消除冗余的Project投影(select)；
      SimplifyCreateStructOps,  // 操作下推到CreateStructOps；
      SimplifyCreateArrayOps,  // 操作下推到CreateArrayOps；
      SimplifyCreateMapOps) ++ // 操作下推到CreateMapOps；
      extendedOperatorOptimizationRules: _*) ::
    Batch("Check Cartesian Products", Once,
      CheckCartesianProducts(conf)) ::  // 检测笛卡尔积 join；如果spark.sql.crossJoin.enabled=false时发生了笛卡尔积，就会报错；
    Batch("Join Reorder", Once,
      CostBasedJoinReorder(conf)) ::  // 基于成本的连接重新排序，选择合适的 join顺序(基于动态规划)；
    Batch("Decimal Optimizations", fixedPoint,
      DecimalAggregates(conf)) ::  // decimal类型聚合优化；
    Batch("Object Expressions Optimization", fixedPoint,
      EliminateMapObjects,  // 消除MapObject；
      CombineTypedFilters) ::  // 合并相邻的类型过滤
    Batch("LocalRelation", fixedPoint,
      ConvertToLocalRelation,  // 优化 LocalRelation
      PropagateEmptyRelation) ::  // 优化 EmptyRelation
    Batch("OptimizeCodegen", Once,
      OptimizeCodegen(conf)) ::  // 优化生成的代码；
    Batch("RewriteSubquery", Once,
      RewritePredicateSubquery,  // 优化子查询为 left semi join和left anti join；
      CollapseProject) :: Nil  // 合并Project(去掉不必要的select)；上面出现了；

5，SparkPlan，生成物理计划；优化后的逻辑计划也只是个抽象的概念，例如 join，代表两个表根据相同的字段进行连接，但是具体怎么实现这个 join，逻辑计划中没有说明，此时就需要CBO根据最小耗时选择一个 join实现；

org.apache.spark.sql.execution.QueryExecution.scala
lazy val sparkPlan: SparkPlan = {
    SparkSession.setActiveSession(sparkSession)
    // (5.1)调用SparkPlanner对象的plan方法，plan方法定义在父类QueryPlanner中
    // plan会返回一个或多个物理计划(目前只会返回一个)，并使用第一个；目前CBO主要是优化 join，选
    // 择合适的 join实现，但是这里只是根据 join表的大小来选择 join实现，后续会根据代价模型来选择；
    planner.plan(ReturnAnswer(optimizedPlan)).next()
  }
// 执行物理计划
lazy val executedPlan: SparkPlan = prepareForExecution(sparkPlan)
// 执行物理计划之前，会对物理计划应用一些规则；
protected def prepareForExecution(plan: SparkPlan): SparkPlan = {
    preparations.foldLeft(plan) { case (sp, rule) => rule.apply(sp) }
}
// 这些规则也都继承自Rule类，类似于Optimizer中的规则优化，所以这里也认为是对物理计划的优化；
protected def preparations: Seq[Rule[SparkPlan]] = Seq(
    python.ExtractPythonUDFs,
    PlanSubqueries(sparkSession),  // 对子查询再次解析和优化，并生成一个QueryExecution对象；
    EnsureRequirements(sparkSession.sessionState.conf), // 验证是否是目标分区数，不是就shuffle；
    // 将一串的算子(map、filter等)转换为一个java方法；
    // 就是在支持Codegen的SparkPlan上添加一个WholeStageCodegenExec，不支持Codegen的SparkPlan则会添加一个InputAdapter;
    CollapseCodegenStages(sparkSession.sessionState.conf),
    // Exchange可以认为是一个shuffle，这里就是找重复的Exchange然后替换，避免重复计算；
    ReuseExchange(sparkSession.sessionState.conf),
    // 与上一步类似，这里是找重复的子查询然后替换，避免重复计算；
    ReuseSubquery(sparkSession.sessionState.conf))

5.1，生成物理计划；继承关系SparkPlanner -> SparkStrategies -> QueryPlanner；

// SparkPlanner中定义的strategies策略
def strategies: Seq[Strategy] =
    experimentalMethods.extraStrategies ++
      extraPlanningStrategies ++ (
      FileSourceStrategy ::
      DataSourceStrategy(conf) ::
      SpecialLimits ::
      Aggregation ::
      JoinSelection ::  // 这里就会选择具体的 join实现类；定义在父类SparkStrategies中；
      InMemoryScans ::  // 处理cacheTable
      BasicOperators :: Nil)

// QueryPlanner中定义plan方法
def plan(plan: LogicalPlan): Iterator[PhysicalPlan] = {
    // 这里的strategies就是SparkPlanner中定义的规则，调用每个规则的apply(plan)方法
    val candidates = strategies.iterator.flatMap(_(plan))

    // 可能有子查询，递归调用plan方法
    val plans = candidates.flatMap { candidate =>
      // ...
      }
    }
    // 最后返回物理计划
  }

6，SparkPlan执行生成RDD阶段；调用doExecute方法；

// QueryExecution中会调用execute方法；
lazy val toRdd: RDD[InternalRow] = executedPlan.execute()
// 而execute方法会调用doExecute方法，每个继承SparkPlan的类都要实现自己的doExecute方法；
final def execute(): RDD[InternalRow] = executeQuery {
    doExecute()
 }

// 例如读取hive表的执行器HiveTableScanExec；
// 通常继承SparkPlan的类都是Exec结尾，表示物理计划的实际执行器；
org.apache.spark.sql.hive.execution.HiveTableScanExec.scala

protected override def doExecute(): RDD[InternalRow] = {
    // 这里创建了RDD
    val rdd = if (!relation.isPartitioned) {
      Utils.withDummyCallSite(sqlContext.sparkContext) {
        hadoopReader.makeRDDForTable(hiveQlTable)
      }
    } else {
      Utils.withDummyCallSite(sqlContext.sparkContext) {
        hadoopReader.makeRDDForPartitionedTable(prunePartitions(rawPartitions))
      }
    }
    val numOutputRows = longMetric("numOutputRows")
    val outputSchema = schema
    // 这里调用了内部的mapPartitionsWithIndexInternal，与RDD的mapPartitionsWithIndex类似；
    rdd.mapPartitionsWithIndexInternal { (index, iter) =>
      // (7)这里会使用GenerateUnsafeProjection根据表达式生成可执行的java代码，并生成java字节码；
      val proj = UnsafeProjection.create(outputSchema)
      proj.initialize(index)
      iter.map { r =>
        numOutputRows += 1
        proj(r)
      }
    }
  }

7，生成java代码，并编译为java字节码，发送给executor执行；

org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection.scala

// HiveTableScanExec的doExecute中的UnsafeProjection.create，就是调用了这个create方法来生成代码；
private def create(
      expressions: Seq[Expression],
      subexpressionEliminationEnabled: Boolean): UnsafeProjection = {
    val ctx = newCodeGenContext()
    val eval = createCode(ctx, expressions, subexpressionEliminationEnabled)

    // 这里是java代码的模板
    val codeBody = s"""
      public java.lang.Object generate(Object[] references) {
        return new SpecificUnsafeProjection(references);
      }

      class SpecificUnsafeProjection extends ${classOf[UnsafeProjection].getName} {

        private Object[] references;
        ${ctx.declareMutableStates()}

        public SpecificUnsafeProjection(Object[] references) {
          this.references = references;
          ${ctx.initMutableStates()}
        }
        // ...
      }
      """
    // format格式化模板
    val code = CodeFormatter.stripOverlappingComments(
      new CodeAndComment(codeBody, ctx.getPlaceHolderToComments()))
    logDebug(s"code for ${expressions.mkString(",")}:\n${CodeFormatter.format(code)}")

    // 编译java代码为java字节码；
    val c = CodeGenerator.compile(code)
    // 调用模板中的generate方法，传入需要的对象；
    c.generate(ctx.references.toArray).asInstanceOf[UnsafeProjection]
  }

参考文章

https://www.jianshu.com/p/860f52c582d3 spark排序源码解读
https://www.jianshu.com/p/286173f03a0b spark shuffle write原理详解
https://toutiao.io/posts/eicdjo/preview spark shuffle原理详解
https://www.jianshu.com/p/f9f7bfa43978 spark shuffle源码解读（读写的数据流程）
https://www.jianshu.com/p/c83bb237caa8 spark shuffle内存分析
https://www.huaweicloud.com/articles/159498580b1636b394809ed54f6a5689.html spark shuffle与其他模块的交互
https://cloud.tencent.com/developer/article/1195231 spark shuffle的Tungsten-sort分析
https://cloud.tencent.com/developer/article/1638045 sparksql执行计划解读（模块解读1）
https://zhuanlan.zhihu.com/p/367590611 sparkSql执行计划解读（深入底层3）
https://www.jianshu.com/p/0aa4b1caac2e sparksql执行计划解读（逻辑清晰2）
https://www.cnblogs.com/listenfwind/p/12767896.html SparkSQL执行计划解读（源码流程4）
https://zhuanlan.zhihu.com/p/388379065 Optimizer中PushProjectionThroughUnion优化器的原理
https://zhuanlan.zhihu.com/p/389509226 Optimizer中EliminateOuterJoin优化器原理
https://www.jianshu.com/p/aa56b02cc82e sparkSql中limit 优化
https://masterwangzx.com/2020/11/05/spark-sql-optimizer-logical-plan/ sparksql optimized优化（全面）
https://blog.csdn.net/rlnLo2pNEfx9c/article/details/105283012 coalesce算子详解

posted @ 2021-08-16 21:47 huas_lqy 阅读(241) 评论(0) 编辑收藏举报

刷新页面返回顶部

快乐咸鱼每一天

既来之，则安之