sprak_算子
一、sortBy算子
前言:spark中的排序采用的是tera sort算法,先分区间有序再分区内有序,从而达到全局有序:
-
1,采样确定边界:对每个分区采样,然后汇总排序,确定每个分区保存数据的范围,最后输出范围的上界数组;
-
2,shuffle write分区间有序:用RangePartitioner并按照上界数组计算每条数据的分区号;
-
3,shuffle read分区内有序:拉取分布在多个节点上的相同分区数据并排序,使得分区内有序;
-
1,将RDD构造为RDD[(k, v)],然后调用sortByKey算子;
org.apache.spark.rdd.RDD.scala // spark2.2版本
def sortBy[K](
f: (T) => K, // 根据返回的字段K来排序,要求K是可以compare的;
ascending: Boolean = true, // 默认升序;
numPartitions: Int = this.partitions.length) // 排序后的分区数,默认等于rdd分区数;
(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
this.keyBy[K](f) // 构造RDD(k, v)
.sortByKey(ascending, numPartitions) // (2)调用sortByKey
.values
}
- 2,实例化RangerPartitioner分区器,并在shuffle write时使用该分区器对数据分区,使分区间有序;并设置shuffle read的排序key使分区内有序;
org.apache.spark.rdd.OrderedRDDFunctions.scala
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
: RDD[(K, V)] = self.withScope
{
val part = new RangePartitioner(numPartitions, self, ascending) // (3)构建RangePartitioner;
new ShuffledRDD[K, V, V](self, part) // shuffle时根据构建的RangePartitioner对数据分区;
.setKeyOrdering(if (ascending) ordering else ordering.reverse) // 设置shuffle时的排序key;
}
- 3,构建RangerPartitioner分区器,对原始rdd数据采样,得到range分区的边界数组;
org.apache.spark.Partitioner.scala // 该文件中Partitioner类的defaultPartitioner方法决定RDD的join分区器
// HashPartitioner是很多场景默认的分区器;最后就是排序的RangePartitioner分区器;
RangePartitioner类中
// rangeBounds数组中的元素就是分区之间的边界;
private var rangeBounds: Array[K] = {
if (partitions <= 1) {
Array.empty
} else {
// 采样的基础参数,不超过1M;
val sampleSize = math.min(20.0 * partitions, 1e6)
// 初步采样的大小,先假设原始分区的数据是均衡的,因为采样时每个分区采样的数量一样;
// 不设置排序分区参数时,默认就是每个分区采样60个;
val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt
// (4)采样,numItems是采样的总数量,sketched:[idx:Int, n:Long, sample:Array[K]]其中idx是分区号,
// n是该分区的元素个数,sample是采样的结果;
val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)
if (numItems == 0L) {
Array.empty
} else {
// 如果一个分区的元素个数大于平均数量3倍,就会重新采样,以保证最后的分区数量均衡;
val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)
val candidates = ArrayBuffer.empty[(K, Float)] // 均衡的采样,最终的结果;
val imbalancedPartitions = mutable.Set.empty[Int] // 不均衡的分区采样,还要重新采样;
sketched.foreach { case (idx, n, sample) =>
if (fraction * n > sampleSizePerPartition) {
imbalancedPartitions += idx
} else {
// 为每个采样元素设置权重weight,权重代表的含义这个key是从weight个数中选出的一个,也就是
// 1 / 改元素的抽样概率;
val weight = (n.toDouble / sample.length).toFloat
for (key <- sample) {
candidates += ((key, weight))
}
}
}
if (imbalancedPartitions.nonEmpty) {
// 对不均衡的分区重新采样;
val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)
val seed = byteswap32(-rdd.id - 1)
// 不放回的概率采样,采样的条数=20* m/n,m:不均衡分区的元素个数,n:分区平均元素个数;
// 由于m > 3n,所以平均一个分区采样的数量会大于60;
val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()
val weight = (1.0 / fraction).toFloat
candidates ++= reSampled.map(x => (x, weight))
}
RangePartitioner.determineBounds(candidates, partitions) // (5)根据无序的采样结果划分边界;
}
}
}
- 4,采样,对每个分区的数据采样;
org.apache.spark.Partitioner.scala
RangePartitioner类中
// 采样
def sketch[K : ClassTag](
rdd: RDD[K],
sampleSizePerPartition: Int): (Long, Array[(Int, Long, Array[K])]) = {
val shift = rdd.id
// 以分区为单位采样
val sketched = rdd.mapPartitionsWithIndex { (idx, iter) =>
val seed = byteswap32(idx ^ (shift << 16)) // 随机种子
val (sample, n) = SamplingUtils.reservoirSampleAndCount( // 采用随机替换采样,会遍历所有数据;
iter, sampleSizePerPartition, seed)
Iterator((idx, n, sample)) // (分区号,分区元素个数,采样结果),默认每个分区采样60个;
}.collect() // collect到driver端;
val numItems = sketched.map(_._2).sum // 采样时遍历的元素个数,等于rdd的元素个数;
(numItems, sketched)
}
- 5,根据采样的结果划分分区边界;
org.apache.spark.Partitioner.scala
RangePartitioner类中
def determineBounds[K : Ordering : ClassTag](
candidates: ArrayBuffer[(K, Float)], // 无序的 (采样结果, 权重);
partitions: Int): Array[K] = {
val ordering = implicitly[Ordering[K]]
val ordered = candidates.sortBy(_._1)
val numCandidates = ordered.size // 采样的数量
val sumWeights = ordered.map(_._2.toDouble).sum // rdd的总数据量
val step = sumWeights / partitions // 平均每个分区的数据量
var cumWeight = 0.0
var target = step
val bounds = ArrayBuffer.empty[K] // 最终的边界,最多partition-1个元素
var i = 0
var j = 0
var previousBound = Option.empty[K] // 上一个边界元素,防止重复
while ((i < numCandidates) && (j < partitions - 1)) {
val (key, weight) = ordered(i)
cumWeight += weight
if (cumWeight >= target) { // 权重w表示这条数据的前面有w条数据;
// 防止边界重复
if (previousBound.isEmpty || ordering.gt(key, previousBound.get)) {
bounds += key
target += step
j += 1
previousBound = Some(key)
}
}
i += 1
}
// 注意:最终的分区数等于bounds.length + 1,并不一定等于源rdd或设定的分区数;
// 当源rdd中有大量重复的key时,就会导致bounds.length+1小于源rdd或设定的分区数;
bounds.toArray
}
- 6,shuffle write时确定数据所在的分区号;
org.apache.spark.Partitioner.scala
RangePartitioner类中
def getPartition(key: Any): Int = {
val k = key.asInstanceOf[K]
var partition = 0
// 从这里可以看出,最后的分区数是等于rangeBounds + 1
if (rangeBounds.length <= 128) {
// 分区数小于128个就使用普通查找;
while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) {
partition += 1
}
} else {
// 大于128就使用二分查找
partition = binarySearch(rangeBounds, k)
// 小于0表示在两个元素(a, b)的中间,该元素应该写到 b分区;此时二分查找返回的是 -b-1,
// 所以这里 -(-b-1)-1 = b
if (partition < 0) {
partition = -partition-1
}
if (partition > rangeBounds.length) { // 最后一个分区
partition = rangeBounds.length
}
}
if (ascending) {
partition
} else {
rangeBounds.length - partition
}
}
二、RDD的 join算子
前言:PairRDDFunctions是为(key, value)类型RDD扩展的一个方法类,groupBy、sortBy最终都是转换成了(key, value)类型的RDD,然后调用PairRDDFunctions中对应的groupByKey和sortByKey算子;同样的,join是PairRDDFunctions类中的方法,并且都是调用cogroup算子来实现各种 join的,而cogroup中是CoGroupedRDD;
- 1,确定 join的分区器;
org.apache.spark.rdd.PairRDDFunctions.scala
// 不传其他参数,会使用默认分区的方式来决定分区器;
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] = self.withScope {
// (2)调用 join的重构方法;
join(other, defaultPartitioner(self, other))
}
// 传入分区数,则使用HashPartitioner分区器;
def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))] = self.withScope {
join(other, new HashPartitioner(numPartitions))
}
// 选择默认分区器
def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
val rdds = (Seq(rdd) ++ others)
val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))
if (hasPartitioner.nonEmpty) {
// 如果两个rdd的分区器不同时为空,则使用分区数较大的rdd的分区器;
hasPartitioner.maxBy(_.partitions.length).partitioner.get
} else {
// 否则使用HashPartitioner
if (rdd.context.conf.contains("spark.default.parallelism")) {
// 设了默认并行度,则分区数为默认并行度;
new HashPartitioner(rdd.context.defaultParallelism)
} else {
// 否则分区数为两个rdd分区数的较大者;
new HashPartitioner(rdds.map(_.partitions.length).max)
}
}
}
}
- 2,所有的 join方式都是调用cogroup算子;
org.apache.spark.rdd.PairRDDFunctions.scala
// (3)都是先使用cogroup得到(k, (Iterable[V1], Iterable[V2])),然后根据不同的 join方式筛选结果;
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
this.cogroup(other, partitioner).flatMapValues( pair =>
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w) // 两个集合都有元素才返回;
)
}
def leftOuterJoin[W](
other: RDD[(K, W)],
partitioner: Partitioner): RDD[(K, (V, Option[W]))] = self.withScope {
this.cogroup(other, partitioner).flatMapValues { pair =>
if (pair._2.isEmpty) {
pair._1.iterator.map(v => (v, None)) // 右表为空,则返回 (左表v, None);
} else {
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, Some(w)) // 加上内连接的结果;
}
}
}
def fullOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner)
: RDD[(K, (Option[V], Option[W]))] = self.withScope {
this.cogroup(other, partitioner).flatMapValues {
case (vs, Seq()) => vs.iterator.map(v => (Some(v), None)) // 右表为空的
case (Seq(), ws) => ws.iterator.map(w => (None, Some(w))) // 左表为空的
case (vs, ws) => for (v <- vs.iterator; w <- ws.iterator) yield (Some(v), Some(w)) // 加上内连接的结果
}
}
- 3,cogroup算子使用的是CoGroupedRDD;
org.apache.spark.rdd.PairRDDFunctions.scala
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
: RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {
// 使用HashPartitioner时,key不能是数组类型;
if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
// (4)使用CoGroupedRDD
val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
cg.mapValues { case Array(vs, w1s) =>
(vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
}
}
- 4,选择宽窄依赖,并将结果聚合为(key, Array[Iterable]);
org.apache.spark.rdd.CoGroupPartition.scala
// 选择宽窄依赖
override def getDependencies: Seq[Dependency[_]] = {
rdds.map { rdd: RDD[_] =>
if (rdd.partitioner == Some(part)) {
logDebug("Adding one-to-one dependency with " + rdd)
// 如果这个rdd的分区器与CoGroupRDD的分区器相同,就是窄依赖;
new OneToOneDependency(rdd)
} else {
logDebug("Adding shuffle dependency with " + rdd)
// 如果这个rdd的分区器与CoGroupRDD的分区器不同,就是宽依赖
// 需要对这个rdd使用CoGroupRDD的分区器进行shuffle操作;
// 所以 rdd1.join(rdd2),如果rdd1和rdd2的分区器相同,就不会再次触发shuffle;
new ShuffleDependency[K, Any, CoGroupCombiner](
rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)
}
}
}
// 实际计算
override def compute(s: Partition, context: TaskContext): Iterator[(K, Array[Iterable[_]])] = {
val split = s.asInstanceOf[CoGroupPartition]
val numRdds = dependencies.length
// ...
// 使用的是ExternalAppendOnlyMap,自定义了聚合器,聚合器的返回类型为(key, Array[Iterable])
// Array中的每一个元素都是一个Iterable,表示一个rdd中key相同的value集合;
val map = createExternalMap(numRdds)
for ((it, depNum) <- rddIterators) {
map.insertAll(it.map(pair => (pair._1, new CoGroupValue(pair._2, depNum))))
}
// ...
}
// 选择聚合数据结构,并自定义聚合器
private def createExternalMap(numRdds: Int)
: ExternalAppendOnlyMap[K, CoGroupValue, CoGroupCombiner] = {
// newCombiner的类型是 Array[CoGroup],CoGroup就是一个Seq[T]
// 所以newCombiner的类型可以理解为Array[Seq[T]]
val createCombiner: (CoGroupValue => CoGroupCombiner) = value => {
// value是(Any, Int)类型,_2是rdd在rdds中的index,_1是这个rdd对应的value;
val newCombiner = Array.fill(numRdds)(new CoGroup)
newCombiner(value._2) += value._1
newCombiner
}
val mergeValue: (CoGroupCombiner, CoGroupValue) => CoGroupCombiner =
(combiner, value) => {
combiner(value._2) += value._1
combiner
}
// 如果是 rdd0.join(rdd1),则CoGroupRDD的rdds就是Seq[RDD](rdd0, rdd1)
// combiner1数组就是Array[Seq[T]](Seq[T0], Seq[T1]),Seq[T0]就是rdd0相同key的value列表;
val mergeCombiners: (CoGroupCombiner, CoGroupCombiner) => CoGroupCombiner =
(combiner1, combiner2) => {
var depNum = 0
while (depNum < numRdds) {
combiner1(depNum) ++= combiner2(depNum)
depNum += 1
}
combiner1
}
new ExternalAppendOnlyMap[K, CoGroupValue, CoGroupCombiner](
createCombiner, mergeValue, mergeCombiners)
}
三、persist算子
前言:persist会根据缓存的级别,将rdd的数据存储到BlockManager的 memoryStore和diskStore;cache最终也是调用persist方法,rdd默认的缓存级别是MEMORY_ONLY,dataset默认是MEMORY_AND_DIST;spark中的缓存只是给rdd的分区加上一个缓存级别的标记,只有在这个分区被task执行完后,才会真正的缓存;并且一个rdd只能设置一次缓存级别,再次设置会报错;
- 1,RDD中判断缓存级别;
org.apache.spark.rdd.RDD.scala
final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
if (storageLevel != StorageLevel.NONE) { // 如果设置了缓存
getOrCompute(split, context) // 从缓存获取,没有则计算
} else { // 默认没有缓存
computeOrReadCheckpoint(split, context) // 计算,或者读checkpoint;
}
}
private[spark] def getOrCompute(partition: Partition, context: TaskContext): Iterator[T] = {
// rdd的每个partition对应一个唯一的block(rdd_id, partition_id)
val blockId = RDDBlockId(id, partition.index)
// (2)调用BlockManager的getOrElseUpdate获取缓存,或者使用传入的方法重新计算;
SparkEnv.get.blockManager.getOrElseUpdate(blockId, storageLevel, elementClassTag, () => {
readCachedBlock = false
computeOrReadCheckpoint(partition, context)
}) match {
// Left表示成功put到BlockManager中,就返回block
case Left(blockResult) =>
if (readCachedBlock) { // 已经缓存了,直接从BlockManager中拿到了该block;
// ...
new InterruptibleIterator[T](context, blockResult.data.asInstanceOf[Iterator[T]]) {
// ...
}
} else { // 先计算然后put到BlockManager中,再重新获取该block,所返回的block;
new InterruptibleIterator(context, blockResult.data.asInstanceOf[Iterator[T]])
}
// Right表示put到BlockManager失败(例如缓存级别为MEM_ONLY,但是内存不够的情况)
// 这种情况的block就相当于没有缓存,每次都要重新计算;
case Right(iter) =>
new InterruptibleIterator(context, iter.asInstanceOf[Iterator[T]])
}
}
- 2,从BlockManager获取block,或者重新计算;
org.apache.spark.storage.BlockManager.scala
def getOrElseUpdate[T](
blockId: BlockId,
level: StorageLevel,
classTag: ClassTag[T],
makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = {
// 先从BlockManager中获取该block,获取到了就返回,否则就要计算该block;
get[T](blockId)(classTag) match {
case Some(block) =>
return Left(block)
case _ =>
}
// (3)需要计算该block,并put到BlockManager
doPutIterator(blockId, makeIterator, level, classTag, keepReadLock = true) match {
// 返回为空表示计算之后已经put到BlockManager中了,这里直接从BlockManager中获取block;
case None =>
val blockResult = getLocalValues(blockId).getOrElse {
releaseLock(blockId)
throw new SparkException(s"get() failed for block $blockId even though we held a lock")
}
releaseLock(blockId)
Left(blockResult)
// 返回迭代器,表示由于内存或磁盘空间不够导致put失败,下次使用该block时还要重新计算;
case Some(iter) =>
Right(iter)
}
}
- 3,把block put到BlockManager;
org.apache.spark.storage.BlockManager.scala
private def doPutIterator[T](
blockId: BlockId,
iterator: () => Iterator[T],
level: StorageLevel,
classTag: ClassTag[T],
tellMaster: Boolean = true,
keepReadLock: Boolean = false): Option[PartiallyUnrolledIterator[T]] = {
doPut(blockId, level, classTag, tellMaster = tellMaster, keepReadLock = keepReadLock) { info =>
// 缓存级别使用了内存(mem_only、mem_and_disk等)
if (level.useMemory) {
// 如果数据是反序列化了的(也就是对象)
// deserialized表示是否是反序列化的,反序列化后就是对象,也就是是否以对象的形式存储;
if (level.deserialized) {
// 直接将对象缓存到内存,读取时不需要反序列化
memoryStore.putIteratorAsValues(blockId, iterator(), classTag) match {
// 内存足够
case Right(s) =>
size = s
// 内存不够
case Left(iter) =>
// 缓存级别使用了磁盘
if (level.useDisk) {
diskStore.put(blockId) {// ... }
} else {
// 如果内存不够,又没有使用缓存到磁盘,则这个block不缓存,返回这个block的计算Iter;
iteratorFromFailedMemoryStorePut = Some(iter)
}
}
} else { // 如果不是以反序列化的形式存储,则先将数据序列化再存储,读取时需要反序列化
memoryStore.putIteratorAsBytes(blockId, iterator(), classTag, level.memoryMode) match {
case Right(s) =>
size = s
case Left(partiallySerializedValues) =>
// 步骤与上面相同
if (level.useDisk) {//... }
size = diskStore.getSize(blockId)
} else {
iteratorFromFailedMemoryStorePut = Some(partiallySerializedValues.valuesIterator)
}
}
}
} else if (level.useDisk) { // 只缓存到磁盘
diskStore.put(blockId) {//... }
// ... 缓存级别中副本数量大于1时,进行副本复制操作
// 如果block没有缓存成功,就会返回这个block的 Iter计算,让下游去执行;
iteratorFromFailedMemoryStorePut
}
}
四、coalesce算子
前言:coalesce的shuffle参数决定了重分区是否触发shuffle;默认的repartition也是调用coalesce并且shuffle为true,默认的coalesce的shuffle为false;下面介绍shuffle为false的原理;
- 1,选择是否触发shuffle;如果要增大分区数或者减少的分区数较多就使用shuffle重分区,如果只是减少分区数,并且减少的不多就可以不触发shuffle;
org.apache.spark.rdd.RDD.scala
def coalesce(numPartitions: Int, shuffle: Boolean = false,
partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
(implicit ord: Ordering[T] = null): RDD[T] = withScope {
if (shuffle) {
// 触发shuffle
val distributePartition = (index: Int, items: Iterator[T]) => {
// 使用随机数作为分区key;
var position = (new Random(index)).nextInt(numPartitions)
items.map { t =>
position = position + 1
(position, t)
}
} : Iterator[(Int, T)]
new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions,
partitionCoalescer).values
} else {
// (2)不触发shuffle
new CoalescedRDD(this, numPartitions, partitionCoalescer)
}
}
- 2,CoalesceRDD中的按组分区和计算;
org.apache.spark.rdd.CoalescedRDD.scala
override def getPartitions: Array[Partition] = {
// 分区器默认为空
val pc = partitionCoalescer.getOrElse(new DefaultPartitionCoalescer())
// DefaultPartitionCoalescer.coalesce方法返回Array[PartitionGroup]
// 每个PartitionGroup就是父RDD的一组分区,一个task将运行一个PartitionGroup;
// DefaultPartitionCoalescer在将父rdd的分区划分到每个PartitionGroup中时,会尽量保证数据均衡
// 以及本地性;
pc.coalesce(maxPartitions, prev).zipWithIndex.map {
case (pg, i) =>
val ids = pg.partitions.map(_.index).toArray
new CoalescedRDDPartition(i, prev, ids, pg.prefLoc)
}
}
override def compute(partition: Partition, context: TaskContext): Iterator[T] = {
// CoalescedRDDPartition.parents就是这个task要处理的一组父rdd的分区;
// 所以当父rdd有1000个分区,coalesce(10)时,子rdd的一个task要处理100个父rdd分区;
// 所以当coalesce的分区数比父rdd的分区数小很多时(1~2个数量级),最好使用shuffle重分区;
partition.asInstanceOf[CoalescedRDDPartition].parents.iterator.flatMap { parentPartition =>
firstParent[T].iterator(parentPartition, context)
}
}
五、shuffle write原理
前言:spark的shuffle(Sort Shuffle)与MapReduce的shuffle过程类似,都是先写内存,内存不够就spill溢写到磁盘,等处理到最后一批数据时,就把最后一批数据与之前溢写到磁盘的数据合并,生成一个按照(partId, keyId)排序的大文件,并生成一个索引文件用来标识每个分区的起始和结束位置;下面是Sort Shuffle的write过程:
-
1,spill溢写:数据先写内存,内存不够就spill溢写到磁盘,spill之前会先根据 (partId, keyId)的hash值排序;
-
2,merge合并:当计算到最后一批数据时,会合并之前溢写的文件为一个按(partId, keyId)排序的文件,并生成一个索引文件记录每个分区的起始和结束位置的偏移量;
-
PS:spark2.2提供了三种Shuffle write实现,SortShuffleWriter、BypassMergeSortShuffleWriter、UnsafeShuffleWriter(Tungsten-sort);
-
1,生成tasks,一个job只有最后一个stage是ResultStage,其他的都是ShuffleMapStage(要对数据重分区和持久化),所以Shuffle write就是ShuffleMapStage,对应task就是ShuffleMapTask;
org.apache.spark.scheduler.DAGScheduler.scala
submitMissingTasks方法中
val tasks: Seq[Task[_]] = try {
val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
stage match {
case stage: ShuffleMapStage =>
stage.pendingPartitions.clear()
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = stage.rdd.partitions(id)
stage.pendingPartitions += id
// (2)实例化ShuffleMapTask,会调用其中的runTask方法;
new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
Option(sc.applicationId), sc.applicationAttemptId)
}
case stage: ResultStage =>
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = stage.rdd.partitions(p)
val locs = taskIdToLocations(id)
new ResultTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, id, properties, serializedTaskMetrics,
Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
}
}
}
- 2,获取shuffleManager对象,并调用该对象的write方法开始写数据;
org.apache.spark.scheduler.ShuffleMapTask.scala
override def runTask(context: TaskContext): MapStatus = {
// 省略部分代码...
var writer: ShuffleWriter[Any, Any] = null
try {
// (3)从SparkEnv获取实例化的shuffleManager对象;
val manager = SparkEnv.get.shuffleManager
// (4)选择write对象
writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
// (5)开始写数据
writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
writer.stop(success = true).get
} catch {
// 省略部分代码...
}
- 3,选择Shuffle Manager,也就是选择Shuffle write的方式,默认是SortShuffleManager;
org.apache.spark.SparkEnv.scala
create方法中
val shortShuffleMgrNames = Map(
"sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName,
// 排序和spill直接在serialized binary data上操作而不是java objects,性能高,但是使用的限制条件较多;
"tungsten-sort" -> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName)
val shuffleMgrName = conf.get("spark.shuffle.manager", "sort") // 默认是SortShuffleManager
val shuffleMgrClass =
shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase(Locale.ROOT), shuffleMgrName)
val shuffleManager = instantiateClass[ShuffleManager](shuffleMgrClass) // 实例化shuffleManager对象
- 4,确定ShuffleHandle的实现类,并选择对应的Shuffle Write实现类;
org.apache.spark.shuffle.sort.SortShuffleManager.scala
// ShuffleDependency中注册Shuffle时调用该方法,来选择handle
override def registerShuffle[K, V, C](
shuffleId: Int,
numMaps: Int,
dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
// 1,mapSideCombine为false,即不能使用map-side聚合;
// 2,并且分区数小于spark.shuffle.sort.bypassMergeThreshold (默认200);
// 因为过程中每个task会为每个分区创建一个文件,虽然最后会合并,但中间的文件数量还是很大;
new BypassMergeSortShuffleHandle[K, V](
shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
} else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
// 1,依赖中没有aggregation或者对输出排序;
// 2,序列化器支持序列化后的重定位(Kryo序列化器和SparkSQL自定义的序列化器支持);
// 3,分区数小于16777216(2的24次方);
new SerializedShuffleHandle[K, V](
shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
} else {
// 否则就是BaseShuffleHandle,前面两个也都继承BaseShuffleHandle;
new BaseShuffleHandle(shuffleId, numMaps, dependency)
}
}
// 根据注册时选择的handle来选择Writer实现
override def getWriter[K, V](/* 省略参数... */)
val env = SparkEnv.get
handle match {
case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>
// 且叫他序列化Shuffle,也就是tungsten-sort shuffle的实现;之前版本使用时需要设置参数;
new UnsafeShuffleWriter(/* 省略参数... */)
case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>
// 开启bypass机制的SortShuffle
new BypassMergeSortShuffleWriter(/* 省略参数... */)
case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
// 默认就是SortShuffle
new SortShuffleWriter(/* 省略参数... */)
}
}
- 5,使用对应的shuffle write实现开始写数据,这里只介绍SortShuffleWriter;
org.apache.spark.shuffle.sort.SortShuffleWriter.scala
override def write(records: Iterator[Product2[K, V]]): Unit = {
sorter = if (dep.mapSideCombine) {
// 如果依赖链中有聚合算子,则会在map端聚合,也就是shuffle write之前聚合;
require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
new ExternalSorter[K, V, C](
context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
} else {
new ExternalSorter[K, V, V](
context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
}
// (6)将数据插入内存缓存中,其中包括spill到磁盘的操作 (优化的重点)
sorter.insertAll(records)
val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
val tmp = Utils.tempFileWith(output)
try {
val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
// 对所有spill溢写的文件使用TimSort归并排序,合并成一个按 (partId, keyId)排序的数据文件;
val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
// 生成索引文件,记录每个分区数据的起始和结束位置的偏移量;
shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
} finally {
if (tmp.exists() && !tmp.delete()) {
logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
}
}
}
-
6,将数据插入内存缓存中,其中包括spill到磁盘的操作 (优化的重点);
org.apache.spark.util.collection.ExternalSorter.scala // shuffle write和shuffle read都是使用这个类来读数据到内存,然后spill溢写,最后合并; // 插入一个分区的数据 def insertAll(records: Iterator[Product2[K, V]]): Unit = { val shouldCombine = aggregator.isDefined // 选择存储数据的数据结构,如果依赖是聚合算子就使用map: PartitionedAppendOnlyMap (例如 // reduceByKey),否则就使用buffer: PartitionedPairBuffer (例如 join); // 这两个数据结构内部都是数组,data(2*n) =(partId, key),data(2*n+1)=value; if (shouldCombine) { // 构造聚合器,在读数据时使用聚合器聚合; val mergeValue = aggregator.get.mergeValue val createCombiner = aggregator.get.createCombiner var kv: Product2[K, V] = null val update = (hadValue: Boolean, oldValue: C) => { if (hadValue) mergeValue(oldValue, kv._2) else createCombiner(kv._2) } while (records.hasNext) { addElementsRead() kv = records.next() map.changeValue((getPartition(kv._1), kv._1), update) // *添加这条数据后可能会触发溢写 maybeSpillCollection(usingMap = true) } } else { while (records.hasNext) { addElementsRead() val kv = records.next() buffer.insert(getPartition(kv._1), kv._1, kv._2.asInstanceOf[C]) maybeSpillCollection(usingMap = false) } } } // 可能会溢写 private def maybeSpillCollection(usingMap: Boolean): Unit = { var estimatedSize = 0L if (usingMap) { // (6.1)估算map占用的内存大小(并不是实际的占用内存大小,而是推算出来的); // 由于是估算出来的,所以可能实际使用的内存过大,而没有spill导致OOM; estimatedSize = map.estimateSize() // (6.2)可能触发spill溢写 if (maybeSpill(map, estimatedSize)) { // 如果溢写了,就重新实例化map; map = new PartitionedAppendOnlyMap[K, C] } } else { estimatedSize = buffer.estimateSize() if (maybeSpill(buffer, estimatedSize)) { buffer = new PartitionedPairBuffer[K, C] } } if (estimatedSize > _peakMemoryUsedBytes) { _peakMemoryUsedBytes = estimatedSize } }
-
6.1,估算map占用的内存大小;获取内存大小需要几毫秒,上亿条数据如果每条都获取一次肯定是不行的;
// 回调函数,每更新一条数据调用一次 protected def afterUpdate(): Unit = { numUpdates += 1 if (nextSampleNum == numUpdates) { // 每隔当前大小的1.1倍,就获取一次map的实际内存大小 takeSample() } } // 获取实际内存大小 private def takeSample(): Unit = { samples.enqueue(Sample(SizeEstimator.estimate(this), numUpdates)) // 只保存最近的两次内存占用大小 if (samples.size > 2) { samples.dequeue() } // 平均每条数据的字节大小 val bytesDelta = samples.toList.reverse match { case latest :: previous :: tail => // 最近两次增加的字节大小 / 最近两次增加的数据条数 (latest.size - previous.size).toDouble / (latest.numUpdates - previous.numUpdates) case _ => 0 } // 平均每条数据的字节大小 bytesPerUpdate = math.max(0, bytesDelta) // 下一次获取实际内存大小的时机,当前条数 * 1.1; nextSampleNum = math.ceil(numUpdates * SAMPLE_GROWTH_RATE).toLong } // 推算内存的大小 def estimateSize(): Long = { assert(samples.nonEmpty) // 推算增加的内存大小 = 一条数据平均大小 * 增加的条数 val extrapolatedDelta = bytesPerUpdate * (numUpdates - samples.last.numUpdates) // 上一次实际内存大小 + 推算增加的内存大小 (samples.last.size + extrapolatedDelta).toLong }
-
6.2,如果获取不到内存,就触发spill溢写;
// 可能触发spill protected def maybeSpill(collection: C, currentMemory: Long): Boolean = { var shouldSpill = false // 每32条记录触发一次,并且当前内存大于等于当前内存阈值; if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) { // 从shuffle内存中再获取一倍的当前内存; val amountToRequest = 2 * currentMemory - myMemoryThreshold // 实际获取的内存大小 val granted = acquireMemory(amountToRequest) myMemoryThreshold += granted // 如果实际获取的内存大小为0或者很少,currentMemory就会大于等于 myMemoryThreshold // 这就说明shuffle内存已经不足了,就会触发溢写; shouldSpill = currentMemory >= myMemoryThreshold } // 如果内存不足,或者内存中的记录数大于spark.shuffle.spill.numElementsForceSpillThreshold // 就会触发溢写,该参数默认是Long.MaxValue(几乎不可能触发) shouldSpill = shouldSpill || _elementsRead > numElementsForceSpillThreshold // 实际触发溢写 if (shouldSpill) { _spillCount += 1 logSpillage(currentMemory) // 系列化和反序列化也是按批次的一个批次spark.shuffle.spill.batchSize(默认10000)条; // 溢写之前会先根据k的hash值排序; // 先写buffer大小是spark.shuffle.file.buffer(默认32k)然后再把buffer一次性写入磁盘; spill(collection) _elementsRead = 0 _memoryBytesSpilled += currentMemory // 释放内存 releaseMemory() } shouldSpill }
-
六、shuffle read原理
前言:shuffle read由于要到远程节点拉取数据所以有网络IO,并且是按批次,以block (shuffle write的一个partition数据)为最小单位拉取;当数据倾斜时,一个block的数据会很大,而一个拉取请求中至少会请求一个block并放入内存,所以数据倾斜时很容易造成OOM;但可以通过设置参数来解决;
- 1,实例化ShuffleReader对象;
org.apache.spark.shuffle.sort.SortShuffleManager.scala
override def getReader[K, C](
handle: ShuffleHandle,
startPartition: Int,
endPartition: Int,
context: TaskContext): ShuffleReader[K, C] = {
// (2)实例化BlockStoreShuffleReader对象来读取;
new BlockStoreShuffleReader(
handle.asInstanceOf[BaseShuffleHandle[K, _, C]], startPartition, endPartition, context)
}
- 2,实例化ShuffleBlockFetcherIterator对象;
org.apache.spark.shuffle.BlockStoreShuffleReader.scala
override def read(): Iterator[Product2[K, C]] = {
// (3)按批次拉取block;
val wrappedStreams = new ShuffleBlockFetcherIterator(
context,
blockManager.shuffleClient, // blockManager客户端,用来拉取远程块;
blockManager, // 当前executor的blockManager,用来拉取本地块;
// 根据该task处理的shuffleId,以及处理的分区范围,来获取包含这些分区的节点和blockId及长度
// 返回的类型为 (BlockManagerId, Seq[BlockId, Long]),该服务由BlockManager提供;
mapOutputTracker.getMapSizesByExecutorId(handle.shuffleId, startPartition, endPartition),
serializerManager.wrapStream,
// 一次请求从远程拉取数据块的最大值(同时从5个节点拉取);
SparkEnv.get.conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024,
// 一次请求从一个节点拉取的最大block数量;
SparkEnv.get.conf.getInt("spark.reducer.maxReqsInFlight", Int.MaxValue),
// 拉取结果小于该阈值就放入内存,否则直接写入磁盘;
SparkEnv.get.conf.get(config.REDUCER_MAX_REQ_SIZE_SHUFFLE_TO_MEM),
// 开启数据校验;只会校验压缩文件、小文件(小于maxBytesInFlight/3)或 大文件的开头一部分;
SparkEnv.get.conf.getBoolean("spark.shuffle.detectCorrupt", true))
// 序列化器
val serializerInstance = dep.serializer.newInstance()
// 将迭代器 wrappedStreams(BlockId, InputStream)转为recordIter(key, value)迭代器;
val recordIter = wrappedStreams.flatMap { case (blockId, wrappedStream) =>
serializerInstance.deserializeStream(wrappedStream).asKeyValueIterator
}
val readMetrics = context.taskMetrics.createTempShuffleReadMetrics()
// 在recordIter基础上加入了度量,统计读入了多少(key, value);
val metricIter = CompletionIterator[(Any, Any), Iterator[(Any, Any)]](
recordIter.map { record =>
readMetrics.incRecordsRead(1)
record
},
context.taskMetrics().mergeShuffleReadMetrics())
// 使读取可中断,最后操作的迭代器interruptibleIter
val interruptibleIter = new InterruptibleIterator[(Any, Any)](context, metricIter)
val aggregatedIter: Iterator[Product2[K, C]] = if (dep.aggregator.isDefined) {
if (dep.mapSideCombine) {
// 对聚合算子并且mapSideCombine为true,调用聚合方法;
// 底层使用ExternalAppendOnlyMap聚合数据并溢写,最后合并spill再聚合;
val combinedKeyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, C)]]
dep.aggregator.get.combineCombinersByKey(combinedKeyValuesIterator, context)
} else {
val keyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, Nothing)]]
dep.aggregator.get.combineValuesByKey(keyValuesIterator, context)
}
} else {
require(!dep.mapSideCombine, "Map-side combine without Aggregator specified!")
interruptibleIter.asInstanceOf[Iterator[Product2[K, C]]]
}
// 如果要自定义排序,这里与Shuffle Write一样,都是使用ExternalSorter来排序;
// 由于上一步是根据key的hash值排序,而用户可能自定义key的排序规则,所以这里只做排序,不做聚合;
// ExternalSorter中使用的是buffer: PartitionedPairBuffer数据结构;
dep.keyOrdering match {
case Some(keyOrd: Ordering[K]) =>
val sorter =
new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), serializer = dep.serializer)
sorter.insertAll(aggregatedIter)
context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled)
context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled)
context.taskMetrics().incPeakExecutionMemory(sorter.peakMemoryUsedBytes)
CompletionIterator[Product2[K, C], Iterator[Product2[K, C]]](sorter.iterator, sorter.stop())
case None =>
aggregatedIter
}
}
- 3,初始化,切分本地请求和远程请求,并发拉取本地block;
org.apache.spark.storage.ShuffleBlockFetcherIterator.scala
// 实例化该对象时会调用该方法;
private[this] def initialize(): Unit = {
context.addTaskCompletionListener(_ => cleanup())
// (4)切分本地和远程块的拉取请求,并返回远程块;
val remoteRequests = splitLocalRemoteBlocks()
// 随机打散拉取请求
fetchRequests ++= Utils.randomize(remoteRequests)
// ...
// 发送第一个拉取远程block的请求;
fetchUpToMaxBytes()
// ...
// 获取本地块,与获取远程块同时进行;
// 实现方式是直接使用 blockManager.getBlockData(blockId)获取本地的block;
fetchLocalBlocks()
logDebug("Got local blocks in " + Utils.getUsedTimeMs(startTime))
}
- 4,构建远程拉取请求;Shuffle Read优化的重点;
org.apache.spark.storage.ShuffleBlockFetcherIterator.scala
private[this] def splitLocalRemoteBlocks(): ArrayBuffer[FetchRequest] = {
// 并行从5个节点拉取数据,每个节点拉取 maxBytesInFlight / 5 字节的数据;
val targetRequestSize = math.max(maxBytesInFlight / 5, 1L)
// 远程拉取请求
val remoteRequests = new ArrayBuffer[FetchRequest]
// 遍历BlockManager和block的信息,这些BlockManager(节点)上包含该task处理的block分区数据;
for ((address, blockInfos) <- blocksByAddress) {
totalBlocks += blockInfos.size
if (address.executorId == blockManager.blockManagerId.executorId) {
// 获取长度不为0的本地块;
localBlocks ++= blockInfos.filter(_._2 != 0).map(_._1)
numBlocksToFetch += localBlocks.size
} else {
val iterator = blockInfos.iterator
var curRequestSize = 0L
var curBlocks = new ArrayBuffer[(BlockId, Long)]
// 遍历这个节点的目标block
while (iterator.hasNext) {
val (blockId, size) = iterator.next()
if (size > 0) {
curBlocks += ((blockId, size)) // 这个请求要拉取的block;
remoteBlocks += blockId
numBlocksToFetch += 1
curRequestSize += size // 这个请求要拉取的block大小;
} else if (size < 0) {
throw new BlockException(blockId, "Negative block size " + size)
}
if (curRequestSize >= targetRequestSize) {
// 如果这个请求要拉取的大小大于等于 maxBytesInFlight / 5 就构建一个请求
// 并将该请求加入到远程拉取请求的队列中;
remoteRequests += new FetchRequest(address, curBlocks)
curBlocks = new ArrayBuffer[(BlockId, Long)]
logDebug(s"Creating fetch request of $curRequestSize at $address")
curRequestSize = 0
}
}
// 构建最后一个请求;curBlocks中有值,但是长度小于targetRequestSize;
if (curBlocks.nonEmpty) {
remoteRequests += new FetchRequest(address, curBlocks)
}
}
}
logInfo(s"Getting $numBlocksToFetch non-empty blocks out of $totalBlocks blocks")
remoteRequests
}
- 5,调用next()方法来获取数据,并再次发送远程拉取请求;
org.apache.spark.storage.ShuffleBlockFetcherIterator.scala
override def next(): (BlockId, InputStream) = {
// ...
while (result == null) {
val startFetchWait = System.currentTimeMillis()
result = results.take() // 获取队列的第一个元素;
val stopFetchWait = System.currentTimeMillis()
result match {
// 拉取请求成功,构建输入流,并校验数据;
case r @ SuccessFetchResult(blockId, address, size, buf, isNetworkReqDone) =>
// ... 构建输入流
// 校验数据,一次读入整个block,有可能OOM
if (detectCorrupt && !input.eq(in) && size < maxBytesInFlight / 3) {
// 只校验压缩的,长度小于maxBytesInFlight / 3的block;
// 第一次校验失败会重新拉取,并把该blockId加入到HashSet中,第二次失败则返回请求失败;
}
// 拉取请求失败
case FailureFetchResult(blockId, address, e) =>
throwFetchFailedException(blockId, address, e)
}
// (6)再发送一次远程拉取请求获取数据;
fetchUpToMaxBytes()
}
// 返回 (blockId, 该blockId的输入流)
currentResult = result.asInstanceOf[SuccessFetchResult]
(currentResult.blockId, new BufferReleasingInputStream(input, this))
}
- 6,发送远程拉取请求;
org.apache.spark.storage.ShuffleBlockFetcherIterator.scala
// 发送远程拉取请求
private def fetchUpToMaxBytes(): Unit = {
// 发送远程拉取请求,拉取长度为maxBytesInFlight(48M)的数据
// 由于构建一个请求时的长度限制为maxBytesInFlight(48M)/5,如果每个请求长度都等于 48M/5
// 那么一次就会发送5个请求,每个请求不超过 48M / 5,总大小不超过 48M;
while (fetchRequests.nonEmpty &&
(bytesInFlight == 0 || // 还有远程请求
(reqsInFlight + 1 <= maxReqsInFlight && // 这一批次的请求数要小于这个阈值
bytesInFlight + fetchRequests.front.size <= maxBytesInFlight))) { // 这一批次长度不超过48M
sendRequest(fetchRequests.dequeue()) // (7)实际的发送请求
}
}
- 7,实际发送请求,拉取数据;
org.apache.spark.storage.ShuffleBlockFetcherIterator.scala
private[this] def sendRequest(req: FetchRequest) {
// ...
val blockFetchingListener = new BlockFetchingListener {
override def onBlockFetchSuccess(blockId: String, buf: ManagedBuffer): Unit = {
// ...
results.put(new SuccessFetchResult(/*...*/))
// ...
}
// 无论成功还是失败都向results添加相应的实例,在下游进行处理;
override def onBlockFetchFailure(blockId: String, e: Throwable): Unit = {
results.put(new FailureFetchResult(/*...*/))
}
}
// 请求的大小小于maxReqSizeShuffleToMem就放到内存,否则直接写到磁盘
if (req.size > maxReqSizeShuffleToMem) {
val shuffleFiles = blockIds.map { _ =>
blockManager.diskBlockManager.createTempLocalBlock()._2
}.toArray
shuffleFilesSet ++= shuffleFiles
shuffleClient.fetchBlocks(address.host, address.port, address.executorId, blockIds.toArray,
blockFetchingListener, shuffleFiles)
} else {
shuffleClient.fetchBlocks(address.host, address.port, address.executorId, blockIds.toArray,
blockFetchingListener, null)
}
}
七、SparkSQL执行计划解析
前言:SparkSQL经过解析逻辑计划(Parsed) -> 分析逻辑计划(Analyzed) -> 优化逻辑计划(Optimized) -> 物理计划(Physocal),最后生成RDD来执行;由于会经过优化器优化,所以理论上会比不规范的直接使用rdd的性能高;由于有schema信息,所以可读性好;
-
生成逻辑计划(Parsed):将字符串的sql通过ANTLR解析成AST抽象语法树,再把抽象语法树构建成逻辑计划;
-
分析逻辑计划(Analyzed) :上一步是unresolved,这一步会catalog检查表和字段,生成分析后的逻辑计划;
-
优化逻辑计划(Optimized):对上一步的逻辑计划进行优化,主要是列裁剪、合并、谓词下推等;
-
物理计划(Physocal):生成最终的物理执行计划;
-
1,Parsed Logical Plan,将字符串sql解析成抽象语法树,再构建成unresolved Logical Plan逻辑计划;
- ANTLR根据语法文件SqlBase.g4生成的SqlBaseLexer和SqlBaseParser java类对字符串sql进行词法分析和语法分析,生成语法树;
- 使用astBuilder将语法树构建成unresolved logical plan逻辑计划,系统并不知道每个词的含义;
org.apache.spark.sql.SparkSession.scala
def sql(sqlText: String): DataFrame = {
// sessionState.sqlParser.parsePlan(sqlText)就是将字符串sql解析成逻辑计划
Dataset.ofRows(self, sessionState.sqlParser.parsePlan(sqlText))
}
org.apache.spark.sql.catalyst.parser.ParseDriver.scala
// ParserInterface的实现类
AbstractSqlParser抽象类中
// 将抽象语法树解析成逻辑计划
override def parsePlan(sqlText: String): LogicalPlan = parse(sqlText) { parser => // 解析成AST
astBuilder.visitSingleStatement(parser.singleStatement()) match { // 将AST构建成逻辑计划
case plan: LogicalPlan => plan
case _ =>
val position = Origin(None, None)
throw new ParseException(Option(sqlText), "Unsupported SQL statement", position, position)
}
}
// 将sql解析成抽象语法树
protected def parse[T](command: String)(toResult: SqlBaseParser => T): T = {
logInfo(s"Parsing command: $command")
// 词法分析和语法分析的类SqlBaseLexer和SqlBaseParser都是由ANTLR 4自动生成的java类;
// 进行词法分析
val lexer = new SqlBaseLexer(new ANTLRNoCaseStringStream(command))
// ...
// 进行语法分析
val tokenStream = new CommonTokenStream(lexer)
val parser = new SqlBaseParser(tokenStream)
// ...
}
- 2,创建QueryExecution对象,进行分析和优化逻辑计划,并生成最终的物理计划;
- analyzed:将parse的unresolved logical plan解析成logical plan;
- optimized:对logical plan进行优化;
- sparkPlan:将优化后的逻辑计划解析成spark可以执行的物理计划;
org.apache.spark.sql.Dataset.scala
def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame = {
// 构建QueryExecution对象
val qe = sparkSession.sessionState.executePlan(logicalPlan)
qe.assertAnalyzed()
new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema))
}
// 进行Analyzed分析、Optimized优化、SparkPlan物理计划的核心类
org.apache.spark.sql.execution.QueryExecution.scala
// (3)使用Analyzer对象,将parse的unresolved logical plan解析成logical plan;
lazy val analyzed: LogicalPlan = {
SparkSession.setActiveSession(sparkSession)
sparkSession.sessionState.analyzer.execute(logical)
}
lazy val withCachedData: LogicalPlan = {
assertAnalyzed()
assertSupported()
sparkSession.sharedState.cacheManager.useCachedData(analyzed)
}
// (4)使用Optimizer,优化逻辑计划
lazy val optimizedPlan: LogicalPlan = sparkSession.sessionState.optimizer.execute(withCachedData)
// (5)使用SparkPlanner,生成物理计划
lazy val sparkPlan: SparkPlan = {
SparkSession.setActiveSession(sparkSession)
planner.plan(ReturnAnswer(optimizedPlan)).next()
}
- 3,Analyzer,将parse的unresolved logical plan解析成logical plan;
- Analyzer中的batches会定义很多batche(类别),这里会对每一个batche,按照batche中定义的rule(规则)对Unresolved的逻辑计划进行解析;
- 例如常用的名为Resolution的batch就是将parse后的unresolved节点解析为resolved节点,其中的ResolveRelations规则会调用catalog对象来寻找当前表的结构,从中解析出表的字段;catalog会缓存表名和LogicalPlan键值对;具体就是对unresolved上的节点加上数据类型绑定和函数绑定;
- catalog:spark2.0添加的API,用来操作SparkSQL以及Hive中的元数据,可以获取库、表、字段、函数;并能进行hive表的DDL;
org.apache.spark.sql.catalyst.rules.RuleExecutor.scala
// Batch的结构
protected case class Batch(name: String, strategy: Strategy, rules: Rule[TreeType]*)
// analyzer调用的是Analyzer父类RuleExecutor的execute方法
def execute(plan: TreeType): TreeType = {
var curPlan = plan
batches.foreach { batch =>
// ...
curPlan
}
- 4,Optimizer,优化逻辑计划;
- 与Analyzer一样,都是继承自父类RuleExecutor,Optimizer中也会定义很多batches,来优化逻辑计划;
- 例如常用的名为Operator Optimizations的betch就是对操作优化;batches是按顺序执行优化的;
- SQL经典的优化规则有:谓词下推、常亮累加、列裁剪、Limits合并;
- Optimizer的优化规则有:合并(union)、替换(semi join)、算子下推、算子组合、常量折叠与长度削减;
org.apache.spark.sql.catalyst.optimizer.Optimizer.scala
// SQL优化是我们关注的重点,所以这里主要介绍Optimizer中batches的优化规则
Batch("Union", Once,
CombineUnions) :: // 合并相邻的两个union,嵌套union中的distinct只需要在最外层distinct就可以了;
Batch("Pullup Correlated Expressions", Once,
PullupCorrelatedPredicates) :: // 将子查询的filter上提;
Batch("Subquery", Once,
OptimizeSubqueries) :: // 遇到子查询时,进一步调用Optimizer.this.execute(Subquery(s.plan))优化;
Batch("Replace Operators", fixedPoint,
ReplaceIntersectWithSemiJoin, // 交集替换为semi join
ReplaceExceptWithAntiJoin, // 除外替换为anti join
ReplaceDistinctWithAggregate) :: // distinct替换为聚合group by;
Batch("Aggregate", fixedPoint,
RemoveLiteralFromGroupExpressions, // 删除group by中的常数
RemoveRepetitionFromGroupExpressions) :: // 删除group by中的重复表达式;
Batch("Operator Optimizations", fixedPoint, Seq(
// Operator push down // 算子下推;
PushProjectionThroughUnion, // 列裁剪下推;多个连续union后再select,会每个union中select;
ReorderJoin(conf), // join顺序优化;CBO(Cost Based Optimizer)根据数据量对 join顺序调整;
// 有过滤条件的out join转为inner join;例如left out join后过滤右表字段,右表join不上就是空
// 对空值filter,肯定匹配不上,所以最后的结果跟 inner join是一样的,inner join在filter时数据量更小;
EliminateOuterJoin(conf),
PushPredicateThroughJoin, // join过滤条件下推到join两边;也就是先filter再join;
PushDownPredicate, // 数据源谓词下推;读取数据源后的filter,会在读数据时执行;
LimitPushDown(conf), // limit下推;当union或join后再limit时,把limit推到union和join的子节点;
ColumnPruning, // 列裁剪;只获取要使用的列;
InferFiltersFromConstraints(conf), // 约束条件提取;例如filter(a>2)变为filter(isnotnull(a) && a>2);
// Operator combine // 算子合并;
CollapseRepartition, // 合并repartition;
CollapseProject, // 合并Project(去掉不必要的select);
CollapseWindow, // 合并window(相同分区和排序);
CombineFilters, // 合并filter;
CombineLimits, // 合并limit;相邻的limit合并,取较小的limit;
CombineUnions, // 合并union;与第一个Union优化一样;
// Constant folding and strength reduction // 常量折叠和长度削减;
NullPropagation(conf), // Null提取;避免Null在语法树中的传播;
FoldablePropagation, // 常量传递;select 'c' as a order by a => select 'c' as a order by 'c';
OptimizeIn(conf), // 优化 in;空处理,重复处理;
ConstantFolding, // 常数折叠;例如表达式中的 1+2会先计算为3,而不是每条数据都计算一次;
ReorderAssociativeOperator, // 排序与折叠变量;如x+2+y+7会被flatten成[2,7],[x,y],再把[2,7]变为9;
LikeSimplification, // like化简;例如name like 'shen%'替换为name.startWith(shen);
BooleanSimplification, // Boolean表达式优化;例如(a=1 and b=2) or (a=1 and b>2);变为(a=1) and (b=2 || b>2)
SimplifyConditionals, // if/case语句优化;与BooleanSimplification类似;
RemoveDispensableExpressions, // 删除不必要的节点;
SimplifyBinaryComparison, // 比较算子简化;如果 = 两边的表达式相同就优化为true;
PruneFilters(conf), // 对filter减枝;例如父节点a>4 and b=2, 子节点b=2,则子节点的filter(b=2)去掉;
EliminateSorts, //sort消除;删除sort后没有操作或重复的sort;
SimplifyCasts, // cast简化;如果cast前后类型没有变化,就删除cast操作;
SimplifyCaseConversionExpressions, // 简化字符串的大小写转换;如果有多次转换,只保留最后一次;
RewriteCorrelatedScalarSubquery, // 子查询改写为 left outer join;
EliminateSerialization, // 序列化消除;
RemoveRedundantAliases, // 消除冗余的别名;
RemoveRedundantProject, // 消除冗余的Project投影(select);
SimplifyCreateStructOps, // 操作下推到CreateStructOps;
SimplifyCreateArrayOps, // 操作下推到CreateArrayOps;
SimplifyCreateMapOps) ++ // 操作下推到CreateMapOps;
extendedOperatorOptimizationRules: _*) ::
Batch("Check Cartesian Products", Once,
CheckCartesianProducts(conf)) :: // 检测笛卡尔积 join;如果spark.sql.crossJoin.enabled=false时发生了笛卡尔积,就会报错;
Batch("Join Reorder", Once,
CostBasedJoinReorder(conf)) :: // 基于成本的连接重新排序,选择合适的 join顺序(基于动态规划);
Batch("Decimal Optimizations", fixedPoint,
DecimalAggregates(conf)) :: // decimal类型聚合优化;
Batch("Object Expressions Optimization", fixedPoint,
EliminateMapObjects, // 消除MapObject;
CombineTypedFilters) :: // 合并相邻的类型过滤
Batch("LocalRelation", fixedPoint,
ConvertToLocalRelation, // 优化 LocalRelation
PropagateEmptyRelation) :: // 优化 EmptyRelation
Batch("OptimizeCodegen", Once,
OptimizeCodegen(conf)) :: // 优化生成的代码;
Batch("RewriteSubquery", Once,
RewritePredicateSubquery, // 优化子查询为 left semi join和left anti join;
CollapseProject) :: Nil // 合并Project(去掉不必要的select);上面出现了;
-
5,SparkPlan,生成物理计划;优化后的逻辑计划也只是个抽象的概念,例如 join,代表两个表根据相同的字段进行连接,但是具体怎么实现这个 join,逻辑计划中没有说明,此时就需要CBO根据最小耗时选择一个 join实现;
org.apache.spark.sql.execution.QueryExecution.scala lazy val sparkPlan: SparkPlan = { SparkSession.setActiveSession(sparkSession) // (5.1)调用SparkPlanner对象的plan方法,plan方法定义在父类QueryPlanner中 // plan会返回一个或多个物理计划(目前只会返回一个),并使用第一个;目前CBO主要是优化 join,选 // 择合适的 join实现,但是这里只是根据 join表的大小来选择 join实现,后续会根据代价模型来选择; planner.plan(ReturnAnswer(optimizedPlan)).next() } // 执行物理计划 lazy val executedPlan: SparkPlan = prepareForExecution(sparkPlan) // 执行物理计划之前,会对物理计划应用一些规则; protected def prepareForExecution(plan: SparkPlan): SparkPlan = { preparations.foldLeft(plan) { case (sp, rule) => rule.apply(sp) } } // 这些规则也都继承自Rule类,类似于Optimizer中的规则优化,所以这里也认为是对物理计划的优化; protected def preparations: Seq[Rule[SparkPlan]] = Seq( python.ExtractPythonUDFs, PlanSubqueries(sparkSession), // 对子查询再次解析和优化,并生成一个QueryExecution对象; EnsureRequirements(sparkSession.sessionState.conf), // 验证是否是目标分区数,不是就shuffle; // 将一串的算子(map、filter等)转换为一个java方法; // 就是在支持Codegen的SparkPlan上添加一个WholeStageCodegenExec,不支持Codegen的SparkPlan则会添加一个InputAdapter; CollapseCodegenStages(sparkSession.sessionState.conf), // Exchange可以认为是一个shuffle,这里就是找重复的Exchange然后替换,避免重复计算; ReuseExchange(sparkSession.sessionState.conf), // 与上一步类似,这里是找重复的子查询然后替换,避免重复计算; ReuseSubquery(sparkSession.sessionState.conf))
- 5.1,生成物理计划;继承关系SparkPlanner -> SparkStrategies -> QueryPlanner;
// SparkPlanner中定义的strategies策略 def strategies: Seq[Strategy] = experimentalMethods.extraStrategies ++ extraPlanningStrategies ++ ( FileSourceStrategy :: DataSourceStrategy(conf) :: SpecialLimits :: Aggregation :: JoinSelection :: // 这里就会选择具体的 join实现类;定义在父类SparkStrategies中; InMemoryScans :: // 处理cacheTable BasicOperators :: Nil) // QueryPlanner中定义plan方法 def plan(plan: LogicalPlan): Iterator[PhysicalPlan] = { // 这里的strategies就是SparkPlanner中定义的规则,调用每个规则的apply(plan)方法 val candidates = strategies.iterator.flatMap(_(plan)) // 可能有子查询,递归调用plan方法 val plans = candidates.flatMap { candidate => // ... } } // 最后返回物理计划 }
-
6,SparkPlan执行生成RDD阶段;调用doExecute方法;
// QueryExecution中会调用execute方法;
lazy val toRdd: RDD[InternalRow] = executedPlan.execute()
// 而execute方法会调用doExecute方法,每个继承SparkPlan的类都要实现自己的doExecute方法;
final def execute(): RDD[InternalRow] = executeQuery {
doExecute()
}
// 例如读取hive表的执行器HiveTableScanExec;
// 通常继承SparkPlan的类都是Exec结尾,表示物理计划的实际执行器;
org.apache.spark.sql.hive.execution.HiveTableScanExec.scala
protected override def doExecute(): RDD[InternalRow] = {
// 这里创建了RDD
val rdd = if (!relation.isPartitioned) {
Utils.withDummyCallSite(sqlContext.sparkContext) {
hadoopReader.makeRDDForTable(hiveQlTable)
}
} else {
Utils.withDummyCallSite(sqlContext.sparkContext) {
hadoopReader.makeRDDForPartitionedTable(prunePartitions(rawPartitions))
}
}
val numOutputRows = longMetric("numOutputRows")
val outputSchema = schema
// 这里调用了内部的mapPartitionsWithIndexInternal,与RDD的mapPartitionsWithIndex类似;
rdd.mapPartitionsWithIndexInternal { (index, iter) =>
// (7)这里会使用GenerateUnsafeProjection根据表达式生成可执行的java代码,并生成java字节码;
val proj = UnsafeProjection.create(outputSchema)
proj.initialize(index)
iter.map { r =>
numOutputRows += 1
proj(r)
}
}
}
- 7,生成java代码,并编译为java字节码,发送给executor执行;
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection.scala
// HiveTableScanExec的doExecute中的UnsafeProjection.create,就是调用了这个create方法来生成代码;
private def create(
expressions: Seq[Expression],
subexpressionEliminationEnabled: Boolean): UnsafeProjection = {
val ctx = newCodeGenContext()
val eval = createCode(ctx, expressions, subexpressionEliminationEnabled)
// 这里是java代码的模板
val codeBody = s"""
public java.lang.Object generate(Object[] references) {
return new SpecificUnsafeProjection(references);
}
class SpecificUnsafeProjection extends ${classOf[UnsafeProjection].getName} {
private Object[] references;
${ctx.declareMutableStates()}
public SpecificUnsafeProjection(Object[] references) {
this.references = references;
${ctx.initMutableStates()}
}
// ...
}
"""
// format格式化模板
val code = CodeFormatter.stripOverlappingComments(
new CodeAndComment(codeBody, ctx.getPlaceHolderToComments()))
logDebug(s"code for ${expressions.mkString(",")}:\n${CodeFormatter.format(code)}")
// 编译java代码为java字节码;
val c = CodeGenerator.compile(code)
// 调用模板中的generate方法,传入需要的对象;
c.generate(ctx.references.toArray).asInstanceOf[UnsafeProjection]
}
参考文章
- https://www.jianshu.com/p/860f52c582d3 spark排序源码解读
- https://www.jianshu.com/p/286173f03a0b spark shuffle write原理详解
- https://toutiao.io/posts/eicdjo/preview spark shuffle原理详解
- https://www.jianshu.com/p/f9f7bfa43978 spark shuffle源码解读(读写的数据流程)
- https://www.jianshu.com/p/c83bb237caa8 spark shuffle内存分析
- https://www.huaweicloud.com/articles/159498580b1636b394809ed54f6a5689.html spark shuffle与其他模块的交互
- https://cloud.tencent.com/developer/article/1195231 spark shuffle的Tungsten-sort分析
- https://cloud.tencent.com/developer/article/1638045 sparksql执行计划解读(模块解读1)
- https://zhuanlan.zhihu.com/p/367590611 sparkSql执行计划解读(深入底层3)
- https://www.jianshu.com/p/0aa4b1caac2e sparksql执行计划解读(逻辑清晰2)
- https://www.cnblogs.com/listenfwind/p/12767896.html SparkSQL执行计划解读(源码流程4)
- https://zhuanlan.zhihu.com/p/388379065 Optimizer中PushProjectionThroughUnion优化器的原理
- https://zhuanlan.zhihu.com/p/389509226 Optimizer中EliminateOuterJoin优化器原理
- https://www.jianshu.com/p/aa56b02cc82e sparkSql中limit 优化
- https://masterwangzx.com/2020/11/05/spark-sql-optimizer-logical-plan/ sparksql optimized优化(全面)
- https://blog.csdn.net/rlnLo2pNEfx9c/article/details/105283012 coalesce算子详解