Spark源码系列:DataFrame repartition、coalesce 对比
在Spark开发中,有时为了更好的效率,特别是涉及到关联操作的时候,对数据进行重新分区操作可以提高程序运行效率(很多时候效率的提升远远高于重新分区的消耗,所以进行重新分区还是很有价值的)。
在SparkSQL中,对数据重新分区主要有两个方法 repartition 和 coalesce ,下面将对两个方法比较
repartition
repartition 有三个重载的函数:
- def repartition(numPartitions: Int): DataFrame
1 /** 2 * Returns a new [[DataFrame]] that has exactly `numPartitions` partitions. 3 * @group dfops 4 * @since 1.3.0 5 */ 6 def repartition(numPartitions: Int): DataFrame = withPlan { 7 Repartition(numPartitions, shuffle = true, logicalPlan) 8 }
此方法返回一个新的[[DataFrame]],该[[DataFrame]]具有确切的 'numpartition' 分区。
- def repartition(partitionExprs: Column*): DataFrame
1 /** 2 * Returns a new [[DataFrame]] partitioned by the given partitioning expressions preserving 3 * the existing number of partitions. The resulting DataFrame is hash partitioned. 4 * 5 * This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL). 6 * 7 * @group dfops 8 * @since 1.6.0 9 */ 10 @scala.annotation.varargs 11 def repartition(partitionExprs: Column*): DataFrame = withPlan { 12 RepartitionByExpression(partitionExprs.map(_.expr), logicalPlan, numPartitions = None) 13 }
此方法返回一个新的[[DataFrame]]分区,它由保留现有分区数量的给定分区表达式划分。得到的DataFrame是哈希分区的。
这与SQL (Hive QL)中的“distribution BY”操作相同。
- def repartition(numPartitions: Int, partitionExprs: Column*): DataFrame
1 /** 2 * Returns a new [[DataFrame]] partitioned by the given partitioning expressions into 3 * `numPartitions`. The resulting DataFrame is hash partitioned. 4 * 5 * This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL). 6 * 7 * @group dfops 8 * @since 1.6.0 9 */ 10 @scala.annotation.varargs 11 def repartition(numPartitions: Int, partitionExprs: Column*): DataFrame = withPlan { 12 RepartitionByExpression(partitionExprs.map(_.expr), logicalPlan, Some(numPartitions)) 13 }
此方法返回一个新的[[DataFrame]],由给定的分区表达式划分为 'numpartition' 。得到的DataFrame是哈希分区的。
这与SQL (Hive QL)中的“distribution BY”操作相同。
coalesce
- coalesce(numPartitions: Int): DataFrame
1 /** 2 * Returns a new [[DataFrame]] that has exactly `numPartitions` partitions. 3 * Similar to coalesce defined on an [[RDD]], this operation results in a narrow dependency, e.g. 4 * if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of 5 * the 100 new partitions will claim 10 of the current partitions. 6 * @group rdd 7 * @since 1.4.0 8 */ 9 def coalesce(numPartitions: Int): DataFrame = withPlan { 10 Repartition(numPartitions, shuffle = false, logicalPlan) 11 }
返回一个新的[[DataFrame]],该[[DataFrame]]具有确切的 'numpartition' 分区。类似于在[[RDD]]上定义的coalesce,这种操作会导致一个狭窄的依赖关系,例如:
如果从1000个分区到100个分区,就不会出现shuffle,而是100个新分区中的每一个都会声明10个当前分区。
反过来从100个分区到1000个分区,将会出现shuffle。
注:coalesce(numPartitions: Int): DataFrame 和 repartition(numPartitions: Int): DataFrame 底层调用的都是 class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan)
1 /** 2 * Returns a new RDD that has exactly `numPartitions` partitions. Differs from 3 * [[RepartitionByExpression]] as this method is called directly by DataFrame's, because the user 4 * asked for `coalesce` or `repartition`. [[RepartitionByExpression]] is used when the consumer 5 * of the output requires some specific ordering or distribution of the data. 6 */ 7 case class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan) 8 extends UnaryNode { 9 override def output: Seq[Attribute] = child.output 10 }
返回一个新的RDD,该RDD恰好具有“numpartition”分区。与[[RepartitionByExpression]]不同的是,这个方法直接由DataFrame调用,因为用户需要' coalesce '或' repartition '。
当输出的使用者需要特定的数据排序或分布时使用[[RepartitionByExpression]]。(源码里面说的是RDD,但是返回类型写的是DataFrame,感觉没差)。
而repartition(partitionExprs: Column*): DataFrame 和repartition(numPartitions: Int, partitionExprs: Column*): DataFrame 底层调用是
class RepartitionByExpression(partitionExpressions:Seq[Expression],child:LogicalPlan,numPartitions:Option[Int]=None) extends RedistributeData
1 /** 2 * This method repartitions data using [[Expression]]s into `numPartitions`, and receives 3 * information about the number of partitions during execution. Used when a specific ordering or 4 * distribution is expected by the consumer of the query result. Use [[Repartition]] for RDD-like 5 * `coalesce` and `repartition`. 6 * If `numPartitions` is not specified, the number of partitions will be the number set by 7 * `spark.sql.shuffle.partitions`. 8 */ 9 case class RepartitionByExpression( 10 partitionExpressions: Seq[Expression], 11 child: LogicalPlan, 12 numPartitions: Option[Int] = None) extends RedistributeData { 13 numPartitions match { 14 case Some(n) => require(n > 0, "numPartitions must be greater than 0.") 15 case None => // Ok 16 } 17 }
该方法使用[[Expression]]将数据重新划分为 'numpartition',并在执行期间接收关于分区数量的信息。当用户期望某个特定的排序或分布时使用。使用[[Repartition]]用于类rdd的 'coalesce' 和 'Repartition'。
如果没有指定 'numpartition',那么分区的数量将由 "spark.sql.shuffle.partition" 设置。
使用示例
- def repartition(numPartitions: Int): DataFrame
1 // 获取一个测试的DataFrame 里面包含一个user字段 2 val testDataFrame: DataFrame = readMysqlTable(sqlContext, "MYSQLTABLE", proPath) 3 // 获得10个分区的DataFrame 4 testDataFrame.repartition(10)
- def repartition(partitionExprs: Column*): DataFrame
1 // 获取一个测试的DataFrame 里面包含一个user字段 2 val testDataFrame: DataFrame = readMysqlTable(sqlContext, "MYSQLTABLE", proPath) 3 // 根据 user 字段进行分区,分区数量由 spark.sql.shuffle.partition 决定 4 testDataFrame.repartition($"user")
- def repartition(numPartitions: Int, partitionExprs: Column*): DataFrame
1 // 获取一个测试的DataFrame 里面包含一个user字段 2 val testDataFrame: DataFrame = readMysqlTable(sqlContext, "MYSQLTABLE", proPath) 3 // 根据 user 字段进行分区,将获得10个分区的DataFrame,此方法有时候在join的时候可以极大的提高效率,但是得注意出现数据倾斜的问题 4 testDataFrame.repartition(10,$"user")
- coalesce(numPartitions: Int): DataFrame
1 val testDataFrame1: DataFrame = readMysqlTable(sqlContext, "MYSQLTABLE", proPath) 2 val testDataFrame2=testDataFrame1.repartition(10) 3 // 不会触发shuffle 4 testDataFrame2.coalesce(5) 5 // 触发shuffle 返回一个100分区的DataFrame 6 testDataFrame2.coalesce(100)
至于分区的数据设定,得根据自己的实际情况来,多了浪费少了负优化。
现在的只是初步探讨,具体的底层代码实现,后续去研究一下。
此文为本人工作学习整理笔记,转载请注明出处!!!!!!