spark编程模型(六)之RDD基础转换操作(Transformation Operation)——coalesce、repartition

coalesce()

  • def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T]

  • 该函数用于将RDD进行重分区,使用HashPartitioner

  • 第一个参数为重分区的数目,第二个为是否进行shuffle,默认为false

  • 只传入第一个参数,表示降低RDD中partitions(分区)数量为numPartitions,numPartitions要小于RDD原分区数量

  • 若传入的numPartitions值大于RDD原分区数量,则第二个参数必须设置为true,否则无效

      scala> val data = sc.textFile("/data/spark_rdd.txt", 2)
      data: org.apache.spark.rdd.RDD[String] = /data/spark_rdd.txt MapPartitionsRDD[19] at textFile at <console>:24
      
      scala> data.partitions.size
      res11: Int = 2
      
      scala> val new_data = data.coalesce(1)
      new_data: org.apache.spark.rdd.RDD[String] = CoalescedRDD[20] at coalesce at <console>:26
      
      scala> new_data.partitions.size
      res12: Int = 1
      
      scala> val new_data = data.coalesce(4)
      new_data: org.apache.spark.rdd.RDD[String] = CoalescedRDD[21] at coalesce at <console>:26
      
      scala> new_data.partitions.size
      res13: Int = 2
      
      scala> val new_data = data.coalesce(4, true)
      new_data: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[25] at coalesce at <console>:26
      
      scala> new_data.partitions.size
      res14: Int = 4
    

repartition()

  • def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]

  • Reshuffle(重新洗牌)RDD 中的数据以创建或者更多的 partitions(分区)并将每个分区中的数据尽量保持均匀

  • 该操作总是通过网络来 shuffles 所有的数据

  • 该函数其实就是coalesce()函数中第二个参数为true的实现

      scala> val data = sc.textFile("/data/spark_rdd.txt", 2)
      data: org.apache.spark.rdd.RDD[String] = /data/spark_rdd.txt MapPartitionsRDD[1] at textFile at <console>:24
      
      scala> data.partitions.size
      res0: Int = 2
      
      scala> val data_1 = data.repartition(1)
      data_1: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[5] at repartition at <console>:26
      
      scala> data_1.partitions.size
      res2: Int = 1
      
      scala> val data_2 = data.repartition(4)
      data_2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at repartition at <console>:26
      
      scala> data_2.partitions.size
      res3: Int = 4
    
posted @ 2018-08-11 01:20  oldsix666  阅读(79)  评论(0编辑  收藏  举报