|NO.Z.00016|——————————|BigDataEnd|——|Hadoop&Spark.V04|——|Spark.v04|sparkcore|RDD编程&Transformation|

一、Transformation【重要】
### --- Transformation:RDD的操作算子分为两类:

~~~     Transformation:用来对RDD进行转化,这个操作时延迟执行的(或者说是Lazy 的);
~~~     Action:用来触发RDD的计算;得到相关计算结果 或者 将结果保存的外部系统中;
~~~     Transformation:返回一个新的RDD
~~~     Action:返回结果intdouble、集合(不会返回新的RDD)
~~~     要很准确区分Transformation、Action
### --- Transformation

~~~     每一次 Transformation 操作都会产生新的RDD,供给下一个“转换”使用;转换得到的RDD是惰性求值的。
~~~     也就是说,整个转换过程只是记录了转换的轨迹,并不会发生真正的计算,只有遇到 Action 操作时,
~~~     才会发生真正的计算,开始从血缘关系(lineage)源头开始,进行物理的转换操作;
### --- 常见的 Transformation 算子:

~~~     官方文档:http://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations/
二、常见转换算子1
### --- 常见的transformation

~~~     map(func):对数据集中的每个元素都使用func,然后返回一个新的RDD
~~~     filter(func):对数据集中的每个元素都使用func,然后返回一个包含使func为true的元素构成的RDD
~~~     flatMap(func):与 map 类似,每个输入元素被映射为0或多个输出元素
~~~     mapPartitions(func):和map很像,但是map是将func作用在每个元素上,
~~~     而mapPartitions是func作用在整个分区上。
~~~     假设一个RDD有N个元素,M个分区(N>> M),那么map的函数将被调用N次,
~~~     而mapPartitions中的函数仅被调用M次,一次处理一个分区中的所有元素
~~~     mapPartitionsWithIndex(func):与 mapPartitions 类似,多了分区索引值信息
### --- transformation实验实例

~~~     # 都是 Transformation 操作,没有被执行。如何证明这些操作按预期执行,此时需要引入Action算子
scala> val rdd1 = sc.parallelize(1 to 10)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[42] at parallelize at <console>:24

scala> val rdd2 = rdd1.map(_*2)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[43] at map at <console>:25

scala> val rdd3 = rdd2.filter(_>10)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[44] at filter at <console>:25
~~~     # collect 是Action算子,触发Job的执行,将RDD的全部元素从 Executor 搜集到 Driver 端。生产环境中禁用
scala> rdd1.collect
res27: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> rdd2.collect
res28: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)

scala> rdd3.collect
res29: Array[Int] = Array(12, 14, 16, 18, 20)
~~~     # flatMap 使用案例

scala> val rdd4 = sc.textFile("data/wc.txt")
rdd4: org.apache.spark.rdd.RDD[String] = data/wc.txt MapPartitionsRDD[51] at textFile at <console>:24

scala> rdd4.collect
res33: Array[String] = Array(hadoop mapreduce yarn, hdfs hadoop mapreduce, mapreduce yarn yanqi, yanqi, yanqi)

scala> rdd4.flatMap(_.split("\\s+")).collect
res34: Array[String] = Array(hadoop, mapreduce, yarn, hdfs, hadoop, mapreduce, mapreduce, yarn, yanqi, yanqi, yanqi)
~~~     # RDD 是分区,rdd1有几个区,每个分区有哪些元素

scala> rdd1.getNumPartitions
res35: Int = 3

scala> rdd1.partitions.length
res36: Int = 3

scala> rdd1.mapPartitions{iter => Iterator(s"${iter.toList}")}.collect
res37: Array[String] = Array(List(1, 2, 3), List(4, 5, 6), List(7, 8, 9, 10))

scala> rdd1.mapPartitions{iter => Iterator(s"${iter.toArray.mkString("-")}")}.collect
res38: Array[String] = Array(1-2-3, 4-5-6, 7-8-9-10)

scala> rdd1.mapPartitionsWithIndex{(idx, iter) => Iterator(s"$idx:${iter.toArray.mkString("-")}")}.collect
res39: Array[String] = Array(0:1-2-3, 1:4-5-6, 2:7-8-9-10)

~~~     # 每个元素 * 2
scala> val rdd5 = rdd1.mapPartitions(iter => iter.map(_*2))
rdd5: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[56] at mapPartitions at <console>:25

scala> rdd5.collect
res40: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
### --- map 与 mapPartitions 的区别

~~~     map:每次处理一条数据
~~~     mapPartitions:每次处理一个分区的数据,分区的数据处理完成后,数据才能释放,资源不足时容易导致OOM
~~~     最佳实践:当内存资源充足时,建议使用mapPartitions,以提高处理效率
三、常见转换算子2
### --- 常见转换算子

~~~     groupBy(func):按照传入函数的返回值进行分组。将key相同的值放入一个迭代器
~~~     glom():将每一个分区形成一个数组,形成新的RDD类型 RDD[Array[T]]
~~~     sample(withReplacement, fraction, seed):采样算子。以指定的随机种子(seed)随机抽样出数量为fraction的数据,withReplacement表示是抽出的数据是否放回,true为有放回的抽样,false为无放回的抽样
~~~     distinct([numTasks])):对RDD元素去重后,返回一个新的RDD。可传入numTasks参数改变RDD分区数
~~~     coalesce(numPartitions):缩减分区数,无shuffle
~~~     repartition(numPartitions):增加或减少分区数,有shuffle
~~~     sortBy(func, [ascending], [numTasks]):使用 func 对数据进行处理,对处理后的结果进行排序
~~~     宽依赖的算子(shuffle):groupBy、distinct、repartition、sortBy
### --- 常见算子转换实验案例

~~~     # 将 RDD 中的元素按照3的余数分组
scala> val rdd = sc.parallelize(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[58] at parallelize at <console>:24

scala> val group = rdd.groupBy(_%3)
group: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[60] at groupBy at <console>:25

scala> group.collect
res41: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(6, 9, 3)), (1,CompactBuffer(1, 7, 10, 4)), (2,CompactBuffer(8, 5, 2)))
~~~     # 将 RDD 中的元素每10个元素分组

scala> val rdd = sc.parallelize(1 to 101)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[61] at parallelize at <console>:24

scala> rdd.glom.map(_.sliding(10, 10).toArray)
res42: org.apache.spark.rdd.RDD[Array[Array[Int]]] = MapPartitionsRDD[63] at map at <console>:26
~~~     # sliding是Scala中的方法
~~~     对数据采样。fraction采样的百分比,近似数
~~~     有放回的采样,使用固定的种子

scala> rdd.sample(true, 0.2, 2).collect
res43: Array[Int] = Array(2, 4, 5, 7, 9, 15, 22, 24, 25, 50, 57, 59, 61, 66, 71, 71, 72, 73, 75, 78, 78, 89, 90, 90, 94, 97, 100)
~~~     # 无放回的采样,使用固定的种子
scala> rdd.sample(false, 0.2, 2).collect
res44: Array[Int] = Array(1, 4, 11, 12, 15, 17, 18, 25, 26, 28, 30, 48, 54, 55, 62, 63, 71, 75, 76, 78, 79, 84, 90, 91, 97, 99, 101)

~~~     # 有放回的采样,不设置种子
scala> rdd.sample(false, 0.2).collect
res45: Array[Int] = Array(7, 16, 22, 31, 43, 47, 55, 64, 68, 70, 76, 77, 81, 82, 91, 95, 96, 97)
~~~     # 数据去重

scala> val random = scala.util.Random
random: util.Random.type = scala.util.Random$@2bf7c513

scala> val arr = (1 to 20).map(x => random.nextInt(10))
arr: scala.collection.immutable.IndexedSeq[Int] = Vector(0, 7, 2, 0, 9, 2, 4, 6, 8, 1, 0, 2, 5, 6, 8, 7, 8, 3, 4, 1)

scala> val rdd = sc.makeRDD(arr)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[67] at makeRDD at <console>:26

scala> rdd.distinct.collect
res46: Array[Int] = Array(0, 6, 3, 9, 4, 1, 7, 8, 5, 2)
~~~     # RDD重分区

scala> val rdd1 = sc.range(1, 10000, numSlices=10)
rdd1: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[72] at range at <console>:24

scala> val rdd2 = rdd1.filter(_%2==0)
rdd2: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[73] at filter at <console>:25

scala> rdd2.getNumPartitions
res47: Int = 10
~~~     # 减少分区数;都生效了

scala> val rdd3 = rdd2.repartition(5)
rdd3: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[77] at repartition at <console>:25

scala> rdd3.getNumPartitions
res48: Int = 5

scala> val rdd4 = rdd2.coalesce(5)
rdd4: org.apache.spark.rdd.RDD[Long] = CoalescedRDD[78] at coalesce at <console>:25

scala> rdd4.getNumPartitions
res49: Int = 5
~~~     # 增加分区数

scala> val rdd5 = rdd2.repartition(20)
rdd5: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[82] at repartition at <console>:25

scala> rdd5.getNumPartitions
res50: Int = 20
~~~     # 增加分区数,这样使用没有效果

scala> val rdd6 = rdd2.coalesce(20)
rdd6: org.apache.spark.rdd.RDD[Long] = CoalescedRDD[83] at coalesce at <console>:25

scala> rdd6.getNumPartitions
res51: Int = 10
~~~     # 增加分区数的正确用法

scala> val rdd6 = rdd2.coalesce(20, true)
rdd6: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[87] at coalesce at <console>:25

scala> rdd6.getNumPartitions
res52: Int = 20
~~~     # RDD元素排序

scala> val random = scala.util.Random
random: util.Random.type = scala.util.Random$@2bf7c513

scala> val arr = (1 to 20).map(x => random.nextInt(10))
arr: scala.collection.immutable.IndexedSeq[Int] = Vector(8, 8, 4, 4, 2, 1, 1, 6, 9, 7, 9, 6, 9, 1, 4, 0, 0, 4, 3, 3)

scala> val rdd = sc.makeRDD(arr)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[88] at makeRDD at <console>:26

scala> rdd.collect
res53: Array[Int] = Array(8, 8, 4, 4, 2, 1, 1, 6, 9, 7, 9, 6, 9, 1, 4, 0, 0, 4, 3, 3)
~~~     # 数据全局有序,默认升序
scala> rdd.sortBy(x=>x).collect
res54: Array[Int] = Array(0, 0, 1, 1, 1, 2, 3, 3, 4, 4, 4, 4, 6, 6, 7, 8, 8, 9, 9, 9)

~~~     # 数据全局有序,默认降序
scala> rdd.sortBy(x=>x,false).collect
res55: Array[Int] = Array(9, 9, 9, 8, 8, 7, 6, 6, 4, 4, 4, 4, 3, 3, 2, 1, 1, 1, 0, 0)
### --- coalesce 与 repartition 的区别

~~~     # 源码提取说明:RDD.scala
~~~     # 431行
/**
   * Return a new RDD that has exactly numPartitions partitions.
   *
   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
   * a shuffle to redistribute data.
   *
   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
   * which can avoid performing a shuffle.
   *
   * TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207.
   */
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

~~~     # 468行
  def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
### --- 小结:

~~~     repartition:增大或减少分区数;有shuffle
~~~     coalesce:一般用于减少分区数(此时无shuffle)
四、常见转换算子3
### --- 常见转换算子

~~~     # RDD之间的交、并、差算子,分别如下:
~~~     intersection(otherRDD)
~~~     union(otherRDD)
~~~     subtract (otherRDD)
~~~     # cartesian(otherRDD):笛卡尔积
~~~     # zip(otherRDD):将两个RDD组合成 key-value 形式的RDD,
~~~     # 默认两个RDD的partition数量以及元素数量都相同,否则会抛出异常。
~~~     # 宽依赖的算子(shuffle):intersection、subtract
### --- 常见转换算子实验

scala> val rdd1 = sc.range(1, 21)
rdd1: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[100] at range at <console>:24

scala> val rdd2 = sc.range(10, 31)
rdd2: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[102] at range at <console>:24

scala> rdd1.intersection(rdd2).sortBy(x=>x).collect
res56: Array[Long] = Array(10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
~~~     # 元素求并集,不去重

scala> rdd1.union(rdd2).sortBy(x=>x).collect
res57: Array[Long] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 20, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)

scala> rdd1.subtract(rdd2).sortBy(x=>x).collect
res58: Array[Long] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
~~~     # 检查分区数

scala> rdd1.intersection(rdd2).getNumPartitions
res59: Int = 3

scala> rdd1.union(rdd2).getNumPartitions
res60: Int = 6

scala> rdd1.subtract(rdd2).getNumPartitions
res61: Int = 3
~~~     # 笛卡尔积

scala> val rdd1 = sc.range(1, 5)
rdd1: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[141] at range at <console>:24

scala> val rdd2 = sc.range(6, 10)
rdd2: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[143] at range at <console>:24

scala> rdd1.cartesian(rdd2).collect
res62: Array[(Long, Long)] = Array((1,6), (1,7), (1,8), (1,9), (2,6), (2,7), (2,8), (2,9), (3,6), (4,6), (3,7), (4,7), (3,8), (3,9), (4,8), (4,9))
~~~     # 检查分区数

scala> rdd1.cartesian(rdd2).getNumPartitions
res63: Int = 9
~~~     # 拉链操作

scala> rdd1.zip(rdd2).collect
res64: Array[(Long, Long)] = Array((1,6), (2,7), (3,8), (4,9))

scala> rdd1.zip(rdd2).getNumPartitions
res65: Int = 3
~~~     # zip操作要求:两个RDD的partition数量以及元素数量都相同,否则会抛出异常

scala> val rdd2 = sc.range(2, 20)
rdd2: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[152] at range at <console>:24
scala> rdd1.zip(rdd2).collect
### --- 备注:
~~~     union是窄依赖。得到的RDD分区数为:两个RDD分区数之和

~~~     # cartesian是窄依赖
~~~     得到RDD的元素个数为:两个RDD元素个数的乘积
~~~     得到RDD的分区数为:两个RDD分区数的乘积
~~~     使用该操作会导致数据膨胀,慎用

 
 
 
 
 
 
 
 
 

Walter Savage Landor:strove with none,for none was worth my strife.Nature I loved and, next to Nature, Art:I warm'd both hands before the fire of life.It sinks, and I am ready to depart
                                                                                                                                                   ——W.S.Landor

 

 

posted on   yanqi_vip  阅读(11)  评论(0编辑  收藏  举报

相关博文:
阅读排行:
· 无需6万激活码!GitHub神秘组织3小时极速复刻Manus,手把手教你使用OpenManus搭建本
· Manus爆火,是硬核还是营销?
· 终于写完轮子一部分:tcp代理 了,记录一下
· 别再用vector<bool>了!Google高级工程师:这可能是STL最大的设计失误
· 单元测试从入门到精通
< 2025年3月 >
23 24 25 26 27 28 1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31 1 2 3 4 5

导航

统计

点击右上角即可分享
微信分享提示