|NO.Z.00016|——————————|BigDataEnd|——|Hadoop&Spark.V04|——|Spark.v04|sparkcore|RDD编程&Transformation|
一、Transformation【重要】
### --- Transformation:RDD的操作算子分为两类:
~~~ Transformation:用来对RDD进行转化,这个操作时延迟执行的(或者说是Lazy 的);
~~~ Action:用来触发RDD的计算;得到相关计算结果 或者 将结果保存的外部系统中;
~~~ Transformation:返回一个新的RDD
~~~ Action:返回结果int、double、集合(不会返回新的RDD)
~~~ 要很准确区分Transformation、Action
### --- Transformation
~~~ 每一次 Transformation 操作都会产生新的RDD,供给下一个“转换”使用;转换得到的RDD是惰性求值的。
~~~ 也就是说,整个转换过程只是记录了转换的轨迹,并不会发生真正的计算,只有遇到 Action 操作时,
~~~ 才会发生真正的计算,开始从血缘关系(lineage)源头开始,进行物理的转换操作;

### --- 常见的 Transformation 算子:
~~~ 官方文档:http://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations/
二、常见转换算子1
### --- 常见的transformation
~~~ map(func):对数据集中的每个元素都使用func,然后返回一个新的RDD
~~~ filter(func):对数据集中的每个元素都使用func,然后返回一个包含使func为true的元素构成的RDD
~~~ flatMap(func):与 map 类似,每个输入元素被映射为0或多个输出元素
~~~ mapPartitions(func):和map很像,但是map是将func作用在每个元素上,
~~~ 而mapPartitions是func作用在整个分区上。
~~~ 假设一个RDD有N个元素,M个分区(N>> M),那么map的函数将被调用N次,
~~~ 而mapPartitions中的函数仅被调用M次,一次处理一个分区中的所有元素
~~~ mapPartitionsWithIndex(func):与 mapPartitions 类似,多了分区索引值信息
### --- transformation实验实例
~~~ # 都是 Transformation 操作,没有被执行。如何证明这些操作按预期执行,此时需要引入Action算子
scala> val rdd1 = sc.parallelize(1 to 10)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[42] at parallelize at <console>:24
scala> val rdd2 = rdd1.map(_*2)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[43] at map at <console>:25
scala> val rdd3 = rdd2.filter(_>10)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[44] at filter at <console>:25
~~~ # collect 是Action算子,触发Job的执行,将RDD的全部元素从 Executor 搜集到 Driver 端。生产环境中禁用
scala> rdd1.collect
res27: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> rdd2.collect
res28: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
scala> rdd3.collect
res29: Array[Int] = Array(12, 14, 16, 18, 20)
~~~ # flatMap 使用案例
scala> val rdd4 = sc.textFile("data/wc.txt")
rdd4: org.apache.spark.rdd.RDD[String] = data/wc.txt MapPartitionsRDD[51] at textFile at <console>:24
scala> rdd4.collect
res33: Array[String] = Array(hadoop mapreduce yarn, hdfs hadoop mapreduce, mapreduce yarn yanqi, yanqi, yanqi)
scala> rdd4.flatMap(_.split("\\s+")).collect
res34: Array[String] = Array(hadoop, mapreduce, yarn, hdfs, hadoop, mapreduce, mapreduce, yarn, yanqi, yanqi, yanqi)
~~~ # RDD 是分区,rdd1有几个区,每个分区有哪些元素
scala> rdd1.getNumPartitions
res35: Int = 3
scala> rdd1.partitions.length
res36: Int = 3
scala> rdd1.mapPartitions{iter => Iterator(s"${iter.toList}")}.collect
res37: Array[String] = Array(List(1, 2, 3), List(4, 5, 6), List(7, 8, 9, 10))
scala> rdd1.mapPartitions{iter => Iterator(s"${iter.toArray.mkString("-")}")}.collect
res38: Array[String] = Array(1-2-3, 4-5-6, 7-8-9-10)
scala> rdd1.mapPartitionsWithIndex{(idx, iter) => Iterator(s"$idx:${iter.toArray.mkString("-")}")}.collect
res39: Array[String] = Array(0:1-2-3, 1:4-5-6, 2:7-8-9-10)
~~~ # 每个元素 * 2
scala> val rdd5 = rdd1.mapPartitions(iter => iter.map(_*2))
rdd5: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[56] at mapPartitions at <console>:25
scala> rdd5.collect
res40: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
### --- map 与 mapPartitions 的区别
~~~ map:每次处理一条数据
~~~ mapPartitions:每次处理一个分区的数据,分区的数据处理完成后,数据才能释放,资源不足时容易导致OOM
~~~ 最佳实践:当内存资源充足时,建议使用mapPartitions,以提高处理效率
三、常见转换算子2
### --- 常见转换算子
~~~ groupBy(func):按照传入函数的返回值进行分组。将key相同的值放入一个迭代器
~~~ glom():将每一个分区形成一个数组,形成新的RDD类型 RDD[Array[T]]
~~~ sample(withReplacement, fraction, seed):采样算子。以指定的随机种子(seed)随机抽样出数量为fraction的数据,withReplacement表示是抽出的数据是否放回,true为有放回的抽样,false为无放回的抽样
~~~ distinct([numTasks])):对RDD元素去重后,返回一个新的RDD。可传入numTasks参数改变RDD分区数
~~~ coalesce(numPartitions):缩减分区数,无shuffle
~~~ repartition(numPartitions):增加或减少分区数,有shuffle
~~~ sortBy(func, [ascending], [numTasks]):使用 func 对数据进行处理,对处理后的结果进行排序
~~~ 宽依赖的算子(shuffle):groupBy、distinct、repartition、sortBy
### --- 常见算子转换实验案例
~~~ # 将 RDD 中的元素按照3的余数分组
scala> val rdd = sc.parallelize(1 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[58] at parallelize at <console>:24
scala> val group = rdd.groupBy(_%3)
group: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[60] at groupBy at <console>:25
scala> group.collect
res41: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(6, 9, 3)), (1,CompactBuffer(1, 7, 10, 4)), (2,CompactBuffer(8, 5, 2)))
~~~ # 将 RDD 中的元素每10个元素分组
scala> val rdd = sc.parallelize(1 to 101)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[61] at parallelize at <console>:24
scala> rdd.glom.map(_.sliding(10, 10).toArray)
res42: org.apache.spark.rdd.RDD[Array[Array[Int]]] = MapPartitionsRDD[63] at map at <console>:26
~~~ # sliding是Scala中的方法
~~~ 对数据采样。fraction采样的百分比,近似数
~~~ 有放回的采样,使用固定的种子
scala> rdd.sample(true, 0.2, 2).collect
res43: Array[Int] = Array(2, 4, 5, 7, 9, 15, 22, 24, 25, 50, 57, 59, 61, 66, 71, 71, 72, 73, 75, 78, 78, 89, 90, 90, 94, 97, 100)
~~~ # 无放回的采样,使用固定的种子
scala> rdd.sample(false, 0.2, 2).collect
res44: Array[Int] = Array(1, 4, 11, 12, 15, 17, 18, 25, 26, 28, 30, 48, 54, 55, 62, 63, 71, 75, 76, 78, 79, 84, 90, 91, 97, 99, 101)
~~~ # 有放回的采样,不设置种子
scala> rdd.sample(false, 0.2).collect
res45: Array[Int] = Array(7, 16, 22, 31, 43, 47, 55, 64, 68, 70, 76, 77, 81, 82, 91, 95, 96, 97)
~~~ # 数据去重
scala> val random = scala.util.Random
random: util.Random.type = scala.util.Random$@2bf7c513
scala> val arr = (1 to 20).map(x => random.nextInt(10))
arr: scala.collection.immutable.IndexedSeq[Int] = Vector(0, 7, 2, 0, 9, 2, 4, 6, 8, 1, 0, 2, 5, 6, 8, 7, 8, 3, 4, 1)
scala> val rdd = sc.makeRDD(arr)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[67] at makeRDD at <console>:26
scala> rdd.distinct.collect
res46: Array[Int] = Array(0, 6, 3, 9, 4, 1, 7, 8, 5, 2)
~~~ # RDD重分区
scala> val rdd1 = sc.range(1, 10000, numSlices=10)
rdd1: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[72] at range at <console>:24
scala> val rdd2 = rdd1.filter(_%2==0)
rdd2: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[73] at filter at <console>:25
scala> rdd2.getNumPartitions
res47: Int = 10
~~~ # 减少分区数;都生效了
scala> val rdd3 = rdd2.repartition(5)
rdd3: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[77] at repartition at <console>:25
scala> rdd3.getNumPartitions
res48: Int = 5
scala> val rdd4 = rdd2.coalesce(5)
rdd4: org.apache.spark.rdd.RDD[Long] = CoalescedRDD[78] at coalesce at <console>:25
scala> rdd4.getNumPartitions
res49: Int = 5
~~~ # 增加分区数
scala> val rdd5 = rdd2.repartition(20)
rdd5: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[82] at repartition at <console>:25
scala> rdd5.getNumPartitions
res50: Int = 20
~~~ # 增加分区数,这样使用没有效果
scala> val rdd6 = rdd2.coalesce(20)
rdd6: org.apache.spark.rdd.RDD[Long] = CoalescedRDD[83] at coalesce at <console>:25
scala> rdd6.getNumPartitions
res51: Int = 10
~~~ # 增加分区数的正确用法
scala> val rdd6 = rdd2.coalesce(20, true)
rdd6: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[87] at coalesce at <console>:25
scala> rdd6.getNumPartitions
res52: Int = 20
~~~ # RDD元素排序
scala> val random = scala.util.Random
random: util.Random.type = scala.util.Random$@2bf7c513
scala> val arr = (1 to 20).map(x => random.nextInt(10))
arr: scala.collection.immutable.IndexedSeq[Int] = Vector(8, 8, 4, 4, 2, 1, 1, 6, 9, 7, 9, 6, 9, 1, 4, 0, 0, 4, 3, 3)
scala> val rdd = sc.makeRDD(arr)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[88] at makeRDD at <console>:26
scala> rdd.collect
res53: Array[Int] = Array(8, 8, 4, 4, 2, 1, 1, 6, 9, 7, 9, 6, 9, 1, 4, 0, 0, 4, 3, 3)
~~~ # 数据全局有序,默认升序
scala> rdd.sortBy(x=>x).collect
res54: Array[Int] = Array(0, 0, 1, 1, 1, 2, 3, 3, 4, 4, 4, 4, 6, 6, 7, 8, 8, 9, 9, 9)
~~~ # 数据全局有序,默认降序
scala> rdd.sortBy(x=>x,false).collect
res55: Array[Int] = Array(9, 9, 9, 8, 8, 7, 6, 6, 4, 4, 4, 4, 3, 3, 2, 1, 1, 1, 0, 0)
### --- coalesce 与 repartition 的区别
~~~ # 源码提取说明:RDD.scala
~~~ # 431行
/**
* Return a new RDD that has exactly numPartitions partitions.
*
* Can increase or decrease the level of parallelism in this RDD. Internally, this uses
* a shuffle to redistribute data.
*
* If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
* which can avoid performing a shuffle.
*
* TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207.
*/
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
~~~ # 468行
def coalesce(numPartitions: Int, shuffle: Boolean = false,
partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
(implicit ord: Ordering[T] = null)
: RDD[T] = withScope {
### --- 小结:
~~~ repartition:增大或减少分区数;有shuffle
~~~ coalesce:一般用于减少分区数(此时无shuffle)
四、常见转换算子3
### --- 常见转换算子
~~~ # RDD之间的交、并、差算子,分别如下:
~~~ intersection(otherRDD)
~~~ union(otherRDD)
~~~ subtract (otherRDD)
~~~ # cartesian(otherRDD):笛卡尔积
~~~ # zip(otherRDD):将两个RDD组合成 key-value 形式的RDD,
~~~ # 默认两个RDD的partition数量以及元素数量都相同,否则会抛出异常。
~~~ # 宽依赖的算子(shuffle):intersection、subtract
### --- 常见转换算子实验
scala> val rdd1 = sc.range(1, 21)
rdd1: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[100] at range at <console>:24
scala> val rdd2 = sc.range(10, 31)
rdd2: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[102] at range at <console>:24
scala> rdd1.intersection(rdd2).sortBy(x=>x).collect
res56: Array[Long] = Array(10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
~~~ # 元素求并集,不去重
scala> rdd1.union(rdd2).sortBy(x=>x).collect
res57: Array[Long] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 20, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
scala> rdd1.subtract(rdd2).sortBy(x=>x).collect
res58: Array[Long] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
~~~ # 检查分区数
scala> rdd1.intersection(rdd2).getNumPartitions
res59: Int = 3
scala> rdd1.union(rdd2).getNumPartitions
res60: Int = 6
scala> rdd1.subtract(rdd2).getNumPartitions
res61: Int = 3
~~~ # 笛卡尔积
scala> val rdd1 = sc.range(1, 5)
rdd1: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[141] at range at <console>:24
scala> val rdd2 = sc.range(6, 10)
rdd2: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[143] at range at <console>:24
scala> rdd1.cartesian(rdd2).collect
res62: Array[(Long, Long)] = Array((1,6), (1,7), (1,8), (1,9), (2,6), (2,7), (2,8), (2,9), (3,6), (4,6), (3,7), (4,7), (3,8), (3,9), (4,8), (4,9))
~~~ # 检查分区数
scala> rdd1.cartesian(rdd2).getNumPartitions
res63: Int = 9
~~~ # 拉链操作
scala> rdd1.zip(rdd2).collect
res64: Array[(Long, Long)] = Array((1,6), (2,7), (3,8), (4,9))
scala> rdd1.zip(rdd2).getNumPartitions
res65: Int = 3
~~~ # zip操作要求:两个RDD的partition数量以及元素数量都相同,否则会抛出异常
scala> val rdd2 = sc.range(2, 20)
rdd2: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[152] at range at <console>:24
scala> rdd1.zip(rdd2).collect
### --- 备注:
~~~ union是窄依赖。得到的RDD分区数为:两个RDD分区数之和
~~~ # cartesian是窄依赖
~~~ 得到RDD的元素个数为:两个RDD元素个数的乘积
~~~ 得到RDD的分区数为:两个RDD分区数的乘积
~~~ 使用该操作会导致数据膨胀,慎用
Walter Savage Landor:strove with none,for none was worth my strife.Nature I loved and, next to Nature, Art:I warm'd both hands before the fire of life.It sinks, and I am ready to depart
——W.S.Landor
分类:
bdv016-spark.v01
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 无需6万激活码!GitHub神秘组织3小时极速复刻Manus,手把手教你使用OpenManus搭建本
· Manus爆火,是硬核还是营销?
· 终于写完轮子一部分:tcp代理 了,记录一下
· 别再用vector<bool>了!Google高级工程师:这可能是STL最大的设计失误
· 单元测试从入门到精通