Cccc杨

导航

spark中的算子

Transformation类型算子:不会定义后立即执行的算子

Actions类型算子:立即执行

 

1.map算子

  把原来的数据用map的自定义形式来切换成新的RDD。

scala> rdd_f1.collect()
res32: Array[String] = Array(i am a sutdnet, i am a boy)

scala> var rdd_f2 = rdd_f1.map(x=>x.split)
split   splitAt

//把原来的两个字符串拆分为两个素组 每个单词是数组里面的元素 拆分符号为“ ”空格 scala> var rdd_f2 = rdd_f1.map(x=>x.split(" ")) rdd_f2: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[48] at map at <console>:25 scala> rdd_f2.collect() res33: Array[Array[String]] = Array(Array(i, am, a, sutdnet), Array(i, am, a, boy))

 2.flatMap算子

  相对于map,flatMap会在切割后把数组拆开 但是只会拆一层(最外层)

scala> var rdd_f3 = rdd_f1.flatMap(x=>x.split(" "))

scala> rdd_f3.collect()
res34: Array[String] = Array(i, am, a, sutdnet, i, am, a, boy)

 3.mapPartitions算子

  相对于map功能,输入的元素是整个分区,也就是说传入函数的操作对象是每个分区的iterator集合。该操作不会导致Partitons数量的变化。

//创建一个1-10的数据集合
scala> var rdd_mp = sc.parallelize(1 to 10)

//取出大于3的数据 scala> val mapPartitonsRDD = rdd_mp.mapPartitions(iter => iter.filter(_>3))
//打印 scala> mapPartitonsRDD.collect() res35: Array[Int] = Array(4, 5, 6, 7, 8, 9, 10)

 4.sortBy算子

  sortBy(f:(T) =>  K, ascending, numPartitions)

  f:(T) => K:左边是要被排序对象中的每一个元素 右边的返回值是元素中要进行排序的值。

  ascending:true升序排列,false降序

  numPartitions:排序后RDD分区数,默认排序后分区数和排序前相等。

scala> val rdd_sortBy = sc.parallelize(List(("zhangsna",20),("lisi",10),("wangwu",24)))

scala> rdd_sortBy.collect()
res43: Array[(String, Int)] = Array((zhangsna,20), (lisi,10), (wangwu,24))

scala> rdd_sortBy.sortBy(x=>x._2,true).collect
collect   collectAsMap   collectAsync

scala> rdd_sortBy.sortBy(x=>x._2,true).collect()
res44: Array[(String, Int)] = Array((lisi,10), (zhangsna,20), (wangwu,24))

 5.filter算子

  过滤元素。返回值为true的元素 组成新的RDD 相对于一个比较。

scala> rdd1.collect()
res45: Array[Int] = Array(1, 2, 3)

scala> val result=rdd1.filter(x=>x>1)

scala> result.collect()
res46: Array[Int] = Array(2, 3)

 

实例:

  计算一个成绩表中排名前5的人。

//通过切割制表符 \t来把原bigdata表拆分为数组 后一个map的x代表数组 原本是string 需要把成绩变为int
scala> val bigdata_map = bigdata.map(x=>x.split("\t")).map(x=>(x(0),x(1),x(2).toInt)) scala> bigdata_map.collect() res11:
Array[(String, String, Int)] = Array((1001,大数据基础,90),
(1002,大数据基础,94), (1003,大数据基础,100),
(1004,大数据基础,99), (1005,大数据基础,90),
(1006,大数据基础,94), (1007,大数据基础,100),
(1008,大数据基础,93), (1009,大数据基础,89),
(1010,大数据基础,78), (1011,大数据基础,91),
(1012,大数据基础,84)) scala> val bigdata_sort=bigdata_map.sortBy(x=>x._3,false).collect() bigdata_sort: Array[(String, String, Int)] = Array((1003,大数据基础,100),
(1007,大数据基础,100), (1004,大数据基础,99), (1002,大数据基础,94),
(1006,大数据基础,94), (1008,大数据基础,93), (1011,大数据基础,91),
(1001,大数据基础,90), (1005,大数据基础,90), (1009,大数据基础,89),
(1012,大数据基础,84), (1010,大数据基础,78)) scala> bigdata_sort.take(5) res13: Array[(String, String, Int)] = Array((1003,大数据基础,100), (1007,大数据基础,100), (1004,大数据基础,99), (1002,大数据基础,94), (1006,大数据基础,94))

 

6.distinct算子

  针对RDD中重复的元素,只保留一个元素

scala> val data = sc.parallelize(List(1,2,3,4,4,5,3))
data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[37] at parallelize at <console>:24

scala> data.distinct.collect()
[Stage 24:>                                                         (0 + 3) / 3
[Stage 24:===================> (1 + 2) / 3
[Stage 24:======================================> (2 + 1) / 3
res16: Array[Int] = Array(3, 4, 1, 5, 2)

 

7.union算子

  合并两个RDD

scala> val rdd_1 = sc.parallelize(List(1,2,3))
rdd_1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[41] at parallelize at <console>:24

scala> val rdd_2 = sc.parallelize(List(4,5,6))
rdd_2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[42] at parallelize at <console>:24

scala> rdd_1.union(rdd_2).collect()
res17: Array[Int] = Array(1, 2, 3, 4, 5, 6)

 8.intersection算子

  找出两个RDD中相同的元素(求并集)

scala> rdd_1.collect()
res18: Array[Int] = Array(1, 2, 3)

scala> val rdd_3 = sc.parallelize(List(2,3,5))
rdd_3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[44] at parallelize at <console>:24

scala> rdd_1.intersection(rdd_3).collect
res19: Array[Int] = Array(3, 2)

 9.subtract算子

  求差集   rdd1(rdd2)代表的是rdd1中出现过但是rdd2中没有出现过的元素

scala> val rdd1 = sc.parallelize(Array("A","B","C","D"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[51] at parallelize at <console>:24

scala> val rdd2 = sc.parallelize(Array("D","C","E","F"))
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[52] at parallelize at <console>:24

scala> val subtractRdd=rdd1.subtract(rdd2)
subtractRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at subtract at <console>:27

scala> subtractRdd.collect
res20: Array[String] = Array(B, A)

scala> val substractRdd = rdd2.subtract(rdd1)
substractRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[60] at subtract at <console>:27

scala> substractRdd.collect
res21: Array[String] = Array(E, F)

 10.cartesian算子

  把两个集合元素两两组合

//rdd_1=(1,2,3)
//rdd_2=(4,5,6)

scala> rdd_1.cartesian(rdd_2).collect
res23: Array[(Int, Int)] = Array((1,4), (1,5), (1,6), (2,4), (2,5), (2,6), (3,4), (3,5), (3,6))

 实例任务:

  求出考试成绩得过100分的ID 汇总到一个RDD中

  找出两门成绩都得过100分的学生ID 结果汇总为一个RDD

scala> bigdata_map.collect
res25: Array[(String, String, Int)] = Array((1001,大数据基础,90), 
(1002,大数据基础,94), (1003,大数据基础,100), (1004,大数据基础,99),
(1005,大数据基础,90), (1006,大数据基础,94), (1007,大数据基础,100),
(1008,大数据基础,93), (1009,大数据基础,89), (1010,大数据基础,78),
(1011,大数据基础,91), (1012,大数据基础,84))
//从hdfs中读取result_bigdata的数据,并且转化为通过制表符分割的元组,再通过filter算子把第三个元素等于100的元组的第一个元素取出。也就是把分数为100的学生的ID取出。 scala> val bigdata_100 = sc.textFile("hdfs://master:9000//usr/root/sparkdata/result_bigdata.txt").map(x=>x.split("\t")).map(x=>(x(0),x(1),x(2).toInt)).filter(x=>x._3==100).map(x=>x._1) bigdata_100: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[68] at map at <console>:24 scala> bigdata_100.collect res26: Array[String] = Array(1003, 1007) scala> math_map.collect res27: Array[(String, String, Int)] = Array((1001,应用数学,96),

(1002,应用数学,94), (1003,应用数学,100), (1004,应用数学,100),
(1005,应用数学,94), (1006,应用数学,80), (1007,应用数学,90),
(1008,应用数学,94), (1009,应用数学,84), (1010,应用数学,86),
(1011,应用数学,79), (1012,应用数学,91))
//同样的,通过filter算子把第三个元素等于100的元组的第一个元素输出 scala> val math_100 = math_map.filter(x=>x._3==100).map(x=>x._1) math_100: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[70] at map at <console>:25 scala> math_100.collect res28: Array[String] = Array(1003, 1004)
//通过union算子合并两个RDD表 并且用distinct来去重。 scala> bigdata_100.union(math_100).distinct.collect res29: Array[String] = Array(1003, 1007, 1004)
//通过intersection算子来查找合集 查找出两门成绩都为100的学生 scala> bigdata_100.intersection(math_100).collect res30: Array[String] = Array(1003)

 

 

键值对类型的RDD:

  由键值对组成,被称为PairRDD。提供了并行操作各个键或跨节点重新进行数据分组的操作接口。例如:reduceByKey()方法:可以分别规约每个键对应的数据。join()方法:可以把两个RDD中键相同的元素组合在一起。合并为一个RDD。

 

创建键值对RDD:

scala>  val rdd = sc.parallelize(List("a","b","c"))

scala> val rdd2 = rdd.map(x=>(x,1))
scala> rdd2.collect res31: Array[(String, Int)] = Array((a,1), (b,1), (c,1))

scala> rdd2.keys.collect
res32: Array[String] = Array(a, b, c)

scala> rdd2.values.collect
res33: Array[Int] = Array(1, 1, 1)

 

11.mapValues

  类似于map,针对键值对(key,value)类型的数据中的value进行map操作,而不对key进行处理。

scala> rdd2.collect
res31: Array[(String, Int)] = Array((a,1), (b,1), (c,1))

scala> rdd2.values.collect res33: Array[Int] = Array(1, 1, 1) scala> rdd2.mapValues(x=>(x,4)).collect res34: Array[(String, (Int, Int))] = Array((a,(1,4)), (b,(1,4)), (c,(1,4)))

12.groupByKey

  按键分组,在(K,V)对组成的RDD上调时用,返回(K,Iterable<V>)对组成新的RDD将rdd按键进行分组。

scala> val rdd3 = sc.parallelize(List("a","b","c")).map(x=>(x,1))
rdd3: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[87] at map at <console>:24

scala> rdd3.groupByKey().collect
res35: Array[(String, Iterable[Int])] = Array((c,CompactBuffer(1)), (a,CompactBuffer(1)), (b,CompactBuffer(1)))

 

13.reduceByKey

  将键值对RDD按键分组后进行聚合,当在(K,V)类型的键值对组成的RDD上调用时,返回一个(K,V)类型键值对组成新的RDD。其中新RDD每个键的值使用给定的reduce函数func进行聚合。该函数必须是(V,V)=>V类型。

//统计每个键出现的次数
scala> rdd2.collect res41: Array[(String, Int)] = Array((a,1), (b,1), (c,1), (c,1)) scala> rdd2.reduceByKey((x,y)=>x+y).collect res42: Array[(String, Int)] = Array((c,2), (a,1), (b,1))

 

14.join

  把键值对数据相同键的值整合起来。

  有:(left、rigth、full)OuterJoin

scala> rdd2.collect
res43: Array[(String, Int)] = Array((a,V1), (b,V1), (c,V1), (c,V1))

scala> val rdd3 = sc.parallelize(List("a","b","aa")).map(x=>(x,1))
rdd3: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[96] at map at <console>:24

scala> rdd2.join(rdd3).collect
res44: Array[(String, (Int, Int))] = Array((a,(V1,1)), (b,(V1,1)))

 

 

实例:

1.输出每位学生的总成绩,要求将两个成绩表中学生的ID相同的成绩相加

2.输出每位学生的平均成绩,要求将两个成绩表中学生ID相同的成绩相加并计算出平均分

3.合并每个学生的总成绩和平均成绩

//从hdfs中读取第一个成绩表 返回x(0),x(2)是学生的ID和成绩

scala> val bigdata_kv = sc.textFile("hdfs://master:9000//usr/root/sparkdata/result_bigdata.txt").map(x=>x.split("\t")).map(x=>(x(0),x(2).toInt)) bigdata_kv: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at <console>:24
//查看大数据成绩RDD scala> bigdata_kv.collect res0: Array[(String, Int)] = Array((1001,90), (1002,94), (1003,100),
(1004,99), (1005,90), (1006,94),
(1007,100), (1008,93), (1009,89),
(1010,78), (1011,91), (1012,84))
//读取math成绩 scala> val math_kv = sc.textFile("hdfs://master:9000/usr/root/sparkdata/result_math.txt").map(x=>x.split("\t")).map(x=>(x(0),x(2).toInt)) math_kv: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[7] at map at <console>:24

//查看math的RDD scala> math_kv.collect res1: Array[(String, Int)] = Array((1001,96), (1002,94), (1003,100),
(1004,100), (1005,94), (1006,80),
(1007,90), (1008,94), (1009,84),
(1010,86), (1011,79), (1012,91))
//聚合两RDD表 scala> val score_kv = bigdata_kv.union(math_kv) score_kv: org.apache.spark.rdd.RDD[(String, Int)] = UnionRDD[8] at union at <console>:27
//查看RDD scala> score_kv.collect res2: Array[(String, Int)] = Array((1001,90), (1002,94), (1003,100),
(1004,99), (1005,90), (1006,94),
(1007,100), (1008,93), (1009,89),
(1010,78), (1011,91), (1012,84),
(1001,96), (1002,94), (1003,100),
(1004,100), (1005,94), (1006,80),
(1007,90), (1008,94), (1009,84),
(1010,86), (1011,79), (1012,91))
//把刚刚聚合的RDD表用reduceByKey对值进行操作 把键相同的值相加 scala> val allscore = score_kv.reduceByKey((x,y)=>x+y) allscore: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[9] at reduceByKey at <console>:25

//已经完成聚合 scala> allscore.collect res3: Array[(String, Int)] = Array((1005,184), (1012,175), (1001,186),
(1009,173), (1002,188), (1006,174),
(1010,164), (1003,200), (1007,190),
(1008,187), (1011,170), (1004,199))
//把储存学生两科成绩的RDD表的值部分进行再分组 格式为(值,1) scala> val scores_kv_count=score_kv.mapValues(x=>(x,1)) scores_kv_count: org.apache.spark.rdd.RDD[(String, (Int, Int))] = MapPartitionsRDD[10] at mapValues at <console>:25 scala> scores_kv_count.collect res4: Array[(String, (Int, Int))] = Array((1001,(90,1)), (1002,(94,1)), (1003,(100,1)),
(1004,(99,1)), (1005,(90,1)), (1006,(94,1)),
(1007,(100,1)), (1008,(93,1)), (1009,(89,1)),
(1010,(78,1)), (1011,(91,1)), (1012,(84,1)),
(1001,(96,1)), (1002,(94,1)), (1003,(100,1)),
(1004,(100,1)), (1005,(94,1)), (1006,(80,1)),
(1007,(90,1)), (1008,(94,1)), (1009,(84,1)),
(1010,(86,1)), (1011,(79,1)), (1012,(91,1)))
//把得到的分组RDD中的两个相同的键中的值和编码相加 scala> val avgscore = scores_kv_count.reduceByKey((x,y)=>(x._1+y._1,x._2+y._2)).mapValues(x=>x._1/x._2) avgscore: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[12] at mapValues at <console>:25 scala> avgscore.collect res5: Array[(String, Int)] = Array((1005,92), (1012,87), (1001,93),
(1009,86), (1002,94), (1006,87), (1010,82),
(1003,100), (1007,95), (1008,93), (1011,85), (1004,99))
//用join算子关联两个RDD scala> allscore.join(avgscore).collect res6: Array[(String, (Int, Int))] = Array((1005,(184,92)), (1012,(175,87)),
(1001,(186,93)), (1009,(173,86)), (1002,(188,94)),
(1006,(174,87)), (1010,(164,82)), (1003,(200,100)),
(1007,(190,95)), (1008,(187,93)), (1011,(170,85)),
(1004,(199,99)))

 

 

  

15.lookup(Action类型算子)

  返回指定K的所有V值。

scala> rdd2.collect
res46: Array[(String, Int)] = Array((a,1), (b,1), (c,1), (c,1))

scala> rdd2.lookup("c")
res47: Seq[Int] = WrappedArray(1, 1)

 

posted on 2021-03-08 15:48  Cccc杨  阅读(241)  评论(0编辑  收藏  举报