1月20日

 

4-zipWithIndex: 

根据下标进行拉链操作:

示例1: 将普通数组数据和下标进行拉链操作:
val a=Array(1,2,3,4,5,6,7,8,9,10)
val rdd1: RDD[Int] = sc.makeRDD(a)
val rdd2 = rdd1.zipWithIndex()
rdd2.collect().foreach(println)

这个很有用,可以将两个数组结合起来,两个对应起来


5-MapPartitionsWithIndex算子:

每次遍历一个分区中的数据,并且携带这个分区中的分区号;

作用是:看哪个数据在那个分区中:查看分区编号和对用数据的配列状态:

 

示例1: 
需求:显示每个分区中的数据:
 val data=Array(1,2,3,4,5,6,7,8,9,10)
 val rdd1: RDD[Int] = sc.makeRDD(data,3)
 val f=(index:Int,data:Iterator[Int])=>(data.map(a=>(index,a)))
 val rdd2 = rdd1.mapPartitionsWithIndex(f)
 rdd2.collect().foreach(println)



6-flatMap算子:

当遍历每个元素,将遍历完的元素转换成集合的时候使用flatMap;map处理 flat压平 变成了一层Array数组: 【spark中没有flaten算子,只有flatMap算子】

示例1:
将数组中的对偶中的第二个元素+1000,然后返回一个RDD
val data=List(("zhangsan",2000),("lisi",4000),("wangwu",3000),("zhaoliu",2300))
val rdd1: RDD[(String,Int)] = sc.makeRDD(data,3)
val rdd2: RDD[Any] = rdd1.flatMap(a=>Array(a._1,a._2+1000))


方法说明:
  flatMap是将二维数组或者多维数组压平,元组不能压平,需要加你个元组转成数组,然后数组再套数组,然后再使用flatten方法压平:
 
scala> val data=List(("zhangsan",2000),("lisi",4000),("wangwu",3000),("zhaoliu",2300))
data: List[(String, Int)] = List((zhangsan,2000), (lisi,4000), (wangwu,3000), (zhaoliu,2300))

scala> sc.makeRDD(data,3)
res41: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[16] at makeRDD at <console>:27

scala> res41.flatMap
flatMap   flatMapValues

scala> res41.flatMap(a=>Array(a._1,a._2+1000))
res42: org.apache.spark.rdd.RDD[Any] = MapPartitionsRDD[17] at flatMap at <console>:31

scala> res42.collect
res43: Array[Any] = Array(zhangsan, 3000, lisi, 5000, wangwu, 4000, zhaoliu, 3300)

 


7-filter算子:

对RDD中的数据进行过滤:

filter算子可以对数据进行过滤,过滤完的数据不存在了,但是分区依旧存在,

所以过滤完以后有些分区里面是空的没有数据(executor任务空跑),应该将分区的数量进行缩减或者重新分配。

示例:
scala> val data=Array(1,2,3,4,5,6,7,8,9)
data: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
scala> sc.makeRDD(data,3)
res46: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[19] at makeRDD at <console>:27
scala> res46.mapPartitionsWithIndex((index,data)=>data.map((index,_)))
res48: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[20] at mapPartitionsWithIndex at <console>:29
scala> res48.collect
res49: Array[(Int, Int)] = Array((0,1), (0,2), (0,3), (1,4), (1,5), (1,6), (2,7), (2,8), (2,9))
scala> res46.filter(a=>a>6)
res50: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[21] at filter at <console>:31
scala> res50.collect
res51: Array[Int] = Array(7, 8, 9)
scala> res50.mapPartitionsWithIndex((index,data)=>data.map((index,_)))
res52: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[22] at mapPartitionsWithIndex at <console>:33
scala> res52.collect
res53: Array[(Int, Int)] = Array((2,7), (2,8), (2,9))

8-groupBy算子:

goupBy算子在scala中是Array的返回值,在spark中是iterable的迭代器形式,

worldCount示例:
val conf = new SparkConf().setMaster("local[*]").setAppName("demo")
val sc = new SparkContext(conf)
val rdd1: RDD[String] = sc.textFile("src//data")
val rdd2: RDD[String] = rdd1.flatMap(a=>a.split(" "))
val rdd3: RDD[(String, Int)] = rdd2.map(a=>(a,1))
val rdd4: RDD[(String, Iterable[(String, Int)])] = rdd3.groupBy(a=>a._1)
val rdd5: RDD[(String, Int)] = rdd4.mapValues(a => {
  a.toArray.length
})
rdd5.foreach(println)



9-groupByKey算子:

** 根据key进行分组:

** 可以进行分区的算子:

** 返回值是: RDD[(String, Iterable[Int])]

 

 










posted @ 2022-01-20 21:36  不咬牙  阅读(95)  评论(0编辑  收藏  举报