1月20日
4-zipWithIndex:
根据下标进行拉链操作:
示例1: 将普通数组数据和下标进行拉链操作:
val a=Array(1,2,3,4,5,6,7,8,9,10)
val rdd1: RDD[Int] = sc.makeRDD(a)
val rdd2 = rdd1.zipWithIndex()
rdd2.collect().foreach(println)
这个很有用,可以将两个数组结合起来,两个对应起来
5-MapPartitionsWithIndex算子:
每次遍历一个分区中的数据,并且携带这个分区中的分区号;
作用是:看哪个数据在那个分区中:查看分区编号和对用数据的配列状态:
示例1:
需求:显示每个分区中的数据:
val data=Array(1,2,3,4,5,6,7,8,9,10)
val rdd1: RDD[Int] = sc.makeRDD(data,3)
val f=(index:Int,data:Iterator[Int])=>(data.map(a=>(index,a)))
val rdd2 = rdd1.mapPartitionsWithIndex(f)
rdd2.collect().foreach(println)
6-flatMap算子:
当遍历每个元素,将遍历完的元素转换成集合的时候使用flatMap;map处理 flat压平 变成了一层Array数组: 【spark中没有flaten算子,只有flatMap算子】
示例1:
将数组中的对偶中的第二个元素+1000,然后返回一个RDD
val data=List(("zhangsan",2000),("lisi",4000),("wangwu",3000),("zhaoliu",2300))
val rdd1: RDD[(String,Int)] = sc.makeRDD(data,3)
val rdd2: RDD[Any] = rdd1.flatMap(a=>Array(a._1,a._2+1000))
方法说明:
flatMap是将二维数组或者多维数组压平,元组不能压平,需要加你个元组转成数组,然后数组再套数组,然后再使用flatten方法压平:
scala> val data=List(("zhangsan",2000),("lisi",4000),("wangwu",3000),("zhaoliu",2300))
data: List[(String, Int)] = List((zhangsan,2000), (lisi,4000), (wangwu,3000), (zhaoliu,2300))
scala> sc.makeRDD(data,3)
res41: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[16] at makeRDD at <console>:27
scala> res41.flatMap
flatMap flatMapValues
scala> res41.flatMap(a=>Array(a._1,a._2+1000))
res42: org.apache.spark.rdd.RDD[Any] = MapPartitionsRDD[17] at flatMap at <console>:31
scala> res42.collect
res43: Array[Any] = Array(zhangsan, 3000, lisi, 5000, wangwu, 4000, zhaoliu, 3300)
7-filter算子:
对RDD中的数据进行过滤:
filter算子可以对数据进行过滤,过滤完的数据不存在了,但是分区依旧存在,
所以过滤完以后有些分区里面是空的没有数据(executor任务空跑),应该将分区的数量进行缩减或者重新分配。
示例:
scala> val data=Array(1,2,3,4,5,6,7,8,9)
data: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
scala> sc.makeRDD(data,3)
res46: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[19] at makeRDD at <console>:27
scala> res46.mapPartitionsWithIndex((index,data)=>data.map((index,_)))
res48: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[20] at mapPartitionsWithIndex at <console>:29
scala> res48.collect
res49: Array[(Int, Int)] = Array((0,1), (0,2), (0,3), (1,4), (1,5), (1,6), (2,7), (2,8), (2,9))
scala> res46.filter(a=>a>6)
res50: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[21] at filter at <console>:31
scala> res50.collect
res51: Array[Int] = Array(7, 8, 9)
scala> res50.mapPartitionsWithIndex((index,data)=>data.map((index,_)))
res52: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[22] at mapPartitionsWithIndex at <console>:33
scala> res52.collect
res53: Array[(Int, Int)] = Array((2,7), (2,8), (2,9))
8-groupBy算子:
goupBy算子在scala中是Array的返回值,在spark中是iterable的迭代器形式,
worldCount示例:
val conf = new SparkConf().setMaster("local[*]").setAppName("demo")
val sc = new SparkContext(conf)
val rdd1: RDD[String] = sc.textFile("src//data")
val rdd2: RDD[String] = rdd1.flatMap(a=>a.split(" "))
val rdd3: RDD[(String, Int)] = rdd2.map(a=>(a,1))
val rdd4: RDD[(String, Iterable[(String, Int)])] = rdd3.groupBy(a=>a._1)
val rdd5: RDD[(String, Int)] = rdd4.mapValues(a => {
a.toArray.length
})
rdd5.foreach(println)
9-groupByKey算子:
** 根据key进行分组:
** 可以进行分区的算子:
** 返回值是: RDD[(String, Iterable[Int])]