spark编程模型(五)之RDD基础转换操作(Transformation Operation)——map、flatMap、distinct、filter

map()

  • 将一个RDD中的每个数据项,通过map中的函数映射变为一个新的元素。

  • 输入分区与输出分区一对一,即:有多少个输入分区,就有多少个输出分区

      scala> val data = sc.textFile("/data/spark_rdd.txt")
      data: org.apache.spark.rdd.RDD[String] = /data/spark_rdd.txt MapPartitionsRDD[1] at textFile at <console>:24
      
      scala> val map_rdd = data.map(line => line.split("\\s+"))
      map_rdd: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:26
      
      scala> map_rdd.collect
      res0: Array[Array[String]] = Array(Array(insert, overwrite, table), Array(dataset, intersect, tochar), Array(linux, alter))
    

flatMap()

  • 第一步和map一样,最后将所有的输出分区合并成一个

      scala> val flatMap_rdd = data.flatMap(line => line.split("\\s+"))
      flatMap_rdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at <console>:26
      
      scala> flatMap_rdd.collect
      res1: Array[String] = Array(insert, overwrite, table, dataset, intersect, tochar, linux, alter)
    
  • :flatMap会将String字符串扁平化成字符数组

      scala> data.map(_.toUpperCase).collect
      res3: Array[String] = Array("INSERT     OVERWRITE TABLE ", DATASET      INTERSECT       TOCHAR, LINUX   ALTER)
      
      scala> data.flatMap(_.toUpperCase).collect
      res4: Array[Char] = Array(I, N, S, E, R, T,     , O, V, E, R, W, R, I, T, E,  , T, A, B, L, E,  , D, A, T, A, S, E, T,  , I, N, T, E, R, S, E, C, T,    , T, O, C, H, A, R, L, I, N, U, X,  , A, L, T, E, R)
    

distinct

  • 对RDD里面的元素进行去重操作

      scala> val distinct_rdd = data.flatMap(_.toUpperCase).distinct
      distinct_rdd: org.apache.spark.rdd.RDD[Char] = MapPartitionsRDD[13] at distinct at <console>:26
      
      scala> distinct_rdd.collect
      res6: Array[Char] = Array(T, L, R, B, O, A, I,  , S, H, C, E,   , U, V, X, N, W, D)
    

filter()

  • filter里面的函数作用于RDD里面的每个元素且函数返回为true的RDD元素作为输出

      scala> val data_1 = sc.parallelize(Array(1, 2, 3, 4, 23, 5, 123, 98))
      data_1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at parallelize at <console>:24
      
      scala> val filter_rdd = data_1.filter(_ < 10)
      filter_rdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[15] at filter at <console>:26
      
      scala> filter_rdd.collect
      res7: Array[Int] = Array(1, 2, 3, 4, 5)
    
posted @ 2018-08-11 01:20  oldsix666  阅读(108)  评论(0编辑  收藏  举报