spark 基本操作整理

关于spark 的详细操作请参照spark官网

scala 版本:2.11.8

1.添加spark maven依赖,如需访问hdfs,则添加hdfs依赖

groupId = org.apache.spark
artifactId = spark-core_2.11
version = 2.3.2

groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>

2.sparkcontext 的创建

    val conf = new SparkConf().setAppName("example").setMaster("local[*]")
    val sc = new SparkContext(conf)
    
    sc.stop()

sc 使用结束,记得关闭

3.创建rdd

1)parallelized 方法

val words = sc.parallelize(Array("dong","jason","puma","large"),2)

2)读取外部数据

val rdd = sc.textFile("path_to_file(local or hdfs)")

一个放重要的概念,partitions,spark在逻辑上回对数据进行分区,每个分区会安排一个task来处理,textfile 如果读取的时hdfs,则默认partitions 是 文件的block数,

一般情况下为资源中每个cpu分配 2-4 个task为宜

4. SparkContext.wholeTextFiles

 

    val rdd = sc.wholeTextFiles("./")
    rdd.take(1).foreach(println)

-----------------------------------
(file:/C:/notos/code/sailertest/aa.csv,name,age
jason,29
dong,27)

 

其输出结果时一个元组,(filepath,filecontent)

 5.读取 hadoop sequencefFile 

val seqRdd = sc.sequenceFile[String,Int]("seq")
seqRdd.take(2).foreach(println)


(jason,29)
(dong,27)

sequenceFile[K,V]中的K,V 必须指定,且2要与sequencefile的类型匹配

6.向方法传递函数

object Func{
  def concat(tp:(String,Int)):String={
    tp._1 + " " + tp._2
  }
}

val seqRdd = sc.sequenceFile[String,Int]("seq").map(Func.concat)

上述例子是把方法定义在单利对象中,与之相对,也可以把方法定义在类中,请看下面的例子

class MyClass{
  val field = " "
  def concat(rdd:RDD[(String,Int)]) :RDD[String] ={
    val field_  = field
    rdd.map(tp=> tp._1 + field_ + tp._2)
  }
}

这里在concat方法中我没没有直接 使用 Myclass 的 成员 field ,因为直接使用field ((tp=> tp._1 + field + tp._2) 相当于是 (tp=> tp._1 + this.field + tp._2))

这样会把整个类再引用一遍

7. rdd key-value 操作

    val wordcount = sc.textFile("aa.txt")
      .flatMap(_.split("\\s+",-1))
      .map(word=>(word,1))
      .reduceByKey((x,y)=> x+y)
    wordcount.collect()
      .foreach(println)
(Liu,1)
(worth,3)
(4,1)
(after,1)
(profit,1)

8.计算平均数

    val list = List(1, 2, 4, 5, 6)
    val rdd = sc.parallelize(list)
    val sum = rdd.reduce(_ + _)
    val num = rdd.map(x => 1).reduce(_ + _)
    val sn = rdd.aggregate((0, 0))((u, v) => (u._1 + v, u._2 + 1),
      (u1, u2) => (u1._1 + u2._1, u1._2 + u2._2)
    )
    val res = sn._1.toDouble/sn._2
    println(sum.toDouble/num)
    println(res)

9.计算每个年级的平均成绩

    val list = List(
      ("75", 90),
      ("75", 91),
      ("75", 92),
      ("75", 93),
      ("75", 94),
      ("75", 95),
      ("75", 96),
      ("75", 97),
      ("75", 98),
      ("76", 90),
      ("76", 91),
      ("76", 92),
      ("76", 93),
      ("76", 94),
      ("76", 95),
      ("76", 96),
      ("76", 97),
      ("76", 98)
    )
    val avgScores = sc.parallelize(list)
      .combineByKey(
        (score: Int) => (score, 1),
        (u: (Int, Int), v: Int) => (u._1 + v, u._2 + 1),
        (u: (Int, Int), u2: (Int, Int)) => (u._1 + u2._1, u._2 + u2._2)
      ).mapValues(x => x._1.toDouble / x._2)
    avgScores.collect().foreach(println)
(75,94.0)
(76,94.0)

 10. 广播变量

    val broadcastVar = sc.broadcast(Array(1,2,3))
    broadcastVar.value.foreach(println)

广播变量会被发送到每台机器,而不是每个task

11.累加器

val rdd = sc.parallelize(List(1,2,3,4))
    val acc = sc.longAccumulator("myacc")
    rdd.map(x=>acc.add(x)).collect()
    println()
    println(acc.value)

 

posted @ 2018-10-27 10:31  生心无住  阅读(899)  评论(0编辑  收藏  举报