spark 基本操作整理
关于spark 的详细操作请参照spark官网
scala 版本:2.11.8
1.添加spark maven依赖,如需访问hdfs,则添加hdfs依赖
groupId = org.apache.spark artifactId = spark-core_2.11 version = 2.3.2 groupId = org.apache.hadoop artifactId = hadoop-client version = <your-hdfs-version>
2.sparkcontext 的创建
val conf = new SparkConf().setAppName("example").setMaster("local[*]") val sc = new SparkContext(conf) sc.stop()
sc 使用结束,记得关闭
3.创建rdd
1)parallelized 方法
val words = sc.parallelize(Array("dong","jason","puma","large"),2)
2)读取外部数据
val rdd = sc.textFile("path_to_file(local or hdfs)")
一个放重要的概念,partitions,spark在逻辑上回对数据进行分区,每个分区会安排一个task来处理,textfile 如果读取的时hdfs,则默认partitions 是 文件的block数,
一般情况下为资源中每个cpu分配 2-4 个task为宜
4. SparkContext.wholeTextFiles
val rdd = sc.wholeTextFiles("./") rdd.take(1).foreach(println) ----------------------------------- (file:/C:/notos/code/sailertest/aa.csv,name,age jason,29 dong,27)
其输出结果时一个元组,(filepath,filecontent)
5.读取 hadoop sequencefFile
val seqRdd = sc.sequenceFile[String,Int]("seq") seqRdd.take(2).foreach(println) (jason,29) (dong,27)
sequenceFile[K,V]中的K,V 必须指定,且2要与sequencefile的类型匹配
6.向方法传递函数
object Func{ def concat(tp:(String,Int)):String={ tp._1 + " " + tp._2 } } val seqRdd = sc.sequenceFile[String,Int]("seq").map(Func.concat)
上述例子是把方法定义在单利对象中,与之相对,也可以把方法定义在类中,请看下面的例子
class MyClass{ val field = " " def concat(rdd:RDD[(String,Int)]) :RDD[String] ={ val field_ = field rdd.map(tp=> tp._1 + field_ + tp._2) } }
这里在concat方法中我没没有直接 使用 Myclass 的 成员 field ,因为直接使用field ((tp=> tp._1 + field + tp._2) 相当于是 (tp=> tp._1 + this.field + tp._2))
这样会把整个类再引用一遍
7. rdd key-value 操作
val wordcount = sc.textFile("aa.txt") .flatMap(_.split("\\s+",-1)) .map(word=>(word,1)) .reduceByKey((x,y)=> x+y) wordcount.collect() .foreach(println)
(Liu,1) (worth,3) (4,1) (after,1) (profit,1)
8.计算平均数
val list = List(1, 2, 4, 5, 6) val rdd = sc.parallelize(list) val sum = rdd.reduce(_ + _) val num = rdd.map(x => 1).reduce(_ + _) val sn = rdd.aggregate((0, 0))((u, v) => (u._1 + v, u._2 + 1), (u1, u2) => (u1._1 + u2._1, u1._2 + u2._2) ) val res = sn._1.toDouble/sn._2 println(sum.toDouble/num) println(res)
9.计算每个年级的平均成绩
val list = List( ("75", 90), ("75", 91), ("75", 92), ("75", 93), ("75", 94), ("75", 95), ("75", 96), ("75", 97), ("75", 98), ("76", 90), ("76", 91), ("76", 92), ("76", 93), ("76", 94), ("76", 95), ("76", 96), ("76", 97), ("76", 98) ) val avgScores = sc.parallelize(list) .combineByKey( (score: Int) => (score, 1), (u: (Int, Int), v: Int) => (u._1 + v, u._2 + 1), (u: (Int, Int), u2: (Int, Int)) => (u._1 + u2._1, u._2 + u2._2) ).mapValues(x => x._1.toDouble / x._2) avgScores.collect().foreach(println)
(75,94.0) (76,94.0)
10. 广播变量
val broadcastVar = sc.broadcast(Array(1,2,3)) broadcastVar.value.foreach(println)
广播变量会被发送到每台机器,而不是每个task
11.累加器
val rdd = sc.parallelize(List(1,2,3,4)) val acc = sc.longAccumulator("myacc") rdd.map(x=>acc.add(x)).collect() println() println(acc.value)
欢迎转载,不必注明出处