spark学习
学习spark的机械式记忆
1.搭建linux环境的spark,我就搭建了一个单机版本
如何进入spark-shell
bin/spark-shell 即可
sc = SparkContext
spark= SparkSession
2.idea中搭建环境且可以在windows下运行
pom中添加依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
3.相关代码
System.setProperty("hadoop.home.dir", "C:\\hadoop\\hadoop-common-2.2.0-bin-64bit\\hadoop-common-2.2.0-bin-master")
val conf = new SparkConf().setAppName("WC").setMaster("local[*]")
val sc = new SparkContext(conf)
val spark = SparkSession.builder().appName("spark sql example").config(conf).getOrCreate()
import spark.implicits._
sc.textFile(text)
.flatMap(_.split(" "))
.map((_,1))
.reduceByKey(_+_,1)
.sortBy(_._2,false)
.saveAsTextFile(out)
sc.stop()
4.RDD的计算方法:compute
transformations用于转化;
actions用于计算
5.RDD是只读的,如果想修改RDD,只能从一个RDD转换成另一个RDD
6.RDD的创建方式: 直接创建、从hdfs读入、其他RDD的转换
sc.makeRDD(Array(1,2,3,4))
sc.parallelize(Array(1,2,3,4))
sc.textFile(hdfs://1111)
7.RDD分为value类型和key value类型
map是针对元素的
mapPartitions是针对分区的
mapPartitionsWithIndex
9.flatMap() glom()
10.grouby() filter()
11.sample() distinct()
12.sortBy()
13.双value类型
union() subtract() intersection() zip()
14. key value类型
partitionBy 新增分区
reduceByKey groupByKey aggregateByKey foldByKey combineByKey
sortByKey mapValues join cogroup
15.Action的操作函数
reduce collect take first count takeOrdered
fold saveAsTextFile saveAsSequenceFile saveAsObjectFile
countByKey foreach
16.分区
key-value 类型是有分区的 ,value是没有分区的
分区是从0开始,一直到numPartitions-1
hash分区(根据key的hashCode去分区),range分区(保证均匀),自定义分区
shuffle的时候会经过分区器,走哪个分区
17.累加器
sc.accumulator(0) 定义了一个初始值为0的累加器
累加器需要放到foreach这种地方,不能放到rdd转换函数中
18.broadcast 广播
将某些值广播到其他的任务中去
19.SparkSQL中的两个抽象DataFrame和DataSet
spark.read
val df = spark.read.json("/opt/liutonghang/people.json")
df.show
df.createOrReplaceTempView("people")
val sqlDF = spark.sql("select * from people")
sqlDF.show
df.createGlobalTempView("people1")
spark.sql("select * from global_temp.people").show()
spark.newSession().sql("select * from global_temp.people").show()
df.select("name").show()
df.select($"name",$"age"+1).show()
df.filter($"age">21).show()
df.groupBy("age").count().show()
20. RDD转换成DataFrame
方法1
rdd.map(x=>{
val parm = s.split(",")
(parm(0),parm(1).trim.toInt)
}).toDF("name","age")
方法2
case class People(name:String,age:Int)
rdd.map(x=>{
val para = x.split(",")
People(parm(0),parm(1).trim.toInt)
}).toDF
21.DataFrame转换成RDD
val rdd = df.rdd
22.DataFrame 和 DataSet的区别
DataFrame = DataSet[Row]
DataFrame只知道字段,不知道字段的数据类型
DataSet知道每一行的字段的名字和数据类型
23.
rdd.toDF rdd转换成DataFrame
rdd.toDS rdd转换成DataSet
df.rdd DataFrame转换成rdd
ds.rdd DataSet转换成rdd
ds = df.as[Person] df转换成ds
ds.toDF DataSet转换成DataFrame
24.自定义函数
spark.udf.register("函数名",函数体)
spark.sql("select 函数名(name),age from people").show()
bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--executor-memory 1G \
--total-executor-cores 2 \
./examples/jars/spark-examples_2.11-2.1.1.jar \
100
sc.textFile("../input").flatMap(_.split(" ")).map(_,1).reduceByKey(_+_).collect()
sc.textFile("input").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect
bin/spark-submit \
--class com.WordCount \
WordCount.jar \
../input \
/out