Spark-RDD算子

Spark课堂笔记

 

Spark生态圈:

Spark Core : RDD(弹性分布式数据集)

Spark SQL

Spark Streaming

Spark MLLib:协同过滤,ALS,逻辑回归等等   --> 机器学习

Spark Graphx : 图计算

 

重点在前三章

 

-----------------Spark Core------------------------

一、什么是Spark?特点?

https://spark.apache.org/

Apache Spark™ is a unified analytics engine for large-scale data processing.

 

特点:快、易用、通用性、兼容性(完全兼容Hadoop)

 

快:快100倍(Hadoop 3 之前)

易用:支持多种语言开发

通用性:生态系统全。

易用性:兼容Hadoop

 

spark 取代 Hadoop

 

二、安装和部署Spark、Spark 的 HA

 

1、spark体系结构

Spark的运行方式

 

Yarn

 

Standalone:本机调试(demo)

 

Worker:从节点。每个服务器上,资源和任务的管理者。只负责管理一个节点。

 

执行过程:

一个Worker 有多个 Executor。 Executor是任务的执行者,按阶段(stage)划分任务。————> RDD

 

客户端:Driver Program 提交任务到集群中。

 

1、spark-submit

2、spark-shell

 

2、spark的搭建

(1)准备工作:JDK 配置主机名 免密码登录

(2)伪分布式模式

在一台虚拟机上模拟分布式环境(Master和Worker在一个节点上)

 

export JAVA_HOME=/usr/java/jdk1.8.0_201

export SPARK_MASTER_HOST=node3

export SPARK_MASTER_PORT=7077

 

(3)全分布式环境

修改slave文件  拷贝到其他两台服务器 启动

 

3、Spark的 HA

回顾HA;

(*)HDFS Yarn Hbase Spark  主从结构

(*)单点故障

 

(1)基于文件目录的单点恢复

(*)本质:还是只有一个主节点Master,创建了一个恢复目录,保存集群状态和任务的信息。

当Master挂掉,重新启动时,会从恢复目录下读取状态信息,恢复出来原来的状态

 

用途:用于开发和测试,生产用zookeeper

export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=FILESYSTEM 

-Dspark.deploy.recoveryDirectory=/usr/local/spark-2.1.0-bin-hadoop2.7/recovery"

 

(2)基于Zookeeper :和Hadoop类似

 

(*)复习一下zookeeper:

相当于一个数据库,把一些信息存放在zookeeper中,比如集群的信息。

数据同步功能,选举功能,分布式锁功能

 

数据同步:给一个节点中写入数据,可以同步到其他节点

 

选举:Zookeeper中存在不同的角色,Leader Follower。如果Leader挂掉,重新选举Leader

 

分布式锁:秒杀。以目录节点的方式来保存数据。

 

修改 spark-env.sh

export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER 

-Dspark.deploy.zookeeper.url=node3:2181,node4:2181,node5:2181 

-Dspark.deploy.zookeeper.dir=/spark"

 

同步到其他两台服务器。

 

在node3 start-all  node3 master  node4 Worker node5 Worker

在node4 start-master node3 master node4 master(standby) node4 Worker node5 Worker

 

在node3上kill master

node4 master(Active) node4 Worker node5 Worker

 

在网页http://192.168.109.134:8080/ 可以看到相应信息

 

三、执行Spark的任务:两个工具

 

1、spark-submit:用于提交Spark的任务

任务:jar。

 

举例:蒙特卡洛求PI(圆周率)。

 

./spark-submit --master spark://node3:7077 --class

 

--class指明主程序的名字

 

/usr/local/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --master spark://node3:7077 

--class org.apache.spark.examples.SparkPi 

/usr/local/spark-2.1.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.1.0.jar 100

 

2、spark-shell 相当于REPL 

作为一个独立的Application运行

 

两种模式:

(1)本地模式

spark-shell 后面不接任何参数,代表本地模式

 

Spark context available as 'sc' (master = local[*], app id = local-1554038459298).

 

sc 是 SparkContext 对象名。 local[*] 代表本地模式,不提交到集群中运行。

 

(2)集群模式

./spark-submit --master spark://node3:7077   提交到集群中运行

 

Spark context available as 'sc' (master = spark://node3:7077, app id = app-20190331212447-0000).

 

master = spark://node3:7077

 

Spark session available as 'spark'

Spark Session 是 2.0 以后提供的,利用 SparkSession 可以访问spark所有组件。

 

示例:WordCount程序

 

(*)处理本地文件,把结果打印到屏幕上

scala> sc.textFile("/usr/local/tmp_files/test_WordCount.txt")

.flatMap(_.split(" "))

.map((_,1))

.reduceByKey(_+_)

.collect

 

res0: Array[(String, Int)] = Array((is,1), (love,2), (capital,1), (Beijing,2), (China,2), (I,2), (of,1), (the,1))

 

(*)处理HDFS文件,结果保存在hdfs上

sc.textFile("hdfs://node1:8020/tmp_files/test_WordCount.txt")

.flatMap(_.split(" "))

.map((_,1))

.reduceByKey(_+_)

.saveAsTextFile("hdfs://node1:8020/output/0331/test_WordCount")

 

-rw-r--r--   3 root supergroup          0 2019-03-31 21:43 /output/0331/test_WordCount/_SUCCESS

-rw-r--r--   3 root supergroup         40 2019-03-31 21:43 /output/0331/test_WordCount/part-00000

-rw-r--r--   3 root supergroup         31 2019-03-31 21:43 /output/0331/test_WordCount/part-00001

 

_SUCCESS 代表程序执行成功

 

part-00000  part-00001  结果文件,分区。里面内容不重复。

 

(*)单步运行WordCount   ---->   RDD

 

scala> val rdd1 = sc.textFile("/usr/local/tmp_files/test_WordCount.txt")

rdd1: org.apache.spark.rdd.RDD[String] = /usr/local/tmp_files/test_WordCount.txt MapPartitionsRDD[12] at textFile at <console>:24

 

scala> 1+1

res2: Int = 2

 

scala> rdd1.collect

res3: Array[String] = Array(I love Beijing, I love China, Beijing is the capital of China)

 

scala> val rdd2 = rdd1.flatMap(_.split(" "))

rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[13] at flatMap at <console>:26

 

scala> rdd2.collect

res4: Array[String] = Array(I, love, Beijing, I, love, China, Beijing, is, the, capital, of, China)

 

scala> val rdd3 = rdd2.map((_,1))

rdd3: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[14] at map at <console>:28

 

scala> rdd3.collect

res5: Array[(String, Int)] = Array((I,1), (love,1), (Beijing,1), (I,1), (love,1), (China,1), (Beijing,1), (is,1), (the,1), (capital,1), (of,1), (China,1))

 

scala> val rdd4 = rdd3.reduceByKey(_+_)

rdd4: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[15] at reduceByKey at <console>:30

 

scala> rdd4.collect

res6: Array[(String, Int)] = Array((is,1), (love,2), (capital,1), (Beijing,2), (China,2), (I,2), (of,1), (the,1))

 

RDD 弹性分布式数据集

(1)依赖关系 : 宽依赖和窄依赖

(2)算子:

函数:

Transformation : 延时计算   map   flatMap   textFile

Action : 立即触发计算  collect

 

 

 

 

说明:scala复习

(*)flatten:把嵌套的结果展开

scala> List(List(2,4,6,8,10),List(1,3,5,7,9)).flatten

res21: List[Int] = List(2, 4, 6, 8, 10, 1, 3, 5, 7, 9)

 

 

(*)flatmap : 相当于一个 map + flatten

 

scala> var myList = List(List(2,4,6,8,10),List(1,3,5,7,9))

myList: List[List[Int]] = List(List(2, 4, 6, 8, 10), List(1, 3, 5, 7, 9))

 

scala> myList.flatMap(x=>x.map(_*2))

res22: List[Int] = List(4, 8, 12, 16, 20, 2, 6, 10, 14, 18)

 

myList.flatMap(x=>x.map(_*2))

 

执行过程:

1、将 List(2, 4, 6, 8, 10), List(1, 3, 5, 7, 9) 调用 map(_*2) 方法。x 代表一个List

2、flatten

3、在IDE中开发scala版本和Java版本的WorkCount。

 

(1)scala版本的WordCount

 

新建一个工程,把jar引入到工程中。

 

export jar 点击下一步下一步,不需要设置main class 

 

把jar上传到服务器上。

 

spark-submit --master spark://node3:7077 

--class day1025.MyWordCount 

/usr/local/tmp_files/Demo1.jar 

hdfs://node2:8020/tmp_files/test_WordCount.txt 

hdfs://node2:8020/output/1025/demo1

 

(2)java版本的WordCount

 

./spark-submit --master spark://node3:7077 --class day0330.JavaWordCount /usr/local/tmp_files/Demo2.jar

 

四、分析Spark的任务流程

 

1、分析WordCount程序处理过程

见图片

 

2、Spark调度任务的过程

 

提交到及群众运行任务时,spark执行任务调度。

 

见图片

 

五、RDD和RDD特性、RDD的算子

 

1、RDD:弹性分布式数据集

(*)Spark中最基本的数据抽象。

(*)RDD的特性

* Internally, each RDD is characterized by five main properties:

*

*  - A list of partitions

 

1、是一组分区。

RDD由分区组成,每个分区运行在不同的Worker上,通过这种方式来实现分布式计算。

 

*  - A function for computing each split

在RDD中,提供算子处理每个分区中的数据

 

*  - A list of dependencies on other RDDs

 

RDD存在依赖关系:宽依赖和窄依赖。

 

*  - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)

 

可以自定义分区规则来创建RDD

 

*  - Optionally, a list of preferred locations to compute each split on (e.g. block locations for

*    an HDFS file)

 

优先选择离文件位置近的节点来执行

 

 

如何创建RDD?

 

(1)通过SparkContext.parallelize方法来创建

scala> val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7,8),3)

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[32] at parallelize at <console>:29

 

scala> rdd1.partitions.length

res35: Int = 3

 

scala> val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7,8),2)

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[33] at parallelize at <console>:29

 

scala> rdd1.partitions.length

res36: Int = 2

 

(2)通过外部数据源来创建

sc.textFile()

 

scala> val rdd2 = sc.textFile("/usr/local/tmp_files/test_WordCount.txt")

rdd2: org.apache.spark.rdd.RDD[String] = /usr/local/tmp_files/test_WordCount.txt MapPartitionsRDD[35] at textFile at <console>:29

 

 

2、算子

(1)Transformation

 

map(func):相当于for循环,返回一个新的RDD

 

filter(func):过滤

flatMap(func):flat+map 压平

 

 

mapPartitions(func):对RDD中的每个分区进行操作

mapPartitionsWithIndex(func):对RDD中的每个分区进行操作,可以取到分区号。

 

sample(withReplacement, fraction, seed):采样

 

集合运算

union(otherDataset)

intersection(otherDataset)

 

distinct([numTasks])):去重

 

聚合操作:group by 

groupByKey([numTasks])

reduceByKey(func, [numTasks])

aggregateByKey(zeroValue)(seqOp,combOp,[numTasks])

 

排序

sortByKey([ascending], [numTasks])

sortBy(func,[ascending], [numTasks])

 

join(otherDataset, [numTasks])

cogroup(otherDataset, [numTasks])

cartesian(otherDataset)

pipe(command, [envVars])

coalesce(numPartitions)

 

重分区:

repartition(numPartitions)

repartitionAndSortWithinPartitions(partitioner)

 

举例:

1、创建一个RDD,每个元素乘以2,再排序

scala> val rdd1 = sc.parallelize(Array(3,4,5,100,79,81,6,8))

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[36] at parallelize at <console>:29

 

scala> val rdd2 = rdd1.map(_*2)

rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[37] at map at <console>:31

 

scala> rdd2.collect

res37: Array[Int] = Array(6, 8, 10, 200, 158, 162, 12, 16)    

 

scala> rdd2.sortBy(x=>x,true).collect

res39: Array[Int] = Array(6, 8, 10, 12, 16, 158, 162, 200)                      

 

scala> rdd2.sortBy(x=>x,false).collect

res40: Array[Int] = Array(200, 162, 158, 16, 12, 10, 8, 6) 

 

def sortBy[K](f: (T) ⇒ K, ascending: Boolean = true)

 

过滤出大于20的元素:

 

scala> val rdd3 = rdd2.filter(_>20)

rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[53] at filter at <console>:33

 

scala> rdd3.collect

res41: Array[Int] = Array(200, 158, 162)   

 

2、字符串(字符)类型的RDD

 

scala> val rdd4 = sc.parallelize(Array("a b c","d e f","g h i"))

rdd4: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[54] at parallelize at <console>:29

 

scala> rdd4.flatMap(_.split(" ")).collect

res42: Array[String] = Array(a, b, c, d, e, f, g, h, i)    

 

3、RDD的集合运算:

 

scala> val rdd6 = sc.parallelize(List(1,2,3,6,7,8,9,100))

rdd6: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[56] at parallelize at <console>:29

 

scala> val rdd7 = sc.parallelize(List(1,2,3,4))

rdd7: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[57] at parallelize at <console>:29

 

scala> val rdd8 = rdd6.union(rdd7)

rdd8: org.apache.spark.rdd.RDD[Int] = UnionRDD[58] at union at <console>:33

 

scala> rdd8.collect

res43: Array[Int] = Array(1, 2, 3, 6, 7, 8, 9, 100, 1, 2, 3, 4)

 

scala> rdd8.distinct.collect

res44: Array[Int] = Array(100, 4, 8, 1, 9, 6, 2, 3, 7)  

 

 

4、分组操作:reduceByKey 

 

<key value>

scala> val rdd1 = sc.parallelize(List(("Tom",1000),("Andy",2000),("Lily",1500)))

rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[62] at parallelize at <console>:29

 

scala> val rdd2 = sc.parallelize(List(("Andy",1000),("Tom",2000),("Mike",500)))

rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[63] at parallelize at <console>:29

 

scala> val rdd3 = rdd1 union rdd2

rdd3: org.apache.spark.rdd.RDD[(String, Int)] = UnionRDD[64] at union at <console>:33

 

scala> rdd3.collect

res45: Array[(String, Int)] = Array((Tom,1000), (Andy,2000), (Lily,1500), (Andy,1000), (Tom,2000), (Mike,500))

 

scala> val rdd4= rdd3.groupByKey

rdd4: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[65] at groupByKey at <console>:35

 

scala> rdd4.collect

res46: Array[(String, Iterable[Int])] = Array(

(Tom,CompactBuffer(1000, 2000)), 

(Andy,CompactBuffer(2000, 1000)), 

(Mike,CompactBuffer(500)), (

Lily,CompactBuffer(1500)))

 

scala> rdd3.reduceByKey(_+_).collect

res47: Array[(String, Int)] = Array((Tom,3000), (Andy,3000), (Mike,500), (Lily,1500))

 

reduceByKey will provide much better performance.

 

官方不推荐使用 groupByKey  推荐使用 reduceByKey

 

5、cogroup

 

scala> val rdd1 = sc.parallelize(List(("tom", 1), ("tom", 2), ("jerry", 3), ("kitty", 2)))

rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[67] at parallelize at <console>:29

 

scala> val rdd2 = sc.parallelize(List(("jerry", 2), ("tom", 1), ("shuke", 2)))

rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[68] at parallelize at <console>:29

 

scala> val rdd3 = rdd1.cogroup(rdd2)

rdd3: org.apache.spark.rdd.RDD[(String, (Iterable[Int], Iterable[Int]))] = MapPartitionsRDD[70] at cogroup at <console>:33

 

scala> rdd3.collect

res48: Array[(String, (Iterable[Int], Iterable[Int]))] = Array(

(tom,(CompactBuffer(1, 2),CompactBuffer(1))), 

(jerry,(CompactBuffer(3),CompactBuffer(2))), 

(shuke,(CompactBuffer(),CompactBuffer(2))), 

(kitty,(CompactBuffer(2),CompactBuffer())))

 

6、reduce操作(Action)

 

聚合操作

scala> val rdd1 = sc.parallelize(List(1,2,3,4,5))

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[71] at parallelize at <console>:29

 

scala> rdd1.reduce(_+_)

res49: Int = 15

 

 

7、需求:按照value排序。

做法:

1、交换,把key 和 value交换,然后调用sortByKey方法

2、再次交换

 

scala> val rdd1 = sc.parallelize(List(("tom",1),("jerry",3),("ketty",2),("shuke",2)))

rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[72] at parallelize at <console>:29

 

scala> val rdd2 = sc.parallelize(List(("jerry",1),("tom",3),("shuke",5),("ketty",1)))

rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[73] at parallelize at <console>:29

 

scala> val rdd3 = rdd1.union(rdd2)

rdd3: org.apache.spark.rdd.RDD[(String, Int)] = UnionRDD[74] at union at <console>:33

 

scala> val rdd4 = rdd3.reduceByKey(_+_)

rdd4: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[75] at reduceByKey at <console>:35

 

scala> rdd4.collect

res50: Array[(String, Int)] = Array((tom,4), (jerry,4), (shuke,7), (ketty,3))   

 

scala> val rdd5 = rdd4.map(t=>(t._2,t._1)).sortByKey(false).map(t=>(t._2,t._1))

rdd5: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[80] at map at <console>:37

 

scala> rdd5.collect

res51: Array[(String, Int)] = Array((shuke,7), (tom,4), (jerry,4), (ketty,3))  

 

 

(2)Action

 

reduce(func)

 

collect()

count()

first()

take(n)

takeSample(withReplacement,num, [seed])

takeOrdered(n, [ordering])

saveAsTextFile(path)

saveAsSequenceFile(path) 

saveAsObjectFile(path) 

countByKey()

 

foreach(func):与map类似,没有返回值。

 

3、特性:

(1)RDD的缓存机制

(*)作用:提高性能

(*)使用:标识RDD可以被缓存   persist   cache

 

(*)可以缓存的位置:

  val NONE = new StorageLevel(false, false, false, false)

  val DISK_ONLY = new StorageLevel(true, false, false, false)

  val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)

  val MEMORY_ONLY = new StorageLevel(false, true, false, true)

  val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)

  val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)

  val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)

  val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)

  val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)

  val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)

  val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)

  val OFF_HEAP = new StorageLevel(true, true, true, false, 1)

  

  /**

   * Persist this RDD with the default storage level (`MEMORY_ONLY`).

   */

  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

 

  /**

   * Persist this RDD with the default storage level (`MEMORY_ONLY`).

   */

  def cache(): this.type = persist()

  

 

举例:测试数据,92万条

 

scala> val rdd1 = sc.textFile("hdfs://192.168.109.131:8020/tmp_files/test_Cache.txt")

rdd1: org.apache.spark.rdd.RDD[String] = hdfs://192.168.109.131:8020/tmp_files/test_Cache.txt MapPartitionsRDD[82] at textFile at <console>:29

 

scala> rdd1.count  --> 直接出发计算

res52: Long = 923452                                                            

 

scala> rdd1.cache  --> 标识RDD可以被缓存,不会触发计算

res53: rdd1.type = hdfs://192.168.109.131:8020/tmp_files/test_Cache.txt MapPartitionsRDD[82] at textFile at <console>:29

 

scala> rdd1.count   --> 和第一步一样,触发计算,但是,把结果进行缓存

res54: Long = 923452                                                            

 

scala> rdd1.count   -->  从缓存中直接读出结果

res55: Long = 923452

 

(2)RDD的容错机制

 

 

Spark课堂笔记
Spark生态圈:Spark Core : RDD(弹性分布式数据集)Spark SQLSpark StreamingSpark MLLib:协同过滤,ALS,逻辑回归等等   --> 机器学习Spark Graphx : 图计算
重点在前三章
-----------------Spark Core------------------------一、什么是Spark?特点?https://spark.apache.org/Apache Spark™ is a unified analytics engine for large-scale data processing.特点:快、易用、通用性、兼容性(完全兼容Hadoop)快:快100倍(Hadoop 3 之前)易用:支持多种语言开发通用性:生态系统全。易用性:兼容Hadoopspark 取代 Hadoop
二、安装和部署Spark、Spark 的 HA
1、spark体系结构Spark的运行方式YarnStandalone:本机调试(demo)Worker:从节点。每个服务器上,资源和任务的管理者。只负责管理一个节点。执行过程:一个Worker 有多个 Executor。 Executor是任务的执行者,按阶段(stage)划分任务。————> RDD客户端:Driver Program 提交任务到集群中。1、spark-submit2、spark-shell
2、spark的搭建(1)准备工作:JDK 配置主机名 免密码登录(2)伪分布式模式在一台虚拟机上模拟分布式环境(Master和Worker在一个节点上)export JAVA_HOME=/usr/java/jdk1.8.0_201export SPARK_MASTER_HOST=node3export SPARK_MASTER_PORT=7077(3)全分布式环境修改slave文件  拷贝到其他两台服务器 启动3、Spark的 HA回顾HA;(*)HDFS Yarn Hbase Spark  主从结构(*)单点故障(1)基于文件目录的单点恢复(*)本质:还是只有一个主节点Master,创建了一个恢复目录,保存集群状态和任务的信息。当Master挂掉,重新启动时,会从恢复目录下读取状态信息,恢复出来原来的状态用途:用于开发和测试,生产用zookeeperexport SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=FILESYSTEM -Dspark.deploy.recoveryDirectory=/usr/local/spark-2.1.0-bin-hadoop2.7/recovery"(2)基于Zookeeper :和Hadoop类似(*)复习一下zookeeper:相当于一个数据库,把一些信息存放在zookeeper中,比如集群的信息。数据同步功能,选举功能,分布式锁功能数据同步:给一个节点中写入数据,可以同步到其他节点选举:Zookeeper中存在不同的角色,Leader Follower。如果Leader挂掉,重新选举Leader分布式锁:秒杀。以目录节点的方式来保存数据。修改 spark-env.shexport SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=node3:2181,node4:2181,node5:2181 -Dspark.deploy.zookeeper.dir=/spark"同步到其他两台服务器。在node3 start-all  node3 master  node4 Worker node5 Worker在node4 start-master node3 master node4 master(standby) node4 Worker node5 Worker在node3上kill masternode4 master(Active) node4 Worker node5 Worker在网页http://192.168.109.134:8080/ 可以看到相应信息三、执行Spark的任务:两个工具1、spark-submit:用于提交Spark的任务任务:jar。举例:蒙特卡洛求PI(圆周率)。./spark-submit --master spark://node3:7077 --class--class指明主程序的名字/usr/local/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --master spark://node3:7077 --class org.apache.spark.examples.SparkPi /usr/local/spark-2.1.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.1.0.jar 1002、spark-shell 相当于REPL 作为一个独立的Application运行两种模式:(1)本地模式spark-shell 后面不接任何参数,代表本地模式Spark context available as 'sc' (master = local[*], app id = local-1554038459298).sc 是 SparkContext 对象名。 local[*] 代表本地模式,不提交到集群中运行。(2)集群模式./spark-submit --master spark://node3:7077   提交到集群中运行Spark context available as 'sc' (master = spark://node3:7077, app id = app-20190331212447-0000).master = spark://node3:7077Spark session available as 'spark'Spark Session 是 2.0 以后提供的,利用 SparkSession 可以访问spark所有组件。示例:WordCount程序(*)处理本地文件,把结果打印到屏幕上scala> sc.textFile("/usr/local/tmp_files/test_WordCount.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collectres0: Array[(String, Int)] = Array((is,1), (love,2), (capital,1), (Beijing,2), (China,2), (I,2), (of,1), (the,1))(*)处理HDFS文件,结果保存在hdfs上sc.textFile("hdfs://node1:8020/tmp_files/test_WordCount.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).saveAsTextFile("hdfs://node1:8020/output/0331/test_WordCount")-rw-r--r--   3 root supergroup          0 2019-03-31 21:43 /output/0331/test_WordCount/_SUCCESS-rw-r--r--   3 root supergroup         40 2019-03-31 21:43 /output/0331/test_WordCount/part-00000-rw-r--r--   3 root supergroup         31 2019-03-31 21:43 /output/0331/test_WordCount/part-00001_SUCCESS 代表程序执行成功part-00000  part-00001  结果文件,分区。里面内容不重复。(*)单步运行WordCount   ---->   RDDscala> val rdd1 = sc.textFile("/usr/local/tmp_files/test_WordCount.txt")rdd1: org.apache.spark.rdd.RDD[String] = /usr/local/tmp_files/test_WordCount.txt MapPartitionsRDD[12] at textFile at <console>:24
scala> 1+1res2: Int = 2
scala> rdd1.collectres3: Array[String] = Array(I love Beijing, I love China, Beijing is the capital of China)
scala> val rdd2 = rdd1.flatMap(_.split(" "))rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[13] at flatMap at <console>:26
scala> rdd2.collectres4: Array[String] = Array(I, love, Beijing, I, love, China, Beijing, is, the, capital, of, China)
scala> val rdd3 = rdd2.map((_,1))rdd3: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[14] at map at <console>:28
scala> rdd3.collectres5: Array[(String, Int)] = Array((I,1), (love,1), (Beijing,1), (I,1), (love,1), (China,1), (Beijing,1), (is,1), (the,1), (capital,1), (of,1), (China,1))
scala> val rdd4 = rdd3.reduceByKey(_+_)rdd4: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[15] at reduceByKey at <console>:30
scala> rdd4.collectres6: Array[(String, Int)] = Array((is,1), (love,2), (capital,1), (Beijing,2), (China,2), (I,2), (of,1), (the,1))
RDD 弹性分布式数据集(1)依赖关系 : 宽依赖和窄依赖(2)算子:函数:Transformation : 延时计算   map   flatMap   textFileAction : 立即触发计算  collect说明:scala复习(*)flatten:把嵌套的结果展开scala> List(List(2,4,6,8,10),List(1,3,5,7,9)).flattenres21: List[Int] = List(2, 4, 6, 8, 10, 1, 3, 5, 7, 9)(*)flatmap : 相当于一个 map + flattenscala> var myList = List(List(2,4,6,8,10),List(1,3,5,7,9))myList: List[List[Int]] = List(List(2, 4, 6, 8, 10), List(1, 3, 5, 7, 9))
scala> myList.flatMap(x=>x.map(_*2))res22: List[Int] = List(4, 8, 12, 16, 20, 2, 6, 10, 14, 18)myList.flatMap(x=>x.map(_*2))执行过程:1、将 List(2, 4, 6, 8, 10), List(1, 3, 5, 7, 9) 调用 map(_*2) 方法。x 代表一个List2、flatten3、在IDE中开发scala版本和Java版本的WorkCount。
(1)scala版本的WordCount新建一个工程,把jar引入到工程中。export jar 点击下一步下一步,不需要设置main class 把jar上传到服务器上。spark-submit --master spark://node3:7077 --class day1025.MyWordCount /usr/local/tmp_files/Demo1.jar hdfs://node2:8020/tmp_files/test_WordCount.txt hdfs://node2:8020/output/1025/demo1(2)java版本的WordCount./spark-submit --master spark://node3:7077 --class day0330.JavaWordCount /usr/local/tmp_files/Demo2.jar四、分析Spark的任务流程1、分析WordCount程序处理过程见图片2、Spark调度任务的过程提交到及群众运行任务时,spark执行任务调度。见图片五、RDD和RDD特性、RDD的算子
1、RDD:弹性分布式数据集(*)Spark中最基本的数据抽象。(*)RDD的特性* Internally, each RDD is characterized by five main properties: * *  - A list of partitions 1、是一组分区。RDD由分区组成,每个分区运行在不同的Worker上,通过这种方式来实现分布式计算。
*  - A function for computing each split在RDD中,提供算子处理每个分区中的数据  *  - A list of dependencies on other RDDs RDD存在依赖关系:宽依赖和窄依赖。 *  - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) 可以自定义分区规则来创建RDD *  - Optionally, a list of preferred locations to compute each split on (e.g. block locations for *    an HDFS file) 优先选择离文件位置近的节点来执行如何创建RDD?(1)通过SparkContext.parallelize方法来创建scala> val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7,8),3)rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[32] at parallelize at <console>:29scala> rdd1.partitions.lengthres35: Int = 3
scala> val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7,8),2)rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[33] at parallelize at <console>:29
scala> rdd1.partitions.lengthres36: Int = 2(2)通过外部数据源来创建sc.textFile()scala> val rdd2 = sc.textFile("/usr/local/tmp_files/test_WordCount.txt")rdd2: org.apache.spark.rdd.RDD[String] = /usr/local/tmp_files/test_WordCount.txt MapPartitionsRDD[35] at textFile at <console>:292、算子(1)Transformationmap(func):相当于for循环,返回一个新的RDDfilter(func):过滤flatMap(func):flat+map 压平mapPartitions(func):对RDD中的每个分区进行操作mapPartitionsWithIndex(func):对RDD中的每个分区进行操作,可以取到分区号。sample(withReplacement, fraction, seed):采样集合运算union(otherDataset)intersection(otherDataset)distinct([numTasks])):去重聚合操作:group by groupByKey([numTasks])reduceByKey(func, [numTasks])aggregateByKey(zeroValue)(seqOp,combOp,[numTasks])排序sortByKey([ascending], [numTasks])sortBy(func,[ascending], [numTasks])join(otherDataset, [numTasks])cogroup(otherDataset, [numTasks])cartesian(otherDataset)pipe(command, [envVars])coalesce(numPartitions)重分区:repartition(numPartitions)repartitionAndSortWithinPartitions(partitioner)举例:1、创建一个RDD,每个元素乘以2,再排序scala> val rdd1 = sc.parallelize(Array(3,4,5,100,79,81,6,8))rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[36] at parallelize at <console>:29
scala> val rdd2 = rdd1.map(_*2)rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[37] at map at <console>:31
scala> rdd2.collectres37: Array[Int] = Array(6, 8, 10, 200, 158, 162, 12, 16)    scala> rdd2.sortBy(x=>x,true).collectres39: Array[Int] = Array(6, 8, 10, 12, 16, 158, 162, 200)                      
scala> rdd2.sortBy(x=>x,false).collectres40: Array[Int] = Array(200, 162, 158, 16, 12, 10, 8, 6) def sortBy[K](f: (T) ⇒ K, ascending: Boolean = true)过滤出大于20的元素:scala> val rdd3 = rdd2.filter(_>20)rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[53] at filter at <console>:33
scala> rdd3.collectres41: Array[Int] = Array(200, 158, 162)   2、字符串(字符)类型的RDDscala> val rdd4 = sc.parallelize(Array("a b c","d e f","g h i"))rdd4: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[54] at parallelize at <console>:29
scala> rdd4.flatMap(_.split(" ")).collectres42: Array[String] = Array(a, b, c, d, e, f, g, h, i)    3、RDD的集合运算:scala> val rdd6 = sc.parallelize(List(1,2,3,6,7,8,9,100))rdd6: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[56] at parallelize at <console>:29
scala> val rdd7 = sc.parallelize(List(1,2,3,4))rdd7: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[57] at parallelize at <console>:29
scala> val rdd8 = rdd6.union(rdd7)rdd8: org.apache.spark.rdd.RDD[Int] = UnionRDD[58] at union at <console>:33
scala> rdd8.collectres43: Array[Int] = Array(1, 2, 3, 6, 7, 8, 9, 100, 1, 2, 3, 4)
scala> rdd8.distinct.collectres44: Array[Int] = Array(100, 4, 8, 1, 9, 6, 2, 3, 7)  4、分组操作:reduceByKey <key value>scala> val rdd1 = sc.parallelize(List(("Tom",1000),("Andy",2000),("Lily",1500)))rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[62] at parallelize at <console>:29
scala> val rdd2 = sc.parallelize(List(("Andy",1000),("Tom",2000),("Mike",500)))rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[63] at parallelize at <console>:29
scala> val rdd3 = rdd1 union rdd2rdd3: org.apache.spark.rdd.RDD[(String, Int)] = UnionRDD[64] at union at <console>:33
scala> rdd3.collectres45: Array[(String, Int)] = Array((Tom,1000), (Andy,2000), (Lily,1500), (Andy,1000), (Tom,2000), (Mike,500))
scala> val rdd4= rdd3.groupByKeyrdd4: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[65] at groupByKey at <console>:35
scala> rdd4.collectres46: Array[(String, Iterable[Int])] = Array((Tom,CompactBuffer(1000, 2000)), (Andy,CompactBuffer(2000, 1000)), (Mike,CompactBuffer(500)), (Lily,CompactBuffer(1500)))scala> rdd3.reduceByKey(_+_).collectres47: Array[(String, Int)] = Array((Tom,3000), (Andy,3000), (Mike,500), (Lily,1500))reduceByKey will provide much better performance.官方不推荐使用 groupByKey  推荐使用 reduceByKey5、cogroupscala> val rdd1 = sc.parallelize(List(("tom", 1), ("tom", 2), ("jerry", 3), ("kitty", 2)))rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[67] at parallelize at <console>:29
scala> val rdd2 = sc.parallelize(List(("jerry", 2), ("tom", 1), ("shuke", 2)))rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[68] at parallelize at <console>:29
scala> val rdd3 = rdd1.cogroup(rdd2)rdd3: org.apache.spark.rdd.RDD[(String, (Iterable[Int], Iterable[Int]))] = MapPartitionsRDD[70] at cogroup at <console>:33
scala> rdd3.collectres48: Array[(String, (Iterable[Int], Iterable[Int]))] = Array((tom,(CompactBuffer(1, 2),CompactBuffer(1))), (jerry,(CompactBuffer(3),CompactBuffer(2))), (shuke,(CompactBuffer(),CompactBuffer(2))), (kitty,(CompactBuffer(2),CompactBuffer())))6、reduce操作(Action)聚合操作scala> val rdd1 = sc.parallelize(List(1,2,3,4,5))rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[71] at parallelize at <console>:29
scala> rdd1.reduce(_+_)res49: Int = 157、需求:按照value排序。做法:1、交换,把key 和 value交换,然后调用sortByKey方法2、再次交换scala> val rdd1 = sc.parallelize(List(("tom",1),("jerry",3),("ketty",2),("shuke",2)))rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[72] at parallelize at <console>:29
scala> val rdd2 = sc.parallelize(List(("jerry",1),("tom",3),("shuke",5),("ketty",1)))rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[73] at parallelize at <console>:29
scala> val rdd3 = rdd1.union(rdd2)rdd3: org.apache.spark.rdd.RDD[(String, Int)] = UnionRDD[74] at union at <console>:33
scala> val rdd4 = rdd3.reduceByKey(_+_)rdd4: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[75] at reduceByKey at <console>:35
scala> rdd4.collectres50: Array[(String, Int)] = Array((tom,4), (jerry,4), (shuke,7), (ketty,3))   
scala> val rdd5 = rdd4.map(t=>(t._2,t._1)).sortByKey(false).map(t=>(t._2,t._1))rdd5: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[80] at map at <console>:37
scala> rdd5.collectres51: Array[(String, Int)] = Array((shuke,7), (tom,4), (jerry,4), (ketty,3))  (2)Actionreduce(func)collect()count()first()take(n)takeSample(withReplacement,num, [seed])takeOrdered(n, [ordering])saveAsTextFile(path)saveAsSequenceFile(path) saveAsObjectFile(path) countByKey()foreach(func):与map类似,没有返回值。3、特性:(1)RDD的缓存机制(*)作用:提高性能(*)使用:标识RDD可以被缓存   persist   cache(*)可以缓存的位置:  val NONE = new StorageLevel(false, false, false, false)  val DISK_ONLY = new StorageLevel(true, false, false, false)  val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)  val MEMORY_ONLY = new StorageLevel(false, true, false, true)  val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)  val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)  val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)  val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)  val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)  val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)  val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)  val OFF_HEAP = new StorageLevel(true, true, true, false, 1)    /**   * Persist this RDD with the default storage level (`MEMORY_ONLY`).   */  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
  /**   * Persist this RDD with the default storage level (`MEMORY_ONLY`).   */  def cache(): this.type = persist()  举例:测试数据,92万条scala> val rdd1 = sc.textFile("hdfs://192.168.109.131:8020/tmp_files/test_Cache.txt")rdd1: org.apache.spark.rdd.RDD[String] = hdfs://192.168.109.131:8020/tmp_files/test_Cache.txt MapPartitionsRDD[82] at textFile at <console>:29
scala> rdd1.count  --> 直接出发计算res52: Long = 923452                                                            
scala> rdd1.cache  --> 标识RDD可以被缓存,不会触发计算res53: rdd1.type = hdfs://192.168.109.131:8020/tmp_files/test_Cache.txt MapPartitionsRDD[82] at textFile at <console>:29
scala> rdd1.count   --> 和第一步一样,触发计算,但是,把结果进行缓存res54: Long = 923452                                                            
scala> rdd1.count   -->  从缓存中直接读出结果res55: Long = 923452(2)RDD的容错机制














posted @ 2019-05-01 14:02  jareny  阅读(107)  评论(0编辑  收藏  举报