Spark-On-Yarn
原理
两种模式
client-了解
cluster模式-开发使用
操作
1.需要Yarn集群
2.历史服务器
3.提交任务的的客户端工具-spark-submit命令
4.待提交的spark任务/程序的字节码--可以使用示例程序
spark-shell和spark-submit
- 两个命令的区别
spark-shell:spark应用交互式窗口,启动后可以直接编写spark代码,即时运行,一般在学习测试时使用
spark-submit:用来将spark任务/程序的jar包提交到spark集群(一般都是提交到Yarn集群)
Spark程序开发
导入依赖
<dependencies>
<dependency>
<groupid>org.apache.spark</groupid>
<artifactid>spark-core_2.11</artifactid>
<version>2.4.5</version>
</dependency>
<dependency>
<groupid>org.scala-lang</groupid>
<artifactid>scala-library</artifactid>
<version>2.11.12</version>
</dependency>
<dependency>
<groupid>org.scala-lang</groupid>
<artifactid>scala-compiler</artifactid>
<version>2.11.12</version>
</dependency>
<dependency>
<groupid>org.scala-lang</groupid>
<artifactid>scala-reflect</artifactid>
<version>2.11.12</version>
</dependency>
<dependency>
<groupid>mysql</groupid>
<artifactid>mysql-connector-java</artifactid>
<version>5.1.49</version>
</dependency>
</dependencies>
<build>
<plugins>
<!-- Java Compiler -->
<plugin>
<groupid>org.apache.maven.plugins</groupid>
<artifactid>maven-compiler-plugin</artifactid>
<version>3.1</version>
<configuration>
<source>1.8
<target>1.8</target>
</configuration>
</plugin>
<!-- Scala Compiler -->
<plugin>
<groupid>org.scala-tools</groupid>
<artifactid>maven-scala-plugin</artifactid>
<version>2.15.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
案例
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object Demo02WordCount {
def main(args: Array[String]): Unit = {
/**
* 1、去除setMaster("local")
* 2、修改文件的输入输出路径(因为提交到集群默认是从HDFS获取数据,需要改成HDFS中的路径)
* 3、在HDFS中创建目录
* hdfs dfs -mkdir -p /spark/data/words/
* 4、将数据上传至HDFS
* hdfs dfs -put words.txt /spark/data/words/
* 5、将程序打成jar包
* 6、将jar包上传至虚拟机,然后通过spark-submit提交任务
* spark-submit --class Demo02WordCount --master yarn-client spark-1.0.jar
* spark-submit --class cDemo02WordCount --master yarn-cluster spark-1.0.jar
*/
val conf: SparkConf = new SparkConf
conf.setAppName("Demo02WordCount")
//conf.setMaster("local[*]")
val sc: SparkContext = new SparkContext(conf)
val fileRDD: RDD[String] = sc.textFile("/spark/data/words/words.txt")
// 2、将每一行的单词切分出来
// flatMap: 在Spark中称为 算子
// 算子一般情况下都会返回另外一个新的RDD
val flatRDD: RDD[String] = fileRDD.flatMap(_.split(","))
//按照单词分组
val groupRDD: RDD[(String, Iterable[String])] = flatRDD.groupBy(word => word)
val words: RDD[String] = groupRDD.map(kv => {
val key = kv._1
val size = kv._2.size
key + "," +size
})
// 使用HDFS的JAVA API判断输出路径是否已经存在,存在即删除
val hdfsConf: Configuration = new Configuration()
hdfsConf.set("fs.defaultFS", "hdfs://master:9000")
val fs: FileSystem = FileSystem.get(hdfsConf)
// 判断输出路径是否存在
if (fs.exists(new Path("/spark/data/words/wordCount"))) {
fs.delete(new Path("/spark/data/words/wordCount"), true)
}
// 5、将结果进行保存
words.saveAsTextFile("/spark/data/words/wordCount")
sc.stop()
}
}