Spark集群安装与配置
一、Scala安装
1.https://www.scala-lang.org/download/2.11.12.html下载并复制到/home/jun下解压
[jun@master ~]$ cd scala-2.12.6/ [jun@master scala-2.12.6]$ ls -l total 0 drwxrwxr-x. 2 jun jun 162 Apr 27 16:49 bin drwxrwxr-x. 4 jun jun 86 Apr 27 16:49 doc drwxrwxr-x. 2 jun jun 244 Apr 27 16:49 lib drwxrwxr-x. 3 jun jun 18 Apr 27 16:49 man
2.启动Scala并使用Scala Shell
[jun@master scala-2.12.6]$ bin/scala Welcome to Scala 2.12.6 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171). Type in expressions for evaluation. Or try :help. scala> println("hello,world") hello,world scala> 5*9 res1: Int = 45 scala> 2*res1 res2: Int = 90 scala> :help All commands can be abbreviated, e.g., :he instead of :help. :completions <string> output completions for the given string :edit <id>|<line> edit history :help [command] print this summary or command-specific help :history [num] show the history (optional num is commands to show) :h? <string> search the history :imports [name name ...] show import history, identifying sources of names :implicits [-v] show the implicits in scope :javap <path|class> disassemble a file or class name :line <id>|<line> place line(s) at the end of history :load <path> interpret lines in a file :paste [-raw] [path] enter paste mode or paste a file :power enable power user mode :quit exit the interpreter :replay [options] reset the repl and replay all previous commands :require <path> add a jar to the classpath :reset [options] reset the repl to its initial state, forgetting all session entries :save <path> save replayable session to a file :sh <command line> run a shell command (result is implicitly => List[String]) :settings <options> update compiler options, if possible; see reset :silent disable/enable automatic printing of results :type [-v] <expr> display the type of an expression without evaluating it :kind [-v] <type> display the kind of a type. see also :help kind :warnings show the suppressed warnings from the most recent line which had any scala> :quit
3.将Scala安装包复制到slave节点
二、Spark集群的安装与配置
采用Hadoop Yarn模式安装Spark
1.http://spark.apache.org/downloads.html下载spark-2.3.1-bin-hadoop2.7.tgz.gz并赋值到/home/jun下解压
[jun@master ~]$ cd spark-2.3.1-bin-hadoop2.7/ [jun@master spark-2.3.1-bin-hadoop2.7]$ ls -l total 84 drwxrwxr-x. 2 jun jun 4096 Jun 2 04:49 bin drwxrwxr-x. 2 jun jun 230 Jun 2 04:49 conf drwxrwxr-x. 5 jun jun 50 Jun 2 04:49 data drwxrwxr-x. 4 jun jun 29 Jun 2 04:49 examples drwxrwxr-x. 2 jun jun 12288 Jun 2 04:49 jars drwxrwxr-x. 3 jun jun 25 Jun 2 04:49 kubernetes -rw-rw-r--. 1 jun jun 18045 Jun 2 04:49 LICENSE drwxrwxr-x. 2 jun jun 4096 Jun 2 04:49 licenses -rw-rw-r--. 1 jun jun 24913 Jun 2 04:49 NOTICE drwxrwxr-x. 8 jun jun 240 Jun 2 04:49 python drwxrwxr-x. 3 jun jun 17 Jun 2 04:49 R -rw-rw-r--. 1 jun jun 3809 Jun 2 04:49 README.md -rw-rw-r--. 1 jun jun 161 Jun 2 04:49 RELEASE drwxrwxr-x. 2 jun jun 4096 Jun 2 04:49 sbin drwxrwxr-x. 2 jun jun 42 Jun 2 04:49 yarn
2.配置Linux环境变量
#spark export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export HDFS_CONF_DIR=$HADOOP_HOME/etc/hadoop export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
3.配置spark-env.sh环境变量,注意三个计算机上都必须要这样配置才行
复制默认配置文件并使用gedit打开
[jun@master conf]$ cp ~/spark-2.3.1-bin-hadoop2.7/conf/spark-env.sh.template ~/spark-2.3.1-bin-hadoop2.7/conf/spark-env.sh [jun@master conf]$ gedit ~/spark-2.3.1-bin-hadoop2.7/conf/spark-env.sh
增加下面的配置
export SPARK_MASTER_IP=192.168.1.100 export JAVA_HOME=/usr/java/jdk1.8.0_171/ export SCALA_HOME=/home/jun/scala-2.12.6/
4.修改Spark的slaves文件
使用gedit打开文件
[jun@master conf]$ cp ~/spark-2.3.1-bin-hadoop2.7/conf/slaves.template slaves [jun@master conf]$ gedit ~/spark-2.3.1-bin-hadoop2.7/conf/slaves
删除默认的localhost并增加下面的配置
# A Spark Worker will be started on each of the machines listed below.
slave0
slave1
5.将Spark复制到Slave节点
三、Spark集群的启动与验证
1.启动Spark集群
首先确保Hadoop集群处于启动状态,然后执行启动脚本
[jun@master conf]$ /home/jun/spark-2.3.1-bin-hadoop2.7/sbin/start-all.sh starting org.apache.spark.deploy.master.Master, logging to /home/jun/spark-2.3.1-bin-hadoop2.7/logs/spark-jun-org.apache.spark.deploy.master.Master-1-master.out slave0: starting org.apache.spark.deploy.worker.Worker, logging to /home/jun/spark-2.3.1-bin-hadoop2.7/logs/spark-jun-org.apache.spark.deploy.worker.Worker-1-slave0.out slave1: starting org.apache.spark.deploy.worker.Worker, logging to /home/jun/spark-2.3.1-bin-hadoop2.7/logs/spark-jun-org.apache.spark.deploy.worker.Worker-1-slave1.out
2.验证启动状态
(1)通过jps查看进程,可以看到master节点上增加了Master进程,而slave节点上增加了Worker进程
[jun@master conf]$ jps 8480 ResourceManager 8273 SecondaryNameNode 8805 Master 8038 NameNode 8889 Jps [jun@slave0 ~]$ jps 7794 DataNode 8166 Worker 7961 NodeManager 8252 Jps [jun@slave1 ~]$ jps 5539 DataNode 5917 Worker 5711 NodeManager 5999 Jps
(2)通过Web查看系统状态,输入http://master:8080
(3)通过终端命令行向Spark集群提交计算程序
为了直接在目录下加载jar包,先将示例程序jar包复制到Spark主安装目录下
[jun@master conf]$ cp /home/jun/spark-2.3.1-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.3.1.jar /home/jun/spark-2.3.1-bin-hadoop2.7/
执行SparkPi程序
[jun@master spark-2.3.1-bin-hadoop2.7]$ bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 spark-examples_2.11-2.3.1.jar 10
这个时候报了一个错误,意思就是container要用2.3G内存,而实际的虚拟内存只有2.1G。Yarn默认的虚拟内存和物理内存比例是2.1,也就是说虚拟内存是2.1G,小于了需要的内存2.2G。解决的办法是把拟内存和物理内存比例增大,在yarn-site.xml中增加一个设置:
diagnostics: Application application_1532350446978_0001 failed 2 times due to AM Container for appattempt_1532350446978_0001_000002 exited with exitCode: -103 Failing this attempt.Diagnostics: Container [pid=6514,containerID=container_1532350446978_0001_02_000001] is running beyond virtual memory limits. Current usage: 289.7 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used. Killing container.
关闭Yarn然后在配置文件中增加下面的配置,然后重启Yarn
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.5</value>
</property>
再次运行SparkPi程序,在final status可以看到运行成功!
2018-07-23 21:26:53 INFO Client:54 - client token: N/A diagnostics: N/A ApplicationMaster host: 192.168.1.101 ApplicationMaster RPC port: 0 queue: default start time: 1532352353581 final status: SUCCEEDED tracking URL: http://master:18088/proxy/application_1532352327714_0001/ user: jun 2018-07-23 21:26:53 INFO ShutdownHookManager:54 - Shutdown hook called 2018-07-23 21:26:53 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-1ed5bee9-1aa7-4888-b3ec-80ff2b153192 2018-07-23 21:26:53 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-7349a4e3-9483-4d09-91ff-e1e48cb59b46
在tracking URL上右键然后选择open link即可在浏览器看到运行状态
点击logs,然后点击stdout,可以看到运行结果Pi is roughly 3.143951143951144