https://blog.csdn.net/weixin_37353303/article/details/84313473
jupyter-notebook 以yarn模式运行的出现的问题及解决方法
之前用pyspark虚拟机只跑了单机程序,现在想试试分布式运算。
在做之前找了书和博客来看,总是有各种各样的问题,无法成功。现在特记录一下过程:
这里一共有两个虚拟机,一个做master,一个做slave1
虚拟机slave1安装spark
slave1之前已经安装了hadoop,并且可以成功进行Hadoop集群运算。这里就不多说了。
将master的spark安装包复制到slave1,
(1)进入到spark/conf文件夹中,将slaves.template复制成slaves,在里面添加slave1
(2)增加路径到/etc/profile
master与slave1都要做(1),(2)的步骤
slave1安装anaconda
可以用scp直接将master的anaconda复制过来,接下来修改/etc/profile就可。上面的图已经显示了修改的内容
启动,这时候遇到了好多问题
在master终端输入start-all.sh,使用jps查看,master和slave1都能正常启动
在master终端输入
HADOOP_CONF_DIR=/hadoop/hadoop/etc/hadoop PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" MASTER=yarn-client pyspark
看资料说,如果没有在spark.env.sh中配置HADOOP_CONF_DIR,需要像上面代码在终端写出。这时候,jupyter-notebook可以成功启动,但是我在其中写入sc.master看它是何种模式运行时,却给我报了好多错误
[root@master home]#HADOOP_CONF_IR=/hadoop/hadoop/etc/hadoop PYSPARK_DRIVER_PYTHON="jupyter"
PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
[I 18:58:24.475 NotebookApp]
[nb_conda_kernels] enabled, 2 kernels found
[I 18:58:25.101 NotebookApp] ✓ nbpresent HTML export ENABLED
[W 18:58:25.101 NotebookApp] ✗ nbpresent PDF export DISABLED: No module named 'nbbrowserpdf'
[I 18:58:25.163 NotebookApp]
[nb_anacondacloud] enabled
[I 18:58:25.167 NotebookApp] [nb_conda] enabled
[I 18:58:25.167 NotebookApp] Serving
notebooks from local directory: /home
[I 18:58:25.167 NotebookApp] 0 active
kernels
[I 18:58:25.168 NotebookApp] The Jupyter
Notebook is running at: http://localhost:8888/
[I 18:58:25.168 NotebookApp] Use Control-C
to stop this server and shut down all kernels (twice to skip confirmation).
[I 18:58:33.844 NotebookApp] Kernel
started: c15aabde-b441-45f2-b78d-9933e6534c27
Exception in thread "main"
java.lang.Exception: When running with master 'yarn-client' either
HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
at
org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:263)
at
org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:240)
at
org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:116)
at
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[IPKernelApp] WARNING | Unknown error in
handling PYTHONSTARTUP file /hadoop/spark/python/pyspark/shell.py:
[I 19:00:33.829 NotebookApp] Saving file at
/Untitled2.ipynb
[I 19:00:57.754 NotebookApp] Creating new
notebook in
[I 19:00:59.174 NotebookApp] Kernel
started: ebfbdfd5-2343-4149-9fef-28877967d6c6
Exception in thread "main"
java.lang.Exception: When running with master 'yarn-client' either
HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
at
org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:263)
at
org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:240)
at
org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:116)
at
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[IPKernelApp] WARNING | Unknown error in
handling PYTHONSTARTUP file /hadoop/spark/python/pyspark/shell.py:
[I 19:01:12.315 NotebookApp] Saving file at
/Untitled3.ipynb
^C[I 19:01:15.971 NotebookApp] interrupted
Serving notebooks from local directory:
/home
2 active kernels
The Jupyter Notebook is running at:
http://localhost:8888/
Shutdown this notebook server (y/[n])? y
[C 19:01:17.674 NotebookApp] Shutdown
confirmed
[I 19:01:17.675 NotebookApp] Shutting down
kernels
[I 19:01:18.189 NotebookApp] Kernel
shutdown: ebfbdfd5-2343-4149-9fef-28877967d6c6
[I 19:01:18.190 NotebookApp] Kernel
shutdown: c15aabde-b441-45f2-b78d-9933e6534c27
通过日志显示:
Exception in thread "main" java.lang.Exception: When running with master 'yarn-client' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
1
于是配置spark.env.sh
再次运行:
分别将这两个错误百度下
有的说是内存不足,有的说是需要两个内核
对于内存不足,在yarn-site.xml增加两个点
就是下面图片上的最后两个点
又修改虚拟机设置给slave1增加了两个处理器,使它变成两个核
然而仍旧出现相同的错误
继续修改,中间不知道修改了什么,再次运行
出现了不一样的错误
[root@master hadoop]# pyspark --master yarn
继续按照日志给出的信息继续寻找,
当我用
hadoop dfsadmin -report 查看一下磁盘使用情况时
Configured Capacity: 0 (0 B)
Present Capacity: 0 (0 B)
DFS Remaining: 0 (0 B)
DFS Used: 0 (0 B)
DFS Used%: NaN%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
于是重新格式化namenode,
因为上面提到hdfs,我有修改了一下hdfs-site.xml。将里面的replication值从1变到2
再一次start-all.sh,
[root@master bin]# hadoop dfsadmin -report
DEPRECATED: Use of this script to execute
hdfs command is deprecated.
Instead use the hdfs command for it.
Configured Capacity: 18238930944 (16.99 GB)
Present Capacity: 6707884032 (6.25 GB)
DFS Remaining: 6707879936 (6.25 GB)
DFS Used: 4096 (4 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Live datanodes (1):
Name: 192.168.127.131:50010 (slave1)
Hostname: slave1
Decommission Status : Normal
Configured Capacity: 18238930944 (16.99 GB)
DFS Used: 4096 (4 KB)
Non DFS Used: 11531046912 (10.74 GB)
DFS Remaining: 6707879936 (6.25 GB)
DFS Used%: 0.00%
DFS Remaining%: 36.78%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Tue Nov 20 21:26:11 CST 2018
在终端输入
pyspark --master yarn
惊喜了一下,结果出来了