大数据实战手册-安装篇之安装配置

  • 1.2 安装配置
  • 1.2.1 安装jdk1.8

root vim /etc/profile

export JAVA_HOME=/data/software/jdk1.8.0_191
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.😒{JAVA_HOME}/lib:${JRE_HOME}/lib

  • 1.2.2 安装hadoop+spark

root cd /data/app/hadoop

root wget hadoop-2.7.7

root wget spark-2.4.0-bin-hadoop2.7

root cd /data/app/hadoop/hadoop-2.7.7

root mkdir name tmp data

root chown -R hadoop.hadoop /data/app/hadoop

root vim /etc/profile

export HADOOP_HOME=/data/app/hadoop/hadoop-2.7.7
export SPARK_HOME=/data/app/hadoop/spark-2.4.0-bin-hadoop2.7
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export YARN_HOME=${HADOOP_HOME}
export YARN_CONF_DIR=${YARN_HOME}/etc/hadoop

Export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/bin:$SPARK_HOME/sbin
#pyspark开发调试必备
export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip

hadoop source /etc/profile

  • 1.2.3 配置hadoop

Core-site.xml配置



fs.default.name
hdfs://hadoop01:9000


fs.defaultFS
hdfs://hadoop01:9000
[文件系统名称医技访问入口]

io.file.buffer.size
131072


hadoop.tmp.dir
file:/data/app/hadoop/hadoop-2.7.7/tmp
Abase for other temporary directories.


hadoop.proxyuser.hduser.hosts



hadoop.proxyuser.hduser.groups


Hdfs-site.xml配置



dfs.namenode.secondary.http-address
hadoop02:9001
[从namenode一定避免与主namenode在同台机器]

dfs.namenode.name.dir
file:/data/app/hadoop/hadoop-2.7.7/name
[Namdenode数据,如删除它必须要格式主namenode]

dfs.datanode.data.dir
file:/data/app/hadoop/hadoop-2.7.7/data
[Datanode数据存储]

dfs.replication
2
[每个分块复制块,默认是3,建议等于节点数量]

dfs.webhdfs.enabled
true

Mapperd-site.xml 配置[生产环境大部分已迁移到spark避免使用mapreduce,切记hive依赖于mapreduce]



mapreduce.framework.name
yarn


mapreduce.jobhistory.address
hadoop01:10020


mapreduce.jobhistory.webapp.address
hadoop01:19888


Yarn-site.xml配置


yarn.nodemanager.aux-services
mapreduce_shuffle


yarn.nodemanager.aux-services.mapreduce.shuffle.class
org.apache.hadoop.mapred.ShuffleHandler


yarn.resourcemanager.address
hadoop01:8032


yarn.resourcemanager.scheduler.address
hadoop01:8030


yarn.resourcemanager.resource-tracker.address
hadoop01:8031


yarn.resourcemanager.admin.address
hadoop01:8033


yarn.resourcemanager.webapp.address
hadoop01:8088

Hadoop-env.sh 配置

export JAVA_HOME=/data/software/jdk1.8.0_191

Yarn-env.sh 配置

export JAVA_HOME=/data/software/jdk1.8.0_191

Slaves

Hadoop01
Hadoop02

  • 1.2.4 spark配置

Spark-env.sh配置

export JAVA_HOME=/data/software/jdk1.8.0_191
export SPARK_MASTER_IP=hadoop01
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=1
export SPARK_WORKER_INSTANCES=1[初始化work个数]
export SPARK_WORKER_MEMORY=512M
export SPARK_LOCAL_IP=192.168.0.102[Spark初始化启动模式是local,必须配置本机ip]
export PYTHONH

Slaves

Hadoop01
Hadoop02

  • 1.2.5 hdfs常用操作

Hdfs启动

hadoop cd /data/app/hadoop/hadoop-2.7.7

hadoop ./sbin/start-dfs.sh

hdfs ui:http://192.168.0.101:50070/explorer.html#/hadoop/upload

启动报错,提示:execstack -c libhadoop.so?

root rpm-Uvh http://www.rpmfind.net/linux/fedora/linux/releases/29/Everything/x86_64/os/Packages/e/execstack-0.5.0-15.fc29.x86_64.rpm

root rpm -ivh /data/software/execstack-0.5.0-15.fc29.x86_64.rpm

cd /data/app/hadoop/hadoop-2.7.7/lib/native

root execstack -c libhadoop.so

1 注意提示错误信息
2 hadoop、spark压缩包在windows上解压上传到linux会修改到so包的头文件,execstack时会报非法文件错误]

Hdfs常用命令

1 格式化namenode
#hadoop ./bin/hdfs namenode -format[首次启动必须先格式化namenode]

hadoopt Hadoop fs -mkdir -p /hadoop/upload
#hadoop hadoop fs -ls /hadoop/upload
#hadoop hadoop fs -put /data/app/hadoop/hadoop-2.7.7/etc/hadoop/core-site.xml /hadoop/upload

2 查看所有文件分块存储结点信息
#hadoop hdfs fsck /hadoop/upload -files -locations -blocks
1) 验证hdfs集群是否运行,如集群异常文件存储在单节点上
![查看所有文件分块存储结点信息]( https://img-blog.csdnimg.cn/20190213131144276.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dvbGZqc29u,size_16,color_FFFFFF,t_70)

2) 验证配置的relica是否生效
3) 观察文件系统状态The filesystem under path '/hadoop/upload' is HEALTHY]

3 查看文件系统容量
#root hadoop fs -df -h
查看文件系统容量

4 启动日志文件,密切观察日志文件输出
#hadoop jps[发现datanode只有master,原因:slaves未配置]
Master:namenode datanode进程
Slave:secondarynode datanode进程
hadoop-hadoop-namenode-hadoop01.log
hadoop-hadoop-namenode-hadoop01.out
hadoop-hadoop-secondarynamenode-hadoop02.log
hadoop-hadoop-secondarynamenode-hadoop02.out
hadoop-hadoop-datanode-hadoop02.log
hadoop-hadoop-datanode-hadoop02.out
hadoop-hadoop-datanode-hadoop01.log
hadoop-hadoop-datanode-hadoop01.out

  • 1.2.6 spark standalone模式

启动集群

hadoop cd /data/app/hadoop/spark-2.4.0-bin-hadoop2.7

hadoop ./sbin/start-all.sh

Spark UI:http://hadoop01:8080/(启动提示端口)

hadoop jps

Master:master woker
Slave:work[work数量是在spark-env.sh配置的初始化work数量]

机器上调试python-spark

[前提:
1 /etc/profile export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip
2 spark-env.sh import PYTHONPATH
这种调试是运行local模式spark ui看不到
]
#hadoop cd /data/app/hadoop/spark-2.4.0-bin-hadoop2.7/test
#hadoop python anlysis_hislog.py

集群提交

hadoop ./spark-submit --master spark://hadoop01:7077 --executor-memory 500m

../test/wordcount.py hdfs://hadoop01:9000/hadoop/upload/catalina.out
[Work获取不到资源,请观察机器内存大小与配置的executor-memory,默认是1G]1.2.7 spark on yarn
不需要启动spark master、work进程
启动yarn
#hadoop ./sbin/start-yarn.sh
Spark submit
./bin/spark-submit --master yarn --executor-memory 500m --num-executors 1 /data/app/hadoop/spark-2.4.0-bin-hadoop2.7/test/logdev/rdd_test.py

QA:虚拟内存不够?



yarn.scheduler.maximum-allocation-mb
9000
每个任务最多可用内存,默认8182MB


yarn.scheduler.minimum-allocation-mb
3072
每个任务最小可用内存


yarn.nodemanager.vmem-pmem-ratio
3

posted @ 2019-02-13 14:41  架构师老狼  阅读(229)  评论(0)    收藏  举报