hadoop集群搭建以及常见问题解决
hadoop的集群先搞二台机器,一台管理机,一台node机,为什么呢。因为钱,机子也要钱。数据量是逐步增长起来的。如果一台node不能满足需求了,在增加node节点到集群。
在开始安装配置前,最好把该篇文章看上几遍,理顺了,在开始。特别是我踩过的坑。
一,服务器说明
- 10.0.0.237 bigserver1 //master
- 10.0.0.236 bigserver2 //datanode
二,修改主机名,并配置hosts
1,修改主机名
- [root@localhost ~]# hostname
- localhost.localdomain
- [root@localhost ~]# hostname bigserver1
- [root@localhost ~]# hostname
- bigserver1
2, 在/etc/hosts文件中增加,所有节点一样
- 10.0.0.236 bigserver2
- 10.0.0.237 bigserver1
三,关闭防火墙和selinux
- # systemctl stop firewalld //停止
- # systemctl disable firewalld //取消启动
- # cat /etc/sysconfig/selinux
- SELINUX=disabled //关闭
改完重启一下电脑。hadoop安装配置好了以后,防火墙可以打开,开放端口。
四,ssh免密码登录
- # ssh-keygen -t rsa
- # ssh-copy-id -i ~/.ssh/id_rsa.pub root@10.0.0.236 -p 22
- # ssh-copy-id -i ~/.ssh/id_rsa.pub root@10.0.0.237 -p 22
- # scp ~/.ssh/id_rsa root@10.0.0.236:/root/.ssh/
- # scp ~/.ssh/id_rsa root@10.0.0.237:/root/.ssh/
- 登录到236和237后
- # cd ~/.ssh/
- # chmod 600 id_rsa
我是在本机生成了公私钥,分别传到了236,237机器。
五,安装java1.8
- # yum install java-1.8.0-openjdk java-1.8.0-openjdk-devel
六,下载hadoop
https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
大家根据自己的需求去下载。
- # tar zxvf hadoop-2.7.7.tar.gz
- # mkdir /bigdata
- # mv hadoop-2.2.7 /bigdata/hadoop
- # mkdir -pv /bigdata/hadoop/{tmp,var,dfs}
- # mkdir -pv /bigdata/hadoop/dfs/{name,data}
七,配置hadoop
1,备份
- # cd /bigdata/hadoop/etc
- # cp -r hadoop hadoop_bak
这一步很重要,养成一个良好的习惯会事半功倍。
2,配置core-site.xml
- <property>
- <name>hadoop.tmp.dir</name>
- <value>/bigdata/hadoop/tmp</value>
- </property>
- <property>
- <name>fs.default.name</name>
- <value>hdfs://bigserver1:9000</value>
- </property>
在<configuration></configuration>中添加
3,修改 hadoop-env.sh
- # whereis javac
- javac: /usr/bin/javac /usr/share/man/man1/javac.1.gz
- # ll /usr/bin/javac
- lrwxrwxrwx. 1 root root 23 Dec 27 00:08 /usr/bin/javac -> /etc/alternatives/javac
- # ll /etc/alternatives/javac
- lrwxrwxrwx. 1 root root 70 Dec 27 00:08 /etc/alternatives/javac -> /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-1.el7_6.x86_64/bin/javac
- //以上是查找环境变量
- # echo "export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-1.el7_6.x86_64" >> ~/.bashrc
- # source ~/.bashrc
- # vim hadoop-env.sh
- 将export JAVA_HOME=${JAVA_HOME}替换成
- export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.191.b12-1.el7_6.x86_64
如果不是管理工具包安装,填解压目录,即可
4,配置hdfs-site.xml
- <property>
- <name>dfs.name.dir</name>
- <value>/bigdata/hadoop/dfs/name</value>
- </property>
- <property>
- <name>dfs.data.dir</name>
- <value>/bigdata/hadoop/dfs/data</value>
- </property>
- <property>
- <name>dfs.replication</name>
- <value>2</value>
- </property>
- <property>
- <name>dfs.permissions</name>
- <value>false</value>
- </property>
5,配置mapred-site.xml
- # cp mapred-site.xml.template mapred-site.xml
- # vim mapred-site.xml
- <property>
- <name>mapred.job.tracker</name>
- <value>bigserver1:49001</value>
- </property>
- <property>
- <name>mapred.local.dir</name>
- <value>/bigdata/hadoop/var</value>
- </property>
- <property>
- <name>mapreduce.framework.name</name>
- <value>yarn</value>
- </property>
6,修改slaves
- # cat slaves
- bigserver2
7,配置yarn-site.xml
- <property>
- <name>yarn.resourcemanager.hostname</name>
- <value>bigserver1</value>
- </property>
- <property>
- <name>yarn.resourcemanager.address</name>
- <value>${yarn.resourcemanager.hostname}:8032</value>
- </property>
- <property>
- <name>yarn.resourcemanager.scheduler.address</name>
- <value>${yarn.resourcemanager.hostname}:8030</value>
- </property>
- <property>
- <name>yarn.resourcemanager.webapp.address</name>
- <value>${yarn.resourcemanager.hostname}:8088</value>
- </property>
- <property>
- <name>yarn.resourcemanager.webapp.https.address</name>
- <value>${yarn.resourcemanager.hostname}:8090</value>
- </property>
- <property>
- <name>yarn.resourcemanager.resource-tracker.address</name>
- <value>${yarn.resourcemanager.hostname}:8031</value>
- </property>
- <property>
- <name>yarn.resourcemanager.admin.address</name>
- <value>${yarn.resourcemanager.hostname}:8033</value>
- </property>
- <property>
- <name>yarn.nodemanager.aux-services</name>
- <value>mapreduce_shuffle</value>
- </property>
- <property>
- <name>yarn.nodemanager.vmem-check-enabled</name>
- <value>false</value>
- </property>
不要轻易的去配置cpu,内存等。不然会影响mapredure。例如:
yarn.nodemanager.resource.memory-mb
yarn.scheduler.maximum-allocation-mb
yarn.nodemanager.vmem-pmem-ratio等
以下是我配置不全导致的错误 :
2018-12-27 09:11:54,178 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1545833322243_0001_m_000007_0 TaskAttempt Transitioned from NEW to UNASSIGNED
2018-12-27 09:11:54,178 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1545833322243_0001_m_000008_0 TaskAttempt Transitioned from NEW to UNASSIGNED
2018-12-27 09:11:54,178 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1545833322243_0001_r_000000_0 TaskAttempt Transitioned from NEW to UNASSIGNED
2018-12-27 09:11:54,179 INFO [Thread-52] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: mapResourceRequest:<memory:1024, vCores:1>
2018-12-27 09:11:54,185 INFO [Thread-52] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: reduceResourceRequest:<memory:1024, vCores:1>
2018-12-27 09:11:54,196 INFO [eventHandlingThread] org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Event Writer setup for JobId: job_1545833322243_0001, File: hdfs://bigserver1:9000/tmp/hadoop-yarn/staging/root/.staging/job_1545833322243_0001/job_1545833322243_0001_1.jhist
2018-12-27 09:11:55,138 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:1 ScheduledMaps:9 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:0 CompletedReds:0 ContAlloc:0 ContRel:0 HostLocal:0 RackLocal:0
2018-12-27 09:11:55,194 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application_1545833322243_0001: ask=3 release= 0 newContainers=0 finishedContainers=0 resourcelimit=<memory:0, vCores:0> knownNMs=1
2018-12-27 09:11:55,195 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Going to preempt 1 due to lack of space for maps
2018-12-27 09:11:56,198 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Going to preempt 1 due to lack of space for maps
2018-12-27 09:11:57,202 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Going to preempt 1 due to lack of space for maps
到这儿,hadoop就配置完成了,master节点和datanode节点配置一样。
八,初始化hadoop,并运行hadoop
1,只需要在master节点初始化,node节点不需要
- # cd /bigdata/hadoop/bin/
- ./hadoop namenode -format
初始化成功后,会/bigdata/hadoop/dfs/name多出一个current文件夹。初始化一次后,最好不要在重新初始化,最好不要在重新初始化,最好不要在重新初始化。重要的事情说三遍。会导致master节点和datanode节点对不上。后面会具体说明。
2,只需要在master启动hadoop
- # cd /bigdata/hadoop/sbin/
- # ./start-all.sh
- This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
- Starting namenodes on [bigserver1]
- bigserver1: starting namenode, logging to /home/bigdata/hadoop/logs/hadoop-root-namenode-bigserver1.out
- bigserver2: starting datanode, logging to /home/bigdata/hadoop/logs/hadoop-root-datanode-bigserver2.out
- Starting secondary namenodes [0.0.0.0]
- 0.0.0.0: starting secondarynamenode, logging to /home/bigdata/hadoop/logs/hadoop-root-secondarynamenode-bigserver1.out
- starting yarn daemons
- starting resourcemanager, logging to /home/bigdata/hadoop/logs/yarn-root-resourcemanager-bigserver1.out
- bigserver2: starting nodemanager, logging to /home/bigdata/hadoop/logs/yarn-root-nodemanager-bigserver2.out
3,检查hadoop集群各节点是否正常启动
- //master节点
- [root@bigserver1 name]# netstat -tpnl |grep java
- tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 5573/java
- tcp 0 0 10.0.0.237:9000 0.0.0.0:* LISTEN 5573/java
- tcp 0 0 0.0.0.0:50090 0.0.0.0:* LISTEN 5768/java
- tcp6 0 0 10.0.0.237:8088 :::* LISTEN 5930/java
- tcp6 0 0 10.0.0.237:8030 :::* LISTEN 5930/java
- tcp6 0 0 10.0.0.237:8031 :::* LISTEN 5930/java
- tcp6 0 0 10.0.0.237:8032 :::* LISTEN 5930/java
- tcp6 0 0 10.0.0.237:8033 :::* LISTEN 5930/java
- [root@bigserver1 name]# jps
- 3457 RunJar
- 6851 Jps
- 5573 NameNode
- 5768 SecondaryNameNode
- 5930 ResourceManager
- //datanode节点
- [root@bigserver2 sbin]# netstat -tpnl |grep java
- tcp 0 0 127.0.0.1:44205 0.0.0.0:* LISTEN 3405/java
- tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 3405/java
- tcp 0 0 0.0.0.0:50075 0.0.0.0:* LISTEN 3405/java
- tcp 0 0 0.0.0.0:50020 0.0.0.0:* LISTEN 3405/java
- tcp6 0 0 :::43959 :::* LISTEN 3520/java
- tcp6 0 0 :::13562 :::* LISTEN 3520/java
- tcp6 0 0 :::8040 :::* LISTEN 3520/java
- tcp6 0 0 :::8042 :::* LISTEN 3520/java
- [root@bigserver2 sbin]# jps
- 3520 NodeManager
- 5761 Jps
- 3405 DataNode
jps显示的内容,如果少了一个说明没有配置成功。如果master和datanode节点进程缺少也说明没有成功。
如果都没有什么问题的话,可以通过url来访问了。
http://10.0.0.237:50070,节点健康检查工具
http://10.0.0.237:8088,集群各节点任务分析工具
如下图
4,只需要在master停止hadoop
- root@localhost sbin]# ./stop-all.sh //停止
- This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh
- Stopping namenodes on [bigserver1]
- bigserver1: stopping namenode
- bigserver2: no datanode to stop //刚开始配置时,datanode没有启动成功报的错
- Stopping secondary namenodes [0.0.0.0]
- 0.0.0.0: stopping secondarynamenode
- stopping yarn daemons
- stopping resourcemanager
- bigserver2: stopping nodemanager
- no proxyserver to stop
通过jps查看,缺少了DataNode。但是nodemanager是起来的。问题出在master和datanode节点,集群点没有对上,这让我想起了mysql replication position,对不上也会出现无法同步的问题。导致no datanode to stop这个原因的产生,猜测是master节点,进行了多次的初始化。hadoop namenode -format。
解决办法如下:
master点,打开/bigdata/hadoop/dfs/name/current/VERSION,
datanode点,打开/bigdata/hadoop/dfs/data/current/VERSION,
将master节点的clusterID,copy到datanode中,重启就好。
网上查了一下,有人说同步namespaceID也可以,但是我用hadoop2.7.7版本中,datanode节点,/bigdata/hadoop/dfs/data/current/VERSION文件中根本没有namespaceID,我又不想加。哈哈。也不确定这样行不行。
九,测试hadoop集群
1,master节点hdfs创建测试目录
- # ./bin/hdfs dfs -mkdir /test
- # ./bin/hdfs dfs -ls /
- Found 3 items
- drwxr-xr-x - root supergroup 0 2018-12-26 20:42 /test
- drwx------ - root supergroup 0 2018-12-26 21:27 /tmp
- drwxr-xr-x - root supergroup 0 2018-12-26 20:20 /user
2,master节点上传测试文件到hdfs
- # ./bin/hdfs dfs -put ./etc/hadoop/*.xml /test/
3,master节点测试mapredure
- # ./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep /test/ ./output 'dfs[a-z.]+'
- 18/12/27 09:11:46 INFO client.RMProxy: Connecting to ResourceManager at bigserver1/10.0.0.237:8032
- 18/12/27 09:11:48 INFO input.FileInputFormat: Total input paths to process : 9
- 18/12/27 09:11:48 INFO mapreduce.JobSubmitter: number of splits:9
- 18/12/27 09:11:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1545833322243_0001
- 18/12/27 09:11:49 INFO impl.YarnClientImpl: Submitted application application_1545833322243_0001 //datanode端userlogs中有日志
- 18/12/27 09:11:49 INFO mapreduce.Job: The url to track the job: http://bigserver1:8088/proxy/application_1545833322243_0001/
- 18/12/27 09:11:49 INFO mapreduce.Job: Running job: job_1545833322243_0001
- 18/12/27 09:11:55 INFO mapreduce.Job: Job job_1545833322243_0001 running in uber mode : false
- 18/12/27 09:11:55 INFO mapreduce.Job: map 0% reduce 0% //这块卡死,map reduce都是0
到datanode节点查看日志:
- # cd /bigdata/hadoop/logs/userlogs
- [root@bigserver2 userlogs]# ls
- application_1545825824765_0001 application_1545827800765_0001 application_1545828806710_0001 application_1545829094007_0001 application_1545833322243_0001
- [root@bigserver2 userlogs]# ll |grep application_1545833322243_0001
- drwx--x--- 3 root root 52 12月 27 09:11 application_1545833322243_0001
- [root@bigserver2 userlogs]# cd application_1545833322243_0001
- [root@bigserver2 application_1545833322243_0001]# cd container_1545833322243_0001_01_000001/
- [root@bigserver2 container_1545833322243_0001_01_000001]# ls
- stderr stdout syslog
- [root@bigserver2 container_1545833322243_0001_01_000001]# tail -f syslog
- 2018-12-27 09:14:12,652 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Going to preempt 1 due to lack of space for maps
- 2018-12-27 09:14:13,654 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Going to preempt 1 due to lack of space for maps
- 2018-12-27 09:14:14,657 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Going to preempt 1 due to lack of space for maps
- 2018-12-27 09:14:15,659 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Going to preempt 1 due to lack of space for maps
- 2018-12-27 09:14:16,662 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Going to preempt 1 due to lack of space for maps
- 2018-12-27 09:14:17,665 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Going to preempt 1 due to lack of space for maps
- 。。。。。。。。。。。。。。。。。。。。忽略。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。
解决办法:上面也提到了,就是yarn-site.xml,配置内存和cpu相关的去掉。重启hadoop就好
看一下成功后
十,查日志
hadoop的日志,还是很多的,还没有装hbase,hive,spark等。除了进入服务器查看外,还可以通过网页查看。
不怕问题,就怕出了问题,不知道错在哪里。随便点点网页log就发现个问题
2018-12-27 10:23:04,569 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint
java.io.IOException: Inconsistent checkpoint fields.
LV = -63 namespaceID = 1131284630 cTime = 0 ; clusterId = CID-ea915c79-c5cb-4d23-bc55-e4530f999cb0 ; blockpoolId = BP-508509447-10.0.0.237-1545809802003.
Expecting respectively: -63; 839710719; 0; CID-66e894a8-1cb1-4b8e-bac4-2bc3b526e062; BP-262790598-10.0.0.237-1545803654780.
at org.apache.hadoop.hdfs.server.namenode.CheckpointSignature.validateStorageInfo(CheckpointSignature.java:134)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:531)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:395)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:361)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:357)
at java.lang.Thread.run(Thread.java:748)
解决办法:
mater节点,删除该目录/bigdata/hadoop/tmp/dfs/namesecondary/current下的所有文件,重启hadoop即可。