Hadoop集群搭建
Hadoop集群搭建
1.创建三台虚拟机,本次使用的是centos7,关闭所有机器的防火墙。
-
关闭防火墙:
[hadoop@localhost ~]$ systemctl stop firewalld.service
-
修改主机名,方便对虚拟机进行区分。
主节点名称设为master,其他两个节点名称设置为slave1,slave2。
查看主机名并修改:
[hadoop@localhost ~]$ hostname localhost.localdomain [hadoop@localhost ~]$ hostnamectl set-hostname master [hadoop@localhost ~]$ hostname master
修改完重启机器:
[hadoop@localhost ~]$ reboot
2.编辑ip与hostname的映射表 /etc/hosts
将所有机器的ip与hostname的对应关系添加到hosts文件中,所有的节点都要添加,相当于DNS
172.16.46.161 master
172.16.46.163 slave1
172.16.46.162 slave2
3.ssh免密登陆
请查看ssh免密登陆
4.安装jdk
请查看jdk安装
5.安装hadoop
下载地址hadoop,下载.tar.gz格式的包
解压
[hadoop@master ~]$ tar -zxvf hadoop-2.9.2.tar.gz
设置环境变量,在/etc/profile最下方添加
export HADOOP_HOME=/home/hadoop/hadoop-2.9.2
export PATH=.:$HADOOP_HOME/bin:$PATH
加载环境变量
source /etc/profile
验证hadoop是否安装成功
[hadoop@master ~]$ hadoop
Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
CLASSNAME run the class named CLASSNAME
or
where COMMAND is one of:
fs run a generic filesystem user client
version print the version
jar <jar> run a jar file
note: please use "yarn jar" to launch
YARN applications, not this command.
checknative [-a|-h] check native hadoop and compression libraries availability
distcp <srcurl> <desturl> copy file or directories recursively
archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
classpath prints the class path needed to get the
Hadoop jar and the required libraries
credential interact with credential providers
daemonlog get/set the log level for each daemon
trace view and modify Hadoop tracing settings
Most commands print help when invoked w/o parameters.
出现上述输出,证明安装成功。
如果未输出,则可以尝试重新启动机器加载环境变量。
6.配置hadoop
进入hadoop安装目录
6.1配置etc/hadoop/hadoop-env.sh
修改JAVA_HOME为jdk安装目录的绝对路径
6.2配置etc/hadoop/core-site.xml
设置hdfs的Namenode地址,设置hadoop运行时临时文件的存储路径
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://172.16.46.161:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/home/hadoop/hadoop-2.9.2/tmp</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131702</value>
</property>
</configuration>
如果没有配置hadoop.tmp.dir,默认存储在/tmp/hadoop-username目录下
6.3配置etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>4</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hadoop/hadoop-2.9.2/hdfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hadoop/hadoop-2.9.2/hdfs/data</value>
<final>true</final>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>172.16.46.161:9001</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
6.4配置etc/hadoop/mapred-site.xml
将mapper-site.xml.template重命名为mapper-site.xml
[hadoop@master hadoop-2.9.2]$ mv etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml
修改mapper-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
指定mr运行在yarn上
6.5配置etc/hadoop/slaves
删除原有内容,写入所有节点的ip地址
172.16.46.161
172.16.46.163
172.16.46.162
6.6配置etc/hadoop/yarn-env.sh和etc/hadoop/mapred-env.sh
将JAVA_HOME配置成jdk安装目录的绝对路径
6.7配置etc/hadoop/yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.resourcemanager.address</name>
<value>172.16.46.161:18040</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>172.16.46.161:18030</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>172.16.46.161:18088</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>172.16.46.161:18025</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>172.16.46.161:18141</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
7.将修改后的配置拷贝到其他节点
scp -r etc/ hadoop@slave1:~/hadoop-2.9.2/
8.启动集群
8.1格式化namenode
集群搭建好了,将磁盘格式化一下,后面要存数据,避免有脏数据,同时创建一些东西。
只有第一次启动需要格式化
namenode设置在哪个节点上就在哪个节点上执行下面的命令
bin/hdfs namenode -format
8.2启动集群前必须保证namenode和datanode已经启动
单节点启动namenode
[hadoop@master hadoop-2.9.2]# sbin/hadoop-daemon.sh start namenode
starting namenode, logging to /home/hadoop/hadoop-2.9.2/logs/hadoop-root-namenode-master.out
[hadoop@master hadoop-2.9.2]# jps
3877 NameNode
3947 Jps
单节点启动datanode
[hadoop@master hadoop-2.9.2]# sbin/hadoop-daemon.sh start datanode
starting datanode, logging to /home/hadoop/hadoop-2.9.2/logs/hadoop-root-datanode-master.out
[hadoop@master hadoop-2.9.2]# jps
3877 NameNode
4060 Jps
3982 DataNode
在 其他节点 依次启动 datanode
这样启动hdfs很麻烦,而且发现SecondaryNameNode并没有启动,素有hadoop提供了其他的启动方式
一步启动hdfs集群:Namenode、Datanode、SecondaryNameNode
[hadoop@master hadoop-2.9.2]$ sbin/start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [master]
master: starting namenode, logging to /home/hadoop/hadoop-2.9.2/logs/hadoop-hadoop-namenode-master.out
172.16.46.162: starting datanode, logging to /home/hadoop/hadoop-2.9.2/logs/hadoop-hadoop-datanode-slave2.out
172.16.46.161: starting datanode, logging to /home/hadoop/hadoop-2.9.2/logs/hadoop-hadoop-datanode-master.out
172.16.46.163: starting datanode, logging to /home/hadoop/hadoop-2.9.2/logs/hadoop-hadoop-datanode-slave1.out
Starting secondary namenodes [master]
master: starting secondarynamenode, logging to /home/hadoop/hadoop-2.9.2/logs/hadoop-hadoop-secondarynamenode-master.out
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-2.9.2/logs/yarn-hadoop-resourcemanager-master.out
172.16.46.163: starting nodemanager, logging to /home/hadoop/hadoop-2.9.2/logs/yarn-hadoop-nodemanager-slave1.out
172.16.46.162: starting nodemanager, logging to /home/hadoop/hadoop-2.9.2/logs/yarn-hadoop-nodemanager-slave2.out
172.16.46.161: starting nodemanager, logging to /home/hadoop/hadoop-2.9.2/logs/yarn-hadoop-nodemanager-master.out
[hadoop@master hadoop-2.9.2]$ jps
4192 Jps
3237 NameNode
3543 SecondaryNameNode
3374 DataNode
8.3启动yarn
看yarn要设置在哪个节点,就在哪个节点执行下面的命令。
[hadoop@master hadoop-2.9.2]# sbin/start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-2.9.2/logs/yarn-root-resourcemanager-master.out
172.16.46.162: starting datanode, logging to /home/hadoop/hadoop-2.9.2/logs/hadoop-hadoop-datanode-slave2.out
172.16.46.161: starting datanode, logging to /home/hadoop/hadoop-2.9.2/logs/hadoop-hadoop-datanode-master.out
172.16.46.163: starting datanode, logging to /home/hadoop/hadoop-2.9.2/logs/hadoop-hadoop-datanode-slave1.out
[hadoop@master hadoop-2.9.2]$ jps
4192 Jps
3237 NameNode
3814 NodeManager
3543 SecondaryNameNode
3374 DataNode
3695 ResourceManager
ResourceManager和NodeManager都启动了。
8.4hadoop集群启动成功,包括hdfs、yarn、mapreduce
上面这种启动方式很麻烦,hadoop还提供了一键启动和一键关闭。
sbin/start-all.sh
sbin/stop-all.sh
9.远程访问hadoop集群
访问hdfs:http://172.16.46.161:50070/
10.简单测试
在hdfs文件系统中创建目录,两种方式。
bin/hdfs dfs -mkdir -p /usr/input
bin/hadoop fs -mkdir -p /usr/output
集群部署规划
上述步骤已经将hadoop集群搭建完成,但是我们将Namenode、SecondaryNamenode、ResourceManager都部署到一台机器上。
这样会增大服务器的压力,而且组件的资源都被压缩了。所以可以部署到三台机器。
hadoop11 | hadoop12 | hadoop13 | |
---|---|---|---|
HDFS | NameNode、DataNode | DataNode | SecondaryNameNode |
YARN | NodeManager | ResourceManager、NodeManager | NodeManager |
三个核心组件分布到三台机器。
异常记录
-
找不到jps
jps是查看java进程的,找不到说明java没有装好,需要设置java环境变量
-
重启后无法启动datanode
通常在第一次搭建时可以成功,但是重启后不能成功,datanode 无法启动,原因是 datanode 无法被 namenode 识别。
namenode 在 format 时会形成两个标识,blockPoolId 和 clusterId;
当有 datanode 加入时,会获取这两个标识作为从属 这个 namenode 的标识,这样才能组成集群;
一旦 namenode 被重新 format,会更新这两个标识;
然而 datanode 还拿原来的标识过来接头,自然被拒之门外
解决方法:删除所有节点的数据,即 tmp,包括 namenode 的数据,重新格式化,再启动
-
各种操作都会有如下 警告
WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
无需理会,只是警告,确实想解决,参考 解决办法