搭建hadoop cluster

准备4台机器,一台做namenode和jobtracker,假设它主机名和ip都是master。三台机器做datanode,主机名和ip分别是d1-d3,统称为ds。网络互通。

首先准备ssh。因为hdfs是分布式文件系统,并且node之间会有各种操作,所以需要打通ssh。方法是在所有机器上生成key,比如你将要使用的用户是hadoop

ssh-keygen -t rsa

然后将master的~/.ssh/id_rsa.pub追加到ds的~/.ssh/authorized_keys,实现m->ds打通。反过来,ds->m的也都要打通。这样所有datanode和m都能互通,但是ds之间不需要。另外还得记得一点,master自己的pub要放到自己的authorized去,自己和自己打通,否则启动secondNamenode会失败,因为这里只有一个master

接下来全部在master操作,不需要去ds

其次下载hadoop,比如hadoop-2.7.4,master下载并解压到/opt/下,需要进入/opt/hadoop/etc/hadoop目录并进行如下设置

hadoop-env.sh

# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/java-8-oracle

core-site.xml

<configuration>
  <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/opt/hadoop-2.7.4/tmp</value>
        <description>Abase for other temporary directories.</description>
  </property>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://master:9000</value>
  </property>
</configuration>

hdfs-site.xml

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
    <property>
        <name>dfs.name.dir</name>
        <value>/data/hadoop/hdfs/name</value>
    </property>
    <property>
        <name>dfs.data.dir</name>
        <value>/data/hadoop/hdfs/data</value>
    </property>
</configuration>

这是hdfs设置,注意相关目录都要有权限,最好设置owner为hadoop。replication是复制因子,1的意思是每个datanode都会复制一份数据去别的datanode,每个节点的数据只少在另一个节点有备份

yarn-site.xml

<configuration>
<!-- Site specific YARN configuration properties -->
<property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>master</value>
    </property>
</configuration>

hadoop自带一个yarn server,这个做mapReduce作业管理

mapred-site.xml

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>master:9001</value>
  </property>
</configuration>

这个自然是指定yarn的地址了

 

最后2个就是master和slaves两个文件了,很直观,master填入namenode,slaves是datanode

hadoop@master:/opt/hadoop/etc/hadoop# cat master
master
hadoop@master:/opt/hadoop/etc/hadoop# cat slaves
d1
d2
d3

到此为止conf设置完毕,现在把/opt/hadoop scp到各个ds

进入hadoop bin,先格式化namenode,这个步骤仅需执行一次

./hadoop namenode -format

然后可以启动hadoop了,先启动hdfs,再启动jobtracker

./start-dfs.sh
./start-yarn.sh

正常来讲,在master和ds上操作jps -m,应该有以下进程

master:

NameNode

SecondaryNameNode

ds:

DataNode

可以测试一些命令

# ls
hadoop fs -ls /
# mkdir
hadoop fs -mkdir hdfs://master:9000/test
# add
hadoop fs -put hadoop-2.7.4.tar.gz /test
# delete
hadoop fs -rm /test/hadoop-2.7.4.tar.gz
# 一些参数
hadoop fs -ls -h /test
hadoop fs -du -h /test/hadoop-2.7.4.tar.gz

这样分布式文件系统hdfs和hadoop mapReduce job环境就搭建好了

查看节点状态

运行 hadoop dfsadmin -report
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Configured Capacity: 316636778496 (294.89 GB)
Present Capacity: 292826760164 (272.72 GB)
DFS Remaining: 288759403492 (268.93 GB)
DFS Used: 4067356672 (3.79 GB)
DFS Used%: 1.39%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0

-------------------------------------------------
Live datanodes (3):

Name: d3:50010 (dd3)
Hostname: d3
Decommission Status : Normal
Configured Capacity: 105545592832 (98.30 GB)
DFS Used: 1758699520 (1.64 GB)
Non DFS Used: 2879700992 (2.68 GB)
DFS Remaining: 95974314316 (89.38 GB)
DFS Used%: 1.67%
DFS Remaining%: 90.93%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 9
Last contact: Tue Dec 19 07:19:04 UTC 2017


Name: d2:50010 (d2)
Hostname: d2
Decommission Status : Normal
Configured Capacity: 105545592832 (98.30 GB)
DFS Used: 1217626112 (1.13 GB)
Non DFS Used: 3051429888 (2.84 GB)
DFS Remaining: 96343658828 (89.73 GB)
DFS Used%: 1.15%
DFS Remaining%: 91.28%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 9
Last contact: Tue Dec 19 07:19:02 UTC 2017


Name: d1:50010 (d1)
Hostname: d1
Decommission Status : Normal
Configured Capacity: 105545592832 (98.30 GB)
DFS Used: 1091031040 (1.02 GB)
Non DFS Used: 3080253440 (2.87 GB)
DFS Remaining: 96441430348 (89.82 GB)
DFS Used%: 1.03%
DFS Remaining%: 91.37%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 9
Last contact: Tue Dec 19 07:19:03 UTC 2017

自带的web ui:http://master:50070/dfshealth.html#tab-overview

 

测试一个大文件,比如梦幻西游客户端,3.8个G

wget http://xyq.gdl.netease.com/MHXY-JD-3.0.212.exe.1
hadoop fs -put MHXY-JD-3.0.212.exe.1 /test

然后去各个节点执行du -sh /data/hadoop

master:3.3M  因为不存数据嘛,只是namenode和jobtracker,换句话说只管理节点和mapReduce

d1:1.1G

d2:1.2G

d3:1.7G

西游客户端均分三份是4035141420/3/1024/1024/1024=1.25G

所以d1,d2都是自己的那份数据(因为之前还有别的文件添加,删除,所以不那么精确),而它俩都选择d3做自己的副本,所以d3上多承担了两份数据,变得更大,但并没有变成3*1.25。我的理解是,跟磁盘阵列一样,存放的是一部分数据和校验和checksum,只要能够拿到足够的副本数据和checksum,就能计算出来完整的副本

可以看到数据节点只需要打通ssh操作,其他都在master完成

 

posted @ 2017-12-19 15:25  Els0n  阅读(1398)  评论(0编辑  收藏  举报