详细版在虚拟机安装和使用hadoop分布式集群
集群模式:
一台master 192.168.85.2
一台slave 192.168.85.3
jdk jdk1.8.0_74(版本不重要,看喜欢)
hadoop版本 2.7.2(版本不重要,2.*都差不多)
本文从安装Ubuntu14.04后开始一步步搭建hadoop集群:
简单说一下虚拟机linux系统的搭建:
因为要搭建hadoop集群,所以预计至少两台虚拟机,这个不着急,我们可以布置一台然后克隆出另一台,然后稍微改动一下配置
我用的镜像是ubuntu-14.04.3-server-amd64.iso,为了主机连接和网络连接建立两个网卡,相关内容可以查看另外一篇博文:本机上搭建虚拟机的网络玩法,安装过程中注意安装openssh服务就好了,安装好之后可以用工具ssh到虚拟机上面操作更方便.安装的时候可以直接指定主机名为master比较好识别,用户名指定为hadoop
1.安装jdk
查看是否安装jdk
java -version
如果未安装参考:Ubuntu系统如何卸载并安装新版本的jdk(permission denied问题),已安装则跳过此步
2.下载hadoop
我下载的地址http://mirror.bit.edu.cn/apache/hadoop/common/,上面会有很多种版本可以选择,对试用来说都是一样的.随便下一个
通过ftp或者ssh传送到虚拟机上解压:
tar zxvf hadoop-2.7.2.tar.gz
重命名:
mv hadoop-2.7.2 hadoop
查看安装目录:
hadoop@master:~/hadoop$ pwd
/home/hadoop/hadoop
接下来配置多个配置文件,配置文件集中在安装目录下的的etc/hadoop下,我们将目录切换到该目录下方便操作,我将配置的内容贴出来:
slaves文件
vi slaves
内容改为
master
core-site.xml文件
vi core-site.xml
在<configuration>标签中添加如下内容:
<property> <name>hadoop.tmp.dir</name> <value>/home/hadoop/hadoop/tmp</value> <description>Abase for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://master:9000</value> </property>
hdfs-site.xml
vi hdfs-site.xml
添加:
<property> <name>dfs.name.dir</name> <value>/home/hadoop/hadoop/dfs/name</value> <description>Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently.</description> </property> <property> <name>dfs.data.dir</name> <value>/home/hadoop/hadoop/dfs/data</value> <description>Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.</description> </property> <property> <name>dfs.replication</name> <value>1</value> </property>
mapred-site.xml,这个文件需要从模板中复制一份过来:
cp mapred-site.xml.template mapred-site.xml
vi mapred-site.xml
添加
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
yarn-site.xml
vi yarn-site.xml
添加:
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
到这边理论上hadoop已经可以跑了.但是在hadoop启动过程中因为脚本的限定可能会报一些环境配置错误,我经过实践为了一绝后患先将该配置的东西配置好
首先是的java_home配置:
vi hadoop-env.sh
修改
export JAVA_HOME=/home/hadoop/jdk1.8.0_74
然后添加hadoop环境变量配置(在java配置下面添加就行):
export HADOOP_DEV_HOME=/home/hadoop/hadoop export PATH=$PATH:$HADOOP_DEV_HOME/bin export PATH=$PATH:$HADOOP_DEV_HOME/sbin export HADOOP_MAPARED_HOME=${HADOOP_DEV_HOME} export HADOOP_COMMON_HOME=${HADOOP_DEV_HOME} export HADOOP_HDFS_HOME=${HADOOP_DEV_HOME} export YARN_HOME=${HADOOP_DEV_HOME} export HADOOP_CONF_DIR=${HADOOP_DEV_HOME}/etc/hadoop export HDFS_CONF_DIR=${HADOOP_DEV_HOME}/etc/hadoop export YARN_CONF_DIR=${HADOOP_DEV_HOME}/etc/hadoop
保存更改
source ~/.bashrc
格式化hdfs
bin/hdfs namenode -format
在克隆虚拟机之前先将主机配置好.
vi /etc/hosts
修改
127.0.0.1 localhost #127.0.1.1 master 192.168.85.2 master 192.168.85.3 slave1
克隆虚拟机,并启动克隆的机器.修改主机名和ip
vi /etc/hostname
修改为slave1
vi /etc/network/interfaces
看到
# This file describes the network interfaces available on your system # and how to activate them. For more information, see interfaces(5). # The loopback network interface auto lo iface lo inet loopback # The primary network interface auto eth0 iface eth0 inet dhcp auto eth1 iface eth1 inet static address 192.168.85.2 netmask 255.255.255.0
修改为
# This file describes the network interfaces available on your system # and how to activate them. For more information, see interfaces(5). # The loopback network interface auto lo iface lo inet loopback # The primary network interface auto eth0 iface eth0 inet dhcp auto eth1 iface eth1 inet static address 192.168.85.3 netmask 255.255.255.0
重启机器
到此,hadoop集群就搭建完了.
安装两台机器后,需要让master无密码登录到slave上面
ssh localhost
cd ~/.ssh
ssh-keygen -t rsa
一直确认即可;
Master 节点需能无密码 ssh 本机,这一步还是在 Master 节点上执行:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
完成后可以使用 ssh Master
验证一下。接着将公匙传输到 Slave1 节点:
scp ~/.ssh/id_rsa.pub hadoop@Slave1:/home/hadoop/
scp时会要求输入Slave1上hadoop用户的密码(hadoop),输入完成后会提示传输完毕。
接着在 Slave1节点
上将ssh公匙保存到相应位置,执行
cat ~/id_rsa.pub >> ~/.ssh/authorized_keys
我们来测试一下是否可以运行:
sbin/start-dfs.sh sbin/start-yarn.sh
这个命令启动了master和slave上面的东西,用jps查看内容
hadoop@master:~$ jps 1632 SecondaryNameNode 4581 Jps 1782 ResourceManager 1402 NameNode
在slave1中执行jps
4586 Jps 3210 DataNode 3356 NodeManager
登录http://192.168.85.2:50070/可以看到master和slave的分布以及启动状况
执行经典案例wordcount.
新建一个text1.txt并上传到集群
cd mkdir input cd input echo "hello world" > test1.txt hadoop fs –mkdir input
最后一条命令可能会报错,报错找不到input文件夹,那是因为hdfs初始化还没有根目录,加上/就好了
hadoop fs –mkdir /input
查看文件:
hadoop@master:~/hadoop$ hadoop fs -ls / Found 1 items drwxr-xr-x - hadoop supergroup 0 2016-05-10 10:36 /input
上传文件到input中并查看
hadoop@master:~/hadoop$ hadoop fs -put ../input/*.txt /input
hadoop@master:~/hadoop$ hadoop fs -ls /
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2016-05-10 10:38 /input
hadoop@master:~/hadoop$ hadoop fs -ls /input
Found 2 items
-rw-r--r-- 1 hadoop supergroup 12 2016-05-10 10:38 /input/test1.txt
-rw-r--r-- 1 hadoop supergroup 13 2016-05-10 10:38 /input/test2.txt
接下来就是用hadoop自带的一个脚本运行该文件,计算单词数
hadoop/bin/hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /input/test1.txt output2
格式是hadoop脚本+jar命令+jar脚本+方法+输入文件+输出文件.
job开始执行输出
16/05/10 10:44:14 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.85.2:8032 16/05/10 10:44:15 INFO input.FileInputFormat: Total input paths to process : 1 16/05/10 10:44:15 INFO mapreduce.JobSubmitter: number of splits:1 16/05/10 10:44:15 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1462879083278_0001 16/05/10 10:44:16 INFO impl.YarnClientImpl: Submitted application application_1462879083278_0001 16/05/10 10:44:16 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1462879083278_0001/ 16/05/10 10:44:16 INFO mapreduce.Job: Running job: job_1462879083278_0001 16/05/10 10:44:30 INFO mapreduce.Job: Job job_1462879083278_0001 running in uber mode : false 16/05/10 10:44:30 INFO mapreduce.Job: map 0% reduce 0% 16/05/10 10:44:40 INFO mapreduce.Job: map 100% reduce 0% 16/05/10 10:44:47 INFO mapreduce.Job: map 100% reduce 100% 16/05/10 10:44:47 INFO mapreduce.Job: Job job_1462879083278_0001 completed successfully 16/05/10 10:44:47 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=30 FILE: Number of bytes written=234875 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=111 HDFS: Number of bytes written=16 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=7470 Total time spent by all reduces in occupied slots (ms)=4602 Total time spent by all map tasks (ms)=7470 Total time spent by all reduce tasks (ms)=4602 Total vcore-milliseconds taken by all map tasks=7470 Total vcore-milliseconds taken by all reduce tasks=4602 Total megabyte-milliseconds taken by all map tasks=7649280 Total megabyte-milliseconds taken by all reduce tasks=4712448 Map-Reduce Framework Map input records=1 Map output records=2 Map output bytes=20 Map output materialized bytes=30 Input split bytes=99 Combine input records=2 Combine output records=2 Reduce input groups=2 Reduce shuffle bytes=30 Reduce input records=2 Reduce output records=2 Spilled Records=4 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=229 CPU time spent (ms)=2730 Physical memory (bytes) snapshot=298352640 Virtual memory (bytes) snapshot=3748110336 Total committed heap usage (bytes)=139145216 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=12 File Output Format Counters Bytes Written=16
http://master:8088/proxy/application_1462879083278_0001/可以查看当前job的运行状态,在运行过程中可以查看.看到map 100% reduce 100%就是运行成功了,可以登录http://192.168.85.2:8088/cluster查看具体信息
最后关闭hadoop集群
sbin/stop-dfs.sh sbin/stop-yarn.sh
是不是很简单呢.