CentOS6.5上搭建hadoop 2.5.2 笔记
规划 三台机器,
1、master 192.168.79.135
2、slave1 192.168.79.131
3、slave2 192.168.79.132
OS:CentOS6.5
hadoop:hadoop-2.5.2
jdk:1.7.0_67
由于使用vmware,为了方便,采用clone操作,在操作时,会经常碰到连不上外网的情况,而且有时ip会浮动,建议在clone过程中,按顺序克隆,并按顺序启动,如:首先安装master,然后在master上clone,完成slave1,然后在slave1上clone,完成slave2。
完成克隆后,按顺序启动机器,(本来打算使用固定ip的,但是没有成功,暂时先使用自动获取的),若不适用虚拟机,则需要单独安装三个系统,并配置ip
1、修改/etc/hosts文件
vi /etc/hosts #127.0.0.1 localhost.localdomain localhost 为了以防万一,我将127.0.0.1也注释掉,正常应该只注释ipv6 ::1即可 #::1 localhost6.localdomain6 localhost6 192.168.79.135 master 192.168.79.131 slave1 192.168.79.132 slave2
可以每个机器都单独配置,也可以使用scp命令进行服务器间的拷贝,但是此时没有进行免密码登陆(后边将有ssh免密码登录的说明),拷贝时需要输入密码
2、网络配置:
主要是配置hostname,对于CentOS6.5,配置hostname可以使用命令
#vi /etc/sysconfig/network NETWORKING=yes NETWORKING_IPV6=no HOSTNAME =master NTPSERVERARGS=iburst
将HOSTNAME修改为预先规划的名称,三个机器都要修改,当然、不要重复
关闭防火墙:
a、永久关闭 chkconfig iptables off
b、临时关闭 service iptables stop
c、查看防火墙状态(状态为inactive) service iptables status
3、ssh免密码登陆:
每个机器执行 ssh-keygen 在主节点master执行 cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys ssh slave1 cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys ssh slave2 cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
执行完以后,将authorized_keys文件传输到每个机器的/etc/目录下
拷贝完成后,需要在每个机器执行 ssh 免密码登录到其他机器,包括机器本身,此操作将会把机器的rsa密钥添加到know_hosts,并可以通过该操作验证是否免密码登录设置成功。
确保所有机器互相之间均可以免密码登录以后,既可以使用scp命令免密码传输文件了
4、安装jdk:
先检查系统是否包含jdk,在安装过程中,可能会自动安装openjdk(使用过程中CentOS7安装时勾选开发者工具后,会在系统中自动安装openjdk,CentOS6.5不会安装),
若安装了openjdk,需要卸载后安装sun jdk(http://blog.csdn.net/tralonzhang/article/details/7773824)
先查看 rpm -qa | grep java 显示如下信息: java-1.4.2-gcj-compat-1.4.2.0-40jpp.115 java-1.6.0-openjdk-1.6.0.0-1.7.b09.el5 卸载: rpm -e --nodeps java-1.4.2-gcj-compat-1.4.2.0-40jpp.115 rpm -e --nodeps java-1.6.0-openjdk-1.6.0.0-1.7.b09.el5
下载jdk 对应版本tar包,解压到指定目录,本人(/usr/java/)解压后在/etc/profile中配置jdk环境变量export JAVA_HOME=/usr/java/jdk1.7.0_67 export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar export JRE_HOME=/usr/java/jdk1.7.0_67/jre export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin修改完成后,执行source /etc/profile,是环境变量生效
接下来验证jdk是否安装成功:[root@master ~]# java -version java version "1.7.0_67" Java(TM) SE Runtime Environment (build 1.7.0_67-b15) Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode) [root@master ~]# javac -version javac 1.7.0_67 [root@master ~]# $JAVA_HOME -bash: /usr/java/jdk1.7.0_67: is a directory如上显示正常版本及jdk路径,则安装成功,可以把java安装文件及/etc/profile文件拷贝到其他节点机器,执行 source /etc/profile,即可。
5、安装hadoop-2.5.2:
hadoop 安装目录 /data/hadoop/hadoop-2.5.2/
文件系统数据目录 /data/hadoop/hadoop_home/hdfs/data/
文件系统namenode /data/hadoop/hadoop_home/hdfs/name/
在每个机器创建目录 /data/hadoop/ ,将hadoop-2.5.2.tar.gz拷贝到该目录下,进入该目录,
执行 tar -zxvf hadoop-2.5.2.tar.gz 解压 hadoop-2.5.2.tar.gz
然后进入最关键的hadoop配置环节:
新版的hadoop,将目录文件结构进行了调整,0.20时配置文件均在conf目录下,现在全部移入/data/hadoop/hadoop-2.5.2/etc/hadoop/。
在配置过程中,开始是找了网上的两篇配置文档,
对照着进行配置,但是最终都是只能启动,但是resourceManager会掉线,导致失败,后来直接找了官方文档,发现网上教程中的字母大小写都是错的,一步步对照,一个个属性查看,进行了一次配置,成功了!
重要的是发现了hadoop jar包中本来就包含了详细的官方文档,还有所有配置的默认属性,完全可以拷出来修改,避免书写错误集群安装文档:{jar包位置}hadoop-2.5.2/share/doc/hadoop/hadoop-project-dist/hadoop-common/ClusterSetup.html
还是官方的第一手文档比较靠谱,以后下载后会先看jar包中是否包含文档,省得去网上各种野文档,浪费时间(我也在制造野文档......哈哈哈)。
言归正传:
(1)、修改hadoop-env.sh及yarn-env.sh中的JAVA_HOME,与环境变量中的路径相同即可
(2)、core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://master:9000</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> <property> <name>hadoop.tmp.dir</name> <value>/data/hadoop/tmp/hadoop-${user.name}</value> <description>A base for other temporary directories.</description> </property> <!-- i/o properties --> <property> <name>io.file.buffer.size</name> <value>131072</value> <description>The size of buffer for use in sequence files. The size of this buffer should probably be a multiple of hardware page size (4096 on Intel x86), and it determines how much data is buffered during read and write operations.</description> </property> </configuration>
(3).hdfs-site.xml
<configuration> <property> <name>dfs.namenode.name.dir</name> <value>/data/hadoop/hadoop-2.5.2/hdfs/name</value> <description>Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. </description> </property> <property> <name>dfs.datanode.data.dir</name> <value>/data/hadoop/hadoop-2.5.2/hdfs/data</value> <description>Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. </description> </property> <property> <name>dfs.replication</name> <value>2</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <property> <name>dfs.blocksize</name> <value>134217728</value> <description> The default block size for new files, in bytes. You can use the following suffix (case insensitive): k(kilo), m(mega), g(giga), t(tera), p(peta), e(exa) to specify the size (such as 128k, 512m, 1g, etc.), Or provide complete size in bytes (such as 134217728 for 128 MB). </description> </property> <property> <name>dfs.namenode.handler.count</name> <value>10</value> <description>The number of server threads for the namenode.</description> </property> </configuration>
(4)、mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> <description>The runtime framework for executing MapReduce jobs. Can be one of local, classic or yarn. </description> </property> <!-- jobhistory properties --> <property> <name>mapreduce.jobhistory.address</name> <value>master:10020</value> <description>MapReduce JobHistory Server IPC host:port</description> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>master:19888</value> <description>MapReduce JobHistory Server Web UI host:port</description> </property> </configuration>
(5)、yarn-site.xml
<configuration> <!-- Site specific YARN configuration properties --> <property> <description>The hostname of the RM.</description> <name>yarn.resourcemanager.hostname</name> <value>master</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>${yarn.resourcemanager.hostname}:8032</value> <description>The address of the applications manager interface in the RM.</description> </property> <property> <description>The address of the scheduler interface.</description> <name>yarn.resourcemanager.scheduler.address</name> <value>${yarn.resourcemanager.hostname}:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>${yarn.resourcemanager.hostname}:8031</value> </property> <property> <description>The address of the RM admin interface.</description> <name>yarn.resourcemanager.admin.address</name> <value>${yarn.resourcemanager.hostname}:8033</value> </property> <property> <description>The http address of the RM web application.</description> <name>yarn.resourcemanager.webapp.address</name> <value>${yarn.resourcemanager.hostname}:8088</value> </property> <property> <description>The minimum allocation for every container request at the RM, in MBs. Memory requests lower than this won't take effect, and the specified value will get allocated at minimum. default is 1024 </description> <name>yarn.scheduler.minimum-allocation-mb</name> <value>512</value> </property> <property> <description>The maximum allocation for every container request at the RM, in MBs. Memory requests higher than this won't take effect, and will get capped to this value. default value is 8192</description> <name>yarn.scheduler.maximum-allocation-mb</name> <value>2048</value> </property> <property> <description>Amount of physical memory, in MB, that can be allocated for containers.default value is 8192</description> <name>yarn.nodemanager.resource.memory-mb</name> <value>2048</value> </property> <property> <description>Whether to enable log aggregation. Log aggregation collects each container's logs and moves these logs onto a file-system, for e.g. HDFS, after the application completes. Users can configure the "yarn.nodemanager.remote-app-log-dir" and "yarn.nodemanager.remote-app-log-dir-suffix" properties to determine where these logs are moved to. Users can access the logs via the Application Timeline Server. </description> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> </configuration>
(6)、slaves(根据官方文档所说,slaves的配置可以是ip也可以是hostname,为了修改ip方便,本人使用hostname)
slave1
slave2
在以上配置文件中,有好多属性为hadoop默认属性值,拿来只是为了标注清楚,在配置时,若发现与默认文档相同的值,可以省略
至此,hadoop配置文件就配置完了,接着需要配置hadoop的环境变量,之前java环境变量也是包含其中,如下:
/etc/profile
#set java_env export JAVA_HOME=/usr/java/jdk1.7.0_67 export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar export JRE_HOME=/usr/java/jdk1.7.0_67/jre export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin ###set hadoop_env export HADOOP_HOME=/data/hadoop/hadoop-2.5.2 export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_YARN_HOME=$HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HADOOP_HOME/lib export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
同样执行source /etc/profile
$HADOOP_HOME、hadoop -version 测试配置是否成功
若以上都配置成功,将对应安装文件(包括/etc/profile)拷贝到其他机器上,注意位置一定要相同,则可启动hadoop
6、首先使用hadoop namenode -format 格式化namenode,也可以使用/bin/hdfs namenode -format进行格式化操作
出现 Storage directory /data/hadoop/hadoop-2.5.2/hdfs/name has been successfully formatted,则格式化成功。
然后可以启动hadoop,官方的启动方式是是用守护进程,好像比较繁杂,
故使用start-dfs.sh启动hdfs,
可以看到会有几个日志输出位置,其中包含SecondaryNameNode(0.0.0.0),这个是正确的,因为在hdfs-default.xml中提到
<property> <name>dfs.namenode.rpc-bind-host</name> <value></value> <description> The actual address the RPC server will bind to. If this optional address is set, it overrides only the hostname portion of dfs.namenode.rpc-address. It can also be specified per name node or name service for HA/Federation. This is useful for making the name node listen on all interfaces by setting it to 0.0.0.0. </description> </property>
[root@master hadoop]# jps 2630 Jps 1955 SecondaryNameNode 1785 NameNode [root@slave1 ~]# jps 1942 Jps 1596 DataNode
若在master使用jps发现上边两个进程,在slave发现DataNode,则dfs启动成功(当然你需要在slave节点查看日志,若有错仍还需要排查)
都成功后,可以使用./start-yarn.sh
启动yarn
[root@master hadoop]# jps 2630 Jps 1955 SecondaryNameNode 1785 NameNode 2316 ResourceManager [root@slave1 ~]# jps 1942 Jps 1596 DataNode 1774 NodeManager
启动成功后,master增加一个ResourceManager,slave增加一个NodeManager,启动成功
7、在文档中还有一个jobhistory的web UI,若只启动dfs和yarn是看不到这个界面的,需要单独启动,这时就需要使用守护进程来启动了:
以下命令 用于启动mapreduce-jobhistoryserver (若不需要查看,则不需要启动),jps可以查看到多了一个JobHistoryServer./mr-jobhistory-daemon.sh start historyserver --config $HADOOP_CONF_DIR
以下为hadoop环境的可视化管理界面
Daemon | Web Interface | Notes |
---|---|---|
NameNode | http://nn_host:port/ | Default HTTP port is 50070. |
ResourceManager | http://rm_host:port/ | Default HTTP port is 8088. |
MapReduce JobHistory Server | http://jhs_host:port/ | Default HTTP port is 19888. |
引:0.0.0.0:8031错误原因及解决办法
Exception:
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.ConnectException: Call From slave1/192.168.79.131 to 0.0.0.0:8031 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
在启动yarn时,slave总是会连接0.0.0.0:8031,此处的0.0.0.0实际是resourceManager的默认hostname,若在yarn-site.xml中不设置${yarn.resourcemanager.hostname},则将会出现连接错误,最终导致yarn启动失败,还会在slave日志中出现拒绝连接的错误,
该属性可以在yarn-site.xml中设置为变量,其他属性可以直接引用该变量即可,方便修改(备注:在yarn-default.xml中,该属性的值为ip,实际配置过程为了方便修改,将其设置为hostname,是否可行需后续启动验证)
各配置文件之间变量可以相互引用