搭建hadoop集群
hadoop的架构
HDFS + MapReduce = Hadoop
MapReduce = Mapper + Reducer
hadoop的生态系统
准备四个节点,系统版本为CentOS7.3
192.168.135.170 NameNode,SecondaryNameNode,ResourceManager
192.168.135.171 DataNode,NodeManager
192.168.135.169 DataNode,NodeManager
192.168.135.172 DataNode,NodeManager
1、修改各节点hosts
# vim /etc/hosts
192.168.135.170 node1 master
192.168.135.171 node2
192.168.135.169 node3
192.168.135.172 node4
2、校对时间
# yum install -y ntp ntpdate && ntpdate pool.ntp.org
3、安装java环境
# yum install -y java java-1.8.0-openjdk-devel
# vim /etc/profile.d/java.sh
export JAVA_HOME=/usr
# source /etc/profile.d/java.sh
4、修改各节点环境变量
# vim /etc/profile.d/hadoop.sh
export HADOOP_PREFIX=/bdapps/hadoop
export PATH=$PATH:${HADOOP_PREFIX}/bin:${HADOOP_PREFIX}/sbin
export HADOOP_YARN_HOME=${HADOOP_PREFIX}
export HADOOP_MAPRED_HOME=${HADOOP_PREFIX}
export HADOOP_COMMON_HOME=${HADOOP_PREFIX}
export HADOOP_HDFS_HOME=${HADOOP_PREFIX}
# source /etc/profile.d/hadoop.sh
# scp /etc/profile.d/hadoop.sh node2:/etc/profile.d/hadoop.sh
# scp /etc/profile.d/hadoop.sh node3:/etc/profile.d/hadoop.sh
# scp /etc/profile.d/hadoop.sh node4:/etc/profile.d/hadoop.sh
5、创建用户
# useradd hadoop
# echo 'hadoop' | passwd --stdin hadoop
6、设置ssh互信
# su - hadoop
$ ssh-keygen
$ ssh-copy-id node1
$ ssh-copy-id node2
$ ssh-copy-id node3
$ ssh-copy-id node4
7、配置master节点,即node1
a、创建目录
# mkdir -pv /bdapps
# mkdir -pv /data/hadoop/hdfs/{nn,snn,dn}
# chown hadoop.hadoop -R /data/hadoop/hdfs
b、下载程序包
# wget http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gz
# tar xvf hadoop-2.6.5.tar.gz -C /bdapps/
# cd /bdapps/
# ln -sv hadoop-2.6.5/ hadoop
# cd hadoop
# mkdir logs
# chmod g+w logs
# chown -R hadoop.hadoop /bdapps/hadoop
c、配置NameNode
# cd etc/hadoop/
# vim core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://192.168.135.170:8020</value>
<final>true</final>
</property>
</configuration>
d、配置yarn
# vim yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.address</name>
<value>192.168.135.170:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>192.168.135.170:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>192.168.135.170:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>192.168.135.170:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>192.168.135.170:8088</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>
</configuration>
e、配置HDFS
# vim hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///data/hadoop/hdfs/nn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///data/hadoop/hdfs/dn</value>
</property>
<property>
<name>fs.checkpoint.dir</name>
<value>file:///data/hadoop/hdfs/snn</value>
</property>
<property>
<name>fs.checkpoint.edits.dir</name>
<value>file:///data/hadoop/hdfs/snn</value>
</property>
</configuration>
f、配置MapReduce framework
# cp mapred-site.xml.template mapred-site.xml
# vim mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
g、定义slaves
# vim slaves
192.168.135.171
192.168.135.169
192.168.135.172
8、配置node2,node3,node4
a、创建目录
# mkdir -pv /bdapps
# mkdir -pv /data/hadoop/hdfs/{nn,snn,dn}
# chown hadoop.hadoop -R /data/hadoop/hdfs
b、下载程序包
# wget http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gz
# tar xvf hadoop-2.6.5.tar.gz -C /bdapps/
# cd /bdapps/
# ln -sv hadoop-2.6.5/ hadoop
# cd hadoop
# mkdir logs
# chmod g+w logs
# chown -R hadoop.hadoop /bdapps/hadoop/logs
c、从node1上复制配置文件
# su - hadoop
$ scp /bdapps/hadoop/etc/hadoop/* node2:/bdapps/hadoop/etc/hadoop/
$ scp /bdapps/hadoop/etc/hadoop/* node3:/bdapps/hadoop/etc/hadoop/
$ scp /bdapps/hadoop/etc/hadoop/* node4:/bdapps/hadoop/etc/hadoop/
9、格式化HDFS,需要以hadoop用户身份在master节点上执行
# su - hadoop
$ hdfs --help
http://hadoop.apache.org/docs/r2.6.5/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html
$ hdfs namenode -format
common.Storage: Storage directory /data/hadoop/hdfs/nn has been successfully formatted.
$ ll /data/hadoop/hdfs/nn/current/
10、启动hadoop,有两种方式
a、在各节点上分别启动各服务
master节点需要启动HDFS的NameNode服务和yarn的ResourceManager服务。
$ hadoop-daemon.sh start namenode
$ hadoop-daemon.sh start secondarynamenode
$ yarn-daemon.sh start resourcemanager
各slave节点需要启动HDFS的DataNode服务和yarn的NodeManager服务。
$ hadoop-daemon.sh start datanode
$ yarn-daemon.sh start nodemanager
b、在master节点上用脚本控制集群中的各节点启动
$ start-dfs.sh
Starting namenodes on [node1]
node1: starting namenode, logging to /bdapps/hadoop/logs/hadoop-hadoop-namenode-node1.out
192.168.135.172: starting datanode, logging to /bdapps/hadoop/logs/hadoop-hadoop-datanode-node4.out
192.168.135.171: starting datanode, logging to /bdapps/hadoop/logs/hadoop-hadoop-datanode-node2.out
192.168.135.169: starting datanode, logging to /bdapps/hadoop/logs/hadoop-hadoop-datanode-node3.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
ECDSA key fingerprint is 38:28:13:e9:f0:e7:06:37:b9:3e:96:b5:ce:b9:06:fb.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /bdapps/hadoop/logs/hadoop-hadoop-secondarynamenode-node1.out
尝试上传一个文件
$ hdfs dfs -ls /
$ hdfs dfs -mkdir /test
$ hdfs dfs -put /etc/fstab /test/
$ hdfs dfs -lsr /
drwxr-xr-x - hadoop supergroup 0 2017-04-06 02:19 /test
-rw-r--r-- 2 hadoop supergroup 541 2017-04-06 02:19 /test/fstab
$ hdfs dfs -cat /test/fstab
查看hdfs信息
http://hadoop.apache.org/docs/r2.6.5/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#dfsadmin
-report [-live] [-dead] [-decommissioning]:Reports basic filesystem information and statistics. Optional flags may be used to filter the list of displayed DataNodes.
$ hdfs dfsadmin -report
查看yarn信息
hadoop2引入了yarn框架,对每个slave节点可以通过NodeManager进行管理,启动NodeManager进程后,即可加入集群。
$ yarn node -list
17/04/07 03:33:33 INFO client.RMProxy: Connecting to ResourceManager at /192.168.135.170:8032
Total Nodes:3
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
node4:46842 RUNNING node4:8042 0
node2:35812 RUNNING node2:8042 0
node3:33280 RUNNING node3:8042 0
$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /bdapps/hadoop/logs/yarn-hadoop-resourcemanager-node1.out
192.168.135.172: starting nodemanager, logging to /bdapps/hadoop/logs/yarn-hadoop-nodemanager-node4.out
192.168.135.171: starting nodemanager, logging to /bdapps/hadoop/logs/yarn-hadoop-nodemanager-node2.out
192.168.135.169: starting nodemanager, logging to /bdapps/hadoop/logs/yarn-hadoop-nodemanager-node3.out
在master节点上的进程
$ jps
2272 NameNode
2849 ResourceManager
2454 SecondaryNameNode
3112 Jps
在slave节点上的进程
$ jps
12192 Jps
12086 NodeManager
11935 DataNode
11、查看WebUI
$ netstat -tnlp
a、HDFS的WebUI
http://192.168.135.170:50070
b、yarn的WebUI
http://192.168.135.170:8088
12、运行测试程序
# su - hdfs
$ cd /bdapps/hadoop/share/hadoop/mapreduce
$ yarn jar hadoop-mapreduce-examples-2.6.5.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.
$ yarn jar hadoop-mapreduce-examples-2.6.5.jar wordcount /test/fstab /test/fstab.out
17/04/06 02:40:06 INFO client.RMProxy: Connecting to ResourceManager at /192.168.135.170:8032
17/04/06 02:40:12 INFO input.FileInputFormat: Total input paths to process : 1
17/04/06 02:40:12 INFO mapreduce.JobSubmitter: number of splits:1
17/04/06 02:40:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1491416651117_0001
17/04/06 02:40:14 INFO impl.YarnClientImpl: Submitted application application_1491416651117_0001
17/04/06 02:40:17 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1491416651117_0001/
17/04/06 02:40:17 INFO mapreduce.Job: Running job: job_1491416651117_0001
17/04/06 02:40:47 INFO mapreduce.Job: Job job_1491416651117_0001 running in uber mode : false
17/04/06 02:40:47 INFO mapreduce.Job: map 0% reduce 0%
17/04/06 02:41:19 INFO mapreduce.Job: map 100% reduce 0%
17/04/06 02:41:33 INFO mapreduce.Job: map 100% reduce 100%
17/04/06 02:41:34 INFO mapreduce.Job: Job job_1491416651117_0001 completed successfully
17/04/06 02:41:34 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=585
FILE: Number of bytes written=215501
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=644
HDFS: Number of bytes written=419
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=29830
Total time spent by all reduces in occupied slots (ms)=10691
Total time spent by all map tasks (ms)=29830
Total time spent by all reduce tasks (ms)=10691
Total vcore-milliseconds taken by all map tasks=29830
Total vcore-milliseconds taken by all reduce tasks=10691
Total megabyte-milliseconds taken by all map tasks=30545920
Total megabyte-milliseconds taken by all reduce tasks=10947584
Map-Reduce Framework
Map input records=12
Map output records=60
Map output bytes=648
Map output materialized bytes=585
Input split bytes=103
Combine input records=60
Combine output records=40
Reduce input groups=40
Reduce shuffle bytes=585
Reduce input records=40
Reduce output records=40
Spilled Records=80
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=281
CPU time spent (ms)=8640
Physical memory (bytes) snapshot=291602432
Virtual memory (bytes) snapshot=4209983488
Total committed heap usage (bytes)=149688320
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=541
File Output Format Counters
Bytes Written=419
$ hdfs dfs -ls /test/fstab.out
Found 2 items
-rw-r--r-- 2 hadoop supergroup 0 2017-04-06 02:41 /test/fstab.out/_SUCCESS
-rw-r--r-- 2 hadoop supergroup 419 2017-04-06 02:41 /test/fstab.out/part-r-00000
$ hdfs dfs -cat /test/fstab.out/part-r-00000
# 7
'/dev/disk' 1
/ 1
/boot 1
/dev/mapper/cl-home 1
/dev/mapper/cl-root 1
/dev/mapper/cl-swap 1
/etc/fstab 1
/home 1
0 8
01:15:45 1
11 1
2017 1
Accessible 1
Created 1
Mar 1
Sat 1
See 1
UUID=b76be3cf-613c-478a-ab8b-d1eaa67a061a 1
anaconda 1
and/or 1
are 1
blkid(8) 1
by 2
defaults 4
filesystems, 1
findfs(8), 1
for 1
fstab(5), 1
info 1
maintained 1
man 1
more 1
mount(8) 1
on 1
pages 1
reference, 1
swap 2
under 1
xfs 3