……

hadoop官方文档:

1
https://hadoop.apache.org/docs/

 

安装hadoop集群

 

配置DNS解析或hosts文件:

1
2
3
4
5
6
7
cat /etc/hosts <<EOF
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
10.3.149.20 hadoop-master
10.3.149.21 hadoop-node1
10.3.149.22 hadoop-node2
EOF

配置root用户免秘钥:

1
2
3
4
5
6
7
ssh-keygen 
ssh-copy-id -i .ssh/id_rsa.pub root@hadoop-master
ssh-copy-id -i .ssh/id_rsa.pub root@hadoop-node1
ssh-copy-id -i .ssh/id_rsa.pub root@hadoop-node2
ssh root@hadoop-master 'date'
ssh root@hadoop-node1 'date'
ssh root@hadoop-node2 'date'

配置hadoop免秘钥:

1
2
3
4
5
6
7
8
9
10
11
12
useradd hadoop
echo '123456' passwd --stdin hadoop
su hadoop
 
ssh-keygen
ssh-copy-id -i .ssh/id_rsa.pub hadoop@hadoop-master
ssh-copy-id -i .ssh/id_rsa.pub hadoop@hadoop-node1
ssh-copy-id -i .ssh/id_rsa.pub hadoop@hadoop-node2
ssh hadoop@hadoop-master 'date'
ssh hadoop@hadoop-node1 'date'
ssh hadoop@hadoop-node2 'date'
exit

 

安装java:

1
tar -xf jdk-8u231-linux-x64.tar.gz -C /usr/local/

创建软连接:

1
2
cd /usr/local/
ln -sv jdk1.8.0_231/ jdk

添加环境变量:

1
2
3
4
5
6
7
cat /etc/profile.d/java.sh <<EOF
export JAVA_HOME=/usr/local/jdk
export JRE_HOME=\$JAVA_HOME/jre
export CLASSPATH=.:\$JAVA_HOME/lib/dt.jar:\$JAVA_HOME/lib/tools.jar:\$JRE_HOME/lib
export PATH=\$PATH:\$JAVA_HOME/bin:\$JRE_HOME/bin
EOF
/etc/profile.d/java.sh

测试是否安装成功:

1
2
java -version
javac -version

 

安装hadoop:

hadoop下载地址:

1
2
https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/ 
http://archive.apache.org/dist/hadoop/common/

hadoop2.7版本的:

1
http://archive.apache.org/dist/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz

下载安装包:

1
wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-2.10.0/hadoop-2.10.0.tar.gz

解压:

1
2
3
tar -xf hadoop-2.10.0.tar.gz -C /usr/local/
cd /usr/local/
ln -sv hadoop-2.10.0/ hadoop

配置环境变量:

1
2
3
4
cat /etc/profile.d/hadoop.sh <<EOF
export HADOOP_HOME=/usr/local/hadoop
export PATH=\$PATH:\$HADOOP_HOME/bin:\$HADOOP_HOME/sbin
EOF

应用环境变量:

1
/etc/profile.d/hadoop.sh

创建数据目录:

1
2
3
4
# master
mkdir -pv  /data/hadoop/hdfs/{nn,snn}
# node
mkdir -pv  /data/hadoop/hdfs/dn

 

master节点的配置:

进入配置目录:

1
cd /usr/local/hadoop/etc/hadoop

core-site.xml

1
2
3
4
5
6
7
<configuration>
    <property>
    <name>fs.defaultFS</name>
   <value>hdfs://hadoop-master:8020</value>
    <final>true</final>
    </property>
</configuration>

yarn-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
<configuration>
    <property>
    <name>yarn.resourcemanager.address</name>
    <value>hadoop-master:8032</value>
    </property>
    <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>hadoop-master:8030</value>
    </property>
    <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>hadoop-master:8031</value>
    </property>
    <property>
    <name>yarn.resourcemanager.admin.address</name>
    <value>hadoop-master:8033</value>
    </property>
    <property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>hadoop-master:8088</value>
    </property>
    <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
    </property>
    <property>
    <name>yarn.nodemanager.auxservices.mapreduce_shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
    </property>
</configuration>

hdfs-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<configuration>
    <property>
    <name>dfs.replication</name>
    <value>1</value>
    </property>
    <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///data/hadoop/hdfs/nn</value>
    </property>
    <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///data/hadoop/hdfs/dn</value>
    </property>
    <property>
    <name>fs.checkpoint.dir</name>
    <value>file:///data/hadoop/hdfs/snn</value>
    </property>
    <property>
    <name>fs.checkpoint.edits.dir</name>
    <value>file:///data/hadoop/hdfs/snn</value>
    </property>
</configuration>

mapred-site.xml

1
2
3
4
5
6
<configuration>
    <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
    </property>
</configuration>

创建master文件:

1
2
3
cat > master <<EOF
hadoop-master
EOF

创建slave

1
2
3
4
cat > slaves <<EOF
hadoop-node1
hadoop-node2
EOF

 

常用配置注解:

1
http://blog.51yip.com/hadoop/2020.html

 

node节点上: 

将主节点上的配置复制到node节点即可:

1
2
scp ./* root@hadoop-node1:/usr/local/hadoop/etc/hadoop/
scp ./* root@hadoop-node2:/usr/local/hadoop/etc/hadoop/

删除slaves文件:其他配置同master。

1
rm /usr/local/hadoop/etc/hadoop/slaves -rf

 

创建日志目录:

1
2
mkdir /usr/local/hadoop/logs
chmod g+w /usr/local/hadoop/logs/

改属主属组:

1
2
3
chown -R hadoop.hadoop /data/hadoop/
cd /usr/local/
chown -R hadoop.hadoop hadoop hadoop/

 

启动与停止集群

 

格式化hdfs:格式化之后就可以启动集群了

1
2
su hadoop
[hadoop@hadoop-master ~]$ hadoop namenode -format

 

先启动hdfs:从下面的输出可以看出各个节点以及运行的程序。

1
2
3
4
5
6
7
[hadoop@hadoop-master ~]$ start-dfs.sh 
Starting namenodes on [hadoop-master]
hadoop-master: starting namenode, logging to /usr/local/hadoop-2.10.0/logs/hadoop-hadoop-namenode-hadoop-master.out
hadoop-node2: starting datanode, logging to /usr/local/hadoop-2.10.0/logs/hadoop-hadoop-datanode-hadoop-node2.out
hadoop-node1: starting datanode, logging to /usr/local/hadoop-2.10.0/logs/hadoop-hadoop-datanode-hadoop-node1.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop-2.10.0/logs/hadoop-hadoop-secondarynamenode-hadoop-master.out

查看本节点运行的进程:可以到任意一个节点上使用如下命令。

1
2
3
4
5
~]$ jps
1174 Jps
32632 ResourceManager
32012 NameNode
32220 SecondaryNameNode

 

再启动yarn:可以看到对应的节点启动的进程。

1
2
3
4
5
[hadoop@hadoop-master ~]$ start-yarn.sh 
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop-2.10.0/logs/yarn-hadoop-resourcemanager-hadoop-master.out
hadoop-node2: starting nodemanager, logging to /usr/local/hadoop-2.10.0/logs/yarn-hadoop-nodemanager-hadoop-node2.out
hadoop-node1: starting nodemanager, logging to /usr/local/hadoop-2.10.0/logs/yarn-hadoop-nodemanager-hadoop-node1.out

或者一次性启动:

1
[hadoop@hadoop-master ~]$ start-all.sh

 

查看hadoop集群的运行状态:

1
hadoop dfsadmin -report

访问概览web页面:

1
http://10.3.149.20:50070/

集群信息web页面:

1
http://10.3.149.20:8088/cluster

 

停止集群:

1
2
stop-dfs.sh
stop-yarn.sh

或者:

1
stop-all.sh

 

hdfs文件系统的使用

 

浏览目录:

1
~]$ hdfs dfs -ls /

创建目录:

1
~]$ hdfs dfs -mkdir /test

上传文件:

1
~]$ hdfs dfs -put /etc/fstab /test/fstab

查看文件存储位置:到其中一个datanode上的数据目录就可以查看到这个文件块,默认为128m,超过这个大小文件会分成两块,但是小于128m的文件并不会真正占用128m。

image.png

1
]$ cat /data/hadoop/hdfs/dn/current/BP-1469813358-10.3.149.20-1595493741225/current/finalized/subdir0/subdir0/blk_1073741825

递归浏览

1
~]$ hdfs dfs -ls -R /

查看文件:

1
~]$ hdfs dfs -cat /fstab

更多使用命令帮助:

1
https://hadoop.apache.org/docs/r2.10.0/hadoop-project-dist/hadoop-common/FileSystemShell.html

 

统计字符数运算示例:

在  /usr/local/hadoop/share/hadoop/mapreduce  目录中有很多用于计算的示例可以用来测试。

先上传用于测试的文件:

1
2
hdfs dfs mkdir /test
hdfs dfs -put /etc/fstab /test/fstab

查看帮助:直接运行程序会给出帮助信息

1
yarn jar hadoop-mapreduce-examples-2.10.0.jar

测试:这里选择一个单词统计进行测试。

1
2
cd /usr/local/hadoop/share/hadoop/mapreduce
]$ yarn jar hadoop-mapreduce-examples-2.10.0.jar wordcount /test/fstab /test/count

可以在下面页面查看到正在运行的任务:

1
http://10.3.149.20:8088/cluster/apps

查看运算的结果:

1
]$ hdfs dfs -cat /test/count/part-r-00000

 

yarn常用命令:

查看运行中的应用:

1
~]$ yarn application -list

已经运行过的应用:

1
 ~]$ yarn application -list -appStates=all

查看应用的状态:

1
~]$ yarn application  -status application_1595496103452_0001
 posted on 2020-10-09 17:10  大码王  阅读(340)  评论(0编辑  收藏  举报
复制代码