基于Docker容器搭建hadoop完全分布式集群环境

简介

  • 物理机:windows10
  • 宿主机:Centos7虚拟机,需要安装Docker服务
  • hadoop集群节点:3个centos7的容器,hadoop1、hadoop2、hadoop3
  • 组件:
    • 容器镜像:Centos7
    • Docker CE 24.0.7
    • JDK1.8.0_181
    • Hadoop3.1.3

1.新建虚拟机

安装CentOS7

2.安装Docker

(1)安装docker服务
yum -y install docker-ce
(2)开启docker服务
systemctl start docker
systemctl status docker # 查看服务状态
docker version # 查看版本

点击查看版本信息
Client: Docker Engine - Community
 Version:           24.0.7
 ...
Server: Docker Engine - Community
 Engine:
  Version:          24.0.7
  ...

3. 制作镜像

(1)拉取镜像
docker pull centos:7
docker images #查看镜像

REPOSITORY TAG IMAGE ID CREATED SIZE
centos 7 eeb6ee3f44bd 2 years ago 204MB

(2) 制作镜像
a. 制作Dockerfile文件
vi Dockerfile

Dockerfile
FROM centos:7
MAINTAINER zyz

RUN cd /etc/yum.repos.d/
RUN sed -i 's/mirrorlist/#mirrorlist/g' /etc/yum.repos.d/CentOS-*
RUN sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://mirrors.aliyun.com|g' /etc/yum.repos.d/CentOS-*
RUN yum makecache
RUN yum update -y

RUN yum install -y openssh-server sudo
RUN sed -i 's/UsePAM yes/UsePAM no/g' /etc/ssh/sshd_config
RUN yum install -y openssh-clients

RUN echo "root:root" | chpasswd
RUN echo "root   ALL=(ALL)       ALL" >> /etc/sudoers
RUN ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key
RUN ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key

RUN mkdir /var/run/sshd
EXPOSE 22
CMD ["/usr/sbin/sshd", "-D"]

# FROM:基于什么镜像来制作自己的镜像
# MAINTAINER:表示该镜像的作者(维护者)
# 第2段:配置yum,包括修改镜像源、提速、更新
# 第3段:安装ssh服务和ssh客户端
# 第4段:生成ssh密钥
# 第5段:开启ssh服务,暴露SSH的默认端口22

b.生成镜像
docker build -t centos7-ssh .

# "."表示当前目录,即Dockerfile所在的位置

docker images

REPOSITORY TAG IMAGE ID CREATED SIZE
centos7-ssh latest d39095d60198 17 hours ago 1.42GB

4. 创建容器

(1)创建网桥
docker network create hadoop
docker network ls # 查看网桥

NETWORK ID NAME DRIVER SCOPE
371545b29a8d hadoop bridge local

(2)创建容器

docker run -itd --network hadoop --name hadoop1 -p 50070:50070 -p 8088:8088 centos7-ssh
docker run -itd --network hadoop --name hadoop2 centos7-ssh
docker run -itd --network hadoop --name hadoop3 centos7-ssh

# i:立即运行,t:终端,d:后台运行
# --network:容器要加入的网桥,--name:指定容器名称,-p:表示映射端口,centos7-ssh表示创建容器的镜像 

(3)查看容器
docker ps #查看正在运行的容器

点击查看容器信息
CONTAINER ID   IMAGE         COMMAND               CREATED        STATUS        PORTS                                                                                              NAMES
71c3b5fa9846   centos7-ssh   "/usr/sbin/sshd -D"   12 hours ago   Up 10 hours   22/tcp                                                                                             hadoop3
a16e70f1373e   centos7-ssh   "/usr/sbin/sshd -D"   17 hours ago   Up 10 hours   22/tcp                                                                                             hadoop2
bac46cc68c73   centos7-ssh   "/usr/sbin/sshd -D"   17 hours ago   Up 10 hours   0.0.0.0:8088->8088/tcp, :::8088->8088/tcp, 22/tcp, 0.0.0.0:50070->50070/tcp, :::50070->50070/tcp   hadoop1

(4)查看网桥
docker network inspect hadoop

点击查看网桥信息
"Containers": {
            "71c3b5fa98463d995affb206496c04ee6f2fdaedda15240dc490f79f8cad23f9": {
                "Name": "hadoop3",
                "EndpointID": "2710a60e5ef5ea4e590ed2faff7a9db5eca2e5ea960867f05cc818a665a3c4bf",
                "MacAddress": "02:42:ac:13:00:04",
                "IPv4Address": "172.19.0.4/16",
                "IPv6Address": ""
            },
            "a16e70f1373ef80a53bee0fa0af01b861137ada82787bc909d708fe8774a6651": {
                "Name": "hadoop2",
                "EndpointID": "eee50c5aae298154e0811b6dfdc60fc46e9bf2d4a097b67b73582150748fd890",
                "MacAddress": "02:42:ac:13:00:03",
                "IPv4Address": "172.19.0.3/16",
                "IPv6Address": ""
            },
            "bac46cc68c739292fa36b4a1b92ed9f0347f8d622a7129922b6ab7d009b618f9": {
                "Name": "hadoop1",
                "EndpointID": "209ace260c8f2161e0aa805f3b034c66b7892fb8fa9a2a1f7d903dc9100aff3f",
                "MacAddress": "02:42:ac:13:00:02",
                "IPv4Address": "172.19.0.2/16",
                "IPv6Address": ""
            }
        }

5.给容器安装软件

5.1 登录容器

开3个终端,分别进入相应的容器。
docker exec -it hadoop1 bash
docker exec -it hadoop2 bash
docker exec -it hadoop3 bash

5.2 SSH免密

  • hadoop1免密
    (1)生成密钥对
    ssh-keygen #一路回车
    (2)复制密钥
    ssh-copy-id hadoop1
    ssh-copy-id hadoop2
    ssh-copy-id hadoop3
    (3)测试
    ssh hadoop1 #是否需要密码,exit退出
    ssh hadoop2 #是否需要密码,exit退出
    ssh hadoop3 #是否需要密码,exit退出
  • hadoop2免密
  • hadoop3免密

5.3 安装JDK

(1)下载
由宿主机复制到容器
docker cp /package/jdk-8u181-linux-x64.tar.gz hadoop1:/package/

# package目录需要事先创建

(2)安装
tar -zxvf /package/jdk-8u181-linux-x64.tar.gz -C /software/

# software目录需要事先创建

(3)配置
vi /etc/bashrc

export JAVA_HOME=/software/jdk1.8.0_181
export PATH=$PATH:$JAVA_HOME/bin

source /etc/bashrc # 立即生效
(4)测试
java -version

java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

(5)复制到其它节点
a.复制到hadoop2
scp -r /software/jdk1.8.0_181/ hadoop2:/software/
scp /etc/bashrc hadoop2:/etc
hadoop2上执行:source /etc/bashrc
b.复制到hadoop3
scp -r /software/jdk1.8.0_181/ hadoop3:/software/
scp /etc/bashrc hadoop3:/etc
hadoop3上执行:source /etc/bashrc

5.4 安装hadoop

(1)下载
由宿主机复制到容器
docker cp /package/hadoop-3.1.3.tar.gz hadoop1:/package/

(2)安装
tar -zxvf /package/hadoop-3.1.3.tar.gz -C /software/

(3)环境配置
vi /etc/bashrc

export JAVA_HOME=/software/jdk1.8.0_181
export HADOOP_HOME=/software/hadoop-3.1.3 # 新增
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin #修改

# 指定root用户访问
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

source /etc/bashrc # 立即生效
(4)测试
hadoop version

Hadoop 3.1.3
...

(5)hadoop配置

  • hadoop-env.sh
    vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/software/jdk1.8.0_181
  • core-site.xml
    vi $HADOOP_HOME/etc/hadoop/core-site.xml
core-site.xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop1:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/software/hadoop-3.1.3/data</value>
    </property>
</configuration>

<!-- data目录可以不创建,格式化时会系统可自动创建 -->
  • mapred-site.xml
    vi $HADOOP_HOME/etc/hadoop/mapred-site.xml
mapred-site.xml
<configuration>
<property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
</property>

<!-- "Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster"的解决方法 -->
<property>
    <name>yarn.app.mapreduce.am.env</name>
    <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
    <name>mapreduce.map.env</name>
    <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
    <name>mapreduce.reduce.env</name>
    <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<!--
报错:"is running 221518336B beyond the 'VIRTUAL' memory limit. Current usage: 74.0 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used. Killing container."
此问题原因是container在申请多余的内存时,被resouremanager杀掉了,
解决方法:-->
<property>
        <name>mapreduce.map.memory.mb</name>
        <value>1536</value>
</property>
<property>
        <name>mapreduce.map.java.opts</name>
        <value>-Xmx1024M</value>
</property>
<property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>3072</value>
</property>
<property>
        <name>mapreduce.reduce.java.opts</name>
        <value>-Xmx2560M</value>
</property>
</configuration>

  • hdfs-site.xml
    vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml
hdfs-site.xml
<configuration>
<property>
        <name>dfs.replication</name>
        <value>2</value>
</property>
<property>
        <name>dfs.namenode.http-address</name>
        <value>hadoop1:50070</value>
</property>
</configuration>

  • yarn-site.xml
    vi $HADOOP_HOME/etc/hadoop/yarn-site.xml
yarn-site.xml
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
<property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop1</value>
</property>
<property>
        <name>yarn.resourcemanager.webapp.address</name>
        <value>hadoop1:8088</value>
</property>
# Error: Could not find or load main class org.apache.hadoop.mapred.YarnChild
<property>
<name>yarn.application.classpath</name>
<value>
    ${HADOOP_HOME}/etc/hadoop,
    ${HADOOP_HOME}/share/hadoop/common/*,
    ${HADOOP_HOME}/share/hadoop/common/lib/*,
    ${HADOOP_HOME}/share/hadoop/hdfs/*,
    ${HADOOP_HOME}/share/hadoop/hdfs/lib/*,
    ${HADOOP_HOME}/share/hadoop/mapreduce/*,
    ${HADOOP_HOME}/share/hadoop/mapreduce/lib/*,
    ${HADOOP_HOME}/share/hadoop/yarn/*,
    ${HADOOP_HOME}/share/hadoop/yarn/lib/*
</value>
</property>

</configuration>

  • workers
    vi $HADOOP_HOME/etc/hadoop/workers
hadoop1
hadoop2
hadoop3

(6)复制hadoop到其它节点
scp -r /software/hadoop-3.1.3/ hadoop2:/software/
scp -r /software/hadoop-3.1.3/ hadoop3:/software/

(7)格式化
hdfs namenode -format

(8)启动
start-all.sh
jps

点击查看进程
# hadoop1
3488 NodeManager
2881 DataNode
5425 Jps
3369 ResourceManager
2763 NameNode
3051 SecondaryNameNode
# hadoop2
630 DataNode
1064 Jps
732 NodeManager
# hadoop3
993 Jps
599 DataNode
701 NodeManager

hdfs webUI访问:http://10.10.0.100:50070
yarn webUI访问:http://10.10.0.100:8088

(9)wordcount测试
a. 准备数据
vi 1.txt

hello michael
hello julia

vi 2.txt

hello michael is julia father

b.上传至hdfs
hdfs dfs -mkdir -p /wordcount/input
hdfs dfs -put *.txt /wordcount/input/
hdfs dfs -ls -R /wordcount

drwxr-xr-x - root supergroup 0 2023-12-09 14:25 /wordcount/input
-rw-r--r-- 2 root supergroup 26 2023-12-09 14:25 /wordcount/input/words1.txt
-rw-r--r-- 2 root supergroup 30 2023-12-09 14:25 /wordcount/input/words2.txt

c.运行单词统计程序
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /wordcount/input /wordcount/output
hadoop fs -ls -R /wordcount/output/

-rw-r--r-- 2 root supergroup 0 2023-12-09 16:11 /wordcount/output/_SUCCESS
-rw-r--r-- 2 root supergroup 40 2023-12-09 16:11 /wordcount/output/part-r-00000

hadoop fs -cat /wordcount/output/part-r-00000

2023-12-09 16:14:54,373 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
father 1
hello 3
is 1
julia 2
michael 2

(10)完成

posted @ 2023-12-10 10:04  框框A  阅读(624)  评论(1编辑  收藏  举报