HDFS环境搭建

前言

Hadoop早期衍生自Nutch(Java开源的搜索引擎),早期Nutch构建开源的搜索引擎,同样在少数的机器上同时运行计算任务面临着问题,在这个时候Google发布了GFS和Map Reduce论文。人们参考着这两篇论文中所阐述的思想开始重建Nutch的存储和计算模型。一开始尝试在20台机器的计算和存储,不久人们发现为了满足web服务高扩展性,需要将集群规模扩展到几千台的集群规模。就在这个时候Yahoo团队加入后,开始将存储和计算从Nutch中剥离出去,并且命名为Hadoop,在Yahoo团队的帮助下,Hadoop很快成为能够真正满足应用的高扩展性。

Big Data

Big Data(大数据),或称巨量资料,指的是所涉及的资料量规模巨大到无法通过目前主流软件工具,在合理时间内达到撷取、管理、处理、并整理成为帮助企业经营决策更积极目的的资讯。Big Data大数据,谈的不仅仅是数据量,其实包含了数据量大(Volume)、时效性强(Velocity)、数据多样性(Variety)、价值密度低(Value)。

Volume:数据量大量数据的产生、处理、保存,谈的就是Big Data就字面上的意思,就是谈海量数据。

Velocity:就是处理的时效,既然前头提到Big Data其中一个用途是做市场预测,那处理的时效如果太长就失去了预测的意义了,所以处理的时效对Big Data来说也是非常关键的,500万笔数据的深入分析,可能只能花5分钟的时间。

Variety[vəˈraɪɪti]:多变性指的是数据的形态,包含文字、影音、网页、串流等等结构性、非结构性的数据。

Value:价值密度低是指大量数据中只包含部分有价值的数据。

面临挑战

随着互联网用户量的增多,大用户数据,大数据流量已经远远超过单机服务器的处理能力,虽然摩尔定律给出在价格不变的情况下,硬件性能每隔18个月,性能都会提高提升1倍,即便如此,横向扩展比纵向扩展成本要低得多,分布式系统由此而产生。

Hadoop课程

ZooKeeper    : 分布式系统的协调服务。

Flume  : 分布式日志采集。

Kafka   : 分布式消息队列。

Hive    : HQL(Hive Query language)sql 翻译成MapReduce程序运行 ETL hive编程指南

Hbase   :基于列存储数据库,基于分布式存储HDFS 行10亿 * 百万大数据随机访问 

HDFS   :分布式文件存储

MapReduce:分布式计算

 

HDFS环境安装

参考:

http://hadoop.apache.org/docs/r2.6.5/hadoop-project-dist/hadoop-common/SingleCluster.html

 

1)安装JDK配置JAVA_HOME环境变量(jdk-7u79-linux-x64.rpm

 

[root@CentOS ~]# rpm -ivh jdk-7u79-linux-x64.rpm

[root@CentOS ~]# vi /root/.bashrc

JAVA_HOME=/usr/java/latest

PATH=$PATH:$JAVA_HOME/bin

CLASSPATH=.

export JAVA_HOME

export PATH

export CLASSPATH

[root@CentOS ~]# source .bashrc

 

小提示:

First:修改ip地址为动态获取

[root@CentOS ~]# yum install -y lrzsz之后键入rz文件上传,sz文件下载

[root@CentOS ~]# sz jdk-7u79-linux-x64.rpm

[root@CentOS ~]# rz

2)关闭防火墙

[root@CentOS ~]# service iptables stop

iptables: Setting chains to policy ACCEPT: filter          [  OK  ]

iptables: Flushing firewall rules:                         [  OK  ]

iptables: Unloading modules:                               [  OK  ]

[root@CentOS ~]# chkconfig iptables off

3)配置主机名为CentOS

[root@CentOS ~]# vi /etc/sysconfig/network

NETWORKING=yes

HOSTNAME=CentOS

4)配置主机名和IP映射关系

[root@CentOS ~]# vi /etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4

::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.37.138 CentOS

5)配置SSH免密码认证登陆(基于密匙的安全验证

SSH 为 Secure Shell 的缩写,由 IETF 的网络小组(Network Working Group)所制定;SSH 为建立在应用层基础上的安全协议。SSH 是目前较可靠,专为远程登录会话和其他网络服务提供安全性的协议。

 

[root@CentOS ~]# ssh-keygen -t rsa

Generating public/private rsa key pair.

Enter file in which to save the key (/root/.ssh/id_rsa):

Enter passphrase (empty for no passphrase):

Enter same passphrase again:

Your identification has been saved in /root/.ssh/id_rsa.

Your public key has been saved in /root/.ssh/id_rsa.pub.

The key fingerprint is:

39:c8:22:dd:e8:25:9d:e8:aa:ac:6b:ac:89:a1:98:eb root@CentOS

The key's randomart image is:

+--[ RSA 2048]----+

|                 |

|                 |

|                 |

|   . * o .       |

|  . * B S        |

|   + +   .       |

|o   o            |

|=* .             |

|^E.              |

+-----------------+

[root@CentOS ~]# ssh-copy-id CentOS

等价于

[root@CentOS ~]# cat .ssh/id_rsa.pub  >> .ssh/authorized_keys

[root@CentOS ~]# ssh CentOS

Last login: Mon Jul 23 19:42:25 2018 from centos

 

6)配置HDFS基本环境

1)上传解压hadoop-2.6.0.tar.gz解压到/usr目录并且配置HADOOP_HOME环境变量

[root@CentOS ~]# tar -zxf hadoop-2.6.0_x64.tar.gz -C /usr/

[root@CentOS ~]# vi .bashrc

HADOOP_HOME=/usr/hadoop-2.6.0

JAVA_HOME=/usr/java/latest

PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

CLASSPATH=.

export JAVA_HOME

export PATH

export CLASSPATH

export HADOOP_HOME

[root@CentOS ~]# source .bashrc

2)配置hadoop配置文件(core|hdfs-site.xml/slaves)

core-site.xml

[root@CentOS ~]# vi /usr/hadoop-2.6.0/etc/hadoop/core-site.xml

 <property>

        <name>fs.defaultFS</name>

        <value>hdfs://CentOS:9000</value>

 </property>

 <property>

        <name>hadoop.tmp.dir</name>

        <value>/usr/hadoop-2.6.0/hadoop-${user.name}</value>

 </property>

hdfs-site.xml

[root@CentOS ~]# vi /usr/hadoop-2.6.0/etc/hadoop/hdfs-site.xml

<property>

        <name>dfs.replication</name>

        <value>1</value>

</property>

slaves

[root@CentOS ~]# vi /usr/hadoop-2.6.0/etc/hadoop/slaves

CentOS

附注:hadoop安装文件目录说明

[root@CentOS ~]# yum install -y tree

 

7)启动HDFS环境

1.首次启动hdfs

[root@CentOS ~]# hdfs namenode -format

....

18/07/23 20:06:58 INFO namenode.NNConf: XAttrs enabled? true

18/07/23 20:06:58 INFO namenode.NNConf: Maximum size of an xattr: 16384

18/07/23 20:06:58 INFO namenode.FSImage: Allocated new BlockPoolId: BP-258224419-192.168.37.138-1532347618173

18/07/23 20:06:58 INFO common.Storage: Storage directory /usr/hadoop-2.6.0/hadoop-root/dfs/name has been successfully formatted.

18/07/23 20:06:58 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0

18/07/23 20:06:58 INFO util.ExitUtil: Exiting with status 0

附注:生成fsimage镜像文件

[root@CentOS ~]# tree  /usr/hadoop-2.6.0/hadoop-root/

/usr/hadoop-2.6.0/hadoop-root/

└── dfs

    └── name

        └── current

            ├── fsimage_0000000000000000000

            ├── fsimage_0000000000000000000.md5

            ├── seen_txid

            └── VERSION

 

3 directories, 4 files

2.执行start命令启动HDFS

[root@CentOS ~]# start-dfs.sh

Starting namenodes on [CentOS]

CentOS: starting namenode, logging to /usr/hadoop-2.6.0/logs/hadoop-root-namenode-CentOS.out

CentOS: starting datanode, logging to /usr/hadoop-2.6.0/logs/hadoop-root-datanode-CentOS.out

Starting secondary namenodes [0.0.0.0]

The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.

RSA key fingerprint is 02:cc:8c:da:b9:6d:c9:66:6c:e8:93:64:53:8e:a1:af.

Are you sure you want to continue connecting (yes/no)? yes

0.0.0.0: Warning: Permanently added '0.0.0.0' (RSA) to the list of known hosts.

0.0.0.0: starting secondarynamenode, logging to /usr/hadoop-2.6.0/logs/hadoop-root-secondarynamenode-CentOS.out

3)检查是否成功

[root@CentOS ~]# jps

2700 Jps

2597 SecondaryNameNode

2357 NameNode

2436 DataNode

或者:http://192.168.37.138:50070(namenode的web访问入口)

如果出现诡异错误记得查看启动日志

[root@CentOS ~]# ls /usr/hadoop-2.6.0/logs/

HDFS架构

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

namenodedatanode

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

 

 Namenode & Secondary(辅助)Namenode

 

课外阅读(参考hdfs user guide.pdf):

http://hadoop.apache.org/docs/r2.6.5/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html

http://hadoop.apache.org/docs/r2.6.5/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

 

HDFS shell 命令

[root@CentOS ~]# hdfs dfs -help | hadoop fs -help

Usage: hadoop fs [generic options]

        [-appendToFile <localsrc> ... <dst>]

        [-cat [-ignoreCrc] <src> ...]

        [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]

        [-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]

        [-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]

        [-cp [-f] [-p | -p[topax]] <src> ... <dst>]

       

        [-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]

        [-help [cmd ...]]

        [-ls [-d] [-h] [-R] [<path> ...]]

        [-mkdir [-p] <path> ...]

        [-moveFromLocal <localsrc> ... <dst>]

        [-moveToLocal <src> <localdst>]

        [-mv <src> ... <dst>]

        [-put [-f] [-p] [-l] <localsrc> ... <dst>]

        [-rm [-f] [-r|-R] [-skipTrash] <src> ...]

        [-rmdir [--ignore-fail-on-non-empty] <dir> ...]

        [-tail [-f] <file>]

        [-text [-ignoreCrc] <src> ...]

        [-touchz <path> ...]

        [-usage [cmd ...]]

案例:

hdfs dfs -mkdir -p /demo/src

hdfs dfs -touchz /hello.txt

hadoop fs -appendToFile /root/aa.txt /hello.txt

hadoop fs -cat /hello.txt

hadoop fs -mv /hello.txt /hello.java

hdfs dfs -copyFromLocal /root/hadoop-2.6.0_x64.tar.gz /

hdfs dfs -copyToLocal /hadoop-2.6.0_x64.tar.gz /root/

hdfs dfs -cp /hadoop-2.6.0_x64.tar.gz /demo/src

hdfs dfs -rm -r -f /demo/src

 

Java HDFS API

1)构建window开发环境

 

1.解压hadoop-2.6.0_x64.tar.gz到C:/

2.配置HADOOP_HOME环境变量

3.拷贝winutil.exe和hadoop.dll文件到%HADOOP_HOME%/bin目录

4.配置windows上CentOS主机名映射关系(C:\Windows\System32\drivers\etc)

4.重启IDE

 

2)构建HDFS开发环境(略)

 

posted @ 2018-07-25 17:31  henu小白  阅读(973)  评论(0编辑  收藏  举报