HDFS环境搭建
前言
Hadoop早期衍生自Nutch(Java开源的搜索引擎),早期Nutch构建开源的搜索引擎,同样在少数的机器上同时运行计算任务面临着问题,在这个时候Google发布了GFS和Map Reduce论文。人们参考着这两篇论文中所阐述的思想开始重建Nutch的存储和计算模型。一开始尝试在20台机器的计算和存储,不久人们发现为了满足web服务高扩展性,需要将集群规模扩展到几千台的集群规模。就在这个时候Yahoo团队加入后,开始将存储和计算从Nutch中剥离出去,并且命名为Hadoop,在Yahoo团队的帮助下,Hadoop很快成为能够真正满足应用的高扩展性。
Big Data
Big Data(大数据),或称巨量资料,指的是所涉及的资料量规模巨大到无法通过目前主流软件工具,在合理时间内达到撷取、管理、处理、并整理成为帮助企业经营决策更积极目的的资讯。Big Data大数据,谈的不仅仅是数据量,其实包含了数据量大(Volume)、时效性强(Velocity)、数据多样性(Variety)、价值密度低(Value)。
Volume:数据量大量数据的产生、处理、保存,谈的就是Big Data就字面上的意思,就是谈海量数据。
Velocity:就是处理的时效,既然前头提到Big Data其中一个用途是做市场预测,那处理的时效如果太长就失去了预测的意义了,所以处理的时效对Big Data来说也是非常关键的,500万笔数据的深入分析,可能只能花5分钟的时间。
Variety[vəˈraɪɪti]:多变性指的是数据的形态,包含文字、影音、网页、串流等等结构性、非结构性的数据。
Value:价值密度低是指大量数据中只包含部分有价值的数据。
面临挑战
随着互联网用户量的增多,大用户数据,大数据流量已经远远超过单机服务器的处理能力,虽然摩尔定律给出在价格不变的情况下,硬件性能每隔18个月,性能都会提高提升1倍,即便如此,横向扩展比纵向扩展成本要低得多,分布式系统由此而产生。
Hadoop课程
ZooKeeper : 分布式系统的协调服务。
Flume : 分布式日志采集。
Kafka : 分布式消息队列。
Hive : HQL(Hive Query language)sql 翻译成MapReduce程序运行 ETL hive编程指南
Hbase :基于列存储数据库,基于分布式存储HDFS 行10亿 * 百万大数据随机访问
HDFS :分布式文件存储
MapReduce:分布式计算
HDFS环境安装
参考:
http://hadoop.apache.org/docs/r2.6.5/hadoop-project-dist/hadoop-common/SingleCluster.html
1)安装JDK配置JAVA_HOME环境变量(jdk-7u79-linux-x64.rpm)
[root@CentOS ~]# rpm -ivh jdk-7u79-linux-x64.rpm
[root@CentOS ~]# vi /root/.bashrc
JAVA_HOME=/usr/java/latest
PATH=$PATH:$JAVA_HOME/bin
CLASSPATH=.
export JAVA_HOME
export PATH
export CLASSPATH
[root@CentOS ~]# source .bashrc
小提示:
First:修改ip地址为动态获取
[root@CentOS ~]# yum install -y lrzsz之后键入rz文件上传,sz文件下载
[root@CentOS ~]# sz jdk-7u79-linux-x64.rpm
[root@CentOS ~]# rz
2)关闭防火墙
[root@CentOS ~]# service iptables stop
iptables: Setting chains to policy ACCEPT: filter [ OK ]
iptables: Flushing firewall rules: [ OK ]
iptables: Unloading modules: [ OK ]
[root@CentOS ~]# chkconfig iptables off
3)配置主机名为CentOS
[root@CentOS ~]# vi /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=CentOS
4)配置主机名和IP映射关系
[root@CentOS ~]# vi /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.37.138 CentOS
5)配置SSH免密码认证登陆(基于密匙的安全验证)
SSH 为 Secure Shell 的缩写,由 IETF 的网络小组(Network Working Group)所制定;SSH 为建立在应用层基础上的安全协议。SSH 是目前较可靠,专为远程登录会话和其他网络服务提供安全性的协议。
[root@CentOS ~]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
39:c8:22:dd:e8:25:9d:e8:aa:ac:6b:ac:89:a1:98:eb root@CentOS
The key's randomart image is:
+--[ RSA 2048]----+
| |
| |
| |
| . * o . |
| . * B S |
| + + . |
|o o |
|=* . |
|^E. |
+-----------------+
[root@CentOS ~]# ssh-copy-id CentOS
等价于
[root@CentOS ~]# cat .ssh/id_rsa.pub >> .ssh/authorized_keys
[root@CentOS ~]# ssh CentOS
Last login: Mon Jul 23 19:42:25 2018 from centos
6)配置HDFS基本环境
1)上传解压hadoop-2.6.0.tar.gz解压到/usr目录并且配置HADOOP_HOME环境变量
[root@CentOS ~]# tar -zxf hadoop-2.6.0_x64.tar.gz -C /usr/
[root@CentOS ~]# vi .bashrc
HADOOP_HOME=/usr/hadoop-2.6.0
JAVA_HOME=/usr/java/latest
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
CLASSPATH=.
export JAVA_HOME
export PATH
export CLASSPATH
export HADOOP_HOME
[root@CentOS ~]# source .bashrc
2)配置hadoop配置文件(core|hdfs-site.xml/slaves)
core-site.xml
[root@CentOS ~]# vi /usr/hadoop-2.6.0/etc/hadoop/core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://CentOS:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/hadoop-2.6.0/hadoop-${user.name}</value>
</property>
hdfs-site.xml
[root@CentOS ~]# vi /usr/hadoop-2.6.0/etc/hadoop/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
slaves
[root@CentOS ~]# vi /usr/hadoop-2.6.0/etc/hadoop/slaves
CentOS
附注:hadoop安装文件目录说明
[root@CentOS ~]# yum install -y tree
7)启动HDFS环境
1.首次启动hdfs
[root@CentOS ~]# hdfs namenode -format
....
18/07/23 20:06:58 INFO namenode.NNConf: XAttrs enabled? true
18/07/23 20:06:58 INFO namenode.NNConf: Maximum size of an xattr: 16384
18/07/23 20:06:58 INFO namenode.FSImage: Allocated new BlockPoolId: BP-258224419-192.168.37.138-1532347618173
18/07/23 20:06:58 INFO common.Storage: Storage directory /usr/hadoop-2.6.0/hadoop-root/dfs/name has been successfully formatted.
18/07/23 20:06:58 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
18/07/23 20:06:58 INFO util.ExitUtil: Exiting with status 0
附注:生成fsimage镜像文件
[root@CentOS ~]# tree /usr/hadoop-2.6.0/hadoop-root/
/usr/hadoop-2.6.0/hadoop-root/
└── dfs
└── name
└── current
├── fsimage_0000000000000000000
├── fsimage_0000000000000000000.md5
├── seen_txid
└── VERSION
3 directories, 4 files
2.执行start命令启动HDFS
[root@CentOS ~]# start-dfs.sh
Starting namenodes on [CentOS]
CentOS: starting namenode, logging to /usr/hadoop-2.6.0/logs/hadoop-root-namenode-CentOS.out
CentOS: starting datanode, logging to /usr/hadoop-2.6.0/logs/hadoop-root-datanode-CentOS.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
RSA key fingerprint is 02:cc:8c:da:b9:6d:c9:66:6c:e8:93:64:53:8e:a1:af.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added '0.0.0.0' (RSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /usr/hadoop-2.6.0/logs/hadoop-root-secondarynamenode-CentOS.out
3)检查是否成功
[root@CentOS ~]# jps
2700 Jps
2597 SecondaryNameNode
2357 NameNode
2436 DataNode
或者:http://192.168.37.138:50070(namenode的web访问入口)
如果出现诡异错误记得查看启动日志
[root@CentOS ~]# ls /usr/hadoop-2.6.0/logs/
HDFS架构
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
①namenode和datanode
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
Namenode & Secondary(辅助)Namenode
课外阅读(参考hdfs user guide.pdf):
http://hadoop.apache.org/docs/r2.6.5/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
http://hadoop.apache.org/docs/r2.6.5/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
HDFS shell 命令
[root@CentOS ~]# hdfs dfs -help | hadoop fs -help
Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-help [cmd ...]]
[-ls [-d] [-h] [-R] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] [-l] <localsrc> ... <dst>]
[-rm [-f] [-r|-R] [-skipTrash] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-tail [-f] <file>]
[-text [-ignoreCrc] <src> ...]
[-touchz <path> ...]
[-usage [cmd ...]]
案例:
hdfs dfs -mkdir -p /demo/src
hdfs dfs -touchz /hello.txt
hadoop fs -appendToFile /root/aa.txt /hello.txt
hadoop fs -cat /hello.txt
hadoop fs -mv /hello.txt /hello.java
hdfs dfs -copyFromLocal /root/hadoop-2.6.0_x64.tar.gz /
hdfs dfs -copyToLocal /hadoop-2.6.0_x64.tar.gz /root/
hdfs dfs -cp /hadoop-2.6.0_x64.tar.gz /demo/src
hdfs dfs -rm -r -f /demo/src
Java HDFS API
1)构建window开发环境
1.解压hadoop-2.6.0_x64.tar.gz到C:/
2.配置HADOOP_HOME环境变量
3.拷贝winutil.exe和hadoop.dll文件到%HADOOP_HOME%/bin目录
4.配置windows上CentOS主机名映射关系(C:\Windows\System32\drivers\etc)
4.重启IDE
2)构建HDFS开发环境(略)