复习笔记|Lecture 2: Hadoop(大数据处理技术)
What is Hadoop
• Apache open source software framework** designed for distributed storage and processing** of large datasets on** large clusters** of commodity hardware
– Large datasets → terabytes or petabytes of data
– Large clusters → hundreds or thousands of nodes
• Hadoop is developed in Java, supports C/C++, Java, Python etc.
• Apache 开源软件框架,专为在大型商品硬件集群上分布式存储和处理大型数据集而设计 – 大型数据集 → TB 或 PB 级数据 – 大型集群 → 成百上千个节点 • Hadoop 用 Java 开发,支持 C/C++ , Java, Python 等
A Brief History of Apache Hadoop
• Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project.
• However, its creators realized that their architecture wouldn’t scale to the billions of pages on the Web.
• Google published a paper in 2003 that described the architecture of Google’s distributed file system, called GFS.
• Hadoop 起源于 Apache Nutch,一个开源网络搜索引擎,它本身是 Lucene 项目的一部分。
• 然而,它的创建者意识到他们的架构无法扩展到 Web 上的数十亿页面。
• 谷歌在 2003 年发表了一篇论文,描述了谷歌分布式文件系统的架构,称为 GFS。
• In 2004, Nutch’s developers set about writing an open source implementation, the Nutch Distributed Filesystem (NDFS).
• In 2004, Google published the paper that introduced MapReduce.
• Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch, and then all the major Nutch algorithms had been ported to run using MapReduce and NDFS.
• 2004 年,Nutch 的开发人员着手编写开源实现,即 Nutch 分布式文件系统 (NDFS)。
• 2004 年,Google 发表了介绍MapReduce 的论文。
• 2005 年初,Nutch 开发人员在 Nutch 中实现了一个有效的 MapReduce 实现,然后所有主要的 Nutch 算法都被移植到使用 MapReduce 和 NDFS 运行。
• In 2006, they moved out of Nutch to form an independent subproject of Lucene called Hadoop. At around the same time, Doug Cutting joined Yahoo!, which provided a dedicated team and the resources to turn Hadoop into a system that ran at web scale.
• By 2007, Hadoop was being used by many other companies besides Yahoo!, such as Facebook, and the New York Times.
• 2006 年,他们离开了Nutch,形成了一个独立的Lucene 子项目,称为Hadoop。大约在同一时间,Doug Cutting 加入了 Yahoo!,Yahoo! 提供了一个专门的团队和资源来将 Hadoop 变成一个在网络规模上运行的系统。
• 到 2007 年,Hadoop 已被 Yahoo! 以外的许多其他公司使用,例如 Facebook 和纽约时报。
• In April 2008, Hadoop broke a world record to become the fastest system to sort an entire terabyte of data. Running on a 910-node cluster, Hadoop sorted 1 terabyte in 209 seconds (just under 3.5 minutes), beating the previous year’s winner of 297 seconds.
• Then, in April 2009, it was announced that a team at Yahoo! had used Hadoop to sort 1 terabyte in 62 seconds.
• 2008 年 4 月,Hadoop 打破了世界纪录,成为对整个 TB 数据进行排序最快的系统。在 910 节点集群上运行,Hadoop 在 209 秒(不到 3.5 分钟)内对 1 TB 进行了排序,击败了去年的 297 秒冠军。
• 然后,在 2009 年 4 月,雅虎宣布一个团队!曾使用 Hadoop 在 62 秒内对 1 TB 进行排序。
Motivations for Hadoop
• What were the limitations of earlier large-scale computing?
• What requirements should an alternative approach have?
• How does Hadoop address those requirements?
• 早期大规模计算的局限性是什么?
• 替代方法应具备哪些要求?
• Hadoop 如何满足这些要求?
Early Large Scale Computing
• 从历史上看,计算受限于处理器 • 数据量相对较小 • 对数据执行复杂的计算 • 计算机技术的进步历来以提高单台机器的能力为中心 • CPU 的进步 - 摩尔定律:1965 年,George摩尔假设,大约每两年,微芯片上的晶体管数量就会翻一番。
• 单核限制:单核计算无法满足当前的计算需求
Distributed Systems
Allows developers to use multiple machines for a single task
允许开发人员使用多台机器完成一项任务
Distributed System: Problems
• 在分布式系统上编程要复杂得多 • 同步数据交换 • 管理有限带宽 • 控制计算时间很复杂 • 分布式系统的设计必须考虑到失败的预期
Requirements for Hadoop
Must be easy to use Must support partial failure Must be scalable
必须易于使用 必须支持部分故障 必须可扩展
组件恢复 如果组件发生故障,它应该能够在不重新启动整个系统的情况下恢复 作业期间的组件故障或恢复不得影响最终输出 可扩展性 增加资源应该增加负载能力
Partial Failures
Failure of a single component must not cause the failure of the whole system - only a degradation of the application performance
Failure should not result in the loss of any data
单个组件的故障不得导致整个系统的故障 - 只会导致应用程序性能下降
故障不应导致任何数据丢失
Design Principles of Hadoop
big data
parallel computation
commodity hardware
automatic parallelization & distribution
fault tolerance and automatic recovery
simple programming abstraction
• 需要处理大数据 • 需要跨数千个节点并行计算 • 商品硬件 • 大量低端廉价机器并行工作以解决计算问题 • 这与并行数据库形成对比 • 少量高端昂贵的机器 • 自动并行化和分布 • 对最终用户隐藏 • 容错和自动恢复 • 节点/任务可能会失败并自动恢复 • 干净简单的编程抽象 • 用户只提供两个函数“映射”和“减少”
Hadoop Core
Features of Hadoop
• Reliability • Hadoop infrastructure has inbuilt fault tolerance features and hence, Hadoop is highly reliable.
• Economical • Hadoop uses commodity hardware (like your PC, laptop).
• Scalability • Hadoop is highly scalable and has the inbuilt capability of integrating seamlessly with cloud-based services.
• Flexibility • Hadoop is very flexible in terms of the ability to deal with all kinds of data.
• 可靠性 • Hadoop 基础架构具有内置的容错功能,因此,Hadoop 非常可靠。
• 经济 • Hadoop 使用商用硬件(如个人电脑、笔记本电脑)。
• 可扩展性 • Hadoop 具有高度可扩展性,并具有与基于云的服务无缝集成的内置功能。
• 灵活性 • Hadoop 在处理各种数据的能力方面非常灵活。
Hadoop 3.x is still in early access releases and has not yet been sufficiently tested. Hadoop 2.x is recommended.
Hadoop 3.x 仍处于早期访问版本,尚未经过充分测试。建议使用 Hadoop 2.x。
Apache Hadoop
• NameNode 高可用性 – NameNode 不再是单点故障 • HDFS Federation – NameNode 的水平可扩展性 • YARN – 能够使用 Storm、GIRAPH 等非 MapReduce 应用程序处理 HDFS 中可用的 TB 和 PB 级数据
Hadoop 3.x has not yet been sufficiently tested by the HBase community for production use cases. Hadoop 2.x is recommended.
Hadoop 3.x 尚未经过 HBase 社区针对生产用例的充分测试。建议使用 Hadoop 2.x。
Hadoop Ecosystem
Hadoop Ecosystem is neither a programming language nor a service, it is a platform or framework which solves big data problems. You can consider it as a suite which encompasses a number of services (storing, analyzing and maintaining) inside it.
Hadoop生态系统既不是一种编程语言也不是一种服务,它是一个解决大数据问题的平台或框架。您可以将其视为一个套件,其中包含许多服务(存储、分析和维护)。
又一个资源谈判者
Ambari->供应、监控和维护集群
Hadoop Ecosystem: Hive
• Facebook created HIVE for people who are fluent with SQL.
Thus, HIVE makes them feel at home while working in a Hadoop Ecosystem.
• Basically, HIVE is a data warehousing component which performs reading, writing and managing large data sets in a distributed environment using SQL-like interface.
• The query language of Hive is called Hive Query Language(HQL), which is very similar like SQL.
• Hive can serve both the purposes, i.e. large data set processing (i.e. Batch query processing) and real time processing (i.e. Interactive query processing).
HIVE + SQL = HQL
• Facebook 为精通 SQL 的人创建了 HIVE。
因此,HIVE 让他们在 Hadoop 生态系统中工作时有宾至如归的感觉。
• 基本上,HIVE 是一个数据仓库组件,它使用类似SQL 的接口在分布式环境中执行读取、写入和管理大型数据集。
• Hive 的查询语言称为Hive Query Language(HQL),与SQL 非常相似。
• Hive 可以满足两种目的,即大数据集处理(即批查询处理)和实时处理(即交互式查询处理)。
HIVE + SQL = HQL
Hadoop Ecosystem: Pig
• PIG has two parts: Pig Latin, the language and the pig runtime, for the execution environment. You can better understand it as Java and JVM.
• It supports Pig Latin language, for the ones who do not come from a programming background.
• At the back end of Pig job, a map-reduce job executes it.
• The compiler internally converts pig latin to MapReduce. It produces a sequential set of MapReduce jobs.
• It gives you a platform for building data flow for ETL (Extract, Transform and Load), processing and analyzing huge data sets.
• PIG 有两部分:Pig Latin,语言和用于执行环境的pig 运行时。你可以更好的理解为Java和JVM。
• 它支持Pig Latin 语言,适用于没有编程背景的人。
• 在Pig 作业的后端,一个map-reduce 作业执行它。
• 编译器在内部将pig latin 转换为MapReduce。它生成一组顺序的 MapReduce 作业。
• 它为您提供了一个为ETL(提取、转换和加载)构建数据流、处理和分析庞大数据集的平台。
PIG, first the load command, loads the data. Then we perform various functions on it like grouping, filtering, joining, sorting, etc. At last, either you can dump the data on the screen or you can store the result back in HDFS.
PIG,首先是加载命令,加载数据。然后我们对其执行各种功能,如分组、过滤、连接、排序等。最后,您可以将数据转储到屏幕上,也可以将结果存储回 HDFS。
Hadoop Ecosystem: Mahout
• Mahout provides an environment for creating machine learning applications which are scalable.
• Mahout provides a command line to invoke various algorithms. It has a predefined set of library which already contains different inbuilt algorithms for different use cases.
• Collaborative filtering: Mahout mines user behaviors, their patterns and their characteristics and based on that it predicts and make recommendations to the users. The typical use case is E-commerce website.
• Clustering: It organizes a similar group of data together like articles can contain blogs, news, research papers etc.
• Classification: It classifies and categorizes data into various subgroups.
•** Frequent item set mining:** It mines which objects are likely to be appearing together and make suggestions.
• Mahout 为创建可扩展的机器学习应用程序提供了一个环境。
• Mahout 提供了一个命令行来调用各种算法。它有一组预定义的库,其中已经包含针对不同用例的不同内置算法。
• 协作过滤:Mahout 挖掘用户行为、他们的模式和他们的特征,并据此预测并向用户提出建议。典型的用例是电子商务网站。
• 聚类:它将一组相似的数据组织在一起,就像文章可以包含博客、新闻、研究论文等。
• 分类:它将数据分类并分类到不同的子组中。
• 频繁项集挖掘:挖掘哪些对象可能一起出现并提出建议。
Hadoop Ecosystem: Spark
• Spark is a framework for real time data analytics in a distributed computing environment.
• The Spark is written in Scala and was originally developed at the University of California, Berkeley.
• It executes in-memory computations to increase speed of data processing over Map-Reduce.
• It is 100x faster than Hadoop for large scale data processing by exploiting in-memory computations and other optimizations. Therefore, it requires high processing power than Map-Reduce
• Spark 是分布式计算环境中实时数据分析的框架。
• Spark 是用Scala 编写的,最初是在加州大学伯克利分校开发的。
• 它执行内存计算以提高Map-Reduce 数据处理速度。
• 通过利用内存计算和其他优化,它在大规模数据处理方面比 Hadoop 快 100 倍。因此,它比Map-Reduce需要更高的处理能力
Hadoop Ecosystem: HBase
• HBase is an open source, non-relational distributed database. In other words, it is a NoSQL database.
• It is modelled after Google’s BigTable, which is a distributed storage system designed to cope up with large data sets.
• The HBase was designed to run on top of HDFS and provides BigTable like capabilities.
• HBase 是一种开源的非关系型分布式数据库。换句话说,它是一个 NoSQL 数据库。
• 它仿照谷歌的BigTable,这是一个分布式存储系统,旨在应对大型数据集。
• HBase 旨在运行在 HDFS 之上,并提供类似 BigTable 的功能。
Hadoop Ecosystem: Zookeeper
• Before Zookeeper, it was very difficult and time consuming to coordinate between different services in Hadoop Ecosystem. The services earlier had many problems with interactions like common configuration while synchronizing data. Even if the services are configured, changes in the configurations of the services make it complex and difficult to handle. The grouping and naming was also a time-consuming factor.
• Apache Zookeeper is the coordinator of any Hadoop job which includes a combination of various services in a Hadoop Ecosystem.
• Zookeeper saves a lot of time by performing synchronization, configuration maintenance, grouping and naming.
• 在Zookeeper 出现之前,协调Hadoop 生态系统中的不同服务非常困难且耗时。早期的服务在同步数据时的公共配置等交互方面存在很多问题。即使配置了服务,服务配置的变化也会变得复杂和难以处理。分组和命名也是一个耗时的因素。
• Apache Zookeeper 是任何Hadoop 作业的协调器,其中包括Hadoop 生态系统中各种服务的组合。
• Zookeeper 通过执行同步、配置维护、分组和命名来节省大量时间。
Hadoop Ecosystem: Ambari
• Ambari is an Apache Software Foundation Project which aims at making Hadoop ecosystem more manageable.
• The Ambari provides: • Hadoop cluster provisioning: – It gives us step by step process for installing Hadoop services across a number of hosts.
– It also handles configuration of Hadoop services over a cluster.
• Hadoop cluster management: – It provides a central management service for starting, stopping and re-configuring Hadoop services across the cluster.
• Hadoop cluster monitoring: – For monitoring health and status, Ambari provides us a dashboard.
– The Amber Alert framework is an alerting service which notifies the user, whenever the attention is needed. For example, if a node goes down or low disk space on a node, etc.
• Ambari 是一个Apache 软件基金会项目,旨在使Hadoop 生态系统更易于管理。
• Ambari 提供: • Hadoop 集群配置: – 它为我们提供了在多个主机上安装Hadoop 服务的分步过程。
– 它还处理集群上 Hadoop 服务的配置。
• Hadoop 集群管理: – 它提供中央管理服务,用于启动、停止和重新配置集群中的Hadoop 服务。
• Hadoop 集群监控: – 为了监控健康和状态,Ambari 为我们提供了一个仪表板。
– Amber Alert 框架是一种警报服务,可在需要注意时通知用户。例如,如果节点出现故障或节点上的磁盘空间不足等。
Hadoop Installation and Configuration
Hadoop can be run in one of three modes:
• Standalone (or local) mode: There are no daemons running and everything runs in a single JVM. Standalone mode is suitable for running MapReduce programs during development, since it is easy to test and debug them.
• Pseudo-distributed mode: All Hadoop daemons run on the local machine, thus simulating a cluster on a small scale.
• Fully distributed mode: The Hadoop daemons run on a cluster of machines.
Hadoop 可以以三种模式之一运行:
• 独立(或本地)模式:没有守护进程在运行,一切都在单个 JVM 中运行。 Standalone 模式适合在开发过程中运行 MapReduce 程序,因为它易于测试和调试。
• 伪分布式模式:所有Hadoop守护进程都运行在本地机器上,从而模拟一个小规模的集群。
• 完全分布式模式:Hadoop 守护进程在机器集群上运行。
实验步骤: • 第一步:添加Hadoop系统用户 • 第二步:配置SSH • 第三步:下载并安装Java • 第四步:下载并安装Hadoop(Hadoop默认配置为非分布式运行) • 第五步:配置Hadoop并运行示例(本地模式&伪分布式模式)