复习笔记|Lecture 3: HDFS（大数据处理技术）

Distributed File System

• In traditional approach, all the data were stored in a single central machine. With the rise of big data, a single database is not enough for storage.
• The solution is to use a distributed approach to store the massive amount of data. Data are divided and distributed among many nodes (machines).
• 在传统方法中，所有数据都存储在一台中央机器中。随着大数据的兴起，单一的数据库已经不足以存储。
• 解决方案是使用分布式方法来存储海量数据。数据在许多节点（机器）之间划分和分布。

• Distribution: Distribute blocks of data sets across multiple nodes.
• 分布：跨多个节点分布数据集块。

• 复制：通过将相同的信息片段复制到不同机架上的多个节点来复制不同节点上的数据块。 • 容错：在集群/机架故障的情况下恢复数据块 • 高并发：相同的数据可以同时被多个客户端处理。

Assumptions and Goals

 Hardware Failure ◦ Hardware failure is the norm rather than the exception, detection of faults and quick, automatic recovery from them is a core architectural goal
 Streaming Data Access ◦ HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access.
 Large Data Sets ◦ Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. It should support tens of millions of files in a single instance.
 硬件故障 ◦ 硬件故障是常态而不是例外，故障检测和快速自动恢复是核心架构目标
 流式数据访问 ◦ HDFS 更适合批处理而不是用户交互使用。重点是数据访问的高吞吐量而不是数据访问的低延迟。
 大数据集 ◦ 在 HDFS 上运行的应用程序具有大数据集。 HDFS 中的典型文件大小为 GB 到 TB。它应该在单个实例中支持数千万个文件。
portability across different hardware and software platforms

 简单的一致性模型 ◦ HDFS 应用程序需要一个一次写入多次读取的文件访问模型。文件一旦创建、写入和关闭就不需要更改。这种假设简化了数据一致性问题并实现了高吞吐量数据访问“移动计算比移动数据更便宜”◦将计算迁移到更靠近数据的地方通常比将数据移动到计算更好。这最大限度地减少了网络拥塞并增加了系统的整体吞吐量。  跨异构硬件和软件平台的可移植性 ◦ HDFS 被设计为可以轻松地从一个平台移植到另一个平台（用 Java 开发）。

• Hadoop 分布式文件系统（基于 Google 文件系统 (GFS) 论文，2004 年）— 作为 Hadoop 生态系统中大多数工具的分布式文件系统— 大型数据集的可扩展性— 处理硬件故障的可靠性 • HDFS 适用于： – 大文件（通常每个文件 100MB 或更多） – 流式读取 • 不利于： – 大量小文件 – 随机访问文件 – 低延迟访问

HDFS Architecture

• Master-Slave设计 • Master Node – 用于管理元数据的单个NameNode • Slave Nodes – 用于存储数据的多个DataNode • 其他 – 作为备份的Secondary NameNode

• NameNode 是主守护进程 • 只有一个活动的NameNode
• 管理数据节点；
存储所有元数据；
管理文件系统命名空间、文件名、位置和目录；
接收来自所有 DataNodes 的心跳和块报告

Provides storage for data blocks;Serves read and write requests from clients;Performs block creation,deletion,and replication upon instruction from the NameNode
为数据块提供存储；
为来自客户的读写请求提供服务；
根据NameNode的指令执行数据块的创建、删除和复制。

NameNode

• 有两个文件与元数据相关联： • fsimage：它包含自NameNode 启动以来文件系统命名空间的完整状态 • editlog：它包含关于最近的fsimage 对文件系统所做的所有最近修改
• 每个fsimage 文件都是文件系统元数据的一个完整持久检查点。
• 但是，它不会针对每个文件系统写操作进行更新，因为写出可能增长到千兆字节的 fsimage 文件太慢了

有关 fsimage 的更多信息... • 每个 fsimage 文件都包含文件或目录的元数据 • 对于文件，它包含文件的复制级别、修改和访问时间、访问权限、块大小以及文件组成的块等信息。 • 对于目录，它存储修改时间、权限等信息

有关 fsimage 的更多信息…… • fsimage 文件不记录存储块的 DataNode。相反，NameNode 将此映射保存在内存中，这是通过在 DataNode 加入集群时询问其块列表并在之后定期询问以确保 NameNode 的块映射是最新的来构建的。

• 当NameNode 启动时，它做的第一件事是将其图像文件(fsimage) 加载到内存中并应用编辑日志中的编辑。 • 一旦它重建了文件系统元数据的一致内存映像，它就会创建一个新的 fsimage 文件和一个空的 editlog。 • 在此过程中，NameNode 以安全模式运行，这意味着它仅向客户端提供文件系统的只读视图

editlog grows unwieldy in size, how to handle it?
• The editlog would grow without bound (even if it was spread across several physical edits files).
• Though this state of affairs would have no impact on the system while the NameNode is running, if the NameNode were restarted, it would take a long time to apply each of the transactions in its (very long) editlog. During this time, the filesystem would be offline, which is generally undesirable.
The solution is to run the Secondary NameNode
editlog 变得笨重，如何处理？ • editlog 会无限增长（即使它分布在多个物理编辑文件中）。
• 虽然在NameNode 运行时这种情况不会对系统产生影响，但是如果NameNode 重新启动，将需要很长时间来应用其（非常长的）editlog 中的每个事务。在此期间，文件系统将处于脱机状态，这通常是不可取的。
解决方案是运行Secondary NameNode

Secondary NameNode

Secondary NameNode 用于生成主节点的内存文件系统元数据的检查点。检查点过程如下： 1. 次节点要求主节点滚动其正在进行的编辑文件，因此新编辑转到新文件； 2. secondary 从primary 获取最新的fsimage 并编辑文件（使用HTTP GET）； 3. 次级将 fsimage 加载到内存中，应用来自编辑的每个事务，然后创建一个新的合并 fsimage.ckpt 文件； 4. secondary 将新的 fsimage 发送回 primary（使用 HTTP PUT），primary 将其保存为临时 .ckpt 文件。5. primary重命名临时fsimage文件使其可用

How Files Are Stored in HDFS?

为什么 HDFS 中的块这么大？ • HDFS 块与磁盘块相比较大，原因是最小化寻道成本具有块抽象的好处： • 文件可以比网络中的任何单个磁盘都大 • 存储系统处理块，简化存储管理（因为块的大小是固定的，所以很容易计算出给定磁盘上可以存储多少块） • 块非常适合复制以提供容错性和可用性

DataNode Failure

Data Replication

• HDFS 块被复制多次 • 一个块存储在多个位置，也存储在不同的机架上（通常是3 次） • 这使得HDFS 存储容错并且读取速度更快 • NameNode 定期从DataNode 收集块报告以维护复制因子。因此，每当一个块被过度复制或复制不足时，NameNode 会根据需要删除或添加副本

Rack Awareness

• NameNode 还确保所有副本不会存储在同一机架或单个机架上 • 考虑到复制因子为 3，机架感知算法表示块的第一个副本将存储在本地机架上，接下来的两个副本将存储在不同的（远程）机架上，但在该（远程）机架内的不同 DataNode 上 • 其余副本将放置在随机 DataNode 上，前提是同一机架上不超过两个副本
Advantages of Rack Awareness:
• To improve the network performance: In general, you will find greater network bandwidth between machines in the same rack than the machines residing in different rack. So, the Rack Awareness helps you to have reduced write traffic in between different racks and thus providing a better write performance. Also, you will be gaining increased read performance because you are using the bandwidth of multiple racks.
• To prevent loss of data: We don’t have to worry about the data even if an entire rack fails because of the switch failure or power failure.
机架感知的优点：
• 提高网络性能：一般来说，您会发现同一机架中的机器之间的网络带宽比位于不同机架中的机器之间的网络带宽更大。因此，机架感知可帮助您减少不同机架之间的写入流量，从而提供更好的写入性能。此外，您将获得更高的读取性能，因为您正在使用多个机架的带宽。
• 防止数据丢失：即使由于交换机故障或电源故障导致整个机架发生故障，我们也不必担心数据丢失。

HDFS Java API

• 配置：配置对象封装了客户端或服务器的配置 • 文件系统：通用文件系统API；第一步是检索我们要使用的文件系统的实例 - 在本例中为 HDFS • Path：Hadoop 文件系统中的文件由 Hadoop Path 对象表示 • FSDataInputStream：FileSystem 上的 open() 方法实际上返回FSDataInputStream 而不是标准的java.io 类 • FSDataOutputStream：FileSystem 类有许多创建文件的方法。最简单的方法是调用 create() 方法，该方法采用要创建的文件的 Path 对象并返回输出流

HDFS Read

 当客户端想要检索数据时 ◦ 与 NameNode 通信以确定哪些块构成文件以及这些块存储在哪些 DataNodes 上 ◦ 然后直接与 DataNodes 通信以读取数据

1. The client opens the file it wishes to read by calling open() on the FileSystem object, which for HDFS is an instance of DistributedFileSystem
1. 客户端通过在 FileSystem 对象上调用 open() 打开它希望读取的文件，对于 HDFS 来说，它是 DistributedFileSystem 的一个实例

• DistributedFileSystem 向客户端返回一个FSDataInputStream 以供其从中读取数据。
• FSDataInputStream 依次包装一个DFSInputStream，它管理datanode 和namenode I/O。
2. DistributedFileSystem calls the namenode, using remote procedure calls (RPCs), to determine the locations of the first few blocks in the file. For each block, the namenode returns the addresses of the datanodes that have a copy of that block. Furthermore, the datanodes are sorted according to their proximity to the client.
2. DistributedFileSystem 使用远程过程调用 (RPC) 调用名称节点，以确定文件中前几个块的位置。对于每个块，名称节点返回具有该块副本的数据节点的地址。此外，数据节点根据它们与客户端的接近程度进行排序。
3. The client then calls read() on the stream. DFSInputStream, which has stored the datanode addresses for the first few blocks in the file, then connects to the first (closest) datanode for the first block in the file.

3. 客户端然后在流上调用 read()。 DFSInputStream，它存储了文件中前几个块的数据节点地址，然后连接到文件中第一个块的第一个（最近的）数据节点。

4. Data is streamed from the datanode back to the client, which calls read() repeatedly on the stream
4.数据从datanode流回客户端，客户端在流上重复调用read()

5. When the end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best datanode for the next block ‐‐ This happens transparently to the client
5. 当到达块末尾时，DFSInputStream 将关闭与数据节点的连接，然后为下一个块找到最佳数据节点——这对客户端透明地发生

Blocks are read in order, with the DFSInputStream opening new connections to datanodes as the client reads through the stream. It will also call the namenode to retrieve the datanode locations for the next batch of blocks as needwed.
块按顺序读取，当客户端读取流时，DFSInputStream 打开到数据节点的新连接。它还将调用名称节点来根据需要检索下一批块的数据节点位置。
6. When the client has finished reading, it calls close() on the FSDataInputStream
6. 客户端完成读取后，调用 FSDataInputStream 上的 close()

• 在读取期间，如果 DFSInputStream 在与数据节点通信时遇到错误，它将尝试下一个最接近该块的数据节点。它还会记住失败的数据节点，这样它就不会不必要地为以后的块重试它们。（数据节点故障） • 它还验证检索到的数据块的校验和。如果发现损坏的块，DFSInputStream 会尝试从另一个数据节点读取该块的副本；它还将损坏的块报告给名称节点。（数据错误）

• 这种设计的一个重要方面是客户端直接联系数据节点以检索数据，并由名称节点引导到每个块的最佳数据节点。 • 这种设计允许HDFS 扩展到大量并发客户端，因为数据流量分布在集群中的所有数据节点上。 • 同时，namenode 只需要服务块位置请求（它存储在内存中，使它们非常高效）而不服务数据，随着客户端数量的增长，这将很快成为瓶颈。

HDFS Write

1. The client creates the file by calling create() on DistributedFileSystem.
1. 客户端通过调用 DistributedFileSystem 上的 create() 创建文件。

2. DistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace, with no blocks associated with it.
2. DistributedFileSystem 对名称节点进行 RPC 调用，以在文件系统的命名空间中创建一个新文件，没有与之关联的块。
The namenode performs various checks to make sure the file doesn’t already exist and that the client has the right permissions to create the file. If these checks pass, the namenode makes a record of the new file; otherwise, file creation fails and the client is thrown an IOException.
namenode 执行各种检查以确保文件不存在并且客户端具有创建文件的正确权限。如果这些检查通过，名称节点将记录新文件；否则，文件创建失败并且客户端抛出 IOException。
The DistributedFileSystem returns an FSDataOutputStream for the client to start writing data to.
Just as in the read case, FSDataOutputStream wraps a DFSOutputStream, which handles communication with the datanodes and namenode.
DistributedFileSystem 返回一个 FSDataOutputStream 供客户端开始写入数据。
就像在读取的情况下一样，FSDataOutputStream 包装了一个 DFSOutputStream，它处理与数据节点和名称节点的通信。

3. As the client writes data, the DFSOutputStream splits it into packets, which it writes to an internal queue called the data queue. The data queue is consumed by the DataStreamer, which is responsible for asking the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas.
3. 当客户端写入数据时，DFSOutputStream 将其拆分为数据包，并将其写入称为数据队列的内部队列。数据队列由 DataStreamer 使用，它负责要求名称节点通过选择合适的数据节点列表来分配新块来存储副本。

4. The list of datanodes forms a pipeline.
The DataStreamer streams the packets to the first datanode in the pipeline, which stores each packet and forwards it to the second datanode in the pipeline. Similarly, the second datanode stores the packet and forwards it to the third (and last) datanode in the pipeline.
4. datanodes列表形成一个pipeline。
DataStreamer 将数据包流式传输到管道中的第一个数据节点，它存储每个数据包并将其转发到管道中的第二个数据节点。同样，第二个数据节点存储数据包并将其转发到管道中的第三个（也是最后一个）数据节点。

5. The DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline.
5. DFSOutputStream 还维护着一个等待数据节点确认的数据包内部队列，称为确认队列。只有当数据包被管道中的所有数据节点确认时，它才会从确认队列中删除。

6. When the client has finished writing data, it calls close() on the stream.
6. 当客户端完成写入数据时，它会在流上调用 close()。

7. When it receives all acknowledgement, it contacts the namenode to signal that the file write is complete.
7. 当它收到所有确认后，它会联系名称节点以发出文件写入完成的信号。

如果在向其写入数据时任何数据节点发生故障： • 管道关闭，并且确认队列中的任何数据包都被添加到数据队列的前面，以便故障节点下游的数据节点不会丢失任何数据包 •失败的数据节点从管道中移除，并从两个良好的数据节点构建一个新的管道。块数据的其余部分被写入管道中的好数据节点。
• 名称节点注意到该块复制不足，并安排在另一个节点上创建一个进一步的副本。
随后的块将被视为正常。