Hadoop TDG 2 -- HDFS
The Hadoop Distributed Filesystem
The Design of HDFS
HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. Let’s examine this statement in more detail:
Very large files
“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data.
Streaming data access
HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.
Commodity hardware
Hadoop doesn’t require expensive, highly reliable hardware to run on.
While this may change in the future, these are areas where HDFS is not a good fit today:
Low-latency data access
Applications that require low-latency access to data, in the tens of milliseconds range, will not work well with HDFS. Remember HDFS is optimized for delivering a high throughput of data, and this may be at the expense of latency. HBase is currently a better choice for low-latency access.
Lots of small files
Since the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode.
As a rule of thumb, each file, directory, and block takes about 150 bytes.
So, for example, if you had one million files, each taking one block, you would need at least 300 MB of memory. While storing millions of files is feasible, billions is beyond the capability of current hardware.
Multiple writers, arbitrary file modifications
Files in HDFS may be written to by a single writer. Writes are always made at the end of the file. There is no support for multiple writers, or for modifications at arbitrary offsets in the file.
这儿写的比较清晰, 比GFS paper要清晰...
HDFS Concepts
Blocks
A disk has a block size, which is the minimum amount of data that it can read or write.
Filesystems for a single disk build on this by dealing with data in blocks, which are an integral multiple of the disk block size. Filesystem blocks are typically a few kilobytes in size, while disk blocks are normally 512 bytes.
This is generally transparent to the filesystem user who is simply reading or writing a file—of whatever length. However, there are tools to perform filesystem maintenance, such as df and fsck, that operate on the filesystem block level.
HDFS, too, has the concept of a block, but it is a much larger unit—64 MB by default.
Like in a filesystem for a single disk, files in HDFS are broken into block-sized chunks, which are stored as independent units.
Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage.
在文件系统里面也有block的概念,是指这个最小的读写单元,一般是512bytes,在HDFS中也有block的概念,一般默认大小是64M.
Why Is a Block in HDFS So Large?
HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. By making a block large enough, the time to transfer the data from the disk can be made to be significantly larger than the time to seek to the start of the block.Thus the time to transfer a large file made of multiple blocks operates at the disk transfer rate.
这儿说的太不专业, GFS paper里面有专门一节来谈论, 为什么使用large block size? 参考paper2.5
Having a block abstraction for a distributed filesystem brings several benefits.
- The first benefit is the most obvious: a file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster.
- Second, making the unit of abstraction a block rather than a file simplifies the storage subsystem.
Simplicity is something to strive for all in all systems, but is especially important for a distributed system in which the failure modes are so varied.
The storage subsystem deals with blocks, simplifying storage management (since blocks are a fixed size, it is easy to calculate how many can be stored on a given disk) and eliminating metadata concerns (blocks are just a chunk of data to be stored—file metadata such as permissions information does not need to be stored with the blocks, so another system can handle metadata separately). - Furthermore, blocks fit well with replication for providing fault tolerance and availability.
To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three).
If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client.
A block that is no longer available due to corruption or machine failure can be replicated from its alternative locations to other live machines to bring the replication factor back to the normal level. (See “Data Integrity” on page 75 for more on guarding against corrupt data.) Similarly, some applications may choose to set a high replication factor for the blocks in a popular file to spread the read load on the cluster.
说白了, 为什么要搞block的概念, 首先对于大文件肯定要切分, 不然一个node存不下, 然后选择固定大小的block size, 单纯的出于简单的目的, 便于管理, 便于replica
对于大型系统, 简单的设计才是王道...
Like its disk filesystem cousin, HDFS’s fsck command understands blocks. For example, running:
% hadoop fsck / -files -blocks
will list the blocks that make up each file in the filesystem.
Namenodes and Datanodes
An HDFS cluster has two types of node operating in a master-worker pattern: a namenode (the master) and a number of datanodes (workers).
Namenode
The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree.
This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log.
The namenode also knows the datanodes on which all the blocks for a given file are located, however, it does not store block locations persistently, since this information is reconstructed from
datanodes when the system starts.
和GFS一样, namenode管理了file和block的namespace, 以及他们之间的对应关系
对于block location也是通过每次namenode启动的时候, 动态生成, 简单设计的结果
并且防止在memory中的metadata丢失, 需要定时生成namespace image, 存到本地
为防止两次image之前的数据丢失, 所有新的操作都会写入edit log, 下一次生成image的时候, edit log会被清空
Datanodes
Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.
怎样防止Namenode挂掉?
Without the namenode, the filesystem cannot be used.
In fact, if the machine running the namenode were obliterated, all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the
datanodes. For this reason, it is important to make the namenode resilient to failure, and Hadoop provides two mechanisms for this.
- The first way is to back up the files that make up the persistent state of the filesystem metadata.
Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems. These writes are synchronous and atomic. The usual configuration choice is to write to local disk as well as a remote NFS mount. - It is also possible to run a secondary namenode, which despite its name does not act as a namenode.
Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.
The secondary namenode usually runs on a separate physical machine, since it requires plenty of CPU and as much memory as the namenode to perform the merge. It keeps a copy of the merged namespace image, which can be used in the event of the namenode failing.
However, the state of the secondary namenode lags that of the primary, so in the event of total failure of the primary, data loss is almost certain. The usual course of action in this case is to copy the namenode’s metadata files that are on NFS to the secondary and run it as the new primary. See “The filesystem image and edit log” on page 294 for more details.
这块的设计几乎和GFS没有什么区别, 有趣的是所有的名词都用的不同
master –nameNode, chunckserver -- datanode
chunck – block, checkpoint – image, edit log – operation log
不过个人觉得, HDFS的名词容易理解一些
The Command-Line Interface
We’re going to have a look at HDFS by interacting with it from the command line. There are many other interfaces to HDFS, but the command line is one of the simplest and, to many developers, the most familiar.
Basic Filesystem Operations
The filesystem is ready to be used, and we can do all of the usual filesystem operations such as reading files, creating directories, moving files, deleting data, and listing directories.
You can type hadoop fs -help to get detailed help on every command.
Start by copying a file from the local filesystem to HDFS:
% hadoop fs -copyFromLocal input/docs/quangle.txt hdfs://localhost/user/tom/quangle.txt
This command invokes Hadoop’s filesystem shell command fs, which supports a number of subcommands—in this case, we are running -copyFromLocal.
Let’s copy the file back to the local filesystem and check whether it’s the same:
% hadoop fs -copyToLocal quangle.txt quangle.copy.txt % md5 input/docs/quangle.txt quangle.copy.txt MD5 (input/docs/quangle.txt) = a16f231da6b05e2ba7a339320e7dacd9 MD5 (quangle.copy.txt) = a16f231da6b05e2ba7a339320e7dacd9
The MD5 digests are the same, showing that the file survived its trip to HDFS and is back intact.
Finally, let’s look at an HDFS file listing. We create a directory first just to see how it is displayed in the listing:
% hadoop fs -mkdir books % hadoop fs -ls . Found 2 items drwxr-xr-x - tom supergroup 0 2009-04-02 22:41 /user/tom/books -rw-r--r-- 1 tom supergroup 118 2009-04-02 22:29 /user/tom/quangle.txt
The information returned is very similar to the Unix command ls -l, with a few minor differences.
The first column shows the file mode.
The second column is the replication factor of the file (something a traditional Unix filesystem does not have). The entry in this column is empty for directories since the concept of replication does not apply to them—directories are treated as metadata and stored by the namenode, not the datanodes.
The third and fourth columns show the file owner and group.
The fifth column is the size of the file in bytes, or zero for directories.
The sixth and seventh columns are the last modified date and time.
Finally, the eighth column is the absolute name of the file or directory.
Hadoop Filesystems
Hadoop has an abstract notion of filesystem, of which HDFS is just one implementation.
The Java abstract class org.apache.hadoop.fs.FileSystem represents a filesystem in Hadoop, and there are several concrete implementations, which are described in Table 3-1.
虽然HDFS是Hadoop的专有的文件系统,但是Hadoop也支持其他的文件系统,下面是一些例子
Filesystem URI scheme Java implementation (all under org.apache.hadoop)
Local file fs.LocalFileSystem
HDFS hdfs hdfs.DistributedFileSystem
KFS (Cloud-Store) kfs fs.kfs.KosmosFileSystem
FTP ftp fs.ftp.FTPFileSystem
S3 (native) s3n fs.s3native.NativeS3FileSystem
S3 (blockbased) s3 fs.s3.S3FileSystem
Description
KFS (Cloud-Store):CloudStore (formerly Kosmos filesystem)is a distributed filesystem like HDFS or Google’s GFS, written in C++. Find more information about it at http://kosmosfs.sourceforge.net/.
S3 (native):A filesystem backed by AmazonS3. See http://wiki.apache.org/hadoop/AmazonS3.
S3 (blockbased):A filesystem backed by Amazon S3, which stores files in blocks(much like HDFS) to overcome S3’s 5 GB file size limit.
Although it is possible (and sometimes very convenient) to run MapReduce programs that access any of these filesystems, when you are processing large volumes of data, you should choose a distributed filesystem that has the data locality optimization, such as HDFS or KFS
The Java Interface
In this section, we dig into the Hadoop’s FileSystem class: the API for interacting with one of Hadoop’s filesystems.
While we focus mainly on the HDFS implementation, DistributedFileSystem, in general you should strive to write your code against the FileSystem abstract class, to retain portability across filesystems.
This is very useful when testing your program, for example, since you can rapidly run tests using data stored on the local filesystem.
Reading Data Using the FileSystem API
读取主要使用下面两个函数,
public static FileSystem get(Configuration conf) throws IOException public static FileSystem get(URI uri, Configuration conf) throws IOException public FSDataInputStream open(Path f) throws IOException public abstract FSDataInputStream open(Path f, int bufferSize) throws IOException
代码的例子
public class FileSystemCat { public static void main(String[] args) throws Exception { String uri = args[0]; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(uri), conf); InputStream in = null; try { in = fs.open(new Path(uri)); IOUtils.copyBytes(in, System.out, 4096, false); } finally { IOUtils.closeStream(in); } } }
The program runs as follows:
% hadoop FileSystemCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
Writing Data
public class FileCopyWithProgress { public static void main(String[] args) throws Exception { String localSrc = args[0]; String dst = args[1]; InputStream in = new BufferedInputStream(new FileInputStream(localSrc)); Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(dst), conf); OutputStream out = fs.create(new Path(dst), new Progressable() { public void progress() { System.out.print("."); } }); IOUtils.copyBytes(in, out, 4096, true); } }
HDFS Data Flow
Anatomy of a File Read
Step 1
The client opens the file it wishes to read by calling open() on the FileSystem object, which for HDFS is an instance of DistributedFileSystem (step 1 in Figure 3-1).
DistributedFileSystem calls the namenode, using RPC, to determine the locations of the blocks for the first few blocks in the file (step 2).
For each block, the namenode returns the addresses of the datanodes that have a copy of that block.
Furthermore, the datanodes are sorted according to their proximity to the client (according to the topology of the cluster’s network; see “Network Topology and Hadoop” on page 64). If the client is itself a datanode (in the case of a MapReduce task, for instance), then it will read from the local datanode, if it hosts a copy of the block.
对于一个block的多个replica, 会按照他们的网络topology进行排序, 找出最近的那个node的location
Step 3怎样判断两个node之间的距离远近?
What does it mean for two nodes in a local network to be “close” to each other?
In the context of high-volume data processing, the limiting factor is the rate at which we can transfer data between nodes—bandwidth is a scarce commodity. The idea is to use the bandwidth between two nodes as a measure of distance.用节点间的带宽来measure是可以的, 但是这个比较难以测量, 所以用了更简单的方法, 很实用
For example, imagine a node n1 on rack r1 in data center d1. This can be represented as /d1/r1/n1. Using this notation, here are the distances for the four scenarios:
• distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node)
• distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack)
• distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different racks in the same data center)
• distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data centers)
The DistributedFileSystem returns an FSDataInputStream (an input stream that supports file seeks) to the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and namenode I/O.
The client then calls read() on the stream (step 3). DFSInputStream, which has stored the datanode addresses for the first few blocks in the file, then connects to the first (closest) datanode for the first block in the file.
Step 4
Data is streamed from the datanode back to the client, which calls read() repeatedly on the stream (step 4).
Step 5
When the end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best datanode for the next block (step 5).
This happens transparently to the client, which from its point of view is just reading a continuous stream.
Step 6
Blocks are read in order with the DFSInputStream opening new connections to datanodes as the client reads through the stream. It will also call the namenode to retrieve the datanode locations for the next batch of blocks as needed. When the client has finished reading, it calls close() on the FSDataInputStream (step 6).
其实读取过程是非常简单的, 从namenode获取地址, 然后直接去datanode读就好了, 上面的步骤, 主要是反映出了HDFS的Java Design
During reading, if the DFSInputStream encounters an error while communicating with a datanode, then it will try the next closest one for that block. It will also remember datanodes that have failed so that it doesn’t needlessly retry them for later blocks.
The DFSInputStream also verifies checksums for the data transferred to it from the datanode.
If a corrupted block is found, it is reported to the namenode before the DFSInputStream attempts to read a replica of the block from another datanode.
读取过程中, 是会有问题的.
Datanode挂了, 怎么办? 去下一个datanode上读, 并在client记录下, 避免后面再去不必要的retry
从Datanode读到的数据corrupted, 怎么办? 首先汇报给namenode, 然后从其他datanode上重新读一个replica, 这边之所以要汇报, namenode需要负责维护出问题的replica (参考GFS Paper 5.2)
One important aspect of this design is that the client contacts datanodes directly to retrieve data and is guided by the namenode to the best datanode for each block.
This design allows HDFS to scale to a large number of concurrent clients, since the data traffic is spread across all the datanodes in the cluster. The namenode meanwhile merely has to service block location requests (which it stores in memory, making them very efficient) and does not, for example, serve data, which would quickly become a bottleneck as the number of clients grew.
为了避免Master的负载过重, 和GFS设计完全一致. 参考GFS paper, 2.3 和2.4
Anatomy of a File Write
Step 1
The client creates the file by calling create() on DistributedFileSystem (step 1 in Figure 3-3).
Step 2DistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace, with no blocks associated with it (step 2).
The namenode performs various checks to make sure the file doesn’t already exist, and that the client has the right permissions to create the file. If these checks pass, the namenode makes a record of the new file; otherwise, file creation fails and the client is thrown an IOException.
The DistributedFileSystem returns an FSDataOutputStream for the client to start writing data to. Just as in the read case, FSDataOutputStream wraps a DFSOutputStream, which handles communication with the datanodes and namenode.
As the client writes data (step 3), DFSOutputStream splits it into packets, which it writes to an internal queue, called the data queue.
DistributedFileSystem收到Create命令时, 会通知namenode创建file, namenode在做过各种check后, 在文件系统的命名空间中生成新文件, 并返回DFSOutputStream, client把数据写入这个DFSOutputStream , 它会将数据经行split, 并放入内部data queue
Step 4The data queue is consumed by the DataStreamer, whose responsibility it is to ask the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas. The list of datanodes forms a pipeline—we’ll assume the replication level is three, so there are three nodes in the pipeline. The DataStreamer streams the packets to the first datanode in the pipeline, which stores the packet and forwards it to the second datanode in the pipeline. Similarly, the second datanode stores the packet and forwards it to the third (and last) datanode in the pipeline (step 4).
DataStreamer会来处理data queue中的数据, 首先要求namenode分配适合的datanodes与blocks来存放数据, 被分配的datanodes会形成pipeline, 逐个把data进行forward和存储. 这种pipeline的思路是否也是借鉴与GFS, 参考GFS paper 3.2
Replica Placement
How does the namenode choose which datanodes to store replicas on ?There’s a tradeoff between reliability and write bandwidth and read bandwidth here.
Hadoop’s strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second , but on a different node chosen at random. Further replicas are placed on random nodes on the cluster, although the system tries to avoid placing too many replicas on the same rack.参考GFS的策略, Paper, 4.2
Step 5
DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by datanodes, called the ack queue.
A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline (step 5).
当有datanode crash时, 怎么处理?
If a datanode fails while data is being written to it, then the following actions are taken, which are transparent to the client writing the data?
- First the pipeline is closed, and any packets in the ack queue are added to the front of the data queue so that datanodes that are downstream from the failed node will not miss any packets.
先将pipeline关掉, 由于有datanode fail, 当前等待ack的所有packet都有可能存储失败, 因为datanode是以pipeline形式进行存储的, 所以一个datanode fail, 会影响到后面所有的datanode. 所以把所有等待ack的packet放到data queue中, 从新存储一次, 以免丢失- The current block on the good datanodes is given a new identity, which is communicated to the namenode, so that the partial block on the failed datanode will be deleted if the failed datanode recovers later on.
通知namenode, 当fail的datanode恢复时, 删除那些partial block(发生故障时, 已存储的不完整的数据)- The failed datanode is removed from the pipeline and the remainder of the block’s data is written to the two good datanodes in the pipeline.
把crash datanode从pipeline中删除, 将剩下的数据(在等待ack, 并被放入data queue中的数据?)写入pipeline中两个好的datanodes中(默认3复本, 坏了一个, 剩2个?)- The namenode notices that the block is under-replicated, and it arranges for a further replica to be created on another node. Subsequent blocks are then treated as normal.
Namenode会check到上面的block under-replicated(因为datanode fail, 导致某些replica未存成功), namenode会负责在其他datanode上将replica补全- It’s possible, but unlikely, that multiple datanodes fail while a block is being written. As long as dfs.replication.min replicas (default one) are written, the write will succeed, and the block will be asynchronously replicated across the cluster until its target replication factor is reached (dfs.replication, which defaults to three).
只要存储成功的replica数目大于dfs.replication.min, 就算存储成功
Step 6,7
When the client has finished writing data, it calls close() on the stream (step 6).
This action flushes all the remaining packets to the datanode pipeline and waits for acknowledgments before contacting the namenode to signal that the file is complete (step 7).
The namenode already knows which blocks the file is made up of (via Data Streamer asking for block allocations), so it only has to wait for blocks to be minimally replicated before returning successfully.
这个整个写入的过程和GFS比, 有比较大的区别, 参考GFS paper 3.1
首先, GFS会通过lease机制对每个chunck (block)指定一个primary replica, 通过这种机制来保证并发写的时候, 每个replica保证相同的写顺序.
而HDFS把这个机制简化了, HDFS不支持Multiple writers, 即不支持并发写
所以HDFS在一致性问题上, 非常简单, 不用象GFS那么复杂
其次, 在GFS中client直接查询master, 获取replicas的chunckserver的地址, 并将data存储到各个replicas上去, 当发生错误时, primary会将report给client, client负责handle error. client处理错误的方法是重新存一遍, 所以在某些replica上会造成重复数据(参考GFS paper 3.3).
而HDFS多做了层封装, Client不会直接面对Namenode与datanode,
Client只需要将data放入DFSOutputStream, 它会自动将data放入data queue
并且由DataStreamer, 查询namenode来获取存储的datanode的地址, 并以pipeline的方式, 将数据存储到各个replicas上去
但是当某个datanode fail的时候, 处理的方法比GFS要复杂些
上面写了一堆, 不太清楚, 我个人的理解, 就是碰到fail的, 就把fail的从pipeline上remove掉, 保证可以在正常的datanode上存储成功
最终, Client写完数据, 调用close(). Client不需要去知道存储细节和handle error.
DFSOutputStream只要保证所有的block都至少成功存储大于dfs.replication.min个复本, 就通知namenode, 存complete
Coherency Model (一致性模型)
A coherency model for a filesystem describes the data visibility of reads and writes for a file. HDFS trades off some POSIX requirements for performance, so some operations may behave differently than you expect them to.
读写一致性问题, 创建file, 一致的
After creating a file, it is visible in the filesystem namespace, as expected:
Path p = new Path("p"); fs.create(p); assertThat(fs.exists(p), is(true));
而写文件就会有不一致, 如下例, write, flush完, 再去read, length仍然为0
However, any content written to the file is not guaranteed to be visible, even if the stream is flushed. So the file appears to have a length of zero:
Path p = new Path("p"); OutputStream out = fs.create(p); out.write("content".getBytes("UTF-8")); out.flush(); assertThat(fs.getFileStatus(p).getLen(), is(0L));
GFS中说采用Lazy space allocation(为了避免大量的internal fragmentation), 会将数据缓存在buffer中, 当block满的时候, 才会真正的将buffer写到磁盘上, 所以对于当前block中内容, 其他reader无法读到.
Once more than a block’s worth of data has been written, the first block will be visible to new readers. This is true of subsequent blocks, too: it is always the current block being written that is not visible to other readers.
HDFS provides a method for forcing all buffers to be synchronized to the datanodes via the sync() method on FSDataOutputStream. After a successful return from sync(), HDFS guarantees that the data written up to that point in the file is persisted and visible to all new readers.
Consequences for application design
This coherency model has implications for the way you design applications. With no calls to sync(), you should be prepared to lose up to a block of data in the event of client or system failure.
For many applications, this is unacceptable, so you should call sync() at suitable points, such as after writing a certain number of records or number of bytes.
强制调用sync会产生internal fragmentation, 所以需要trade-off
Parallel Copying with distcp
The HDFS access patterns that we have seen so far focus on single-threaded access. It’s possible to act on a collection of files, by specifying file globs, for example, but for efficient, parallel processing of these files you would have to write a program yourself.
Hadoop comes with a useful program called distcp for copying large amounts of data to and from Hadoop filesystems in parallel.
The canonical use case for distcp is for transferring data between two HDFS clusters.
If the clusters are running identical versions of Hadoop, the hdfs scheme is appropriate:
% hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar
This will copy the /foo directory (and its contents) from the first cluster to the /bar directory on the second cluster, so the second cluster ends up with the directory structure /bar/foo.
Hadoop Archives
HDFS stores small files inefficiently , since each file is stored in a block, and block metadata is held in memory by the namenode.
Hadoop Archives, or HAR files , are a file archiving facility that packs files into HDFS blocks more efficiently, thereby reducing namenode memory usage while still allowing transparent access to files. In particular, Hadoop Archives can be used as input to MapReduce .
HDFS基本和GFS的design是一致的
除了不支持snapshot和不支持垃圾回收而是采取直接删除以外(http://labs.chinamobile.com/groups/10216_23490?fdlayenxoaencysxyant), 最主要的区别是在写操作上
GFS的设计比较通用, GFS的设计是用于数据一旦写入就很少需要改动的场景, 但是GFS还是支持基于offset的文件update, 并且支持atomic的append操作, 这也保证了其可以支持对同一文件的并发写.
但HDFS的设计就简单了多, 也许从开发简单的角度, 也许从Hadoop场景不需要的角度, 不支持对文件的改动
早期版本的HDFS不支持append操作,一旦文件被关闭,它就是不可更改的. 并且一个文件,除非被成功的closed(调用FSDataOutputStream.close()),否则不会存在。如果client在它close文件之前挂了,或者close抛出了异常,这个文件就不会存在,就像从来没有被写过一样。恢复这个文件的唯一方法就是从头再写一遍。MapReduce在这种模式下工作得很好,因为如果task挂了,它会简单地从头重新运行一遍。
但是对于HBase场景, 需要用HDFS来存储log, 当log file没有被close前crash, 会导致所有log丢失, 而且需要能够在append模式下打开这个日志,并写入新的数据。
所以我们Hadoop加入了append功能, (http://blog.csdn.net/zhaokunwu/article/details/7362119 ,File Appends in HDFS)
但是这个和GFS的atomic append不同, HDFS并不保证append是原子的, 也就是仍然无法保证并发写, 并且这个patch只是正对HBase的场景, 并不推荐广泛使用append
对于这个原因, 应用一篇blog, http://blog.csdn.net/foreverdengwei/article/details/7323032
Owen O'malley原文:"
My personal inclination is that atomic append does very bad things to both the design of the file system and the user interface to the file system. Clearly they added atomic append to GFS before they had MapReduce. It seems like most applications would be better served by implementing in MapReduce rather than using atomic append anyways..."
Google是在有mapreduce之前加入atomic append的, Owen认为这是一个bad design, 我们可以用M/R来代替atomic append