Apache Hadoop 官方文档阅读笔记

归根结底还得阅读人家自己的文档，这才是最权威的东西。刚开始看过，纯英文的，对于初学者来说大都云里雾里，不知所云。通过一段时间的学习，了解了Hadoop部署、配置和使用的一些细节，回过头来看官方文档，还是有些收获的。

Hadoop是Apache下的一个开源项目，核心模块是HDFS和Map/Reduce，分别代表分布式存储和计算的模块。我最近在学习Hadoop 1.0.3版本，所以对所提供的英文官方文档进行了详细的阅读，做下笔记，mark一下。（持续更新中）

图1 是Hadoop 1.0.3主页上的文档目录结构，之后将详细介绍里面重要的部分和文档。

图1

1. Getting Started

Overview和Single Node Setup中的内容在很多地方可以找到，也没有需要额外注意的东西。不过在Cluster Setup中，提到了几个值得注意的地方。下面是摘自Cluster Setup中的内容

Configuring the Environment of the Hadoop Daemons，即hadoop-env.sh，讲述了其中一些高级设置。比如：

Daemon	Configure Options
NameNode	HADOOP_NAMENODE_OPTS
DataNode	HADOOP_DATANODE_OPTS
SecondaryNamenode	HADOOP_SECONDARYNAMENODE_OPTS
JobTracker	HADOOP_JOBTRACKER_OPTS
TaskTracker	HADOOP_TASKTRACKER_OPTS

还有介绍到HADOOP_LOG_DIR和HADOOP_HEAPSIZE，详细信息参考原文。

Configuring the Hadoop Daemons，其中介绍了很多在最基本的几项配置之外的其他很多重要的配置。比如mapred.tasktracker.{map|reduce}.tasks.maximum、dfs.hosts/dfs.hosts.exclude、mapred.hosts/mapred.hosts.exclude等。
Real-World Cluster Configurations，介绍了9TB of data sorted on a cluster with 900 nodes和14TB of data sorted on 1400 nodes and 20TB of data sorted on 2000 nodes两种实际案例的参考配置。
Task Controllers，除了默认的DefaultTaskController之外，还有一种Controller，即LinuxTaskController。以及如何配置使用LinuxTaskController。LinuxTaskController保证了“except the job owner and tasktracker, no other user can access any of the local files/directories including those localized as part of the distributed cache”，即进一步保证了安全性。

2. Guids

“This document describes how to configure Hadoop HTTP web-consoles to require user authentication. By default Hadoop HTTP web-consoles (JobTracker, NameNode, TaskTrackers and DataNodes) allow access without any form of authentication.”

介绍如何通过配置文件来进行认证网页浏览Hadoop集群信息，因为默认的是没有认证都可以查看的。

3. MapReduce

MapReduce Tutorial（虽然写的是1.0.3的文档，但是里面的wordcount程序还不是1.0.3的，Apache的工作人员没有更新啊。）这个是MapReduce比较全面、详细的介绍MapReduce的原理和细节。

Example: WordCount v1.0 （比较老版本的wordcount程序），里面的 Walk-through 详细的解释了在整个map、combile和reduce过程对于file0和file1的处理输入和输出结果。

User Interfaces 这部分在面向使用者的角度更加详细的介绍了一定量的细节。This should help users implement, configure and tune their jobs in a fine-grained manner.但是也指出，最好的文档最好参考javadoc，这才是切切实实在代码中参考的最重要的内容。

4. HDFS

HDFS Users Guide

HDFS用户使用指南。不用说这个是必须要阅读的文档。

DFSAdmin Command中的-refreshNodes选项，这是在动态增加和删除datanode的时候需要用到的选项。

Secondary Namenode、Checkpoint Node和Backup Node的区别(理解这三者之间的区别对于构建备份namenode节点非常重要)：

The secondary NameNode merges the fsimage and the edits log files periodically and keeps edits log size within a limit.
The Checkpoint node periodically creates checkpoints of the namespace. It downloads fsimage and edits from the active NameNode, merges them locally, and uploads the new image back to the active NameNode.
The Backup node provides the same checkpointing functionality as the Checkpoint node, as well as maintaining an in-memory, up-to-date copy of the file system namespace that is always synchronized with the active NameNode state.
HDFS Architecture

作为HDFS架构介绍的一篇文章，是必读而且要精读的官方文档，其他地方介绍到的HDFS也大都是这篇文档翻译过来的。