hadoop中的重要概念

  学习了hadoop这几天,一些主要的概念必须得先弄清楚,下面是来自wiki.apache的一些很好的解释,整理如下:

  本文信息来源:http://wiki.apache.org/hadoop/FrontPage

1. NameNode  

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.

The NameNode is a Single Point of Failure (单点故障)for the HDFS Cluster. HDFS is not currently a High Availability system. When the NameNode goes down, the file system goes offline. There is an optional SecondaryNameNode that can be hosted on a separate machine. It only creates checkpoints of the namespace by merging the edits file into the fsimage file and does not provide any real redundancy(冗余). Hadoop 0.21+ has a BackupNameNode that is part of a plan to have an HA name service, but it needs active contributions from the people who want it (i.e. you) to make it Highly Available.

It is essential to look after the NameNode. Here are some recommendations from production use

  • Use a good server with lots of RAM. The more RAM you have, the bigger the file system, or the smaller the block size.
  • Use ECC RAM.
  • On Java6u15 or later, run the server VM with compressed pointers -XX:+UseCompressedOops to cut the JVM heap size down.
  • List more than one name node directory in the configuration, so that multiple copies of the file system meta-data will be stored. As long as the directories are on separate disks, a single disk failure will not corrupt the meta-data.
  • Configure the NameNode to store one set of transaction logs on a separate disk from the image.
  • Configure the NameNode to store another set of transaction logs to a network mounted disk.
  • Monitor the disk space available to the NameNode. If free space is getting low, add more storage.
  • Do not host DataNodeJobTracker or TaskTracker services on the same system.

If a NameNode does not start up, look at the TroubleShooting page.

2. DataNode

A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them.

On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.

Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data. Similarly, MapReduce operations farmed out toTaskTracker instances near a DataNode, talk directly to the DataNode to access the files. TaskTracker instances can, indeed should, be deployed on the same servers that host DataNode instances, so that MapReduce operations are performed close to the data.

DataNode instances can talk to each other, which is what they do when they are replicating data.

  • There is usually no need to use RAID storage for DataNode data, because data is designed to be replicated across multiple servers, rather than multiple disks on the same server.

  • An ideal configuration is for a server to have a DataNode, a TaskTracker, and then physical disks one TaskTracker slot per CPU. This will allow everyTaskTracker 100% of a CPU, and separate disks to read and write data.

  • Avoid using NFS for data storage in production system.

3. JobTracker

The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.

  1. Client applications submit jobs to the Job tracker.
  2. The JobTracker talks to the NameNode to determine the location of the data
  3. The JobTracker locates TaskTracker nodes with available slots at or near the data
  4. The JobTracker submits the work to the chosen TaskTracker nodes.
  5. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.
  6. TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.
  7. When the work is completed, the JobTracker updates its status.
  8. Client applications can poll the JobTracker for information.

The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.

4. TaskTracker

A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.

Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.

The TaskTracker spawns a separate JVM processes to do the actual work; this is to ensure that process failure does not take down the task tracker. The TaskTracker monitors these spawned processes, capturing the output and exit codes. When the process finishes, successfully or not, the tracker notifies the JobTracker. TheTaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.

5. MapReduce

MapReduce is the key algorithm that the Hadoop MapReduce engine uses to distribute work around a cluster. 

5.1 The Map

A map transform is provided to transform an input data row of key and value to an output key/value:

  • map(key1,value) -> list<key2,value2>

That is, for an input it returns a list containing zero or more (key,value) pairs:

  • The output can be a different key from the input
  • The output can have multiple entries with the same key 

5.2 The Reduce

A reduce transform is provided to take all values for a specific key, and generate a new list of the reduced output.

  • reduce(key2, list<value2>) -> list<value3> 

5.3 The MapReduce Engine

The key aspect of the MapReduce algorithm is that if every Map and Reduce is independent of all other ongoing Maps and Reduces, then the operation can be run in parallel on different keys and lists of data. On a large cluster of machines, you can go one step further, and run the Map operations on servers where the data lives. Rather than copy the data over the network to the program, you push out the program to the machines. The output list can then be saved to the distributed filesystem, and the reducers run to merge the results. Again, it may be possible to run these in parallel, each reducing different keys.

  • A distributed filesystem spreads multiple copies of the data across different machines. This not only offers reliability without the need for RAID-controlled disks, it offers multiple locations to run the mapping. If a machine with one copy of the data is busy or offline, another machine can be used.
  • A job scheduler (in Hadoop, the JobTracker), keeps track of which MR jobs are executing, schedules individual Maps, Reduces or intermediate merging operations to specific machines, monitors the success and failures of these individual Tasks, and works to complete the entire batch job.

  • The filesystem and Job scheduler can somehow be accessed by the people and programs that wish to read and write data, and to submit and monitor MR jobs.

Apache Hadoop is such a MapReduce engine. It provides its own distributed filesystem and runs [HadoopMapReduce] jobs on servers near the data stored on the filesystem -or any other supported filesystem, of which there is more than one. 

5.4 Limitations

  • For maximum parallelism, you need the Maps and Reduces to be stateless, to not depend on any data generated in the same MapReduce job. You cannot control the order in which the maps run, or the reductions.

  • It is very inefficient if you are repeating similar searches again and again. A database with an index will always be faster than running an MR job over unindexed data. However, if that index needs to be regenerated whenever data is added, and data is being added continually, MR jobs may have an edge(有优势). That inefficiency can be measured in both CPU time and power consumed.
  • In the Hadoop implementation Reduce operations do not take place until all the Maps are complete (or have failed and been skipped). As a result, you do not get any data back until the entire mapping has finished.
  • There is a general assumption that the output of the reduce is smaller than the input to the Map. That is, you are taking a large datasource and generating smaller final values. 

5.5 Will MapReduce/Hadoop solve my problems?

If you can rewrite your algorithms as Maps and Reduces, then yes. If not, then no.

It is not a silver bullet(喻指新技术) to all the problems of scale, just a good technique to work on large sets of data when you can work on small pieces of that dataset in parallel.

6. Pseudo Distributed Hadoop

Pseudo Distributed Hadoop is where Hadoop runs as set of independent JVMs, but only on a single host. It has much lower performance than a real Hadoop cluster, due to the smaller number of hard disks limiting IO bandwidth. It is, however, a good way to play with new MR algorithms on very small datasets, and to learn how to use Hadoop. Developers working in the Hadoop codebase usually test their code in this mode before deploying their build of Hadoop to a local test cluster.

If you are running in this mode (and don't have a proxy server fielding HTML requests), and have not changed the default port values, then both the NameNode andJobTracker can be reached from this page 

6.1 Ports in Use

These are the standard ports; if the configuration files are changed then they will not be valid.

With only a single HDFS datanode, the replication factor should be set to 1 the same goes for the replication factor of submitted jars. You also need to tell the Job tracker to not try handing a failing task to another task tracker, or to blacklist a tracker that appears to fail a lot. While those options are essential in large clusters with many machines -some of which will start to fail, on a single node cluster they do more harm than good.

mapred.submit.replication=1
mapred.skip.attempts.to.start.skipping=1
mapred.max.tracker.failures=10000
mapred.max.tracker.blacklists=10000
mapred.map.tasks.speculative.execution=false
mapred.reduce.tasks.speculative.execution=false
tasktracker.http.threads=5
posted @ 2012-11-08 19:14  beanmoon  阅读(728)  评论(0编辑  收藏  举报