Hadoop编程笔记(一):Mapper及Reducer类详解
本《hadoop编程笔记》系列主要针对Hadoop编程方面的学习,包括主要类及接口的用法和作用以及编程方法,最佳实践等,如果你想更多的学习Hadoop本身所具有的特性和功能及其附属的生态圈(如Pig,Hive,Hbase等),请参阅另一个笔记系列《Hadoop学习笔记》,俺深知自己能力有限,写的不对的地方还望各位海涵,同时给俺指点一二~~
本文说明:本文来源于hadoop1.0.4 API。
1. Mapper
Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.
Map任务是一个转换输入记录到某个中间记录的独立任务,被转换后的中间记录不需要与输入记录具有相同的类型。一个给定的输入键值对可能会map到零个或多个输出键值对。
The Hadoop Map-Reduce framework spawns one map task for each InputSplit
generated by the InputFormat
for the job. Mapper
implementations can access the Configuration
for the job via the JobContext.getConfiguration()
.
InputFormat会为对应的作业产生一个或多个InputSplit,Hadoop Map-Reduce框架再为每个InputSplit产生一个Map任务。Mapper的实现类可以通过JobContext.getConfiguration()来获得此作业的Configuration对象(里面保存了各种job的配置信息)。
The framework first calls setup(org.apache.hadoop.mapreduce.Mapper.Context)
, followed by map(Object, Object, Context)
for each key/value pair in the InputSplit
. Finally cleanup(Context)
is called.
Hadoop Map-Reduce框架首先调用setup(org.apache.hadoop.mapreduce.Mapper.Context)建立工作环境,然后为每个InputSplit中的键值对调用
map(Object, Object, Context)函数处理输入数据,map任务完成后再调用
cleanup(Context)做些清理工作。
All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to a Reducer
to determine the final output. Users can control the sorting and grouping by specifying two key RawComparator
classes.
中间输出键(key)相同的所有中间值(value)随后都会被框架分组,然后传输给reducer来决定最终的输出。用户可以通过指定两个key RawComparator类来控制中间数据的排序和分组。
The Mapper
outputs are partitioned per Reducer
. Users can control which keys (and hence records) go to which Reducer
by implementing a custom Partitioner
.
Mapper的输出会为每个Reduceer分区,用户可以通过实现一个自定义的Partitioner来控制哪个键(因此也包括了对应记录)流向那个Reducer。
Users can optionally specify a combiner
, via Job.setCombinerClass(Class)
, to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper
to the Reducer
.
用户还可以通过Job.setCombinerClass(Class)来指定一个combiner(合并器)来执行中间输出数据的本地合并,从而可以减少从Mapper到Reducer的网络上的传输数据量。
Applications can specify if and how the intermediate outputs are to be compressed and which CompressionCodec
s are to be used via the Configuration
.
应用程序可以通过Configuration来指定中间输出数据是否和如何被压缩,以及使用哪个CompressionCodec来进行压缩。
If the job has zero reduces then the output of the Mapper
is directly written to the OutputFormat
without sorting by keys.
如果此作业没有reduce任务,那么Mapper的输出不经过关键字排序就会直接被写入到OutputFormat中去。
Mapper抽象类除了上面介绍的setup(),map(),cleanup()方法外,还有另一个方法:public void run(Mapper.Context context) , 用户可以重载这个方法来实现更多的对Mapper的控制。
2. Reducer
Reducer的实现类可以通过
JobContext.getConfiguration()来获得此作业的Configuration对象(里面保存了各种job的配置信息)。
Reducer
有三个主要的阶段:
1. Shuffle
The Reducer
copies the sorted output from each Mapper
using HTTP across the network.
Reducer把来自Mapper的已排序的输出数据通过网络经Http拷贝到本地来
2. Sort
The framework merge sorts Reducer
inputs by key
s (since different Mapper
s may have output the same key).
MapReduce框架按关键字(key)对Reducer输入进行融合和排序(因为不同的Mapper可能会输出同样的key给某个reducer)
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
shuffle和sort阶段可以同时进行,例如map输出数据在传输时可以同时被融合。
2.1 SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce.The grouping comparator is specified via Job.setGroupingComparatorClass(Class)
. The sort order is controlled by Job.setSortComparatorClass(Class)
.
要想对值迭代器返回的值组合获得第二次排序,应用程序应该扩展关键字,即使用一个“第二关键字”(secondary key),然后再定义一个组比较器,这样的话关键字的排序会由整个关键字组决定,但会通过组比较器来进行分组从而决定了哪些键值对的集合被送往同一个reducer进行处理。组比较器可以通过Job.setGroupingComparatorClass(Class)来指定,排序的顺序通过
Job.setSortComparatorClass(Class)控制。
For example, say that you want to find duplicate web pages and tag them all with the url of the "best" known example. You would set up the job like:
例如,如果你想找到某些内容重复的web页面,然后全部给他们赋予这些页面中最受欢迎页面的URL,你可以这样来设置作业:
- Map Input Key: url
- Map Input Value: document
- Map Output Key: document checksum, url pagerank
- Map Output Value: url
- Partitioner: by checksum
- OutputKeyComparator: by checksum and then decreasing pagerank
- OutputValueGroupingComparator: by checksum
3. Reduce
In this phase the reduce(Object, Iterable, Context)
method is called for each <key, (collection of values)>
in the sorted inputs.
在reduce这个阶段会为已排序reduce输入中的每个<key, (collection of values)>调用reduce(Object, Iterable, Context)
The output of the reduce task is typically written to a
RecordWriter
via TaskInputOutputContext.write(Object, Object)
.
通常情况下,reduce任务的输出会通过TaskInputOutputContext.write(Object, Object)写入到一个
RecordWriter中
The output of the Reducer
is not re-sorted.
reducer的输出是不会重排序的。
本文系原创,转载请注明出处:http://www.cnblogs.com/beanmoon/archive/2012/12/06/2804594.html
作者:beanmoon
出处:http://www.cnblogs.com/beanmoon/
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接。
该文章也同时发布在我的独立博客中-豆月博客。