MapReduce库类
Hadoop除了可以让开发人员自行编写map函数和reduce函数,还提供一些常用函数(mapper、reducer和partitioner)的类库,这些类位于 org.apache.hadoop.mapred.lib 包内,在1.2.1版,该包包含一个接口和若干类。在org.apache.hadoop.mapreduce.lib 包内也存在相关类库,且有部分重复。mapred包内部是旧API,mapreduce包是重构之后的新API,但两者都可以使用。
接口如下:
InputSampler.Sampler<K,V> | Interface to sample using an InputFormat . |
类如下:
BinaryPartitioner<V> | Partition BinaryComparable keys using a configurable part of the bytes array returned by BinaryComparable.getBytes() . |
ChainMapper | The ChainMapper class allows to use multiple Mapper classes within a single Map task. |
ChainReducer | The ChainReducer class allows to chain multiple Mapper classes after a Reducer within the Reducer task. |
CombineFileInputFormat<K,V> | An abstract InputFormat that returns CombineFileSplit 's in InputFormat.getSplits(JobConf, int) method. |
CombineFileRecordReader<K,V> | A generic RecordReader that can hand out different recordReaders for each chunk in a CombineFileSplit . |
CombineFileSplit | A sub-collection of input files. |
DelegatingInputFormat<K,V> | An InputFormat that delegates behaviour of paths to multiple other InputFormats. |
DelegatingMapper<K1,V1,K2,V2> | An Mapper that delegates behaviour of paths to multiple other mappers. |
FieldSelectionMapReduce<K,V> | This class implements a mapper/reducer class that can be used to perform field selections in a manner similar to unix cut. |
HashPartitioner<K2,V2> | Partition keys by their Object.hashCode() . |
IdentityMapper<K,V> | Implements the identity function, mapping inputs directly to outputs. |
IdentityReducer<K,V> | Performs no reduction, writing all input values directly to the output. |
InputSampler<K,V> | Utility for collecting samples and writing a partition file for TotalOrderPartitioner . |
InputSampler.IntervalSampler<K,V> | Sample from s splits at regular intervals. |
InputSampler.RandomSampler<K,V> | Sample from random points in the input. |
InputSampler.SplitSampler<K,V> | Samples the first n records from s splits. |
InverseMapper<K,V> | A Mapper that swaps keys and values. |
KeyFieldBasedComparator<K,V> | This comparator implementation provides a subset of the features provided by the Unix/GNU Sort. |
KeyFieldBasedPartitioner<K2,V2> | Defines a way to partition keys based on certain key fields (also see KeyFieldBasedComparator . |
LongSumReducer<K> | A Reducer that sums long values. |
MultipleInputs | This class supports MapReduce jobs that have multiple input paths with a different InputFormat and Mapper for each path |
MultipleOutputFormat<K,V> | This abstract class extends the FileOutputFormat, allowing to write the output data to different output files. |
MultipleOutputs | The MultipleOutputs class simplifies writting to additional outputs other than the job default output via the OutputCollector passed to the map() and reduce() methods of the Mapper and Reducer implementations. |
MultipleSequenceFileOutputFormat<K,V> | This class extends the MultipleOutputFormat, allowing to write the output data to different output files in sequence file output format. |
MultipleTextOutputFormat<K,V> | This class extends the MultipleOutputFormat, allowing to write the output data to different output files in Text output format. |
MultithreadedMapRunner<K1,V1,K2,V2> | Multithreaded implementation for @link org.apache.hadoop.mapred.MapRunnable. |
NLineInputFormat | NLineInputFormat which splits N lines of input as one split. |
NullOutputFormat<K,V> | Consume all outputs and put them in /dev/null. |
RegexMapper<K> | A Mapper that extracts text matching a regular expression. |
TokenCountMapper<K> | A Mapper that maps text values into <token,freq>pairs. |
TotalOrderPartitioner<K extends WritableComparable,V> | Partitioner effecting a total order by reading split points from an externally generated source. |
目前,用到的有一下几个类,后续将对其他类及接口进行研究。
1)ChainMapper类和ChainReducer类:可以在一个mapper中运行多个mapper,再运行reducer,之后还可以再运行多个mapper。这两个类组合使用,用于需要执行多个mapreduce过程的情况。这个方案可以明显降低磁盘的I/O开销。
2)TokenCounterMapper类:将输入值分解成独立的单词(使用Java的StringTokenizer)、输出各单词及其计数器(值为1)
3)InverseMapper类:一个能交换键和值的mapper
参考资料:
1. hadoop API 文档
2. Hadoop 权威指南