关于MapReduce中自定义带比较key类、比较器类（二）——初学者从源码查看其原理

Job类

/**
* Define the comparator that controls
* how the keys are sorted before they
* are passed to the {@link Reducer}.
* @param cls the raw comparator
* @see #setCombinerKeyGroupingComparatorClass(Class)
*/
publicvoid setSortComparatorClass(Class<? extends RawComparator> cls
) throws IllegalStateException{
ensureState(JobState.DEFINE);
conf.setOutputKeyComparatorClass(cls);
}

Define the comparator that controls 
how the keys are sorted before they  

定义一个比较器，控制keys在被传递给Reducer之前是如何排序的

<? extends RawComparator>

是泛型的向下限定，要么是RawComparator类型，要是RawComparator的子类（）

RawComparator

接口Comparator

——子接口RawComparator：Compare two objects in binary.

compare方法

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);

——子实现类WritableComparator

既然cls必须是类型或其子类类型，那么如果我们自定义的key类是WritableComparator也可以的

JonConf类

点击setOutputKeyComparatorClass，链接到JonConf类中

/**
* Set the {@link RawComparator} comparator used to compare keys.
* @param theClass the {@link RawComparator} comparator used to
* compare keys.
* @see #setOutputValueGroupingComparator(Class)
*/
设定用于比较key的比较器，theClass参数就是那个比较器啦
publicvoid setOutputKeyComparatorClass(Class<?extendsRawComparator> theClass){
setClass(JobContext.KEY_COMPARATOR,
theClass,RawComparator.class);
}

Set the {@link RawComparator} comparator used to compare keys.
* @param theClass the {@link RawComparator} comparator used to
* compare keys.

设置用于比较key的比较器，参数theClass 就是这个比较器

setClass(JobContext.KEY_COMPARATOR,theClass,RawComparator.class);

关于setClass

* An exception is thrown if <code>theClass</code> does not implement the

* interface <code>xface</code>.

setClass的意思，从JobContext中取出KEY_COMPARATOR属性的值，该值对应的类要是RawComparator本身类型或其子类类型，如果不是其子类类型，则会报错。即。theClass实现了RawComparator。

既然有setOutputKeyComparatorClass,j就会有getOutputKeyComparator。仍然在JobConf类中找到

/**
* Get the {@link RawComparator} comparator used to compare keys.
获取到一个用于比较key的比较器，并返回，返回类型是RawComparator
* @return the {@link RawComparator} comparator used to compare keys.
*/
publicRawComparator getOutputKeyComparator(){
Class<? extends RawComparator> theClass = getClass(
JobContext.KEY_COMPARATOR, null,RawComparator.class);

如果KEY_COMPARATOR属性中没值，则返回null

if(theClass != null)
returnReflectionUtils.newInstance(theClass,this);

如果不为空，则就通过反射创建theClass

否则，使用默认的
returnWritableComparator.get(getMapOutputKeyClass().
asSubclass(WritableComparable.class),this);
}

if(theClass != null)

returnReflectionUtils.newInstance(theClass,this);

假如我们制定了一个比较器类，即job.setSortComparatorClass(xxxS.class)，xxxS,class继承了WritableComparator类型，复写了其中的compare方法。

MapTask$MapOutputBuffer类

到了这里，有一个疑问（强迫症患者专用），那么是谁来调用这个getOutputKeyComparator方法的呢？

在MapTask类中有一个内部类MapOutputBuffer：

属性：private RawComparator<K> comparator;

属性被赋值：

// k/v serialization

comparator = job.getOutputKeyComparator();

可见是在序列化的时候被调用赋值了

ctrl+shift+P 跳转到匹配的括号

方法：compare

/**

     * Compare logical range, st i, j MOD offset capacity.

     * Compare by partition, then by key.

     * @see IndexedSortable#compare

     */

publicint compare(final int mi, final int mj){

      final int kvi = offsetFor(mi % maxRec);

      final int kvj = offsetFor(mj % maxRec);

      final int kvip = kvmeta.get(kvi + PARTITION);

      final int kvjp = kvmeta.get(kvj + PARTITION);

      // sort by partition

      if(kvip != kvjp){

        return kvip - kvjp;

      }

      // sort by key

      return comparator.compare(kvbuffer,

          kvmeta.get(kvi + KEYSTART),

          kvmeta.get(kvi + VALSTART)- kvmeta.get(kvi + KEYSTART),

          kvbuffer,

          kvmeta.get(kvj + KEYSTART),

          kvmeta.get(kvj + VALSTART)- kvmeta.get(kvj + KEYSTART));

}

而在RawComparator中：

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);

所以，当我们传递了一个WritableComparator的子类xxxS的时候，其实此时调用的是子类xxxS继承自WritableComparator类的那个compare方法，只不过其还有另一个重载的compare方法

如下即为WritableComparator类中的这个compare

/** Optimization hook. Override this to make SequenceFile.Sorter's scream.
*
* The default implementation reads the data into two {@link
* WritableComparable}s (using {@link
* Writable#readFields(DataInput)}, then calls {@link
* #compare(WritableComparable,WritableComparable)}.
*/
@Override
publicint compare(byte[] b1,int s1,int l1, byte[] b2,int s2,int l2){
try{
buffer.reset(b1, s1, l1); // parse key1
key1.readFields(buffer);
buffer.reset(b2, s2, l2); // parse key2
key2.readFields(buffer);
}catch(IOException e){
thrownewRuntimeException(e);
}
return compare(key1, key2); // compare them
}

其实我看了下，前面部分应该是在通过数组来读取到两个key——key1、key2

最终调用的是： compare(key1, key2);

/** Compare two WritableComparables.
* The default implementation uses the natural ordering, calling {@link
* Comparable#compareTo(Object)}. */
@SuppressWarnings("unchecked")
publicint compare(WritableComparable a,WritableComparable b){
return a.compareTo(b);
}

此时，调用的是WritableComparable类中的compareTo方法，而这个方法被我们复写了。

（自定义类实现了WritableComparable接口，并复写了该compareTo方法）

还有一点，之前不是提到，如果要用setSortComparatorClass，则必须是RawComparator类型或其子类嘛？

（一）

我们如果是自定义key类——keyxxxS类，且实现了WritableComparable接口，复写CompareTo方法

此时，不用set，

此时。它会return WritableComparator.get(getMapOutputKeyClass().asSubclass(WritableComparable.class), this);

/**
* Get the key class for the map output data. If it is not set, use the
* (final) output key class. This allows the map output key class to be
* different than the final output key class.
*
* @return the map output key class.
*/
publicClass<?> getMapOutputKeyClass(){
Class<?> retv = getClass(JobContext.MAP_OUTPUT_KEY_CLASS, null,Object.class);
if(retv == null){
retv = getOutputKeyClass();
}
return retv;
}

顾名思义。就是获取key的类——即job.setMapOutputClass(xxx.class)中的那个，比如Text，比如我们自定义的keyxxxS

怎么自定义key类——keyxxxS类的

WritableComparable接口的声明：

public interface WritableComparable<T> extends Writable,Comparable<T>

/**
* A serializable object which implements a simple, efficient, serialization
* protocol, based on {@link DataInput} and {@link DataOutput}.
一个实现了一个简单高效的序列化协议（基于....）的可序列化的对象
* Any <code>key</code> or <code>value</code> type in the Hadoop Map-Reduce
* framework implements this interface.
在hadoop mp框架中。任何一个key或者value类型实现该接口
（意思就是说，任意键和值所属的类型应该实现该接口咯）
比如Text，IntWritable

我们查看查看Text类的源码验证之
1. publicclassText extends BinaryComparable
2. implements WritableComparable<BinaryComparable>{}

*Implementations typically implement a static<code>read(DataInput)</code>
* method which constructs a new instance, calls {@link#readFields(DataInput)}
* and returns the instance.
实现类通常实现一个静态的read方法——它构建一个新的实例，调用readFields，返回实例

下面是注释中给出的一个完整的例子：

Example:
*<blockquote><pre>
* publicclassMyWritableComparable implements WritableComparable<MyWritableComparable>{
* // Some data
* privateint counter;
* privatelong timestamp;
*
* publicvoid write(DataOutput out) throws IOException{
* out.writeInt(counter);
* out.writeLong(timestamp);
* }
*
* publicvoid readFields(DataInput in) throws IOException{
* counter = in.readInt();
* timestamp = in.readLong();
* }
*
* publicint compareTo(MyWritableComparable o){
* int thisValue =this.value;
* int thatValue = o.value;
* return(thisValue < thatValue ?-1:(thisValue==thatValue ?0:1));
* }
*
* publicint hashCode(){
* final int prime =31;
* int result =1;
* result = prime * result + counter;
* result = prime * result +(int)(timestamp ^(timestamp >>>32));
* return result
* }
* }