[hadoop源码阅读][3]-新旧api区别
http://blog.csdn.net/xw13106209/article/details/6924458
hadoop 版本0.20和之前的版本差距较大,包括部分api结构,配置文件结构
在hadoop 权威指南中有说明,原文如下:
The new Java MapReduce API
Release 0.20.0 of Hadoop included a new Java MapReduce API, sometimes referred to as “Context Objects,” designed to make the API easier to evolve in the future. The new API is type-incompatible with the old, however, so applications need to be rewritten to take advantage of it.*
There are several notable differences between the two APIs:
- The new API favors abstract classes over interfaces, since these are easier to evolve.For example, you can add a method (with a default implementation) to an abstract class without breaking old implementations of the class. In the new API, the Mapper and Reducer interfaces are now abstract classes.
- The new API is in the org.apache.hadoop.mapreduce package (and subpackages). The old API can still be found in org.apache.hadoop.mapred.
- The new API makes extensive use of context objects that allow the user code to communicate with the MapReduce system. The MapContext, for example, essen-tially unifies the role of the JobConf, the OutputCollector, and the Reporter.
- The new API supports both a “push” and a “pull” style of iteration. In both APIs, key-value record pairs are pushed to the mapper, but in addition, the new API allows a mapper to pull records from within the map() method. The same goes for the reducer. An example of how the “pull” style can be useful is processing records in batches, rather than one by one.
- Configuration has been unified. The old API has a special JobConf object for job configuration, which is an extension of Hadoop’s vanilla Configuration object (used for configuring daemons; see “The Configuration API” on page 130). In the new API, this distinction is dropped, so job configuration is done through a Configuration.
- Job control is performed through the Job class, rather than JobClient, which no longer exists in the new API.
- Output files are named slightly differently: part-m-nnnnn for map outputs, and part-r-nnnnn for reduce outputs (where nnnnn is an integer designating the part number,starting from zero).
译文
Hadoop最新版本的MapReduce Release 0.20.0的API包括了一个全新的Mapreduce JAVA API,有时候也称为上下文对象。
新的API类型上不兼容以前的API,所以,以前的应用程序需要重写才能使新的API发挥其作用 。
新的API和旧的API之间有下面几个明显的区别。
- 新的API倾向于使用抽象类,而不是接口,因为这更容易扩展。例如,你可以添加一个方法(用默认的实现)到一个抽象类而不需修改类之前的实现方法。在新的API中,Mapper和Reducer是抽象类。
- 新的API是在org.apache.hadoop.mapreduce包(和子包)中的。之前版本的API则是放在org.apache.hadoop.mapred中的。
- 新的API广泛使用context object(上下文对象),并允许用户代码与MapReduce系统进行通信。例如,MapContext基本上充当着JobConf的OutputCollector和Reporter的角色。
- 新的API同时支持"推"和"拉"式的迭代。在这两个新老API中,键/值记录对被推mapper中,但除此之外,新的API允许把记录从map()方法中拉出,这也适用于reducer。"拉"式的一个有用的例子是分批处理记录,而不是一个接一个。
- 新的API统一了配置。旧的API有一个特殊的JobConf对象用于作业配置,这是一个对于Hadoop通常的Configuration对象的扩展。在新的API中,这种区别没有了,所以作业配置通过Configuration来完成。作业控制的执行由Job类来负责,而不是JobClient,它在新的API中已经荡然无存。
总结
在hadoop权威指南第二版p25中,译者将上面的abstract class翻译成了虚类,是错误的。abstract class应该是抽象类。抽象类与虚类是不同概念的。当时在看到这一段的时候也有点莫名其妙。这两天看了一点关于抽象类与接口的区别方面的内容,感觉hadoop新api应该更倾向于用抽象类,而不是虚类。所以去翻阅了原著。真的发现原著中写的是abstract class。