MapReduce:Simplified Data Processing on Large Clusters(中文翻译2)

【注:本人菜鸟一枚,喜欢Hadoop方向的大数据处理,现在正在研读Google影响世界的三篇论文,遂一边阅读一边翻译,错误在所难免,希望大家给予批评,我会增加学习的动力】

1 Introduction

  Over the past five years,the authors and many others at Google have implemented hundreds of special-purpose computations that process large amounts of raw data,such as crawled documents,web request logs,etc.,to compute various kinds of derived data,such as inverted indices,various representations of the graph structure of web documents,summaries of the number of pages crawled per host,the set of most frequent queries in a given day ,etc.Most such computations are conceptually straightforward.However,the input data is usually large and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time.The issues of how to parallelize the computation ,distribute the data,and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with these issues.

  As a reaction to this complexity,we designed a new abstraction that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization,fault-tolerance,data distribution and load balancing in a library. Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages.We realized that most of our computations involved applying a map operation to each logical "record" in our input in order to compute a set of intermediate key/value pairs,and then applying a reduce operation to all the values that shared the same key,in order to combine the derived data appropriately. Our use of a functional model with user specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.

  The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations,combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs.

  Section 2 describes the basic programming model and gives several examples.Section 3 describes an implementation of the MapReduce interface tailored towards our cluster-based computing environment.Section 4 describes several refinements of the programming model that we have found useful.Section 5 has performance measurements of our implementation for a variety of tasks. Section 6 explores the use of MapReduce within Google including our experiences in using it as the basis for a rewrite of our production indexing system.Section 7 discusses related and future work. 

---------------------------中文翻译---------------------------

1  介绍

  在过去的五年里,作者和谷歌的其他员工实现了几百个有着特殊目的计算方法,用来处理大量的原始数据,例如文档抓取、网络请求日志等;也处理各种各样的派生数据,例如倒排索引、web文档的各种有代表性的图形结构、爬行每个集群抓取页面的汇总,以及每天被请求最多的查询的集合等。大部分这样的运算在概念上都是比较好理解的。然而,输入的数据通常是大型的并且要在一个合理的时间内利用分布的几百台机器完成计算。问题在于如何实现并行运算、如何分布数据以及如何处理故障,面对这些问题需要大量复杂的代码进行处理,这也使原本简单的运算变得难以处理。

  为了解决上述的复杂问题,我们设计了一个新的抽象的模型,这个模型允许我们不考虑散乱的并行运算、分布式文件系统以及负载均衡这一系列问题的细节,而让我们表述我们想要的简单计算就可以了。我们的抽象模型是来自于Lisp和许多其他功能语言中的Map和Reduce原语。我们意识到大部分的运算都涉及到这样的应用:运用map函数将输入的逻辑“记录”处理成一系列的中间媒介key/value对,然后运用Reduce函数处理所有拥有相同key值的key/value对中的value值,目的是适当的合并派生数据。我们使用这个功能模型与用户指定的map和reduce操作相结合,这样就允许我们很容易的实现并行运算,并且可以实现再次执行的功能并以此作为基础设备的容错方案。

  这个模型的主要贡献是通过一个简单而强大的接口来实现自动地并行处理和大规模的分布式计算,通过大量的普通计算机与模型的处理接口相结合来实现高性能。

  论文的第二部分描述了基本的编程模型,并给出了几个例子;第三部分描述了一个针对于集群计算环境定制的MapReduce接口;第四部分描述了几个我们发现比较有用的编程模型;第五部分我们执行各种任务来测定MapReduce的性能;第六部分探讨了谷歌如何使用MapReduce,包括我们以此(MapReduce)作为基础重写索引系统的基础经验;第七部分讨论了与此相关的未来的工作。

posted @ 2013-03-21 20:37  二手产品经理  阅读(583)  评论(0编辑  收藏  举报