hello world!!!!!

写下自己的一些心得,写下自己问题的方式,写下程序之路的艰辛,希望能够有朝一日成为大牛。
  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

2011年12月17日

摘要: BloomFilter——大规模数据处理利器BloomFilter——大规模数据处理利器 Bloom Filter是由Bloom在1970年提出的一种多哈希函数映射的快速查找算法。通常应用在一些需要快速判断某个元素是否属于集合,但是并不严格要求100%正确的场合。一.实例 为了说明Bloom Filter存在的重要意义,举一个实例: 假设要你写一个网络蜘蛛(web crawler)。由于网络间的链接错综复杂,蜘蛛在网络间爬行很可能会形成“环”。为了避免形成“环”,就需要知道蜘蛛已经访问过那些URL。给一个URL,怎样知道蜘蛛是否已经访问过呢?稍微想想,就会有如下几种方案: 1. 将访问... 阅读全文

posted @ 2011-12-17 14:51 陈力 阅读(170) 评论(0) 推荐(0) 编辑

摘要: Let’s see how MapReduce reads input data and writes output data and focus on thefile formats it uses. To enable easy distributed processing, MapReduce makes certainassumptions about the data it’s processing. It also provides flexibility in dealing with avariety of data formats.Input data usually res 阅读全文

posted @ 2011-12-17 02:38 陈力 阅读(283) 评论(0) 推荐(0) 编辑

摘要: In many situations with MapReduce applications, we may wish to perform a “localreduce ” before we distribute the mapper results. Consider the WordCount example ofThe MapReduce data flow, with an emphasis on partitioning andshuffling. Each icon is a key/value pair. The shapes represents keys, whereas 阅读全文

posted @ 2011-12-17 02:35 陈力 阅读(233) 评论(0) 推荐(0) 编辑

摘要: Partitioner— redirecting output from MapperA common misconception for first-time MapReduce programmers is to use only asingle reducer大部分的一种错误的概念是程序只用一个单一的reducerAfter all, a single reducer sorts all of your data before processing—and who doesn’t like sorted data? Our discussions regarding MapReduce 阅读全文

posted @ 2011-12-17 02:28 陈力 阅读(305) 评论(0) 推荐(0) 编辑

摘要: When the reducer task receives the output from the various mappers, it sorts theincoming data on the key of the (key/value) pair and groups together all values ofthe same key.当reducer task接受到从不同r的mappers接受到输出时,他进行排序,并且进行累加。Table 3.3 Some useful Reducer implementations predefined by HadoopClass Descr 阅读全文

posted @ 2011-12-17 02:06 陈力 阅读(215) 评论(0) 推荐(0) 编辑

摘要: The function generates a (possibly empty) list of (K2, V2) pairs for a given (K1, V1)input pair. The OutputCollector receives the output of the mapping process, andthe Reporter provides the option to record extra information about the mapper asthe task progresses.Hadoop provides a few useful mapper 阅读全文

posted @ 2011-12-17 01:57 陈力 阅读(148) 评论(0) 推荐(0) 编辑

摘要: Despite our many discussions regarding keys and values, we have yet to mention theirtypes. The MapReduce framework won’t allow them to be any arbitrary class. For example,although we can and often do talk about certain keys and values as integers,strings, and so on, they aren’t exactly standard Java 阅读全文

posted @ 2011-12-17 01:55 陈力 阅读(175) 评论(0) 推荐(0) 编辑