2011 年 12月 17 日随笔档案 - 陈力

BloomFilter——大规模数据处理利器

摘要： BloomFilter——大规模数据处理利器BloomFilter——大规模数据处理利器 Bloom Filter是由Bloom在1970年提出的一种多哈希函数映射的快速查找算法。通常应用在一些需要快速判断某个元素是否属于集合，但是并不严格要求100%正确的场合。一.实例为了说明Bloom Filter存在的重要意义，举一个实例：假设要你写一个网络蜘蛛（web crawler）。由于网络间的链接错综复杂，蜘蛛在网络间爬行很可能会形成“环”。为了避免形成“环”，就需要知道蜘蛛已经访问过那些URL。给一个URL，怎样知道蜘蛛是否已经访问过呢？稍微想想，就会有如下几种方案： 1. 将访问... 阅读全文

posted @ 2011-12-17 14:51 陈力阅读(170) 评论(0) 推荐(0) 编辑

3.3 Reading and writing

摘要： Let’s see how MapReduce reads input data and writes output data and focus on thefile formats it uses. To enable easy distributed processing, MapReduce makes certainassumptions about the data it’s processing. It also provides flexibility in dealing with avariety of data formats.Input data usually res 阅读全文

posted @ 2011-12-17 02:38 陈力阅读(283) 评论(0) 推荐(0) 编辑

Combiner—local reduce （在本地reduce,最后再做一次reduce）

摘要： In many situations with MapReduce applications, we may wish to perform a “localreduce ” before we distribute the mapper results. Consider the WordCount example ofThe MapReduce data flow, with an emphasis on partitioning andshuffling. Each icon is a key/value pair. The shapes represents keys, whereas 阅读全文

posted @ 2011-12-17 02:35 陈力阅读(233) 评论(0) 推荐(0) 编辑

Partitioner— redirecting output from Mapper如何来分割---引导来自mapper的输出

摘要： Partitioner— redirecting output from MapperA common misconception for first-time MapReduce programmers is to use only asingle reducer大部分的一种错误的概念是程序只用一个单一的reducerAfter all, a single reducer sorts all of your data before processing—and who doesn’t like sorted data? Our discussions regarding MapReduce 阅读全文

posted @ 2011-12-17 02:28 陈力阅读(305) 评论(0) 推荐(0) 编辑

reducer 的功能.(叙述他已经包含的类型)

摘要： When the reducer task receives the output from the various mappers, it sorts theincoming data on the key of the (key/value) pair and groups together all values ofthe same key.当reducer task接受到从不同r的mappers接受到输出时,他进行排序,并且进行累加。Table 3.3 Some useful Reducer implementations predefined by HadoopClass Descr 阅读全文

posted @ 2011-12-17 02:06 陈力阅读(215) 评论(0) 推荐(0) 编辑

一些usefull的mapper

摘要： The function generates a (possibly empty) list of (K2, V2) pairs for a given (K1, V1)input pair. The OutputCollector receives the output of the mapping process, andthe Reporter provides the option to record extra information about the mapper asthe task progresses.Hadoop provides a few useful mapper 阅读全文

posted @ 2011-12-17 01:57 陈力阅读(148) 评论(0) 推荐(0) 编辑

新的契机,因为本人en文不好,所以只能从片言只语来理解,不过有利于思考what is key value )答案在最后!

摘要： Despite our many discussions regarding keys and values, we have yet to mention theirtypes. The MapReduce framework won’t allow them to be any arbitrary class. For example,although we can and often do talk about certain keys and values as integers,strings, and so on, they aren’t exactly standard Java 阅读全文

posted @ 2011-12-17 01:55 陈力阅读(175) 评论(0) 推荐(0) 编辑

hello world!!!!!