count the word frequency of a web page?
缺点:只有一台机器, 慢,内存大小受限。
step1 input 输入
0: a b a c d d
1: a b c c d b
step2 split 输入的拆分,给不同机器
m1 - 0: a b a c d d
m2 - 1: a b c c d b
step3 map 机器分别执行,不做aggregation
m1 - a,1 b,1 a,1 c,1 d,1 d,1
m2 - a,1 b,1 b,1 c,1 c1 d,1
step4 partition + sort
m1 - a,1 a,1 b,1 | c,1 d,1 d,1
m2 - a,1 b,1 b,1 | c,1 d,1 d,1
step5 fetch + merge sort
m3 - a,1 a,1 b,1 | a,1 b,1 b,1
m4 - c,1 d,1 d,1 | c,1 d,1 d,1
m3 - a,[1,1,1] b,[1,1,1]
m4 - c,[1,1,1] d,[1,1,1]
step6 reduce 合起来
m3 - a,[3] b,[3]
m4 - c,[3] d,[3]
step7 output 输出
a,[3] b,[3] c,[3] d,[3]
public static class Map {
public void map(String key, String value, OutputCollector<String, Integer> output) { // key 文章储存地址,value文章内容
// 切割文章中的单词
StringTokenizer tokenizer = new StringTokenizer(value);
while (tokenizer.hasMoreTokens()) {
String outputkey = tokenizer.nextToken();
output.collect(outputkey, 1);
public static class Reduce {
public void reduce(String key, Iterator<Integer> values, OutputCollector<String, Integer> output) { // key map输出的key ..
int sum = 0;
while (values.hasNext()) {
sum += values.next();
output.collect(key, sum);
partition and sort
master consistant hashing进行分组。硬盘上外排序
map, reduce 多少机器。1000 + 1000
key 文章关键词,value: 文章编号
reduce 去重操作,同一文章出现关键词两次的情况
// 同一文章下打散
public static class Map {
public void map (String key, Document value, OutputCollector<String, Integer> output) {
StringTokenizer tokenizer = new StringTokenizer(value.content);
while (tokenizer.hasMoreToken()) {
String word = tokenizer.nextToken();
output.collect(word, value.id);
// 同一单词的合并
public static class Reduce {
public void reduce(String key, Iterator<Integer> values, OutputColllector<String, List<Integer>> output) {
List<Integer> results = new ArrayList<>();
int left = -1;
while (values.hasNext()) {
int now = values.next();
if (left != now) {
left = now;
output.collect(key, results);
map key: 每个单词的root value: word
public static class Map{
public void map(String. key, String value, OutputCollector<String, String> output) {
StringTokenizer tokenizer = new StringTokenizer(value);
while (tokenizer.hasMoreTokens()) {
String word = tokenizer.nextToken();
char[] sc = word.toCharArray();
output.collect(new String(sc), word);
reduce key:单词 value:list
public static class Reduce {
public void reudce(String key, Iterator<String> values, OutputCollector<String, List<String>> output) {
List<String> results = new ArrrayList<>();
while (values.hasNext()) {
output.collect(key, results);
top k frequency
class Pair {
String key;
int value;
Pair(String k, int v) {
key = k;
value = v;
public void map(String _, Document value, OutputCollector<String, Integer> output) {
StringTokenizer tokenizer = new StringTokenizer(value.content);
while (tokenizer.hasMoreTokens()) {
String word = tokenizer.nextToken();
output.collect(word, 1);
public static class Reduce {
private PriorityQueue<Pair> Q;
private int k;
private Comparator<Pair> cmp = new Comparator<Pair>() {
public int compare(Pair a, Pair b) {
if (a.value != b.value) {
return a.value - b.value;
return b.key.compareTo(a.key);
public void setup(int k) {
Q = new PriorityQueue<Pair>(k, cmp);
this.key = k;
public void reduce(String key, Iterator<Integer> values) {
int sum = 0;
while (values.hasNext()) {
sum += values.next();
Pair cur = new Pair(key, sum);
if (Q.size() < k) {
} else {
Pair peek = Q.peek();
if (cmp.compare(cur, peek) > 0) {
public void cleanup(OutputCollector<String, Integer> output) {
List<Pair> res = new ArrayList<>();
while (!Q.isEmpty()) {
for (int i = res.size() - 1; i >= 0; i --) {
Pair cur = res.get(i);
output.collect(cur.key, cur.value);
design a MR system:
master 控制整个系统流程 - slave 完成真正的工作
1. 用户指定多少map,多少reduce。启动相应机器
2. master分配哪些slave作为map/ reduce。
3. master将input尽量等分给map, map读取文件后执行map工作
4. map工作后将结果写到本地硬盘上
5. 传输整理将map结果传给reduce
6. reduce工作,结束后将结果写出
reducer一个key特别多。加random后缀。类似shard key。 fb1, fb2, fb3分配到不同
input, output存放到GFS
local disk的mapper output data不需要保存GFS,丢了重做。中间数据不重要。
MapReduce whole process
1. start: user program start master and worker
2. assign task: master assign task to the map worker and reduce worker. assign map and reduce code
3. split: master split the input data
4. map read: each map worker read the split input data
5. map: each map worker do the map job on their machine
6. map output: each map worker output the file in the local disk of its worker
6. reduce fetch: each reduce worker fetch the data from the map worker
7. reduce: each reducer worker do the reduce job on their machine
8. reduce output: reduce worker outpt the final output data
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· PowerShell开发游戏 · 打蜜蜂
· 在鹅厂做java开发是什么体验
· 百万级群聊的设计实践
· WPF到Web的无缝过渡:英雄联盟客户端的OpenSilver迁移实战
· 永远不要相信用户的输入:从 SQL 注入攻防看输入验证的重要性