大数据职位
分布式数据处理框架
count the word frequency of a web page?
for循环,存在hashmap
缺点:只有一台机器, 慢,内存大小受限。
多台机器,并行处理
合并的时候是瓶颈。
map把任务打散,reduce把任务合并
step1 input 输入
0: a b a c d d
1: a b c c d b
step2 split 输入的拆分,给不同机器
m1 - 0: a b a c d d
m2 - 1: a b c c d b
step3 map 机器分别执行,不做aggregation
m1 - a,1 b,1 a,1 c,1 d,1 d,1
m2 - a,1 b,1 b,1 c,1 c1 d,1
step4 partition + sort
m1 - a,1 a,1 b,1 | c,1 d,1 d,1
m2 - a,1 b,1 b,1 | c,1 d,1 d,1
step5 fetch + merge sort
m3 - a,1 a,1 b,1 | a,1 b,1 b,1
m4 - c,1 d,1 d,1 | c,1 d,1 d,1
m3 - a,[1,1,1] b,[1,1,1]
m4 - c,[1,1,1] d,[1,1,1]
step6 reduce 合起来
m3 - a,[3] b,[3]
m4 - c,[3] d,[3]
step7 output 输出
a,[3] b,[3] c,[3] d,[3]
step3不合并,不用hashmap
public static class Map {
public void map(String key, String value, OutputCollector<String, Integer> output) { // key 文章储存地址,value文章内容
// 切割文章中的单词
StringTokenizer tokenizer = new StringTokenizer(value);
while (tokenizer.hasMoreTokens()) {
String outputkey = tokenizer.nextToken();
output.collect(outputkey, 1);
}
}
public static class Reduce {
public void reduce(String key, Iterator<Integer> values, OutputCollector<String, Integer> output) { // key map输出的key ..
int sum = 0;
while (values.hasNext()) {
sum += values.next();
}
output.collect(key, sum);
}
}
}
partition and sort
master consistant hashing进行分组。硬盘上外排序
reduce把排好序的文件拿到对应的机器
map, reduce 多少机器。1000 + 1000
机器多,每台处理的时间越少,总时间越快。启动时间变长
reduce数目上限,key的数目
给定正倒排索引,建立倒排索引,给词返回文章编号
key 文章关键词,value: 文章编号
reduce 去重操作,同一文章出现关键词两次的情况
// 同一文章下打散
public static class Map {
public void map (String key, Document value, OutputCollector<String, Integer> output) {
StringTokenizer tokenizer = new StringTokenizer(value.content);
while (tokenizer.hasMoreToken()) {
String word = tokenizer.nextToken();
output.collect(word, value.id);
}
}
}
// 同一单词的合并
public static class Reduce {
public void reduce(String key, Iterator<Integer> values, OutputColllector<String, List<Integer>> output) {
List<Integer> results = new ArrayList<>();
int left = -1;
while (values.hasNext()) {
int now = values.next();
if (left != now) {
results.add(now);
}
left = now;
}
output.collect(key, results);
}
}
anagram:
map key: 每个单词的root value: word
public static class Map{
public void map(String. key, String value, OutputCollector<String, String> output) {
StringTokenizer tokenizer = new StringTokenizer(value);
while (tokenizer.hasMoreTokens()) {
String word = tokenizer.nextToken();
char[] sc = word.toCharArray();
Arrays.sort(sc);
output.collect(new String(sc), word);
}
}
}
reduce key:单词 value:list
public static class Reduce {
public void reudce(String key, Iterator<String> values, OutputCollector<String, List<String>> output) {
List<String> results = new ArrrayList<>();
while (values.hasNext()) {
results.add(values.next());
}
output.collect(key, results);
}
}
top k frequency
class Pair {
String key;
int value;
Pair(String k, int v) {
key = k;
value = v;
}
}
public void map(String _, Document value, OutputCollector<String, Integer> output) {
StringTokenizer tokenizer = new StringTokenizer(value.content);
while (tokenizer.hasMoreTokens()) {
String word = tokenizer.nextToken();
output.collect(word, 1);
}
}
public static class Reduce {
private PriorityQueue<Pair> Q;
private int k;
private Comparator<Pair> cmp = new Comparator<Pair>() {
public int compare(Pair a, Pair b) {
if (a.value != b.value) {
return a.value - b.value;
}
return b.key.compareTo(a.key);
}
};
public void setup(int k) {
Q = new PriorityQueue<Pair>(k, cmp);
this.key = k;
}
public void reduce(String key, Iterator<Integer> values) {
int sum = 0;
while (values.hasNext()) {
sum += values.next();
}
Pair cur = new Pair(key, sum);
if (Q.size() < k) {
Q.add(cur);
} else {
Pair peek = Q.peek();
if (cmp.compare(cur, peek) > 0) {
Q.poll();
Q.add(cur);
}
}
}
public void cleanup(OutputCollector<String, Integer> output) {
List<Pair> res = new ArrayList<>();
while (!Q.isEmpty()) {
res.add(Q.poll());
}
for (int i = res.size() - 1; i >= 0; i --) {
Pair cur = res.get(i);
output.collect(cur.key, cur.value);
}
}
}
design a MR system:
master 控制整个系统流程 - slave 完成真正的工作
1. 用户指定多少map,多少reduce。启动相应机器
2. master分配哪些slave作为map/ reduce。
3. master将input尽量等分给map, map读取文件后执行map工作
4. map工作后将结果写到本地硬盘上
5. 传输整理将map结果传给reduce
6. reduce工作,结束后将结果写出
map结束了reduce
如果挂了一台,重新分配一台机器
reducer一个key特别多。加random后缀。类似shard key。 fb1, fb2, fb3分配到不同
input, output存放到GFS
local disk的mapper output data不需要保存GFS,丢了重做。中间数据不重要。
mapper和reducer之前有预处理,放在不同机器上
MapReduce whole process
1. start: user program start master and worker
2. assign task: master assign task to the map worker and reduce worker. assign map and reduce code
3. split: master split the input data
4. map read: each map worker read the split input data
5. map: each map worker do the map job on their machine
6. map output: each map worker output the file in the local disk of its worker
6. reduce fetch: each reduce worker fetch the data from the map worker
7. reduce: each reducer worker do the reduce job on their machine
8. reduce output: reduce worker outpt the final output data