大数据职位

分布式数据处理框架

count the word frequency of a web page?

for循环,存在hashmap

缺点:只有一台机器, 慢,内存大小受限。

 

多台机器,并行处理

合并的时候是瓶颈。

map把任务打散,reduce把任务合并

 

step1 input 输入 

0: a b a c d d

1: a b c c d b

step2 split 输入的拆分,给不同机器

m1 - 0:  a b a c d d

m2 - 1: a b c c d b

step3 map 机器分别执行,不做aggregation

m1 - a,1 b,1 a,1 c,1 d,1 d,1

m2 - a,1 b,1 b,1 c,1 c1 d,1

step4 partition + sort

m1 - a,1 a,1 b,1 |  c,1 d,1 d,1

m2 - a,1 b,1 b,1 | c,1 d,1 d,1

step5 fetch + merge sort

m3 - a,1 a,1 b,1 | a,1 b,1 b,1

m4 - c,1 d,1 d,1 | c,1 d,1 d,1

m3 - a,[1,1,1]  b,[1,1,1]

m4 - c,[1,1,1]  d,[1,1,1]

step6 reduce 合起来

m3 - a,[3]  b,[3]

m4 - c,[3]  d,[3]

step7 output 输出

a,[3]  b,[3] c,[3] d,[3]

 

step3不合并,不用hashmap

public static class Map {

  public void map(String key, String value, OutputCollector<String, Integer> output) { // key 文章储存地址,value文章内容

    // 切割文章中的单词

    StringTokenizer tokenizer = new StringTokenizer(value);

    while (tokenizer.hasMoreTokens()) {

      String outputkey = tokenizer.nextToken();

      output.collect(outputkey, 1);

    }

  }

  public static class Reduce {

    public void reduce(String key, Iterator<Integer> values, OutputCollector<String, Integer> output) { // key map输出的key .. 

      int sum = 0;

      while (values.hasNext()) {

        sum += values.next();

      }

      output.collect(key, sum);

    }

  }

}

 

partition and sort

master consistant hashing进行分组。硬盘上外排序

reduce把排好序的文件拿到对应的机器

 

map, reduce 多少机器。1000 + 1000

机器多,每台处理的时间越少,总时间越快。启动时间变长

reduce数目上限,key的数目

 

给定正倒排索引,建立倒排索引,给词返回文章编号

 key 文章关键词,value: 文章编号

reduce 去重操作,同一文章出现关键词两次的情况

// 同一文章下打散

public static class Map {

  public void map (String key, Document value, OutputCollector<String, Integer> output) {

    StringTokenizer tokenizer = new StringTokenizer(value.content);

    while (tokenizer.hasMoreToken()) {

      String word = tokenizer.nextToken();

      output.collect(word, value.id);

    }

  }

}

// 同一单词的合并

public static class Reduce {

  public void reduce(String key, Iterator<Integer> values, OutputColllector<String, List<Integer>> output) {

    List<Integer> results = new ArrayList<>();

    int left = -1;

    while (values.hasNext()) {

      int now = values.next();

      if (left != now) {

        results.add(now);

      }

      left = now;

    }

    output.collect(key, results);

  }

}

 

anagram:

map key: 每个单词的root value: word

public static class Map{

  public void map(String. key, String value, OutputCollector<String, String> output) {

    StringTokenizer tokenizer = new StringTokenizer(value);

    while (tokenizer.hasMoreTokens()) {

      String word = tokenizer.nextToken();

      char[] sc = word.toCharArray();

      Arrays.sort(sc);

      output.collect(new String(sc), word);

      }

   }

}

reduce key:单词  value:list

public static class Reduce {

  public void reudce(String key, Iterator<String> values, OutputCollector<String, List<String>> output) {

    List<String> results = new ArrrayList<>();

    while (values.hasNext()) {

      results.add(values.next());

    }

    output.collect(key, results);

  }

}

 

top k frequency

class Pair {

  String key;

  int value;

  Pair(String k, int v) {

    key = k;

    value = v;

  }

 }

public void map(String _, Document value, OutputCollector<String, Integer> output) {

  StringTokenizer tokenizer = new StringTokenizer(value.content);

   while (tokenizer.hasMoreTokens()) {
    String word = tokenizer.nextToken();

    output.collect(word, 1);

   }

}

public static class Reduce {

  private PriorityQueue<Pair> Q;

  private int k;

  private Comparator<Pair> cmp = new Comparator<Pair>() {

    public int compare(Pair a, Pair b) {

      if (a.value != b.value) {

        return a.value - b.value;

      }

      return b.key.compareTo(a.key);

    }

  };

  public void setup(int k) {

    Q = new PriorityQueue<Pair>(k, cmp);

    this.key = k;

  }

  public void reduce(String key, Iterator<Integer> values) {

    int sum = 0;  

    while (values.hasNext()) {

      sum += values.next();

    }

    Pair cur = new Pair(key, sum);

    if (Q.size() < k) {

      Q.add(cur);

    } else {

      Pair peek = Q.peek();

      if (cmp.compare(cur, peek) > 0) {

        Q.poll();

        Q.add(cur);

      }

    }

  }

  public void cleanup(OutputCollector<String, Integer> output) {

    List<Pair> res = new ArrayList<>();

    while (!Q.isEmpty()) {

      res.add(Q.poll());

    }

    for (int i = res.size() - 1; i >= 0; i --) {

      Pair cur = res.get(i);

      output.collect(cur.key, cur.value);

    }
  }

}
 

design a MR system:

master 控制整个系统流程 - slave 完成真正的工作

1. 用户指定多少map,多少reduce。启动相应机器

2. master分配哪些slave作为map/ reduce。

3. master将input尽量等分给map, map读取文件后执行map工作

4. map工作后将结果写到本地硬盘上

5. 传输整理将map结果传给reduce

6. reduce工作,结束后将结果写出

 

map结束了reduce

如果挂了一台,重新分配一台机器

reducer一个key特别多。加random后缀。类似shard key。 fb1, fb2, fb3分配到不同

input, output存放到GFS

local disk的mapper output data不需要保存GFS,丢了重做。中间数据不重要。

mapper和reducer之前有预处理,放在不同机器上

 

MapReduce whole process

1. start: user program start master and worker

2. assign task: master assign task to the map worker and reduce worker. assign map and reduce code

3. split: master split the input data

4. map read: each map worker read the split input data

5. map: each map worker do the map job on their machine

6. map output: each map worker output the file in the local disk of its worker

6. reduce fetch: each reduce worker fetch the data from the map worker

7. reduce: each reducer worker do the reduce job on their machine

8. reduce output: reduce worker outpt the final output data

posted on 2024-02-27 05:31  dddddcoke  阅读(5)  评论(0编辑  收藏  举报