MIT-6.824 lab1

github：https://github.com/haoweiz/MIT-6.824

Part1：

　　第一部分比较简单，我们只需要修改doMap和doReduce函数即可，主要涉及Go语言对Json文件的读写。简单说说part1的测试流程吧，Sequential部分代码如下

 1 func TestSequentialSingle(t *testing.T) {
 2     mr := Sequential("test", makeInputs(1), 1, MapFunc, ReduceFunc)
 3     mr.Wait()
 4     check(t, mr.files)
 5     checkWorker(t, mr.stats)
 6     cleanup(mr)
 7 }
 8 
 9 func TestSequentialMany(t *testing.T) {
10     mr := Sequential("test", makeInputs(5), 3, MapFunc, ReduceFunc)
11     mr.Wait()
12     check(t, mr.files)
13     checkWorker(t, mr.stats)
14     cleanup(mr)

　　makeInputs（M int）对于0-100000个数字平均分成了M个文件写入，，根据题目要求，我们需要将每个文件分成N个文件写出，因此doMap过程总共产生M*N个文件，我们可以先将文件的所有键值对通过mapF函数（本质上是test_test.go中的MapFunc函数）存储在数组keyvalue中，然后创建nReduce个文件并对每个键值对用ihash()%nReduce得到该键值对应存储的文件位置，利用Encoder写入文件即可

 1 // doMap manages one map task: it reads one of the input files
 2 // (inFile), calls the user-defined map function (mapF) for that file's
 3 // contents, and partitions the output into nReduce intermediate files.
 4 func doMap(
 5     jobName string, // the name of the MapReduce job
 6     mapTaskNumber int, // which map task this is
 7     inFile string,
 8     nReduce int, // the number of reduce task that will be run ("R" in the paper)
 9     mapF func(file string, contents string) []KeyValue,
10 ) {
11     //
12     // You will need to write this function.
13     //
14     // The intermediate output of a map task is stored as multiple
15     // files, one per destination reduce task. The file name includes
16     // both the map task number and the reduce task number. Use the
17     // filename generated by reduceName(jobName, mapTaskNumber, r) as
18     // the intermediate file for reduce task r. Call ihash() (see below)
19     // on each key, mod nReduce, to pick r for a key/value pair.
20     //
21     // mapF() is the map function provided by the application. The first
22     // argument should be the input file name, though the map function
23     // typically ignores it. The second argument should be the entire
24     // input file contents. mapF() returns a slice containing the
25     // key/value pairs for reduce; see common.go for the definition of
26     // KeyValue.
27     //
28     // Look at Go's ioutil and os packages for functions to read
29     // and write files.
30     //
31     // Coming up with a scheme for how to format the key/value pairs on
32     // disk can be tricky, especially when taking into account that both
33     // keys and values could contain newlines, quotes, and any other
34     // character you can think of.
35     //
36     // One format often used for serializing data to a byte stream that the
37     // other end can correctly reconstruct is JSON. You are not required to
38     // use JSON, but as the output of the reduce tasks *must* be JSON,
39     // familiarizing yourself with it here may prove useful. You can write
40     // out a data structure as a JSON string to a file using the commented
41     // code below. The corresponding decoding functions can be found in
42     // common_reduce.go.
43     //
44     //   enc := json.NewEncoder(file)
45     //   for _, kv := ... {
46     //     err := enc.Encode(&kv)
47     //
48     // Remember to close the file after you have written all the values!
49     //
50     
51     // Read from inFile and save all keys and values in keyvalue 
52     var keyvalue []KeyValue
53     fi, err := os.Open(inFile)
54     if err != nil {
55         log.Fatal("doMap Open: ", err)
56     }
57     defer fi.Close()
58     br := bufio.NewReader(fi)
59     for {
60         a, _, c := br.ReadLine()
61         if c == io.EOF {
62             break
63         }
64         kv := mapF(inFile, string(a))
65         for i := 0; i != len(kv); i++ {
66             keyvalue = append(keyvalue, kv[i])
67         }
68     }
69 
70     // Create nReduce files and create encoder for each of them
71     var names []string
72     files := make([]*os.File, 0, nReduce)
73     enc := make([]*json.Encoder, 0, nReduce)
74     for r := 0; r != nReduce; r++ {
75         names = append(names, fmt.Sprintf("mrtmp.%s-%d-%d", jobName, mapTaskNumber, r))
76         file, err := os.Create(names[r])
77         if err != nil {
78             log.Fatal("doMap Create: ", err)
79         }
80         files = append(files, file)
81         enc = append(enc, json.NewEncoder(file))
82     }
83     
84     // Choose which file to store for each keyvalue
85     for _, kv := range keyvalue {
86         index := ihash(kv.Key)%nReduce
87         enc[index].Encode(kv)
88     }
89 
90     // Close all files
91     for _, f := range files {
92         f.Close()
93     }
94 }

　　对于doReduce函数我们需要读取nMap个文件，将所有键值对解码并重新编码写出到outFile中

 1 // doReduce manages one reduce task: it reads the intermediate
 2 // key/value pairs (produced by the map phase) for this task, sorts the
 3 // intermediate key/value pairs by key, calls the user-defined reduce function
 4 // (reduceF) for each key, and writes the output to disk.
 5 func doReduce(
 6     jobName string, // the name of the whole MapReduce job
 7     reduceTaskNumber int, // which reduce task this is
 8     outFile string, // write the output here
 9     nMap int, // the number of map tasks that were run ("M" in the paper)
10     reduceF func(key string, values []string) string,
11 ) {
12     //
13     // You will need to write this function.
14     //
15     // You'll need to read one intermediate file from each map task;
16     // reduceName(jobName, m, reduceTaskNumber) yields the file
17     // name from map task m.
18     //
19     // Your doMap() encoded the key/value pairs in the intermediate
20     // files, so you will need to decode them. If you used JSON, you can
21     // read and decode by creating a decoder and repeatedly calling
22     // .Decode(&kv) on it until it returns an error.
23     //
24     // You may find the first example in the golang sort package
25     // documentation useful.
26     //
27     // reduceF() is the application's reduce function. You should
28     // call it once per distinct key, with a slice of all the values
29     // for that key. reduceF() returns the reduced value for that key.
30     //
31     // You should write the reduce output as JSON encoded KeyValue
32     // objects to the file named outFile. We require you to use JSON
33     // because that is what the merger than combines the output
34     // from all the reduce tasks expects. There is nothing special about
35     // JSON -- it is just the marshalling format we chose to use. Your
36     // output code will look something like this:
37     //
38     // enc := json.NewEncoder(file)
39     // for key := ... {
40     //     enc.Encode(KeyValue{key, reduceF(...)})
41     // }
42     // file.Close()
43     //
44 
45     // Read all mrtmp.xxx-m-reduceTaskNumber and write to outFile
46     var names []string
47     file, err := os.Create(outFile)
48     if err != nil {
49         log.Fatal("doReduce Create: ", err)
50     }
51     enc := json.NewEncoder(file)
52     defer file.Close()
53 
54     // Read all contents from mrtmp.xxx-m-reduceTaskNumber
55     kvs := make(map[string][]string)
56     for m := 0; m != nMap; m++ {
57         names = append(names,  fmt.Sprintf("mrtmp.%s-%d-%d", jobName, m, reduceTaskNumber))
58         fi, err := os.Open(names[m])
59         if err != nil {
60             log.Fatal("doReduce Open: ", err)
61         }
62         dec := json.NewDecoder(fi)
63         for {
64             var kv KeyValue
65             err = dec.Decode(&kv)
66             if err != nil {
67                 break
68             }
69             kvs[kv.Key] = append(kvs[kv.Key], kv.Value)
70         }
71         fi.Close()
72     }
73     for k, v := range kvs {
74         enc.Encode(KeyValue{k, reduceF(k, v)})
75     }
76 }

　　通过测试

Part2：

　　第二部分建立在第一部分的基础上，要进行词频统计，就很简单，对于mapF函数我们产生键值对，对于reduceF函数我们返回值出现的次数即可

 1 //
 2 // The map function is called once for each file of input. The first
 3 // argument is the name of the input file, and the second is the
 4 // file's complete contents. You should ignore the input file name,
 5 // and look only at the contents argument. The return value is a slice
 6 // of key/value pairs.
 7 //
 8 func mapF(filename string, contents string) []mapreduce.KeyValue {
 9     // TODO: you have to write this function
10     f := func(c rune) bool {
11         return !unicode.IsLetter(c)
12     }
13     words := strings.FieldsFunc(contents, f)
14     var keyvalue []mapreduce.KeyValue
15     for _, word := range words {
16         keyvalue = append(keyvalue, mapreduce.KeyValue{word,""})
17     }
18     return keyvalue
19 }
20 
21 //
22 // The reduce function is called once for each key generated by the
23 // map tasks, with a list of all the values created for that key by
24 // any map task.
25 //
26 func reduceF(key string, values []string) string {
27     // TODO: you also have to write this function
28     return strconv.Itoa(len(values))
29 }

　　通过测试

Part3：

　　这一部分有点难，主要是对Go的并发机制要有一些了解才能做，阅读源代码，这个测试将100000以内的数字生成了100个文件，即100个Map任务，并产生50个Reduce任务，同时产生了两个worker，当全部map结束后进入reduce，所以schedule被调用了两次，先整体把握下该怎么写，由于这次需要分布式计算，所以我们可以分配ntasks个go程每次从registerChan中得到一个空闲的worker后调用call函数分配任务，然后结束后重新将其放到registerChan中（这里是第一个小坑，原本以为call函数调用结束后会自动将worker重新放入registerChan中，后来发现registerChan根本没有作为参数传入call函数，自然不能实现这个操作）

 1 func TestBasic(t *testing.T) {
 2     mr := setup()
 3     for i := 0; i < 2; i++ {
 4         go RunWorker(mr.address, port("worker"+strconv.Itoa(i)),
 5             MapFunc, ReduceFunc, -1)
 6     }
 7     mr.Wait()
 8     check(t, mr.files)
 9     checkWorker(t, mr.stats)
10     cleanup(mr)
11 }

　　阅读文档可知这一部分的核心是call函数，用于任务的分配。

 1 // call() sends an RPC to the rpcname handler on server srv
 2 // with arguments args, waits for the reply, and leaves the
 3 // reply in reply. the reply argument should be the address
 4 // of a reply structure.
 5 //
 6 // call() returns true if the server responded, and false
 7 // if call() was not able to contact the server. in particular,
 8 // reply's contents are valid if and only if call() returned true.
 9 //
10 // you should assume that call() will time out and return an
11 // error after a while if it doesn't get a reply from the server.
12 //
13 // please use call() to send all RPCs, in master.go, mapreduce.go,
14 // and worker.go.  please don't change this function.
15 //
16 func call(srv string, rpcname string,
17     args interface{}, reply interface{}) bool {
18     c, errx := rpc.Dial("unix", srv)
19     if errx != nil {
20         return false
21     }
22     defer c.Close()
23 
24     err := c.Call(rpcname, args, reply)
25     if err == nil {
26         return true
27     }
28 
29     fmt.Println(err)
30     return false
31 }

　　首先我们需要明确四个参数的值，文档中说了，第一个参数从registerChan中获得，第二个参数是给定的字符串"Worker.DoTask"，第四个参数是nil，第三个参数是个结构体DoTaskArgs，看看这个结构体长啥样

 1 // What follows are RPC types and methods.
 2 // Field names must start with capital letters, otherwise RPC will break.
 3 
 4 // DoTaskArgs holds the arguments that are passed to a worker when a job is
 5 // scheduled on it.
 6 type DoTaskArgs struct {
 7     JobName    string
 8     File       string   // only for map, the input file
 9     Phase      jobPhase // are we in mapPhase or reducePhase?
10     TaskNumber int      // this task's index in the current phase
11 
12     // NumOtherPhase is the total number of tasks in other phase; mappers
13     // need this to compute the number of output bins, and reducers needs
14     // this to know how many input files to collect.
15     NumOtherPhase int
16 }

　　再对照schedule的参数赋值即可，所有参数均有对应。基本上了解了以上几点可以得到下面的代码，和上面的思路一样，创建ntasks个go程，然后将doTaskArgs参数设置好，连同registerChan一并传入，一旦channel中有空闲的worker，随机找一个被阻塞的go程运行即可，因为都是并发的，所以谁先执行都没关系，然后用waitGroup阻塞主程，看起来很完美，但运行一半时会卡住，这是这个部分最大的坑，现在来分析下原因。

　　咱们先反复运行一下测试，发现得到下面的信息，仔细观察可以发现，这90和91号任务已经是最后两个任务了，为什么到最后会卡住，这是因为channel是阻塞式的（Distributed函数中定义channel时并未为其分配缓存），当倒数第二个任务完成后重新加到channel中时，已经没有go程会取出这个worker了，此时最后一个go程再想将worker添加到channel中时就会阻塞，所以这个代码会阻塞在最后一个go程的下面代码的第八行。

 1     var wg sync.WaitGroup
 2     wg.Add(ntasks)
 3     for i := 0; i != ntasks; i++ {
 4         doTaskArgs := DoTaskArgs{jobName, mapFiles[i] , phase, i, n_other}
 5         go func(doTaskArgs DoTaskArgs, registerChan chan string) {
 6             address := <-registerChan
 7             call(address, "Worker.DoTask", doTaskArgs, nil)
 8             registerChan <- address
 9             wg.Done()
10             return
11         }(doTaskArgs, registerChan)
12     }
13     wg.Wait()

　　最简单的做法就是为第八行创建一个新协程，这样最后一个任务虽然无法添加到channel中，但由于它是另一个go程，所以当主程退出后，这个子程虽然还阻塞着，但也会强制退出。修改方式修改后得到如下代码

 1 //
 2 // schedule() starts and waits for all tasks in the given phase (Map
 3 // or Reduce). the mapFiles argument holds the names of the files that
 4 // are the inputs to the map phase, one per map task. nReduce is the
 5 // number of reduce tasks. the registerChan argument yields a stream
 6 // of registered workers; each item is the worker's RPC address,
 7 // suitable for passing to call(). registerChan will yield all
 8 // existing registered workers (if any) and new ones as they register.
 9 //
10 func schedule(jobName string, mapFiles []string, nReduce int, phase jobPhase, registerChan chan string) {
11     var ntasks int
12     var n_other int // number of inputs (for reduce) or outputs (for map)
13     switch phase {
14     case mapPhase:
15         ntasks = len(mapFiles)
16         n_other = nReduce
17     case reducePhase:
18         ntasks = nReduce
19         n_other = len(mapFiles)
20     }
21 
22     fmt.Printf("Schedule: %v %v tasks (%d I/Os)\n", ntasks, phase, n_other)
23 
24     // All ntasks tasks have to be scheduled on workers, and only once all of
25     // them have been completed successfully should the function return.
26     // Remember that workers may fail, and that any given worker may finish
27     // multiple tasks.
28     //
29     // TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO
30     //
31     var wg sync.WaitGroup
32     wg.Add(ntasks)
33     for i := 0; i != ntasks; i++ {
34         doTaskArgs := DoTaskArgs{jobName, mapFiles[i] , phase, i, n_other}
35         go func(doTaskArgs DoTaskArgs, registerChan chan string) {
36             address := <-registerChan
37             call(address, "Worker.DoTask", doTaskArgs, nil)
38             go func() {
39                 registerChan <- address
40             }()
41             wg.Done()
42             return
43         }(doTaskArgs, registerChan)
44     }
45     wg.Wait()
46     fmt.Printf("Schedule: %v phase done\n", phase)
47 }

　　通过测试

Part4：

　　有了第三部分铺垫，这一部分就非常简单了，如果worker失效，那么call函数会返回false，此时我们需要从channel中获得下一个worker直到call返回true为止，因此将获取worker和call函数放在一个for循环中即可。

 1 //
 2 // schedule() starts and waits for all tasks in the given phase (Map
 3 // or Reduce). the mapFiles argument holds the names of the files that
 4 // are the inputs to the map phase, one per map task. nReduce is the
 5 // number of reduce tasks. the registerChan argument yields a stream
 6 // of registered workers; each item is the worker's RPC address,
 7 // suitable for passing to call(). registerChan will yield all
 8 // existing registered workers (if any) and new ones as they register.
 9 //
10 func schedule(jobName string, mapFiles []string, nReduce int, phase jobPhase, registerChan chan string) {
11     var ntasks int
12     var n_other int // number of inputs (for reduce) or outputs (for map)
13     switch phase {
14     case mapPhase:
15         ntasks = len(mapFiles)
16         n_other = nReduce
17     case reducePhase:
18         ntasks = nReduce
19         n_other = len(mapFiles)
20     }
21 
22     fmt.Printf("Schedule: %v %v tasks (%d I/Os)\n", ntasks, phase, n_other)
23 
24     // All ntasks tasks have to be scheduled on workers, and only once all of
25     // them have been completed successfully should the function return.
26     // Remember that workers may fail, and that any given worker may finish
27     // multiple tasks.
28     //
29     // TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO TODO
30     //
31     var wg sync.WaitGroup
32     wg.Add(ntasks)
33     for i := 0; i != ntasks; i++ {
34         doTaskArgs := DoTaskArgs{jobName, mapFiles[i] , phase, i, n_other}
35         go func(doTaskArgs DoTaskArgs, registerChan chan string) {
36             success := false
37             var address string
38             for success==false {
39                 address = <-registerChan
40                 success = call(address, "Worker.DoTask", doTaskArgs, nil)
41             }
42             go func() {
43                 registerChan <- address
44             }()
45             wg.Done()
46         }(doTaskArgs, registerChan)
47     }
48     wg.Wait()
49     fmt.Printf("Schedule: %v phase done\n", phase)
50 }

　　通过测试

Part5：

　　第五部分让我们实现一个inverted index，说白了就是每个出现过的单词->所有出现过该单词的文件，所以map过程输出单词->文件名，reduce过程构建所有独一无二的文件个数（利用哈希表，因为一个单词可能在同一文件出现多次，所以要去除重复文件名）以及所有文件名组成的长字符串（注意需要进行排序）。注意那个源代码上的注释有点误导人，忽略即可。

 1 // The mapping function is called once for each piece of the input.
 2 // In this framework, the key is the name of the file that is being processed,
 3 // and the value is the file's contents. The return value should be a slice of
 4 // key/value pairs, each represented by a mapreduce.KeyValue.
 5 func mapF(document string, value string) (res []mapreduce.KeyValue) {
 6     // TODO: you should complete this to do the inverted index challenge
 7     f := func(c rune) bool {
 8         return !unicode.IsLetter(c)
 9     }
10     words := strings.FieldsFunc(value, f)
11     for _, word := range words {
12         res = append(res, mapreduce.KeyValue{word, document})
13     }
14     return
15 }
16 
17 // The reduce function is called once for each key generated by Map, with a
18 // list of that key's string value (merged across all inputs). The return value
19 // should be a single output value for that key.
20 func reduceF(key string, values []string) string {
21     // TODO: you should complete this to do the inverted index challenge
22     m := make(map[string]string)
23     for _, value := range values {
24         m[value] = value
25     }
26     var uniqueValues []string
27     for _, value := range m {
28         uniqueValues = append(uniqueValues, value)
29     }
30     sort.Strings(uniqueValues)
31     s := strconv.Itoa(len(m))
32     s += " "
33     s += uniqueValues[0]
34     for i := 1; i != len(uniqueValues); i++ {
35         s += ","
36         s += uniqueValues[i]
37     }
38     return s
39 }

　　通过所有测试

posted on 2019-02-18 13:10 周浩炜阅读(4551) 评论(0) 编辑收藏举报

刷新页面返回顶部

haoweizh

MIT-6.824 lab1

导航

公告