这是 MIT 6.824 课程 lab1 的学习总结,记录我在学习过程中的收获和踩的坑。

我的实验环境是 windows 10,所以对lab的code 做了一些环境上的修改,如果你仅仅对code 感兴趣,请移步 : github/zouzhitao

mapreduce overview

先大致看一下 mapreduce 到底是什么

我个人的简单理解是这样的: mapreduce 就是一种分布式处理用户特定任务的系统。它大概是这样处理的。

用户提供两个函数

  1. mapFunc(k1,v1)-> list(k2,v2)
  2. reduceFunc(k2,list(v2)) -> ans of k2

这个 分布式系统 将用户的任务做分布式处理,最终为每一个 k2 生成答案。下面我们就来描述一下,这个分布式系统是如何处理的。

首先,他有一个 master 来做任务调度。

master

  1. 先调度 worker 做 map 任务,设总的 map 任务的数目为 $M$ , 将result 存储在 中间文件 m-i-j 中, $i \in {0,\dots ,M-1}, j \in {0,\dots,R-1}$
  2. 调度 worker 做 reduce 任务,设总的 reduce 任务数目为 $R$, 将答案储存在 $r_j$
  3. 然后将所有的renduce 任务的ans merge起来作为答案放在一个文件中交给用户。

detail 都在实验中

detail

这部分讲 实验内容(观看code), 不过不按照 lab 顺序将。个人认为 做lab的目的,不是做lab 而是为了搞懂 mapreduce system

master

我们先来看看 master 这部分的代码

  1. // Master holds all the state that the master needs to keep track of.
  2. type Master struct {
  3. sync.Mutex
  4. address string
  5. doneChannel chan bool
  6. // protected by the mutex
  7. newCond *sync.Cond // signals when Register() adds to workers[]
  8. workers []string // each worker's UNIX-domain socket name -- its RPC address
  9. // Per-task information
  10. jobName string // Name of currently executing job
  11. files []string // Input files
  12. nReduce int // Number of reduce partitions
  13. shutdown chan struct{}
  14. l net.Listener
  15. stats []int
  16. }

master 维护了执行一个 job 需要的所有状态

master.run

这部分是 master 具体做的事情

  1. // Distributed schedules map and reduce tasks on workers that register with the
  2. // master over RPC.
  3. func Distributed(jobName string, files []string, nreduce int, master string) (mr *Master) {
  4. mr = newMaster(master)
  5. mr.startRPCServer()
  6. go mr.run(jobName, files, nreduce,
  7. func(phase jobPhase) {
  8. ch := make(chan string) // worker 的地址
  9. go mr.forwardRegistrations(ch)
  10. schedule(mr.jobName, mr.files, mr.nReduce, phase, ch)
  11. },
  12. func() {
  13. mr.stats = mr.killWorkers()
  14. mr.stopRPCServer()
  15. })
  16. return
  17. }
  18. // run executes a mapreduce job on the given number of mappers and reducers.
  19. //
  20. // First, it divides up the input file among the given number of mappers, and
  21. // schedules each task on workers as they become available. Each map task bins
  22. // its output in a number of bins equal to the given number of reduce tasks.
  23. // Once all the mappers have finished, workers are assigned reduce tasks.
  24. //
  25. // When all tasks have been completed, the reducer outputs are merged,
  26. // statistics are collected, and the master is shut down.
  27. //
  28. // Note that this implementation assumes a shared file system.
  29. func (mr *Master) run(jobName string, files []string, nreduce int,
  30. schedule func(phase jobPhase),
  31. finish func(),
  32. ) {
  33. mr.jobName = jobName
  34. mr.files = files
  35. mr.nReduce = nreduce
  36. fmt.Printf("%s: Starting Map/Reduce task %s\n", mr.address, mr.jobName)
  37. schedule(mapPhase)
  38. schedule(reducePhase)
  39. finish()
  40. mr.merge()
  41. fmt.Printf("%s: Map/Reduce task completed\n", mr.address)
  42. mr.doneChannel <- true
  43. }

schedule

我们需要实现的其实是这个 schedule 也是最核心的, schedule 实现任务调度,注意这里有 $M$ 个 map 任务,$R$ 个 reduce 任务,只有 $n$ 个 worker, 通常情况下,$M>n,R>n$ 这样才能尽可能利用 worker 的性能,让流水线充沛。

  1. //
  2. // schedule() starts and waits for all tasks in the given phase (mapPhase
  3. // or reducePhase). the mapFiles argument holds the names of the files that
  4. // are the inputs to the map phase, one per map task. nReduce is the
  5. // number of reduce tasks. the registerChan argument yields a stream
  6. // of registered workers; each item is the worker's RPC address,
  7. // suitable for passing to call(). registerChan will yield all
  8. // existing registered workers (if any) and new ones as they register.
  9. //
  10. func schedule(jobName string, mapFiles []string, nReduce int, phase jobPhase, registerChan chan string) {
  11. var ntasks int
  12. var nOther int // number of inputs (for reduce) or outputs (for map)
  13. switch phase {
  14. case mapPhase:
  15. ntasks = len(mapFiles)
  16. nOther = nReduce
  17. case reducePhase:
  18. ntasks = nReduce
  19. nOther = len(mapFiles)
  20. }
  21. fmt.Printf("Schedule: %v %v tasks (%d I/Os)\n", ntasks, phase, nOther)
  22. // All ntasks tasks have to be scheduled on workers. Once all tasks
  23. // have completed successfully, schedule() should return.
  24. //
  25. // Your code here (Part III, Part IV).
  26. //
  27. //Part III
  28. var wg sync.WaitGroup
  29. wg.Add(ntasks)
  30. for i := 0; i < ntasks; i++ {
  31. go func(i int) {
  32. defer wg.Done()
  33. filename := ""
  34. if i <= len(mapFiles) {
  35. filename = mapFiles[i]
  36. }
  37. taskArgs := DoTaskArgs{
  38. JobName: jobName,
  39. File: filename,
  40. Phase: phase,
  41. TaskNumber: i,
  42. NumOtherPhase: nOther,
  43. }
  44. taskFinished := false
  45. for taskFinished == false {
  46. workAddr := <-registerChan
  47. taskFinished = call(workAddr, "Worker.DoTask", taskArgs, nil)
  48. go func() { registerChan <- workAddr }()
  49. }
  50. }(i)
  51. }
  52. wg.Wait()
  53. fmt.Printf("Schedule: %v done\n", phase)
  54. }

schedule 要做的事情就是对于每一个任务,调用 call 函数去执行 一个rpc调用,让 worker 执行 Worker.DoTask 这是 PART III/IV 的代码。

这里注意几点细节

  1. registerChan 用的是管道,传输可用worker 的地址,所以 执行完一个 task之后要将 worker 的地址重新放到 registerChan
  2. master 是串行调度的,也就是说他要等待所有 map 任务做完,才会调度 reduce 任务,所以在schedule 里不能提前返回,要等待 说有task完成

接下来我们来看看这个 call 到底干了什么,其实它调用了 worker.DOTASK, 所以我们简单看看 worker.Dotask 干了什么就好

worker

  1. // DoTask is called by the master when a new task is being scheduled on this
  2. // worker.
  3. func (wk *Worker) DoTask(arg *DoTaskArgs, _ *struct{}) error {
  4. //...
  5. switch arg.Phase {
  6. case mapPhase:
  7. doMap(arg.JobName, arg.TaskNumber, arg.File, arg.NumOtherPhase, wk.Map)
  8. case reducePhase:
  9. doReduce(arg.JobName, arg.TaskNumber, mergeName(arg.JobName, arg.TaskNumber), arg.NumOtherPhase, wk.Reduce)
  10. }
  11. //....
  12. }

它核心就是调用了 doMapdoReduce

这也是 PART 1 的类容,我们来看看 doMapdoReduce 做了什么

doMap

  1. func doMap(
  2. jobName string, // the name of the MapReduce job
  3. mapTask int, // which map task this is
  4. inFile string,
  5. nReduce int, // the number of reduce task that will be run ("R" in the paper)
  6. mapF func(filename string, contents string) []KeyValue,
  7. ) {
  8. //
  9. // doMap manages one map task: it should read one of the input files
  10. // (inFile), call the user-defined map function (mapF) for that file's
  11. // contents, and partition mapF's output into nReduce intermediate files.
  12. //
  13. // There is one intermediate file per reduce task. The file name
  14. // includes both the map task number and the reduce task number. Use
  15. // the filename generated by reduceName(jobName, mapTask, r)
  16. // as the intermediate file for reduce task r. Call ihash() (see
  17. // below) on each key, mod nReduce, to pick r for a key/value pair.
  18. //
  19. // mapF() is the map function provided by the application. The first
  20. // argument should be the input file name, though the map function
  21. // typically ignores it. The second argument should be the entire
  22. // input file contents. mapF() returns a slice containing the
  23. // key/value pairs for reduce; see common.go for the definition of
  24. // KeyValue.
  25. //
  26. // Look at Go's ioutil and os packages for functions to read
  27. // and write files.
  28. //
  29. // Coming up with a scheme for how to format the key/value pairs on
  30. // disk can be tricky, especially when taking into account that both
  31. // keys and values could contain newlines, quotes, and any other
  32. // character you can think of.
  33. //
  34. // One format often used for serializing data to a byte stream that the
  35. // other end can correctly reconstruct is JSON. You are not required to
  36. // use JSON, but as the output of the reduce tasks *must* be JSON,
  37. // familiarizing yourself with it here may prove useful. You can write
  38. // out a data structure as a JSON string to a file using the commented
  39. // code below. The corresponding decoding functions can be found in
  40. // common_reduce.go.
  41. //
  42. // enc := json.NewEncoder(file)
  43. // for _, kv := ... {
  44. // err := enc.Encode(&kv)
  45. //
  46. // Remember to close the file after you have written all the values!
  47. //
  48. // Your code here (Part I).
  49. //
  50. content := safeReadFile(inFile)
  51. ans := mapF(inFile, string(content))
  52. jsonEncoder := make([]*json.Encoder, nReduce)
  53. for i := 0; i < nReduce; i++ {
  54. f := safeCreaFile(reduceName(jobName, mapTask, i))
  55. jsonEncoder[i] = json.NewEncoder(f)
  56. defer f.Close()
  57. }
  58. for _, kv := range ans {
  59. r := ihash(kv.Key) % nReduce
  60. err := jsonEncoder[r].Encode(&kv)
  61. if err != nil {
  62. log.Fatal("jsonEncode err", err)
  63. }
  64. }
  65. }
  1. 读取文件内容
  2. 调用用户的 mapF 生成一系列的 key/val 将所有的 key/val list 以key hash 到每个 reduce 文件中

    也就是说,每个 map 任务产生 $nReduce$ 个中间文件,因此总共有 MxR 个中间文件产生,同时 由于 是以key hash 到reduce 任务的,可以保证同样的 key 一定到同一个 reduce

reduce

  1. func doReduce(
  2. jobName string, // the name of the whole MapReduce job
  3. reduceTask int, // which reduce task this is
  4. outFile string, // write the output here
  5. nMap int, // the number of map tasks that were run ("M" in the paper)
  6. reduceF func(key string, values []string) string,
  7. ) {
  8. //
  9. // doReduce manages one reduce task: it should read the intermediate
  10. // files for the task, sort the intermediate key/value pairs by key,
  11. // call the user-defined reduce function (reduceF) for each key, and
  12. // write reduceF's output to disk.
  13. //
  14. // You'll need to read one intermediate file from each map task;
  15. // reduceName(jobName, m, reduceTask) yields the file
  16. // name from map task m.
  17. //
  18. // Your doMap() encoded the key/value pairs in the intermediate
  19. // files, so you will need to decode them. If you used JSON, you can
  20. // read and decode by creating a decoder and repeatedly calling
  21. // .Decode(&kv) on it until it returns an error.
  22. //
  23. // You may find the first example in the golang sort package
  24. // documentation useful.
  25. //
  26. // reduceF() is the application's reduce function. You should
  27. // call it once per distinct key, with a slice of all the values
  28. // for that key. reduceF() returns the reduced value for that key.
  29. //
  30. // You should write the reduce output as JSON encoded KeyValue
  31. // objects to the file named outFile. We require you to use JSON
  32. // because that is what the merger than combines the output
  33. // from all the reduce tasks expects. There is nothing special about
  34. // JSON -- it is just the marshalling format we chose to use. Your
  35. // output code will look something like this:
  36. //
  37. // enc := json.NewEncoder(file)
  38. // for key := ... {
  39. // enc.Encode(KeyValue{key, reduceF(...)})
  40. // }
  41. // file.Close()
  42. //
  43. // Your code here (Part I).
  44. //
  45. kvs := make(map[string][]string)
  46. for i := 0; i < nMap; i++ {
  47. kv := jsonDecode(reduceName(jobName, i, reduceTask))
  48. for _, v := range kv {
  49. kvs[v.Key] = append(kvs[v.Key], v.Value)
  50. }
  51. }
  52. f := safeCreaFile(outFile)
  53. defer f.Close()
  54. enc := json.NewEncoder(f)
  55. for k, v := range kvs {
  56. reduceAns := reduceF(k, v)
  57. enc.Encode(KeyValue{k, reduceAns})
  58. }
  59. }

reduce 干的事情也很简单,它先读取所有传给它的任务。做成一个 list of key/val

然后调用用户的 reduceF。将答案传给用json 编码到一个文件

PART I 完。

接下来是两个实例

example

这里的两个例子是 word count 和倒排索引 invert index

word count

这个任务,是统计每个单词出现的次数

  1. //
  2. // The map function is called once for each file of input. The first
  3. // argument is the name of the input file, and the second is the
  4. // file's complete contents. You should ignore the input file name,
  5. // and look only at the contents argument. The return value is a slice
  6. // of key/value pairs.
  7. //
  8. func mapF(filename string, contents string) []mapreduce.KeyValue {
  9. // Your code here (Part II).
  10. var ret []mapreduce.KeyValue
  11. words := strings.FieldsFunc(contents, func(x rune) bool {
  12. return unicode.IsLetter(x) == false
  13. })
  14. for _, w := range words {
  15. kv := mapreduce.KeyValue{w, ""}
  16. ret = append(ret, kv)
  17. }
  18. return ret
  19. }
  20. //
  21. // The reduce function is called once for each key generated by the
  22. // map tasks, with a list of all the values created for that key by
  23. // any map task.
  24. //
  25. func reduceF(key string, values []string) string {
  26. // Your code here (Part II).
  27. return strconv.Itoa(len(values))
  28. }

part II 完

这里有一点要注意, test 用的是 diff,这个比对会将 \n,\n\r 认成不一样的,注意将ans 中的东西改成 \n 就好。

invert index

  1. // The mapping function is called once for each piece of the input.
  2. // In this framework, the key is the name of the file that is being processed,
  3. // and the value is the file's contents. The return value should be a slice of
  4. // key/value pairs, each represented by a mapreduce.KeyValue.
  5. func mapF(document string, value string) (res []mapreduce.KeyValue) {
  6. // Your code here (Part V).
  7. words := strings.FieldsFunc(value, func(x rune) bool {
  8. return unicode.IsLetter(x) == false
  9. })
  10. kvmap := make(map[string]string)
  11. for _, w := range words {
  12. kvmap[w] = document
  13. }
  14. for k, v := range kvmap {
  15. res = append(res, mapreduce.KeyValue{k, v})
  16. }
  17. return
  18. }
  19. // The reduce function is called once for each key generated by Map, with a
  20. // list of that key's string value (merged across all inputs). The return value
  21. // should be a single output value for that key.
  22. func reduceF(key string, values []string) string {
  23. // Your code here (Part V).
  24. numberOfDoc := len(values)
  25. sort.Strings(values)
  26. res := strconv.Itoa(numberOfDoc) + " " + strings.Join(values, ",")
  27. return res
  28. }

这个地方要注意将同一个文档中的重复单词去除掉,用一个 map 储存一下就好

最后说一下环境的坑点

windows 环境注意事项

  1. lab 中注册用的unix 文件地址不能用,我将其改成了 tcp
  2. 注意改成 tcp 后,worker在 shutdown 的时候 close 掉tcp链接

reference

  1. google mapreduce paper
  2. lab1
  3. github/zouzhitao code repo

版权声明

本作品为作者原创文章,采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议

作者: taotao

转载请保留此版权声明,并注明出处

MIT 6.824 lab1:mapreduce的更多相关文章

  1. 6.824 LAB1 环境搭建

    MIT 6.824 LAB1 环境搭建 vmware 虚拟机 linux ubuntu server   安装 go 官方安装步骤: 下载此压缩包并提取到 /usr/local 目录,在 /usr/l ...

  2. MIT 6.824(Spring 2020) Lab1: MapReduce 文档翻译

    首发于公众号:努力学习的阿新 前言 大家好,这里是阿新. MIT 6.824 是麻省理工大学开设的一门关于分布式系统的明星课程,共包含四个配套实验,实验的含金量很高,十分适合作为校招生的项目经历,在文 ...

  3. 《MIT 6.828 Lab1: Booting a PC》实验报告

    <MIT 6.828 Lab1: Booting a PC>实验报告 本实验的网站链接见:Lab 1: Booting a PC. 实验内容 熟悉x86汇编语言.QEMU x86仿真器.P ...

  4. MIT 6.824 Lab2D Raft之日志压缩

    书接上文Raft Part C | MIT 6.824 Lab2C Persistence. 实验准备 实验代码:git://g.csail.mit.edu/6.824-golabs-2021/src ...

  5. MIT 6.824 Lab2C Raft之持久化

    书接上文Raft Part B | MIT 6.824 Lab2B Log Replication. 实验准备 实验代码:git://g.csail.mit.edu/6.824-golabs-2021 ...

  6. MIT 6.824 Llab2B Raft之日志复制

    书接上文Raft Part A | MIT 6.824 Lab2A Leader Election. 实验准备 实验代码:git://g.csail.mit.edu/6.824-golabs-2021 ...

  7. MIT 6.824学习笔记4 Lab1

    现在我们准备做第一个作业Lab1啦 wjk大神也在做6.824,可以参考大神的笔记https://github.com/zzzyyyxxxmmm/MIT6824_Distribute_System P ...

  8. MIT 6.824 : Spring 2015 lab1 训练笔记

    源代码参见我的github: https://github.com/YaoZengzeng/MIT-6.824 Part I: Word count MapReduce操作实际上就是将一个输入文件拆分 ...

  9. MIT 6.824学习笔记1 MapReduce

    本节内容:Lect 1 MapReduce框架的执行过程: master分发任务,把map任务和reduce任务分发下去 map worker读取输入,进行map计算写入本地临时文件 map任务完成通 ...

随机推荐

  1. Hadoop 高可用(HA)的自动容灾配置

    参考链接 Hadoop 完全分布式安装 ZooKeeper 集群的安装部署 0. 说明 在 Hadoop 完全分布式安装 & ZooKeeper 集群的安装部署的基础之上进行 Hadoop 高 ...

  2. PyQt5--MessageBox

    # -*- coding:utf-8 -*- ''' Created on Sep 13, 2018 @author: SaShuangYiBing ''' import sys from PyQt5 ...

  3. [python] 修改Tkinter 的默认图标

    先上一个不修改的样式,如下: import easygui as g g.msgbox("hello","hi") 注意左上角的图标为红色的Tk字样 修改后: ...

  4. Post-installation steps for Chromium | Fedora

    Flash 插件安装 网址: https://fedora.pkgs.org/ 下载: chromium-pepper-flash-version.fc28.x86_64.rpm 安装后重启浏览器 解 ...

  5. word怎样从第三页开始设置页码

    一般的文件都是有封面,目录.然后才是正文.所以基本上第一页的封面,第二页是目录,第三页才是正文的开始.但是默认的页码会从第一页开始的,封面上海有页码这会很难看,今天和小编一起来看看怎样将页码从第三页开 ...

  6. C#实现的协同过滤算法

    using System;using System.Collections.Generic;using System.Linq;using System.Text; namespace SlopeOn ...

  7. 【洛谷】【前缀和+st表】P2629 好消息,坏消息

    [题目描述:] uim在公司里面当秘书,现在有n条消息要告知老板.每条消息有一个好坏度,这会影响老板的心情.告知完一条消息后,老板的心情等于之前老板的心情加上这条消息的好坏度.最开始老板的心情是0,一 ...

  8. Windows连接Linux虚拟机里面的Docker容器

    一.Windows.Linux虚拟机.docker关系图 如果此时在Windows宿主机中pingDocker容器是ping不同的,因为在宿主机上没有通往172.17.0.0/24网络的路由,宿主机会 ...

  9. ORA-27125: unable to create shared memory segment的解决方法(转)

    ORA-27125: unable to create shared memory segment的解决方法(转) # Kernel sysctl configuration file for Red ...

  10. 向jupyter notebook加入Anaconda3中已添加的虚拟环境kernel

    # jupyter notebook添加Anaconda虚拟环境的kernel #  开启虚拟环境 (base) C:\Users\jiangshan>activate tensorflow # ...