MapReduce的ReduceTask任务的运行源码级分析

　　MapReduce的MapTask任务的运行源码级分析这篇文章好不容易恢复了。。。谢天谢地。。。这篇文章讲了MapTask的执行流程。咱们这一节讲解ReduceTask的执行流程。ReduceTask也有四种任务，可参考前一章节对应的内容，至于Reduce Task要从各个Map Task上读取一片数据，经过排序后，以组为单位交给用户编写的reduce方法，并将结果写入HDFS中。

　　MapTask和ReduceTask都是Task的子类，分别对应于我们常说的map和reduce任务。同上一节一样Child类中直接运行的是run方法，ReduceTask.run()方法代码如下：

  //ReduceTask.run方法开始和MapTask类似，包括initialize()初始化，根据情况看是否调用runJobCleanupTask()，
   //runJobSetupTask()，runTaskCleanupTask()。之后进入正式的工作，主要有这么三个步骤：Copy、Sort、Reduce。
   @Override
   @SuppressWarnings("unchecked")
   public void run(JobConf job, final TaskUmbilicalProtocol umbilical)
     throws IOException, InterruptedException, ClassNotFoundException {
     this.umbilical = umbilical;
     job.setBoolean("mapred.skip.on", isSkipping());
     /*添加reduce过程需要经过的几个阶段。以便通知TaskTracker目前运   行的情况*/
     if (isMapOrReduce()) {
       copyPhase = getProgress().addPhase("copy");
       sortPhase  = getProgress().addPhase("sort");
       reducePhase = getProgress().addPhase("reduce");
     }
     // start thread that will handle communication with parent
  // 设置并启动reporter进程以便和TaskTracker进行交流
     TaskReporter reporter = new TaskReporter(getProgress(), umbilical,
         jvmContext);
     reporter.startCommunicationThread();
     //在job client中初始化job时，默认就是用新的API，详见Job.setUseNewAPI()方法
     boolean useNewApi = job.getUseNewReducer();
     /*用来初始化任务，主要是进行一些和任务输出相关的设置，比如创建commiter，设置工作目录等*/
     initialize(job, getJobID(), reporter, useNewApi);//这里将会处理输出目录
     /*以下4个if语句均是根据任务类型的不同进行相应的操作，这些方 法均是Task类的方法，所以与任务是MapTask还是ReduceTask无关*/
     // check if it is a cleanupJobTask
     if (jobCleanup) {
       runJobCleanupTask(umbilical, reporter);
       return;
     }
     if (jobSetup) {
         //主要是创建工作目录的FileSystem对象
       runJobSetupTask(umbilical, reporter);
       return;
     }
     if (taskCleanup) {
          //设置任务目前所处的阶段为结束阶段，并且删除工作目录
       runTaskCleanupTask(umbilical, reporter);
       return;
     }
 
     // Initialize the codec
     codec = initCodec();
 
     boolean isLocal = "local".equals(job.get("mapred.job.tracker", "local"));　　//判断是否是单机hadoop
     if (!isLocal) {
         //1. Copy.就是从执行各个Map任务的服务器那里，收到map的输出文件。拷贝的任务，是由ReduceTask.ReduceCopier 类来负责。
         //ReduceCopier对象负责将Map函数的输出拷贝至Reduce所在机器
       reduceCopier = new ReduceCopier(umbilical, job, reporter);
       if (!reduceCopier.fetchOutputs()) {////fetchOutputs函数负责拷贝各个Map函数的输出
         if(reduceCopier.mergeThrowable instanceof FSError) {
           throw (FSError)reduceCopier.mergeThrowable;
         }
         throw new IOException("Task: " + getTaskID() +
             " - The reduce copier failed", reduceCopier.mergeThrowable);
       }
     }
     copyPhase.complete();                         // copy is already complete
     setPhase(TaskStatus.Phase.SORT);
     statusUpdate(umbilical);
 
     final FileSystem rfs = FileSystem.getLocal(job).getRaw();
     //2.Sort(其实相当于合并).排序工作，就相当于上述排序工作的一个延续。它会在所有的文件都拷贝完毕后进行。
     //使用工具类Merger归并所有的文件。经过这一个流程，一个合并了所有所需Map任务输出文件的新文件产生了。
     //而那些从其他各个服务器网罗过来的 Map任务输出文件，全部删除了。
 
     //根据hadoop是否分布式来决定调用哪种排序方式
     RawKeyValueIterator rIter = isLocal
       ? Merger.merge(job, rfs, job.getMapOutputKeyClass(),
           job.getMapOutputValueClass(), codec, getMapFiles(rfs, true),
           !conf.getKeepFailedTaskFiles(), job.getInt("io.sort.factor", 100),
           new Path(getTaskID().toString()), job.getOutputKeyComparator(),
           reporter, spilledRecordsCounter, null)
       : reduceCopier.createKVIterator(job, rfs, reporter);
 
     // free up the data structures
     mapOutputFilesOnDisk.clear();
 
     sortPhase.complete();                         // sort is complete
     setPhase(TaskStatus.Phase.REDUCE);
     statusUpdate(umbilical);
     //3.Reduce 1.Reduce任务的最后一个阶段。它会准备好Map的 keyClass（"mapred.output.key.class"或"mapred.mapoutput.key.class"）,
     //valueClass("mapred.mapoutput.value.class"或"mapred.output.value.class")
     //和 Comparator （“mapred.output.value.groupfn.class”或 “mapred.output.key.comparator.class”）
     Class keyClass = job.getMapOutputKeyClass();
     Class valueClass = job.getMapOutputValueClass();
     RawComparator comparator = job.getOutputValueGroupingComparator();
     //2.根据参数useNewAPI判断执行runNewReduce还是runOldReduce。分析润runNewReduce
     if (useNewApi) {
         //3.runNewReducer
         //0.像报告进程书写一些信息
         //1.获得一个TaskAttemptContext对象。通过这个对象创建reduce、output及用于跟踪的统计output的RecordWrit、最后创建用于收集reduce结果的Context
         //2.reducer.run(reducerContext)开始执行reduce
       runNewReducer(job, umbilical, reporter, rIter, comparator,
                     keyClass, valueClass);
     } else {
       runOldReducer(job, umbilical, reporter, rIter, comparator,
                     keyClass, valueClass);
     }
     done(umbilical, reporter);
   }

　　(1)reduce分为三个阶段(copy就是远程拷贝Map的输出数据、sort就是对所有的数据做排序、reduce做聚集就是我们自己写的reducer)，为这三个阶段分别设置Progress，用来和TaskTracker通信报道状态。

　　(2)上面代码的15-40行和MapReduce的MapTask任务的运行源码级分析中对应部分基本相同，可参考之；

　　(3)codec = initCodec()这句是检查map的输出是否是压缩的，压缩的则返回压缩codec实例，否则返回null，这里讨论不压缩的；

　　(4)我们讨论完全分布式的hadoop，即isLocal==false，然后构造一个ReduceCopier对象reduceCopier，并调用reduceCopier.fetchOutputs()方法拷贝各个Mapper的输出，到本地；

　　(5)然后copy阶段完成，设置接下来的阶段是sort阶段，更新状态信息；

　　(6)根据isLocal来选择KV迭代器，完全分布式的会使用reduceCopier.createKVIterator(job, rfs, reporter)作为KV迭代器；

　　(7)sort阶段完成，设置接下来的阶段是reduce阶段，更新状态信息；

　　(8)然后获取一些配置信息，并根据是否使用新API选择不同的处理方式，这里是新的API，调用runNewReducer(job, umbilical, reporter, rIter, comparator, keyClass, valueClass)会执行reducer；

　　(9)done(umbilical, reporter)这个方法用于做结束任务的一些清理工作：更新计数器updateCounters()；如果任务需要提交，设置Taks状态为COMMIT_PENDING，并利用TaskUmbilicalProtocol，汇报Task完成，等待提交，然后调用commit提交任务；设置任务结束标志位；结束Reporter通信线程；发送最后一次统计报告(通过sendLastUpdate方法)；利用TaskUmbilicalProtocol报告结束状态（通过sendDone方法)。

　　有些人将Reduce Task分为了5个阶段：一、shuffle阶段：也称为Copy阶段，就是从各个MapTask上远程拷贝一片数据，如果大小超过一定阈值就写到磁盘，否则放入内存；二、Merge阶段：在远程拷贝数据的同时，Reduce Task启动了两个后台线程对内存和磁盘上的文件进行合并，防止内存使用过多和磁盘文件过多；三、sort阶段：用户编写的reduce方法的输入数据是按key进行聚集的，需要对copy过来的数据排序，这里用的是归并排序，因为Map Task的结果是有序的；四、Reduce阶段：将每组数据依次交给用户编写的Reduce方法处理；五、write阶段：就是将结果写入HDFS。

　　上面的5个阶段分的比较细了，代码里分为3个阶段copy、sort、reduce，我们在eclipse运行MR程序时，控制台看到的reduce阶段的百分比就分为3个阶段各占33.3%。

　　接下来重点将两个个地方：runNewReducer方法和ReduceCopier类，后者有2000多行代码，占据了ReduceTask类的绝大部分代码量。

　　A、我们先看runNewReducer吧，这个比ReduceCopier更容易一些，代码如下：

 @SuppressWarnings("unchecked")
   private <INKEY,INVALUE,OUTKEY,OUTVALUE>
   void runNewReducer(JobConf job,
                      final TaskUmbilicalProtocol umbilical,
                      final TaskReporter reporter,
                      RawKeyValueIterator rIter,
                      RawComparator<INKEY> comparator,
                      Class<INKEY> keyClass,
                      Class<INVALUE> valueClass
                      ) throws IOException,InterruptedException,
                               ClassNotFoundException {
     // wrap value iterator to report progress.
     final RawKeyValueIterator rawIter = rIter;
     rIter = new RawKeyValueIterator() {
       public void close() throws IOException {
         rawIter.close();
       }
       public DataInputBuffer getKey() throws IOException {
         return rawIter.getKey();
       }
       public Progress getProgress() {
         return rawIter.getProgress();
       }
       public DataInputBuffer getValue() throws IOException {
         return rawIter.getValue();
       }
       public boolean next() throws IOException {
         boolean ret = rawIter.next();
         reducePhase.set(rawIter.getProgress().get());
         reporter.progress();
         return ret;
       }
     };
     // make a task context so we can get the classes
     /*TaskAttemptContext类继承于JobContext类，相对于JobContext类增加了一些有关task的信息。通过taskContext对象可以获得很多与任务执行相
     关的类，比如用户定义的Mapper类，InputFormat类等等 */
     org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
       new org.apache.hadoop.mapreduce.TaskAttemptContext(job, getTaskID());
     // make a reducer
     //创建用户定义的Reduce类的实例
     org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE> reducer =
       (org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>)
         ReflectionUtils.newInstance(taskContext.getReducerClass(), job);
 
      org.apache.hadoop.mapreduce.RecordWriter<OUTKEY,OUTVALUE> trackedRW =
        new NewTrackingRecordWriter<OUTKEY, OUTVALUE>(reduceOutputCounter,
          job, reporter, taskContext);
     job.setBoolean("mapred.skip.on", isSkipping());
     org.apache.hadoop.mapreduce.Reducer.Context
          reducerContext = createReduceContext(reducer, job, getTaskID(),
                                                rIter, reduceInputKeyCounter,
                                                reduceInputValueCounter,
                                                trackedRW, committer,
                                                reporter, comparator, keyClass,
                                                valueClass);
     reducer.run(reducerContext);
     trackedRW.close(reducerContext);
   }

　　(1)参数RawKeyValueIterator rIter实际上是org.apache.hadoop.mapred.Merger.MergeQueue。这里将rIter赋值给新的RawKeyValueIterator rawIter，然后将rIter重新实现了RawKeyValueIterator，可以跟踪和汇报rawIter进度；

　　(2)构造任务配置类以及获取用户自己的Reducer类的实例，然后创建一个NewTrackingRecordWriter的对象trackedRW作为输出；　

　　(3)将rIter、trackedRW等信息传递给org.apache.hadoop.mapreduce.Reducer.Context ，构造了一个管理读写的配置对象；在其父类ReduceContext中对输入就是迭代器的操作进行了实现；在ReduceContext的父类TaskInputOutputContext中实现输出的方法，其write方法会直接调用trackedRW.write(key,value)

　　(4)reducer.run(reducerContext)执行reducer的run方法，这个run方法和上一节中的基本相同，可参考之；

　　(5)关闭输出trackedRW.close(reducerContext)。

　　一、这里还得解释一下NewTrackingRecordWriter这个管理输出的类，是mapreduce.RecordWriter的子类，和上一节中的NewDirectOutputCollector较为类似，这里不再讲解。

　　二、至于输入数据rIter迭代器，在此需要解释一下，实现同一个key的不同value迭代读取的功能在ReduceContext中，讲之前，我们先看一下Reducer.run()方法的代码吧：　　

 public void run(Context context) throws IOException, InterruptedException {
     setup(context);
     while (context.nextKey()) {
       reduce(context.getCurrentKey(), context.getValues(), context);
     }
     cleanup(context);
   }

　　我们只说while循环这一部分，其他部分前一小节有讲解，基本类似。while的循环条件是ReduceContext.nextKey()为真，这个方法就在ReduceContext中实现的，这个方法的目的就是处理下一个唯一的key(就是要保证是新的key)，因为reduce方法的输入数据是分组的，所以每次都会处理一个key及这个key对应的所有value，又因为已经将所有的Map Task的输出拷贝过来而且做了排序，所以key相同的KV对都是挨着的。来看nextKey()方法代码：

 /** Start processing next unique key. */
   public boolean nextKey() throws IOException,InterruptedException {
     while (hasMore && nextKeyIsSame) {    //如果还有数据并且下一个KV中的K与当前的相同就一直循环直到key不相同，一般不会执行这个，因为value的迭代器会迭代到nextKeyIsSame==false
       nextKeyValue();
     }
     if (hasMore) {    //如果还有数据
       if (inputKeyCounter != null) {
         inputKeyCounter.increment(1);    //统计
       }
       return nextKeyValue();    //推进到下一个KV
     } else {
       return false;
     }
   }

　　上述方法会调用另外一个方法nextKeyValue()会尝试去获取下一个key值，并且如果没数据了就会返回false，如果还有数据就返回true，具体代码如下：

 public boolean nextKeyValue() throws IOException, InterruptedException {
     if (!hasMore) {
       key = null;
       value = null;
       return false;
     }
     firstValue = !nextKeyIsSame;        //这个是否是同一个key值的不同value，第一个value的话firstValue==true并且nextKeyIsSame==false，后续的会是false，nextKeyIsSame是true
     DataInputBuffer next = input.getKey();
     currentRawKey.set(next.getData(), next.getPosition(),
                       next.getLength() - next.getPosition());
     buffer.reset(currentRawKey.getBytes(), 0, currentRawKey.getLength());
     key = keyDeserializer.deserialize(key);        //反序列化获取key值
     next = input.getValue();
     buffer.reset(next.getData(), next.getPosition(), next.getLength());
     value = valueDeserializer.deserialize(value);    //反序列化获取value值
     hasMore = input.next();        //是否还有数据
     if (hasMore) {
       next = input.getKey();
       nextKeyIsSame = comparator.compare(currentRawKey.getBytes(), 0,
                                          currentRawKey.getLength(),
                                          next.getData(),
                                          next.getPosition(),
                                          next.getLength() - next.getPosition()
                                          ) == 0;        //查看下一个KV的key是否与当前的一样
     } else {            //没有数据了
       nextKeyIsSame = false;
     }
     inputValueCounter.increment(1);
     return true;
   }

　　这里面有两个比较重要的参数：firstValue表示是否是当前key值的第一个value；nextKeyIsSame表示下一个key是否和当前key值相同。这两个参数在迭代获取value的时候会有重要作用。在这个方法中会获取key和value，可以通过getCurrentKey()和getCurrentValue()方法来获取这两个值。这个方法还会读取下一个key与当前的key作比较，如果相同则nextKeyIsSame=true，否则nextKeyIsSame=false。

　　此时我们再返回到run()方法中，循环条件已了解，那么循环体的秘密呢？用户自己的reduce方法还记得么？一个key和一个这个key对应的value迭代器，没错在这分别对应context.getCurrentKey()和context.getValues()。下面我们重点研究一下后者context.getValues()，这个方法也在ReduceContext类中，这个方法主要是返回一个可迭代对象ValueIterable，它封装了迭代器ValueIterator，这个迭代器实现了对value的迭代读取，这个类的全部代码如下：

 protected class ValueIterator implements Iterator<VALUEIN> {
 
     @Override
     public boolean hasNext() {
       return firstValue || nextKeyIsSame;
     }
 
     @Override
     public VALUEIN next() {
       // if this is the first record, we don't need to advance
       if (firstValue) {
         firstValue = false;
         return value;
       }
       // if this isn't the first record and the next key is different, they
       // can't advance it here.
       if (!nextKeyIsSame) {
         throw new NoSuchElementException("iterate past last value");
       }
       // otherwise, go to the next key/value pair
       try {        //firstValue==false and nextKeyIsSame == true
         nextKeyValue();
         return value;
       } catch (IOException ie) {
         throw new RuntimeException("next value iterator failed", ie);
       } catch (InterruptedException ie) {
         // this is bad, but we can't modify the exception list of java.util
         throw new RuntimeException("next value iterator interrupted", ie);
       }
     }
 
     @Override
     public void remove() {
       throw new UnsupportedOperationException("remove not implemented");
     }
 
   }

　　hasNext()判断是否还有下一个value，由上面说的firstValue和nextKeyIsSame决定，只要有一个是true就说明有下一个value，为什么呢，请看上面对着两个参数的解释，自行理解吧，很明显。

　　next()方法就是读取value的地方，这有几种情况需要分析：1、如果firstValue==true，则直接返回当前的value，大伙这没问题吧；2、如果firstValue==false and nextKeyIsSame == false，这明显不科学，哪有下一个key不相同且又不是第一个value的情况呢？所以报错；3、如果firstValue==false and nextKeyIsSame == true 说明下一个KV的key和当前key相同且不是第一个value，可能是第N个，所以需要调用nextKeyValue()获取下一个value并返回。reduce就是通过这种机制不断去获取同一个key的所有valude的。

　　这个上面二中的输入数据迭代器就明了了。

　　B、下面就是ReduceCopier类了，这个类承载的工作量很大，也比较复杂。

　　重点的方法是ReduceCopier.fetchOutputs()这个方法负责拷贝各个Map函数的输出，代码也比较多接近400行，代码如下，里面有一些注释：

  //通过ReduceCopier的fetchOutputs()方法取得map的结果
     public boolean fetchOutputs() throws IOException {
       int totalFailures = 0;
       int            numInFlight = 0, numCopied = 0;
       DecimalFormat  mbpsFormat = new DecimalFormat("0.00");
       final Progress copyPhase =
         reduceTask.getProgress().phase();
       //（4）同时合并，还有一个内存Merger线程InMemFSMergeThread和一个文件Merger线程LocalFSMerger在同步工作，
       //它们将下载过来的文件（可能在内存中，简单的统称为文件...），做着归并排序，以此，节约时间，降低输入文件的数量，
       //为后续的排序工作减   负。InMemFSMergeThread的run循环调用doInMemMerge，该方法使用工具类Merger实现归并，
       //如果需要combine，则combinerRunner.combine。
       LocalFSMerger localFSMergerThread = null;
       InMemFSMergeThread inMemFSMergeThread = null;
       //（1）索取任务。使用GetMapEventsThread线程。
       //该线程的run方法不停的调用getMapCompletionEvents方法，
       //该方法又使用RPC调用TaskUmbilicalProtocol协议的getMapCompletionEvents，
       //方法使用所属的jobID向其父TaskTracker询问此作业个Map任务  的完成状况
       //（TaskTracker要向JobTracker询问后再转告给它...）。返回一个数组TaskCompletionEvent events[]。
       //TaskCompletionEvent包含taskid和ip地址之类的信息。
       GetMapEventsThread getMapEventsThread = null;
 
       for (int i = 0; i < numMaps; i++) {
         copyPhase.addPhase();       // add sub-phase per file
       }
 
       copiers = new ArrayList<MapOutputCopier>(numCopiers);
 
       // start all the copying threads
       for (int i=0; i < numCopiers; i++) {
           //（2）当获取到相关Map任务执行服务器的信息后，有一个线程MapOutputCopier开启，做具体的拷贝工作。
           //它会在一个单独的线程内，负责某个Map任务服务器上文件的拷贝工作。MapOutputCopier的run循环调用
           //copyOutput，copyOutput又调用    getMapOutput，使用HTTP远程拷贝。
         MapOutputCopier copier = new MapOutputCopier(conf, reporter,
             reduceTask.getJobTokenSecret());
         copiers.add(copier);
         copier.start();
       }
 
       //start the on-disk-merge thread
       localFSMergerThread = new LocalFSMerger((LocalFileSystem)localFileSys);
       //start the in memory merger thread
       inMemFSMergeThread = new InMemFSMergeThread();
       localFSMergerThread.start();
       inMemFSMergeThread.start();
 
       // start the map events thread
       getMapEventsThread = new GetMapEventsThread();
       getMapEventsThread.start();
 
       // start the clock for bandwidth measurement
       long startTime = System.currentTimeMillis();
       long currentTime = startTime;
       long lastProgressTime = startTime;
       long lastOutputTime = 0;
 
         // loop until we get all required outputs
         while (copiedMapOutputs.size() < numMaps && mergeThrowable == null) {
 
           currentTime = System.currentTimeMillis();
           boolean logNow = false;
           if (currentTime - lastOutputTime > MIN_LOG_TIME) {
             lastOutputTime = currentTime;
             logNow = true;
           }
           if (logNow) {
             LOG.info(reduceTask.getTaskID() + " Need another "
                    + (numMaps - copiedMapOutputs.size()) + " map output(s) "
                    + "where " + numInFlight + " is already in progress");
           }
 
           // Put the hash entries for the failed fetches.
           Iterator<MapOutputLocation> locItr = retryFetches.iterator();
 
           while (locItr.hasNext()) {
             MapOutputLocation loc = locItr.next();
             List<MapOutputLocation> locList =
               mapLocations.get(loc.getHost());
 
             // Check if the list exists. Map output location mapping is cleared
             // once the jobtracker restarts and is rebuilt from scratch.
             // Note that map-output-location mapping will be recreated and hence
             // we continue with the hope that we might find some locations
             // from the rebuild map.
             if (locList != null) {
               // Add to the beginning of the list so that this map is
               //tried again before the others and we can hasten the
               //re-execution of this map should there be a problem
               locList.add(0, loc);
             }
           }
 
           if (retryFetches.size() > 0) {
             LOG.info(reduceTask.getTaskID() + ": " +
                   "Got " + retryFetches.size() +
                   " map-outputs from previous failures");
           }
           // clear the "failed" fetches hashmap
           retryFetches.clear();
 
           // now walk through the cache and schedule what we can
           int numScheduled = 0;
           int numDups = 0;
 
           synchronized (scheduledCopies) {
 
             // Randomize the map output locations to prevent
             // all reduce-tasks swamping the same tasktracker
             List<String> hostList = new ArrayList<String>();
             hostList.addAll(mapLocations.keySet()); 
 
             Collections.shuffle(hostList, this.random);//混洗，降低热点的出现
 
             Iterator<String> hostsItr = hostList.iterator();
 
             while (hostsItr.hasNext()) {
 
               String host = hostsItr.next();
 
               List<MapOutputLocation> knownOutputsByLoc =
                 mapLocations.get(host);
 
               // Check if the list exists. Map output location mapping is
               // cleared once the jobtracker restarts and is rebuilt from
               // scratch.
               // Note that map-output-location mapping will be recreated and
               // hence we continue with the hope that we might find some
               // locations from the rebuild map and add then for fetching.
               if (knownOutputsByLoc == null || knownOutputsByLoc.size() == 0) {
                 continue;
               }
 
               //Identify duplicate hosts here
               if (uniqueHosts.contains(host)) {
                  numDups += knownOutputsByLoc.size();
                  continue;
               }
 
               Long penaltyEnd = penaltyBox.get(host);
               boolean penalized = false;
 
               if (penaltyEnd != null) {
                 if (currentTime < penaltyEnd.longValue()) {
                   penalized = true;
                 } else {
                   penaltyBox.remove(host);
                 }
               }
 
               if (penalized)
                 continue;
 
               synchronized (knownOutputsByLoc) {
 
                 locItr = knownOutputsByLoc.iterator();
 
                 while (locItr.hasNext()) {
 
                   MapOutputLocation loc = locItr.next();
 
                   // Do not schedule fetches from OBSOLETE maps
                   if (obsoleteMapIds.contains(loc.getTaskAttemptId())) {
                     locItr.remove();
                     continue;
                   }
 
                   uniqueHosts.add(host);
                   scheduledCopies.add(loc);
                   locItr.remove();  // remove from knownOutputs
                   numInFlight++; numScheduled++;
 
                   break; //we have a map from this host
                 }
               }
             }
             scheduledCopies.notifyAll();
           }
 
           if (numScheduled > 0 || logNow) {
             LOG.info(reduceTask.getTaskID() + " Scheduled " + numScheduled +
                    " outputs (" + penaltyBox.size() +
                    " slow hosts and" + numDups + " dup hosts)");
           }
 
           if (penaltyBox.size() > 0 && logNow) {
             LOG.info("Penalized(slow) Hosts: ");
             for (String host : penaltyBox.keySet()) {
               LOG.info(host + " Will be considered after: " +
                   ((penaltyBox.get(host) - currentTime)/1000) + " seconds.");
             }
           }
 
           // if we have no copies in flight and we can't schedule anything
           // new, just wait for a bit
           try {
             if (numInFlight == 0 && numScheduled == 0) {
               // we should indicate progress as we don't want TT to think
               // we're stuck and kill us
               reporter.progress();
               Thread.sleep(5000);
             }
           } catch (InterruptedException e) { } // IGNORE
 
           while (numInFlight > 0 && mergeThrowable == null) {
             LOG.debug(reduceTask.getTaskID() + " numInFlight = " +
                       numInFlight);
             //the call to getCopyResult will either
             //1) return immediately with a null or a valid CopyResult object,
             //                 or
             //2) if the numInFlight is above maxInFlight, return with a
             //   CopyResult object after getting a notification from a
             //   fetcher thread,
             //So, when getCopyResult returns null, we can be sure that
             //we aren't busy enough and we should go and get more mapcompletion
             //events from the tasktracker
             CopyResult cr = getCopyResult(numInFlight);
 
             if (cr == null) {
               break;
             }
 
             if (cr.getSuccess()) {  // a successful copy
               numCopied++;
               lastProgressTime = System.currentTimeMillis();
               reduceShuffleBytes.increment(cr.getSize());
 
               long secsSinceStart =
                 (System.currentTimeMillis()-startTime)/1000+1;
               float mbs = ((float)reduceShuffleBytes.getCounter())/(1024*1024);
               float transferRate = mbs/secsSinceStart;
 
               copyPhase.startNextPhase();
               copyPhase.setStatus("copy (" + numCopied + " of " + numMaps
                                   + " at " +
                                   mbpsFormat.format(transferRate) +  " MB/s)");
 
               // Note successful fetch for this mapId to invalidate
               // (possibly) old fetch-failures
               fetchFailedMaps.remove(cr.getLocation().getTaskId());
             } else if (cr.isObsolete()) {
               //ignore
               LOG.info(reduceTask.getTaskID() +
                        " Ignoring obsolete copy result for Map Task: " +
                        cr.getLocation().getTaskAttemptId() + " from host: " +
                        cr.getHost());
             } else {
               retryFetches.add(cr.getLocation());
 
               // note the failed-fetch
               TaskAttemptID mapTaskId = cr.getLocation().getTaskAttemptId();
               TaskID mapId = cr.getLocation().getTaskId();
 
               totalFailures++;
               Integer noFailedFetches =
                 mapTaskToFailedFetchesMap.get(mapTaskId);
               noFailedFetches =
                 (noFailedFetches == null) ? 1 : (noFailedFetches + 1);
               mapTaskToFailedFetchesMap.put(mapTaskId, noFailedFetches);
               LOG.info("Task " + getTaskID() + ": Failed fetch #" +
                        noFailedFetches + " from " + mapTaskId);
 
               if (noFailedFetches >= abortFailureLimit) {
                 LOG.fatal(noFailedFetches + " failures downloading "
                           + getTaskID() + ".");
                 umbilical.shuffleError(getTaskID(),
                                  "Exceeded the abort failure limit;"
                                  + " bailing-out.", jvmContext);
               }
 
               checkAndInformJobTracker(noFailedFetches, mapTaskId,
                   cr.getError().equals(CopyOutputErrorType.READ_ERROR));
 
               // note unique failed-fetch maps
               if (noFailedFetches == maxFetchFailuresBeforeReporting) {
                 fetchFailedMaps.add(mapId);
 
                 // did we have too many unique failed-fetch maps?
                 // and did we fail on too many fetch attempts?
                 // and did we progress enough
                 //     or did we wait for too long without any progress?
 
                 // check if the reducer is healthy
                 boolean reducerHealthy =
                     (((float)totalFailures / (totalFailures + numCopied))
                      < MAX_ALLOWED_FAILED_FETCH_ATTEMPT_PERCENT);
 
                 // check if the reducer has progressed enough
                 boolean reducerProgressedEnough =
                     (((float)numCopied / numMaps)
                      >= MIN_REQUIRED_PROGRESS_PERCENT);
 
                 // check if the reducer is stalled for a long time
                 // duration for which the reducer is stalled
                 int stallDuration =
                     (int)(System.currentTimeMillis() - lastProgressTime);
                 // duration for which the reducer ran with progress
                 int shuffleProgressDuration =
                     (int)(lastProgressTime - startTime);
                 // min time the reducer should run without getting killed
                 int minShuffleRunDuration =
                     (shuffleProgressDuration > maxMapRuntime)
                     ? shuffleProgressDuration
                     : maxMapRuntime;
                 boolean reducerStalled =
                     (((float)stallDuration / minShuffleRunDuration)
                      >= MAX_ALLOWED_STALL_TIME_PERCENT);
 
                 // kill if not healthy and has insufficient progress
                 if ((fetchFailedMaps.size() >= maxFailedUniqueFetches ||
                      fetchFailedMaps.size() == (numMaps - copiedMapOutputs.size()))
                     && !reducerHealthy
                     && (!reducerProgressedEnough || reducerStalled)) {
                   LOG.fatal("Shuffle failed with too many fetch failures " +
                             "and insufficient progress!" +
                             "Killing task " + getTaskID() + ".");
                   umbilical.shuffleError(getTaskID(),
                                          "Exceeded MAX_FAILED_UNIQUE_FETCHES;"
                                          + " bailing-out.", jvmContext);
                 }
 
               }
 
               currentTime = System.currentTimeMillis();
               long currentBackOff = (long)(INITIAL_PENALTY *
                   Math.pow(PENALTY_GROWTH_RATE, noFailedFetches));
 
               penaltyBox.put(cr.getHost(), currentTime + currentBackOff);
               LOG.warn(reduceTask.getTaskID() + " adding host " +
                        cr.getHost() + " to penalty box, next contact in " +
                        (currentBackOff/1000) + " seconds");
             }
             uniqueHosts.remove(cr.getHost());
             numInFlight--;
           }
         }
 
         // all done, inform the copiers to exit
         exitGetMapEvents= true;
         try {
           getMapEventsThread.join();
           LOG.info("getMapsEventsThread joined.");
         } catch (InterruptedException ie) {
           LOG.info("getMapsEventsThread threw an exception: " +
               StringUtils.stringifyException(ie));
         }
 
         synchronized (copiers) {
           synchronized (scheduledCopies) {
             for (MapOutputCopier copier : copiers) {
               copier.interrupt();
             }
             copiers.clear();
           }
         }
 
         // copiers are done, exit and notify the waiting merge threads
         synchronized (mapOutputFilesOnDisk) {
           exitLocalFSMerge = true;
           mapOutputFilesOnDisk.notify();
         }
 
         ramManager.close();
 
         //Do a merge of in-memory files (if there are any)
         if (mergeThrowable == null) {
           try {
             // Wait for the on-disk merge to complete
             localFSMergerThread.join();
             LOG.info("Interleaved on-disk merge complete: " +
                      mapOutputFilesOnDisk.size() + " files left.");
 
             //wait for an ongoing merge (if it is in flight) to complete
             inMemFSMergeThread.join();
             LOG.info("In-memory merge complete: " +
                      mapOutputsFilesInMemory.size() + " files left.");
             } catch (InterruptedException ie) {
             LOG.warn(reduceTask.getTaskID() +
                      " Final merge of the inmemory files threw an exception: " +
                      StringUtils.stringifyException(ie));
             // check if the last merge generated an error
             if (mergeThrowable != null) {
               mergeThrowable = ie;
             }
             return false;
           }
         }
         return mergeThrowable == null && copiedMapOutputs.size() == numMaps;
     }

　　该方法会构造多个线程对象：1个LocalFSMerger线程、1个InMemFSMergeThread线程、1个GetMapEventsThread线程、若干个(由"mapred.reduce.parallel.copies"决定，默认是5)MapOutputCopier线程。

　　(1)先开若干个MapOutputCopier，并启动线程，加入copiers存储列表。这个线程的run方法中有个死循环，一直监控scheduledCopies列表，这个列表表示正在拷贝的map输出的列表，当scheduledCopies一旦发现有MapOutputLocation就获取第一个MapOutputLocation，调用方法copyOutput(loc)来从远程通过HTTP拷贝Map的输出数据。copyOutput(loc)方法首先检查这个MapOutputLocation是否在copiedMapOutputs和obsoleteMapIds之中，是不能拷贝的，如果在就直接返回-2；然后通过getMapOutput(MapOutputLocation mapOutputLoc, Path filename, int reduce)方法与远程taskTracker建立连接，并获取输入流，通过一系列检查之后检查内存文件系统是否可以放得下这个map输出，如果可以放得下就通过shuffleInMemory方法将这个文件放入内存，否则通过shuffleToDisk刷新到磁盘(shuffleInMemory方法会等待内存释放足够的空间并会关闭输入流再再次建立输入流，在内存中开辟空间，将map数据拷贝到这这段空间中并封装到MapOutput中，然后返回这个MapOutput；shuffleToDisk方法首先会找一个合适的本地位置来存储map的输出，然后构造一个MapOutput对象，并从输入流持续的写到输出流指定的文件中，将这个文件封装到MapOutput中，返回MapOutput)。再返回到copyOutput方法，再对返回的MapOutput做一些检查最终如果是在内存中则mapOutputsFilesInMemory.add(mapOutput)；否则是在本地磁盘对其重命名并将这个文件对应的FileStatus加入mapOutputFilesOnDisk。run方法中的finally中的finish方法将已经拷贝的MapOutputLocation放入copyResults。

　　(2)构造LocalFSMerger对象并启动线程，其run方法如果exitLocalFSMerge==false就会一直等待本地文件数量>=(2 * ioSortFactor - 1)，会触发本地文件合并操作，ioSortFactor是参数"io.sort.factor"，默认是10。然后会从 mapOutputFilesOnDisk(是SortedSet类型)中选取最小的前10个文件放入mapFiles，通过Merger.merge归并排序这10个文件，写入writer指定的文件，并将新文件放入mapOutputFilesOnDisk中。这里如果设置了combiner，也不会调用。

　　(3)构造InMemFSMergeThread对象并启动线程，其run方法循环检查内存中的文件是否可以合并通过exit = ramManager.waitForDataToMerge()，如果满足以下几个条件之一就会触发合并内存文件的操作：一、数据拷贝完毕后，关闭ShuffleRamManager；二、ShuffleRamManager 中已使用内存超过可用内存的“mapred.job.shuffle.merge.percent”，默认是0.66且内存文件数目超过2个；三、内存中的文件数目超过“mapred.inmem.merge.threshold”，默认是1000；四、阻塞在ShuffleRamManager上的请求数目超过拷贝线程数"mapred.reduce.parallel.copies"的0.75。满足条件就会调用doInMemMerge()方法来执行合并操作，该方法使用工具类Merger实现归并，如果设置了combiner，则在写入本地文件之前通过combinerRunner.combine来将排序后的数据聚集后写入writer指定的本地文件中。这里有个问题要注意就是run方法中是do-while循环，循环条件是（!exit）,即当exit==false时才会持续的运行，waitForDataToMerge方法中可以看出来只有ramManager关闭之后才会返回true。

　　(4)构造GetMapEventsThread对象并启动线程。此线程的run方法是每隔1s调用getMapCompletionEvents()方法直到exitGetMapEvents==true(会在fetchOutputs()中赋值true)，这个方法会与TaskTracker通信调用TaskTracker.getMapCompletionEvents已经获取到的etionEvents方法获取已完成的Map Task列表：规则是先查找shouldReset有没有当前reduce task对应的ID，如果有说明要正在shuffle要回滚，则就返回一个要reset的MapTaskCompletionEventsUpdate；如果shouldReset没有，则从runningJobs中找到当前reduce task所属的Job的FetchStatus;获取新增的完成的map task列表FetchStatus.getMapEvents(fromEventId, maxLocs)，从allMapEvents中获取需要的已完成的map，然后封装到这个列表到MapTaskCompletionEventsUpdate，再返回。那么allMapEvents中的数据是如何来的呢？TaskTracker有个MapEventsFetcherThread线程，其run方法会周期性的去获取runningJobs所有的job中第一个处于SHUFFLE阶段的reduce task对应job的FetchStatus，然后对每个FetchStatus调用其fetchMapCompletionEvents(currentTime)方法调用queryJobTracker(fromEventId, jobId, jobClient)方法与JobTracker通信通过JobTracker.getTaskCompletionEvents方法从JobInProgress中的taskCompletionEvents来获取满足条件的TaskCompletionEvent，从中找出是Map task的更新allMapEvents。

　　getMapCompletionEvents()方法中获取到了MapTaskCompletionEventsUpdate之后，就将已完成的map列表放入TaskCompletionEvent events[]之中；如果是reset的，则重置fromEventId、obsoleteMapIds、mapLocations；然后更新fromEventId表示已经获取到已完成map的最新编号，以后再获取新增将会是这个编号之后的。然后遍历events中的所有TaskCompletionEvent，根据每个的状态：如果是SUCCEEDED，则放入mapLocations(保存了TaskTracker Host与已完成任务列表的映射关系)可以去取map的输出数据；如果是OBSOLETE/FAILED/KILLED，就放入obsoleteOutputs，表示停止从这些map取数据；如果是TIPFAILED，则放入copiedMapOutputs表示不需要从这些map去取数据。然后返回mapLocations新增的的个数。

　　在fetchOutputs()方法中这些线程启动之后，还不能工作，还需要将mapLocations中合适的MapOutputLocation放入scheduledCopies唤醒MapOutputCopier线程去拷贝，如果A、所有的拷贝结果中会将拷贝成功的从fetchFailedMaps中删除；B、是Obsolete的会忽略；C、其他失败的加入retryFetches，并且对应mapTaskId的失败次数会加1，并放入mapTaskToFailedFetchesMap之中，这个结构是用来存放mapTaskId和对应的失败次数的，容错机制一：拷贝失败次数超过上限(Math.max(30, numMaps / 10))就会杀死该Reduce Task(等待调度器重新调度执行)；容错机制二：一旦拷贝失败次数>=maxFetchFailuresBeforeReporting(由参数"mapreduce.reduce.shuffle.maxfetchfailures"指定，默认是10)，就加入fetchFailedMaps，同时满足以下条件就会杀死这个reduce task：一、reducer所在节点不健康；二、fetchFailedMaps的大小超过上限(默认是5)或者等于所有的reducer需要的所有的map的个数减去copiedMapOutputs的大小；三、reducer没有足够的Progress或者reducer超时停滞了，容错三、如果前两个条件均不满足，则采用对数回归模型推迟一段时间后重新拷贝对应的map的输出数据，延迟时间是10000*Math.pow(1.3, noFailedFetches))，并放入penaltyBox中进行惩罚。最后待copy操作完成会做一些清理工作：会关闭ramManager，触发InMemFSMergeThread线程结束退出；exitGetMapEvents=true会使得GetMapEventsThread结束退出；exitLocalFSMerge=true会使得LocalFSMerger线程结束退出；挨个中断copiers中所有拷贝线程MapOutputCopier，清理copiers.clear()。

　　至此reduce task算是讲解完毕，mapreduce的整个过程已经讲解了很多内容，大体的过程已知。还有许多东西没有涉及，比如恢复机制、容错机制、任务的推测、快排和归并、文件流的过程包括文件名和位置等等。后续还会继续研究。

　　参考：1、董西成，《hadoop技术内幕---深入理解MapReduce架构设计与实现原理》

MapReduce的ReduceTask任务的运行源码级分析的更多相关文章

MapReduce的MapTask任务的运行源码级分析
TaskTracker任务初始化及启动task源码级分析这篇文章中分析了任务的启动,每个task都会使用一个进程占用一个JVM来执行,org.apache.hadoop.mapred.Child方法 ...
TaskTracker任务初始化及启动task源码级分析
在监听器初始化Job.JobTracker相应TaskTracker心跳.调度器分配task源码级分析中我们分析的Tasktracker发送心跳的机制,这一节我们分析TaskTracker接受JobT ...
MapReduce job在JobTracker初始化源码级分析
mapreduce job提交流程源码级分析(三)中已经说明用户最终调用JobTracker.submitJob方法来向JobTracker提交作业.而这个方法的核心提交方法是JobTracker.a ...
监听器初始化Job、JobTracker相应TaskTracker心跳、调度器分配task源码级分析
JobTracker和TaskTracker分别启动之后(JobTracker启动流程源码级分析,TaskTracker启动过程源码级分析),taskTracker会通过心跳与JobTracker通信 ...
TableInputFormat分片及分片数据读取源码级分析
我们在MapReduce中TextInputFormat分片和读取分片数据源码级分析这篇中以TextInputFormat为例讲解了InputFormat的分片过程以及RecordReader读取分 ...
Shell主要逻辑源码级分析(1)——SHELL运行流程
版权声明:本文由李航原创文章,转载请注明出处: 文章原文链接:https://www.qcloud.com/community/article/109 来源:腾云阁 https://www.qclou ...
Flume-NG内置计数器(监控)源码级分析
Flume的内置监控怎么整?这个问题有很多人问.目前了解到的信息是可以使用Cloudera Manager.Ganglia有图形的监控工具,以及从浏览器获取json串,或者自定义向其他监控系统汇报信息 ...
源码级分析Android系统启动流程
首先看一下Android系统的体系结构,相信大家都不陌生 1.首先Bootloader引导程序启动完Linux内核后,会加载各种驱动和数据结构,当有了驱动以后,开始启动Android系统,同时会加载用 ...
Shell主要逻辑源码级分析 (2)——SHELL作业控制
版权声明:本文由李航原创文章,转载请注明出处: 文章原文链接:https://www.qcloud.com/community/article/110 来源:腾云阁 https://www.qclou ...

随机推荐

git 中关于LF 和 CRLF 的问题
git 中关于LF 和 CRLF 的转换问题注意: Windows下编辑器设置中,建议调整设置为Unix风格.(具体设置位置各种编辑器上不同,需要找找) 使用Git Bash进行命令行操作时,运行一下 ...
RPM包制作教程
一.RPM介绍 RPM 前是Red Hat Package Manager 的缩写,本意是Red Hat 软件包管理,顾名思义是Red Hat 贡献出来的软件包管理:现在应为RPM Package M ...
Linux字符串截取命令
一.简单截取假设有变量 var=http://www.google.com/test.htm 1. # 号截取,删除左边字符,保留右边字符.echo ${var#*//}其中 var 是变量名,# ...
sicily vector有序插入
实现了简单的vector有序插入,这个题目值得注意的点是1.当vector为空时,需要判断再排除 2.迭代器的使用是此段代码的特点 int insertVector(vector<int> ...
Redis-基于php简单安装使用
1.下载php相关redis扩展(apache选vc6),下载地址: https://github.com/phpredis/phpredis/downloads 2.修改php.ini,增加下面两项 ...
ReactNative新手学习之路03真机调试
React Native新手入门03真机调试(iOS) 从设备访问开发服务器在启用开发服务器的情况下,你可以快速的迭代修改应用,然后在设备上查看结果.这样做的前提是你的电脑和设备必须在同一个wifi ...
LeetCode:Subsets I II
求集合的所有子集问题 LeetCode:Subsets Given a set of distinct integers, S, return all possible subsets. Note: ...
CREATE TABLE 表名 AS SELECT 语句
1.新表不存在复制表结构即数据到新表 ? 1 2 create table new_table select * from old_talbe; 这种方法会将old_table中所有的内容都拷贝过来, ...
为什么 "auto a = 1;" 在C语言中可以编译通过？
参照:这里这让我想起之前看的一部书, int i; 其实是等价与 auto int i; 表示为局部变量这应该与static是相对的吧?
js版弹力球实例
<!DOCTYPE html><html> <head> <meta charset="UTF-8"> <title>弹 ...

MapReduce的ReduceTask任务的运行源码级分析

MapReduce的ReduceTask任务的运行源码级分析的更多相关文章

随机推荐

热门专题