MapReduce过程源码分析

Mapper

首先mapper完成映射，将word映射成(word,1)的形式。

MapReduce进程，Map阶段也叫MapTask，在MapTask中会通过run()方法来调用我们用户重写的mapper() 方法，

分布式的运算程序往往需要分成至少两个阶段：Map阶段和Reduce阶段。

第一个阶段，即Map阶段的maptask并发实例，完全并行独立运行，互不相干，如Map将要处理的多个文件的每个文件分成3份，分别放在集群中的各个数据节点，Map阶段中由maptask进程来处理已经存进来的文件，一行一行地去读数据，按空格切分行内单词，切分完毕之后，将单词统计出来以hashmap存储，其中以单词为key，以1作为单词的value。等到分配给自己的数据片全部读完之后，将这个hashmap按照首个字母的范围分成2个hashmap（分区排序），两个hashmap分别为：HashMap（a-p）和HashMap（q-z）。

第二个阶段的reduce task并发实例互不相干，但是他们的数据依赖于上一个阶段的所有maptask的并发实例的输出。

reduce task 分别统计a-p开头的单词和q-z开头的单词，然后输出结果到文件。

注意：MapReduce编程模型只能包含一个map阶段和一个reduce阶段，如果用户的业务逻辑非常复杂，那就只能多个MapReduce程序，串行执行。

那么maptask如何进行任务分配？

reducetask如何进行任务分配？

maptask和reducetask之间如何衔接？

如果maptask运行失败，如何处理？

maptask如果都要自己负责输出数据的分区，很麻烦

MrAPPMaster负责整个程序的过程调度及状态的协调。

三个进程分别对应三个类：

三个进程：

1）MrAppMaster：负责整个程序的过程调度及状态协调

2）MapTask：负责map阶段的整个数据处理流程

3）ReduceTask：负责reduce阶段的整个数据处理流程

分别对应的类：

1）Driver阶段

整个程序需要一个Drvier来进行提交，提交的是一个描述了各种必要信息的job对象

2）Mapper阶段

（1）用户自定义的Mapper要继承自己的父类

（2）Mapper的输入数据是KV对的形式（KV的类型可自定义）

（3）Mapper中的业务逻辑写在map()方法中

（4）Mapper的输出数据是KV对的形式（KV的类型可自定义）

（5）map()方法（maptask进程）对每一个<K,V>调用一次

3）Reducer阶段

（1）用户自定义的Reducer要继承自己的父类

（2）Reducer的输入数据类型对应Mapper的输出数据类型，也是KV

（3）Reducer的业务逻辑写在reduce()方法中

（4）Reducetask进程对每一组相同k的<k,v>组调用一次reduce()方法

数据切片与MapTask并行度的决定机制

1.一个Job的Map阶段并行度由客户端在提交Job时的切片数决定

2.每一个Split切片分配一个MapTask并行实例处理

3.默认情况下，切片大小=BlockSize

4.切片时不靠路数据集整体，而是诸葛针对每一个文件单独切片。

job的提交过程分析

1 提交任务---->2 检查状态(if (state == JobState.DEFINE) {submit();})---->3.0 submit():3.1 确保job状态（ensureState(JobState.DEFINE)）---->3.2 使用新的API(setUseNewAPI())---->3.3 连接集群（connect()）---->3.4 根据get到的集群获取任务提交器（final JobSubmitter submitter = getJobSubmitter(cluster.getFileSystem(), cluster.getClient())）---->3.5 submitter提交任务（return submitter.submitJobInternal(Job.this, cluster)）----->3.5.1检查输出路径是否设置以及输出路径是否存在（checkSpecs(jobs)）---->3.5.2注册JobId（JobID jobId = submitClient.getNewJobID();）---->向目录拷贝一个文件，该文件就是我们之前（setJarByClass(xxx.class)）设置好的jar包---->客户端就是通过该方法调用InputFormat来给我们的输入文件进行切片---->切片之后，执行conf.setInt(MRJobConfig.NUM_MAPS, maps);将切片信息写入到目录(submitJobDir)中---->把job的配置信息conf也提交到submitDir---->任务正式提交。

首先我们调用：

boolean b = job.waitForCompletion(true);

该方法的主体：

public boolean waitForCompletion(boolean verbose

                                   ) throws IOException, InterruptedException,

                                            ClassNotFoundException {

    if (state == JobState.DEFINE) {

      submit();

    }

    if (verbose) {

      monitorAndPrintJob();

    } else {

      // get the completion poll interval from the client.

      int completionPollIntervalMillis =

        Job.getCompletionPollInterval(cluster.getConf());

      while (!isComplete()) {

        try {

          Thread.sleep(completionPollIntervalMillis);

        } catch (InterruptedException ie) {

        }

      }

    }

    return isSuccessful();

  }

然后调用submit()方法：

/**

   * Submit the job to the cluster and return immediately.

   * @throws IOException

   */

  public void submit()

         throws IOException, InterruptedException, ClassNotFoundException {

    ensureState(JobState.DEFINE);

    setUseNewAPI();

    connect();

    final JobSubmitter submitter =

        getJobSubmitter(cluster.getFileSystem(), cluster.getClient());

    status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {

      public JobStatus run() throws IOException, InterruptedException,

      ClassNotFoundException {

        return submitter.submitJobInternal(Job.this, cluster);

      }

    });

    state = JobState.RUNNING;

    LOG.info("The url to track the job: " + getTrackingURL());

   }

在submit()方法中，通过ensureState(JobState.DEFINE)方法再次确认job的状态是否为DEFINE，如果是，就设置使用新的API（hadoop2.x版本升级了很多新的API，而老的版本调用MapReduce程序的时候，在这里自动转成新的API）,然后调用connect()方法建立和yarn集群的连接。

connect()方法的内容如下：

 private synchronized void connect()

          throws IOException, InterruptedException, ClassNotFoundException {

    if (cluster == null) {

      cluster =

        ugi.doAs(new PrivilegedExceptionAction<Cluster>() {

                   public Cluster run()

                          throws IOException, InterruptedException,

                                 ClassNotFoundException {

                     return new Cluster(getConfiguration());

                   }

                 });

    }

  }

在connect()方法中，首先判断cluster，如果集群为null，那么就返回一个新的集群，return new Cluster(getConfiguration());，如果任务的配置是本地模式就是一个LocalMaster，如果是yarn集群就是YarnMaster。

连接到cluster之后，然后就可以提交我们要执行的任务了，这时执行

final JobSubmitter submitter = getJobSubmitter(cluster.getFileSystem(), cluster.getClient());

通过返回的集群的客户端和协议来获得一个submitter，然后执行语句

return submitter.submitJobInternal(Job.this, cluster);

利用submitter来真正的提交我们的job任务，submitJobInternal(Job.this, cluster)这个方法是真正提交job任务的方法，该方法的内容见文章最后。

然后通过checkSpecs(jobs)方法检查输出路径是否设置以及输出路径是否存在，然后执行语句：

Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf);根据我们设置的cluster获取我们的stagingDir，即存放MapReduce过程中产生数据的临时文件夹。

然后执行JobID jobId = submitClient.getNewJobID();对我们的job进行注册，进而得到JobId，通过job.setJobID(jobId);设置我们注册好的jobId，然后执行copyAndConfigureFiles(job, submitJobDir);向目录拷贝一个文件，该文件就是我们之前（setJarByClass(xxx.class)）设置好的jar包,然后执行int maps = writeSplits(job, submitJobDir);客户端就是通过该方法调用InputFormat来给我们的输入文件进行切片，切片之后，执行conf.setInt(MRJobConfig.NUM_MAPS, maps);将切片信息写入到目录(submitJobDir)中。所谓切片信息就是标注了哪台机器处理哪个切片数据，相当于一个索引信息，有了这个索引信息，MapReduce的APPMaster在执行任务的时候就可以知道启动几个maptask并且知道每个机器处理哪一个部分的数据，所以这个信息也是要提交到HDFS的临时目录里面。

然后执行writeConf(conf, submitJobFile);把我们的job的配置信息conf也提交到submitDir，执行完之后，临时目录中生成job.xml(存放job的配置信息,包括集群的各种手动配置及默认配置信息)。

其实到此为止，这一系列的工作都是在为集群的工作做准备，集群中创建一个临时目录，它是可以供集群中所有的数据节点进行访问的，首先在集群的临时目录中存放了jar包，然后放置了切片信息，最后又放置了配置文件，这样maptask可以到临时文件夹中读取存放的这三个信息，进而执行他们各自的任务。

然后程序第240行真正进行job的提交，然后任务开始运行。

至此程序执行完毕。

总结

任务的提交过程：首先是检查任务的状态，检查输出目录，都没问题之后，然后开始连接集群，因为任务都是交由集群中的其他人来执行，所以其他人需要得知这个任务的必要信息，因此job提交的时候有必要将这些任务的必要信息提交到大家都可以访问的临时目录（在HDFS上），这些必要的信息包括：jar包、切片信息以及配置信息(job.xml)。

附:submitJobInternal(Job.this, cluster)方法原代码：

/**

   * Internal method for submitting jobs to the system.

   *

   * <p>The job submission process involves:

   * <ol>

   *   <li>

   *   Checking the input and output specifications of the job.

   *   </li>

   *   <li>

   *   Computing the {@link InputSplit}s for the job.

   *   </li>

   *   <li>

   *   Setup the requisite accounting information for the

   *   {@link DistributedCache} of the job, if necessary.

   *   </li>

   *   <li>

   *   Copying the job's jar and configuration to the map-reduce system

   *   directory on the distributed file-system.

   *   </li>

   *   <li>

   *   Submitting the job to the <code>JobTracker</code> and optionally

   *   monitoring it's status.

   *   </li>

   * </ol></p>

   * @param job the configuration to submit

   * @param cluster the handle to the Cluster

   * @throws ClassNotFoundException

   * @throws InterruptedException

   * @throws IOException

   */

  JobStatus submitJobInternal(Job job, Cluster cluster)

  throws ClassNotFoundException, InterruptedException, IOException {

    //validate the jobs output specs

    checkSpecs(job);

    Configuration conf = job.getConfiguration();

    addMRFrameworkToDistributedCache(conf);

    Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf);

    //configure the command line options correctly on the submitting dfs

    InetAddress ip = InetAddress.getLocalHost();

    if (ip != null) {

      submitHostAddress = ip.getHostAddress();

      submitHostName = ip.getHostName();

      conf.set(MRJobConfig.JOB_SUBMITHOST,submitHostName);

      conf.set(MRJobConfig.JOB_SUBMITHOSTADDR,submitHostAddress);

    }

    JobID jobId = submitClient.getNewJobID();

    job.setJobID(jobId);

    Path submitJobDir = new Path(jobStagingArea, jobId.toString());

    JobStatus status = null;

    try {

      conf.set(MRJobConfig.USER_NAME,

          UserGroupInformation.getCurrentUser().getShortUserName());

      conf.set("hadoop.http.filter.initializers",

          "org.apache.hadoop.yarn.server.webproxy.amfilter.AmFilterInitializer");

      conf.set(MRJobConfig.MAPREDUCE_JOB_DIR, submitJobDir.toString());

      LOG.debug("Configuring job " + jobId + " with " + submitJobDir

          + " as the submit dir");

      // get delegation token for the dir

      TokenCache.obtainTokensForNamenodes(job.getCredentials(),

          new Path[] { submitJobDir }, conf);

      populateTokenCache(conf, job.getCredentials());

      // generate a secret to authenticate shuffle transfers

      if (TokenCache.getShuffleSecretKey(job.getCredentials()) == null) {

        KeyGenerator keyGen;

        try {

          keyGen = KeyGenerator.getInstance(SHUFFLE_KEYGEN_ALGORITHM);

          keyGen.init(SHUFFLE_KEY_LENGTH);

        } catch (NoSuchAlgorithmException e) {

          throw new IOException("Error generating shuffle secret key", e);

        }

        SecretKey shuffleKey = keyGen.generateKey();

        TokenCache.setShuffleSecretKey(shuffleKey.getEncoded(),

            job.getCredentials());

      }

      if (CryptoUtils.isEncryptedSpillEnabled(conf)) {

        conf.setInt(MRJobConfig.MR_AM_MAX_ATTEMPTS, 1);

        LOG.warn("Max job attempts set to 1 since encrypted intermediate" +

                "data spill is enabled");

      }

      copyAndConfigureFiles(job, submitJobDir);

      Path submitJobFile = JobSubmissionFiles.getJobConfPath(submitJobDir);

      // Create the splits for the job

      LOG.debug("Creating splits at " + jtFs.makeQualified(submitJobDir));

      int maps = writeSplits(job, submitJobDir);

      conf.setInt(MRJobConfig.NUM_MAPS, maps);

      LOG.info("number of splits:" + maps);

      // write "queue admins of the queue to which job is being submitted"

      // to job file.

      String queue = conf.get(MRJobConfig.QUEUE_NAME,

          JobConf.DEFAULT_QUEUE_NAME);

      AccessControlList acl = submitClient.getQueueAdmins(queue);

      conf.set(toFullPropertyName(queue,

          QueueACL.ADMINISTER_JOBS.getAclName()), acl.getAclString());

      // removing jobtoken referrals before copying the jobconf to HDFS

      // as the tasks don't need this setting, actually they may break

      // because of it if present as the referral will point to a

      // different job.

      TokenCache.cleanUpTokenReferral(conf);

      if (conf.getBoolean(

          MRJobConfig.JOB_TOKEN_TRACKING_IDS_ENABLED,

          MRJobConfig.DEFAULT_JOB_TOKEN_TRACKING_IDS_ENABLED)) {

        // Add HDFS tracking ids

        ArrayList<String> trackingIds = new ArrayList<String>();

        for (Token<? extends TokenIdentifier> t :

            job.getCredentials().getAllTokens()) {

          trackingIds.add(t.decodeIdentifier().getTrackingId());

        }

        conf.setStrings(MRJobConfig.JOB_TOKEN_TRACKING_IDS,

            trackingIds.toArray(new String[trackingIds.size()]));

      }

      // Set reservation info if it exists

      ReservationId reservationId = job.getReservationId();

      if (reservationId != null) {

        conf.set(MRJobConfig.RESERVATION_ID, reservationId.toString());

      }

      // Write job file to submit dir

      writeConf(conf, submitJobFile);

      //

      // Now, actually submit the job (using the submit name)

      //

      printTokens(jobId, job.getCredentials());

      status = submitClient.submitJob(

          jobId, submitJobDir.toString(), job.getCredentials());

      if (status != null) {

        return status;

      } else {

        throw new IOException("Could not launch job");

      }

    } finally {

      if (status == null) {

        LOG.info("Cleaning up the staging area " + submitJobDir);

        if (jtFs != null && submitJobDir != null)

          jtFs.delete(submitJobDir, true);

      }

    }

  }