WordCount作业提交到FileInputFormat类中split切分算法和host选择算法过程源码分析

参考 FileInputFormat类中split切分算法和host选择算法介绍以及 Hadoop2.6.0的FileInputFormat的任务切分原理分析（即如何控制FileInputFormat的map任务数量）以及 Hadoop中FileInputFormat计算InputSplit的getSplits方法的流程以及 hadoop作业分片处理以及任务本地性分析（源码分析第一篇）

分析前先介绍一下:

( 这里要注意下, Block 的 hosts 和 Split 的 hosts 不一样, Split 的 hosts 是通过 Split 的 hosts 按一定方法生成的, 如果一个 Block 对应一个 Split (一般情况下是这样的), 这时它们两个 hosts 是一样的. 如果不是一对一( Split > block), 则 Split 需要按一定方法选择 hosts .

Split 和 MapTask 是一一对应的, 一个 Split 对应一个 MapTask. 所以本地性是跟 Split 的 hosts 相关的.

BlocksMap存储 Block 与 BlockInfo 的映射关系, Block 中主要包含3项: long blockId; // 数据块的唯一标识,即数据块的ID号. long numBytes; // 数据块包含的文件数据大小. long generationStamp; // 数据块的版本号,或数据块的时间戳. BlockInfo( 在 Hadoop-2.7.3 中是 BlockInfoContiguous) 包含所以副本所在主机名. )

开始分析: ( 这里是 hadoop-2.7.3-src )

以WordCount开始: org.apache.hadoop.examples.WordCount.main() 内部调用 org.apache.hadoop.mapreduce.Job.waitForCompletion(boolean)

// 该段代码在 org.apache.hadoop.examples.WordCount

public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();  //指定作业执行规范 , Configuration：map/reduce的j配置类，向hadoop框架描述map-reduce执行的工作

    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

    if (otherArgs.length < 2) {

      System.err.println("Usage: wordcount <in> [<in>...] <out>");

      System.exit(2);

    }

    Job job = Job.getInstance(conf, "word count");  //指定job名称，及运行对象

    job.setJarByClass(WordCount.class);

    job.setMapperClass(TokenizerMapper.class);  //为job设置Mapper类

    job.setCombinerClass(IntSumReducer.class);  //为job设置Combiner类

    job.setReducerClass(IntSumReducer.class);  //为job设置Reducer类

    job.setOutputKeyClass(Text.class);  //为job的输出数据设置Key类

    job.setOutputValueClass(IntWritable.class);  //为job输出设置value类

    for (int i = 0; i < otherArgs.length - 1; ++i) {

      FileInputFormat.addInputPath(job, new Path(otherArgs[i]));  //为job设置输入路径, org.apache.hadoop.mapreduce.lib.input.FileInputFormat

    }

    FileOutputFormat.setOutputPath(job,

      new Path(otherArgs[otherArgs.length - 1]));  //为job设置输出路径

    System.exit(job.waitForCompletion(true) ? 0 : 1);  //运行job, 调用 Job.waitForCompletion()

  }

WordCount

在 Job.waitForCompletion() 函数内部会调用 Job 本类的方法 submit(), 在 submit() 内部接着调用 org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(Job , Cluster)

// 该段代码在 org.apache.hadoop.mapreduce.Job 中

// ......

/**

   * Submit the job to the cluster and wait for it to finish.

   * @param verbose print the progress to the user

   * @return true if the job succeeded

   * @throws IOException thrown if the communication with the

   *         <code>JobTracker</code> is lost

   */

  public boolean waitForCompletion(boolean verbose

                                   ) throws IOException, InterruptedException,

                                            ClassNotFoundException {

    if (state == JobState.DEFINE) {

      submit();    // 调用本类的 submit()

    }

    if (verbose) {

      monitorAndPrintJob();

    } else {

      // get the completion poll interval from the client.

      int completionPollIntervalMillis =

        Job.getCompletionPollInterval(cluster.getConf());

      while (!isComplete()) {

        try {

          Thread.sleep(completionPollIntervalMillis);

        } catch (InterruptedException ie) {

        }

      }

    }

    return isSuccessful();

  }

// ......

/**

   * Submit the job to the cluster and return immediately.

   * @throws IOException

   */

  public void submit()

         throws IOException, InterruptedException, ClassNotFoundException {

    ensureState(JobState.DEFINE);

    setUseNewAPI();

    connect();

    final JobSubmitter submitter =

        getJobSubmitter(cluster.getFileSystem(), cluster.getClient());

    status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {

      public JobStatus run() throws IOException, InterruptedException,

      ClassNotFoundException {

        return submitter.submitJobInternal(Job.this, cluster);  //

      }

    });

    state = JobState.RUNNING;

    LOG.info("The url to track the job: " + getTrackingURL());

   }

Job

在 JobSubmitter.submitJobInternal(Job , Cluster) 函数内部调用本类的 writeSplits(Job ,Path ) 为job创建分片; 接着 writeSplits(Job ,Path ) 方法内部会调用本类的 (1) writeNewSplits(JobContext , Path ) { Hadoop2.0 会调用这个,新版的API}和 (2) writeOldSplits(JobConf , Path ) { 这个是旧版的 API }; 在 JobSubmitter.writeNewSplits(JobContext , Path ) 方法内部会调用抽象类 org.apache.hadoop.mapreduce.InputFormat.getSplites(JobContext ), 计算job的输入文件的逻辑分片集合; 而在 JobSubmitter.writeOldSplits(JobContext , Path ) 方法内部会调用抽象类 org.apache.hadoop.mapred.InputFormat.getSplites(JobContext , int ), 计算job的输入文件的逻辑分片集合.

// 该段代码在 org.apache.hadoop.mapreduce.JobSubmitter 中

// ......

/**

   * Internal method for submitting jobs to the system.

   *

   * <p>The job submission process involves:

   * <ol>

   *   <li>

   *   Checking the input and output specifications of the job.

   *   </li>

   *   <li>

   *   Computing the {@link InputSplit}s for the job.

   *   </li>

   *   <li>

   *   Setup the requisite accounting information for the

   *   {@link DistributedCache} of the job, if necessary.

   *   </li>

   *   <li>

   *   Copying the job's jar and configuration to the map-reduce system

   *   directory on the distributed file-system.

   *   </li>

   *   <li>

   *   Submitting the job to the <code>JobTracker</code> and optionally

   *   monitoring it's status.

   *   </li>

   * </ol></p>

   * @param job the configuration to submit

   * @param cluster the handle to the Cluster

   * @throws ClassNotFoundException

   * @throws InterruptedException

   * @throws IOException

   */

  JobStatus submitJobInternal(Job job, Cluster cluster)

  throws ClassNotFoundException, InterruptedException, IOException {

    //validate the jobs output specs

    checkSpecs(job);

    Configuration conf = job.getConfiguration();

    addMRFrameworkToDistributedCache(conf);

    Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf);

    //configure the command line options correctly on the submitting dfs

    InetAddress ip = InetAddress.getLocalHost();

    if (ip != null) {

      submitHostAddress = ip.getHostAddress();

      submitHostName = ip.getHostName();

      conf.set(MRJobConfig.JOB_SUBMITHOST,submitHostName);

      conf.set(MRJobConfig.JOB_SUBMITHOSTADDR,submitHostAddress);

    }

    JobID jobId = submitClient.getNewJobID();

    job.setJobID(jobId);

    Path submitJobDir = new Path(jobStagingArea, jobId.toString());

    JobStatus status = null;

    try {

      conf.set(MRJobConfig.USER_NAME,

          UserGroupInformation.getCurrentUser().getShortUserName());

      conf.set("hadoop.http.filter.initializers",

          "org.apache.hadoop.yarn.server.webproxy.amfilter.AmFilterInitializer");

      conf.set(MRJobConfig.MAPREDUCE_JOB_DIR, submitJobDir.toString());

      LOG.debug("Configuring job " + jobId + " with " + submitJobDir

          + " as the submit dir");

      // get delegation token for the dir

      TokenCache.obtainTokensForNamenodes(job.getCredentials(),

          new Path[] { submitJobDir }, conf);

      populateTokenCache(conf, job.getCredentials());

      // generate a secret to authenticate shuffle transfers

      if (TokenCache.getShuffleSecretKey(job.getCredentials()) == null) {

        KeyGenerator keyGen;

        try {

          keyGen = KeyGenerator.getInstance(SHUFFLE_KEYGEN_ALGORITHM);

          keyGen.init(SHUFFLE_KEY_LENGTH);

        } catch (NoSuchAlgorithmException e) {

          throw new IOException("Error generating shuffle secret key", e);

        }

        SecretKey shuffleKey = keyGen.generateKey();

        TokenCache.setShuffleSecretKey(shuffleKey.getEncoded(),

            job.getCredentials());

      }

      if (CryptoUtils.isEncryptedSpillEnabled(conf)) {

        conf.setInt(MRJobConfig.MR_AM_MAX_ATTEMPTS, 1);

        LOG.warn("Max job attempts set to 1 since encrypted intermediate" +

                "data spill is enabled");

      }

      copyAndConfigureFiles(job, submitJobDir);

      Path submitJobFile = JobSubmissionFiles.getJobConfPath(submitJobDir);

      // Create the splits for the job

      LOG.debug("Creating splits at " + jtFs.makeQualified(submitJobDir));

      int maps = writeSplits(job, submitJobDir);  // 为job创建分片

      conf.setInt(MRJobConfig.NUM_MAPS, maps);

      LOG.info("number of splits:" + maps);

      // write "queue admins of the queue to which job is being submitted"

      // to job file.

      String queue = conf.get(MRJobConfig.QUEUE_NAME,

          JobConf.DEFAULT_QUEUE_NAME);

      AccessControlList acl = submitClient.getQueueAdmins(queue);

      conf.set(toFullPropertyName(queue,

          QueueACL.ADMINISTER_JOBS.getAclName()), acl.getAclString());

      // removing jobtoken referrals before copying the jobconf to HDFS

      // as the tasks don't need this setting, actually they may break

      // because of it if present as the referral will point to a

      // different job.

      TokenCache.cleanUpTokenReferral(conf);

      if (conf.getBoolean(

          MRJobConfig.JOB_TOKEN_TRACKING_IDS_ENABLED,

          MRJobConfig.DEFAULT_JOB_TOKEN_TRACKING_IDS_ENABLED)) {

        // Add HDFS tracking ids

        ArrayList<String> trackingIds = new ArrayList<String>();

        for (Token<? extends TokenIdentifier> t :

            job.getCredentials().getAllTokens()) {

          trackingIds.add(t.decodeIdentifier().getTrackingId());

        }

        conf.setStrings(MRJobConfig.JOB_TOKEN_TRACKING_IDS,

            trackingIds.toArray(new String[trackingIds.size()]));

      }

      // Set reservation info if it exists

      ReservationId reservationId = job.getReservationId();

      if (reservationId != null) {

        conf.set(MRJobConfig.RESERVATION_ID, reservationId.toString());

      }

      // Write job file to submit dir

      writeConf(conf, submitJobFile);

      //

      // Now, actually submit the job (using the submit name)

      //

      printTokens(jobId, job.getCredentials());

      status = submitClient.submitJob(

          jobId, submitJobDir.toString(), job.getCredentials());  // 提交 job

      if (status != null) {

        return status;

      } else {

        throw new IOException("Could not launch job");

      }

    } finally {

      if (status == null) {

        LOG.info("Cleaning up the staging area " + submitJobDir);

        if (jtFs != null && submitJobDir != null)

          jtFs.delete(submitJobDir, true);

      }

    }

  }

// ......

  private int writeSplits(org.apache.hadoop.mapreduce.JobContext job,

      Path jobSubmitDir) throws IOException,

      InterruptedException, ClassNotFoundException {

    JobConf jConf = (JobConf)job.getConfiguration();

    int maps;

    if (jConf.getUseNewMapper()) {

      maps = writeNewSplits(job, jobSubmitDir);

    } else {

      maps = writeOldSplits(jConf, jobSubmitDir);

    }

    return maps;

  }

// ......

  @SuppressWarnings("unchecked")

  private <T extends InputSplit>

  int writeNewSplits(JobContext job, Path jobSubmitDir) throws IOException,

      InterruptedException, ClassNotFoundException {

    Configuration conf = job.getConfiguration();

    InputFormat<?, ?> input =

      ReflectionUtils.newInstance(job.getInputFormatClass(), conf);

    List<InputSplit> splits = input.getSplits(job);

    T[] array = (T[]) splits.toArray(new InputSplit[splits.size()]);

    // sort the splits into order based on size, so that the biggest

    // go first

    Arrays.sort(array, new SplitComparator());

    JobSplitWriter.createSplitFiles(jobSubmitDir, conf,

        jobSubmitDir.getFileSystem(conf), array);

    return array.length;

  }

// ......

  //method to write splits for old api mapper.

  private int writeOldSplits(JobConf job, Path jobSubmitDir)

  throws IOException {

    org.apache.hadoop.mapred.InputSplit[] splits =

    job.getInputFormat().getSplits(job, job.getNumMapTasks());

    // sort the splits into order based on size, so that the biggest

    // go first

    Arrays.sort(splits, new Comparator<org.apache.hadoop.mapred.InputSplit>() {

      public int compare(org.apache.hadoop.mapred.InputSplit a,

                         org.apache.hadoop.mapred.InputSplit b) {

        try {

          long left = a.getLength();

          long right = b.getLength();

          if (left == right) {

            return 0;

          } else if (left < right) {

            return 1;

          } else {

            return -1;

          }

        } catch (IOException ie) {

          throw new RuntimeException("Problem getting input split size", ie);

        }

      }

    });

    JobSplitWriter.createSplitFiles(jobSubmitDir, job,

        jobSubmitDir.getFileSystem(job), splits);

    return splits.length;

  }

JobSubmitter

(1) 在抽象类 org.apache.hadoop.mapreduce.InputFormat.getSplites(JobContext ) 方法,这里实际调用的是实现类 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplites(JobContext ), ( 这里 FileInputFormat 有两个相同的, 分别是 org.apache.hadoop.mapreduce.lib.input.FileInputFormat 和 org.apache.hadoop.mapred.FileInputFormat , 我们选择org.apache.hadoop.mapreduce.lib.input.FileInputFormat 有以下几点原因: 首先, org.apache.hadoop.mapred.FileInputFormat 类是抽象类 InputFormat 的实现类; 其次, WordCount中的FileInputFormat 就是 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.)

我们介绍一个概念, 即新旧 MapReduce API , 从0.20.0版本开始, Hadoop 同时提供了新旧两套 MapReduce API. 新 API 在旧 API 基础上进行了封装,使得其在扩展性和易用性方面更好. 旧版 API 放在 org.apache.hadoop.mapred 包中, 而新版 API 则放在 org.apache.hadoop.mapreduce 包及其子包中.

 // 该段代码是在 org.apache.hadoop.mapreduce.InputFormat 中

  /**

   * Logically split the set of input files for the job.

   *

   * <p>Each {@link InputSplit} is then assigned to an individual {@link Mapper}

   * for processing.</p>

   *

   * <p><i>Note</i>: The split is a <i>logical</i> split of the inputs and the

   * input files are not physically split into chunks. For e.g. a split could

   * be <i>&lt;input-file-path, start, offset&gt;</i> tuple. The InputFormat

   * also creates the {@link RecordReader} to read the {@link InputSplit}.

   *

   * @param context job configuration.

   * @return an array of {@link InputSplit}s for the job.

   */

  public abstract

    List<InputSplit> getSplits(JobContext context

                               ) throws IOException, InterruptedException;

在 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplites(JobContext ) 方法中,

(2) 在抽象类 org.apache.hadoop.mapred.InputFormat.getSplites(JobContext ,int ) 方法,这里实际调用的是实现类 org.apache.hadoop.mapred.FileInputFormat.getSplites(JobContext ,int ) , 这里 org.apache.hadoop.mapred.FileInputFormat 类是抽象类 InputFormat 的实现类. 具体参考 FileInputFormat类中split切分算法和host选择算法介绍 .

WordCount作业提交到FileInputFormat类中split切分算法和host选择算法过程源码分析的更多相关文章

Hadoop Mapreduce 中的FileInputFormat类的文件切分算法和host选择算法
文件切分算法文件切分算法主要用于确定InputSplit的个数以及每个InputSplit对应的数据段. FileInputFormat以文件为单位切分成InputSplit.对于每个文件,由以下三 ...
【Java面试题】17 如何把一个逗号分隔的字符串转换为数组？关于String类中split方法的使用，超级详细！！！
split 方法:将一个字符串分割为子字符串,然后将结果作为字符串数组返回. stringObj.split([separator],[limit])参数:stringObj 必选项.要被分解的 ...
【原创】有关Silverlight中自动生成的类中没有WCF层edmx模型新加入的对象原因分析。
前端页面层: 编译老是不通过,报如下如所示错误: -- 然后下意识的查了下生成的cs文件,没有搜到根据edmx 生成的对应的类. 结果整理: 1.尽管在 edmx 模 ...
MapReduce执行过程源码分析（一）——Job任务的提交
为了能使源码的执行过程与Hadoop权威指南(2.3版)中章节Shuffle and Sort的分析相对应,Hadoop的版本为0.20.2. 一般情况下我们通过Job(org.apache.hado ...
SqlAlchemy 中操作数据库时session和scoped_session的区别(源码分析)
原生session: from sqlalchemy.orm import sessionmaker from sqlalchemy import create_engine from sqlalch ...
Disruptor中shutdown方法失效，及产生的不确定性源码分析
版权声明:原创作品,谢绝转载!否则将追究法律责任. Disruptor框架是一个优秀的并发框架,利用RingBuffer中的预分配内存实现内存的可重复利用,降低了GC的频率. 具体关于Disrupto ...
linux内核中socket的创建过程源码分析（总结性质）
在漫长地分析完socket的创建源码后,发现一片浆糊,所以特此总结,我的博客中同时有另外一篇详细的源码分析,内核版本为3.9,建议在阅读本文后若还有兴趣再去看另外一篇博文.绝对不要单独看另外一篇. 一 ...
Mapper类/Reducer类中的setup方法和cleanup方法以及run方法的介绍
在hadoop的源码中,基类Mapper类和Reducer类中都是只包含四个方法:setup方法,cleanup方法,run方法,map方法.如下所示: 其方法的调用方式是在run方法中,如下所示: ...
StringUtils工具类中的isBlank()方法和isEmpty()方法的区别
1.isBlank()方法 1 public static boolean isBlank(String str) { 2 int strLen; 3 if (str == null || (strL ...

随机推荐

pjblog支持QQ、新浪微博一键登录
转载地址: http://www.ruisoftcn.com/blog/article.asp?id=955
Java中的class类的cast方法和asSubclass方法
一般来说cast是转型的意思,但是学java的时间也不短了,class类居然还有cast这个方法,这里来学习一下这个cast有何用. 第一次看到这个cast是在Spring的源码中, spring-f ...
EOF需要两次才能结束输入
.EOF作为文件结束符时的情况: EOF虽然是文件结束符,但并不是在任何情况下输入Ctrl+D(Windows下Ctrl+Z)都能够实现文件结束的功能,只有在下列的条件下,才作为文件结束符.(1)遇 ...
Visual Studio 2010无法启动调试
现象:Visual Studio 2010点击调试或者按F5.Visual Studio 2010没有什么反应,但又不报错. 而点击运行不调试(Ctrl+F5)却没有问题. 解决的方法:打开项目属性, ...
quick-cocos2d-x教程3:程序框架内文件夹分析之docs文件夹
如今我们分析框架中的docs文件夹.看看这个文档文件夹中,究竟放了那些对我们实用的东西. docs文件夹分析 UPGRADE_TO_2_2_3.md 就是讲升级的变化.详细说明:quick-cocos ...
xcode升级到6.0以后遇到的警告错误原帖链接http://www.cocoachina.com/bbs/simple/?t112432.html
Xcode 升级后,常常遇到的遇到的警告.错误,解决方法从sdk3.2.5升级到sdk 7.1中间废弃了很多的方法,还有一些逻辑关系更加严谨了.1,警告:“xoxoxoxo” is depreca ...
extjs中新建窗体时，给窗体添加背景图片不显示问题之一
1.在extjs中新建窗体时,给窗体添加背景图片不显示,例如下面的代码. 不显示的原因:因为设置了 layout: 'fit', Ext.create('Ext.Window', { title: ...
Activity和ListActivity的区别
http://book.51cto.com/art/201007/212051.htm
Linux就该这么学--命令集合7（管道命令符）
1.管道命令符“|”的作用是将前一个命令的标准输出当作后一个命令的标准输入,格式为:“命令A|命令B”. 找出被限制登录用户的命令是:grep "/sbin/nologin" /e ...
AndroidPageObjectTest_TimeOutManagement.java
以下代码使用ApiDemos-debug.apk进行测试 //这个脚本用于演示PageFactory的功能:设置timeout时间. package com.saucelabs.appium; imp ...

WordCount作业提交到FileInputFormat类中split切分算法和host选择算法过程源码分析

WordCount作业提交到FileInputFormat类中split切分算法和host选择算法过程源码分析的更多相关文章

随机推荐

热门专题