Job流程：提交MR-Job过程

1.一个标准 MR-Job 的执行入口：

//参数 true 表示检查并打印 Job 和 Task 的运行状况

System.exit(job.waitForCompletion(true) ? 0 : 1);

2.job.waitForCompletion(true)方法的内部实现：

//job.waitForCompletion()方法的内部实现

public boolean waitForCompletion(boolean verbose

                                   ) throws IOException, InterruptedException,

                                            ClassNotFoundException {

    if (state == JobState.DEFINE) {

      submit(); //此方法的核心在于submit()

    }

    if (verbose) { //根据传入的参数，决定是否打印Job运行的详细过程

      monitorAndPrintJob();

    } else {

      // get the completion poll interval from the client.

      int completionPollIntervalMillis =

        Job.getCompletionPollInterval(cluster.getConf());

      while (!isComplete()) {

        try {

          Thread.sleep(completionPollIntervalMillis);

        } catch (InterruptedException ie) {

        }

      }

}

3. Job 类 submit()方法的内部实现：

public void submit()

         throws IOException, InterruptedException, ClassNotFoundException {

    ensureState(JobState.DEFINE); 

    setUseNewAPI();//使用MapReduce新的API


   　 connect();//返回一个【客户端代理对象Cluster】(属于Job类)用于和服务端NN建立RPC通信

  　  final JobSubmitter submitter =

        getJobSubmitter(cluster.getFileSystem(), cluster.getClient());

    status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {

      public JobStatus run() throws IOException, InterruptedException,

      ClassNotFoundException {

　　//提交Job

　　return submitter.submitJobInternal(Job.this, cluster);

      }

    });

    state = JobState.RUNNING;//设置 JobStatus 为 Running

　　 LOG.info("The url to track the job: " + getTrackingURL());

}

3.1.1.查看Connect()方法的内部实现：

private synchronized void connect()

          throws IOException, InterruptedException, ClassNotFoundException {

    if (cluster == null) {

      cluster =

        ugi.doAs(new PrivilegedExceptionAction<Cluster>() {

                   public Cluster run()

                          throws IOException, InterruptedException,

                                 ClassNotFoundException {

                     //返回一个Cluster对象,并将此对象作为 Job 类的一个成员变量
                     //即 Job 类持有 Cluster 的引用。

                     return new Cluster(getConfiguration());

                   }

                 });

    }

}

3.1.2.查看new Cluster()的实现过程：

public Cluster(InetSocketAddress jobTrackAddr, Configuration conf)

      throws IOException {

    this.conf = conf;

    this.ugi = UserGroupInformation.getCurrentUser();

    initialize(jobTrackAddr, conf);//重点在于此方法的内部实现

}

3.1.3.客户端代理对象Cluster实例化过程：

synchronized (frameworkLoader) {

      for (ClientProtocolProvider provider : frameworkLoader) {

        LOG.debug("Trying ClientProtocolProvider : "

            + provider.getClass().getName());

        //ClientProtocol是Client和NN通信的RPC协议，根据RPC通信原理，此协议接口中必定包含一个 versionID 字段。

 
        ClientProtocol clientProtocol = null;

        try {

          if (jobTrackAddr == null) {

            //provider创建YARNRunner对象

            clientProtocol = provider.create(conf);

          } else {

            clientProtocol = provider.create(jobTrackAddr, conf);

          }

          if (clientProtocol != null) {  //初始化Cluster内部成员变量

            clientProtocolProvider = provider;

            client = clientProtocol;     //创建Cluster类的客户端代理对象client

            LOG.debug("Picked " + provider.getClass().getName()

                + " as the ClientProtocolProvider");

            break;

          }

          else {

            LOG.debug("Cannot pick " + provider.getClass().getName()

                + " as the ClientProtocolProvider - returned null protocol");

          }

        }

        catch (Exception e) {

          LOG.info("Failed to use " + provider.getClass().getName()

              + " due to error: " + e.getMessage());

        }

     }

 }

3.1.4.ClientProtocol接口中包含的versionID 字段

//Version 37: More efficient serialization format for framework counters

public static final long versionID = 37L;

3.1.5.provider.create()方法创建【客户端代理对象】有两种实现方式：LocalClientProtocolProvider(本地模式，此处不做研究) 和 YarnClientProtocolProvider(Yarn模式)。

public ClientProtocol create(Configuration conf) throws IOException {

    if (MRConfig.YARN_FRAMEWORK_NAME.equals(conf.get(MRConfig.FRAMEWORK_NAME))) {

      return new YARNRunner(conf);//实例化【客户端代理对象YARNRunner】

    }

    return null;

}

3.1.6.new YARNRunner()方法的实现

其中，ResourceMgrDelegate实际上ResourceManager的代理类，其实现了YarnClient接口，通过ApplicationClientProtocol代理直接向RM提交Job，杀死Job，查看Job运行状态等操作。同时，在ResourceMgrDelegate类中会通过YarnConfiguration来读取yarn-site.xml、core-site.xml等配置文件中的配置属性。

  public YARNRunner(Configuration conf) {

   this(conf, new ResourceMgrDelegate(new YarnConfiguration(conf)));

  }

  public YARNRunner(Configuration conf, ResourceMgrDelegate resMgrDelegate,

      ClientCache clientCache) {

    this.conf = conf;

    try {

      this.resMgrDelegate = resMgrDelegate;

      this.clientCache = clientCache;

      this.defaultFileContext = FileContext.getFileContext(this.conf);

    } catch (UnsupportedFileSystemException ufe) {

      throw new RuntimeException("Error in instantiating YarnClient", ufe);

    }

  }

3.2.1.查看 JobSubmitter 类中 submitJobInternal()方法的实现：

  JobStatus submitJobInternal(Job job, Cluster cluster)

  throws ClassNotFoundException, InterruptedException, IOException {

    //检查job的输出路径是否存在，如果存在则抛出异常

    checkSpecs(job);


    //返回存放Job相关资源【比如jar包,Job.xml,Splits文件等】路径的前缀
    //默认位置 /tmp/hadoop-yarn/staging/root/.staging,可通过 yarn.app.mapreduce.am.staging-dir 修改

    Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster,

                                                     job.getConfiguration());

    //获取从命令行配置的Job参数

    Configuration conf = job.getConfiguration();


    //获取客户端的主机名和IP
    InetAddress ip = InetAddress.getLocalHost();

    if (ip != null) {

      submitHostAddress = ip.getHostAddress();

      submitHostName = ip.getHostName();

      conf.set(MRJobConfig.JOB_SUBMITHOST,submitHostName);

      conf.set(MRJobConfig.JOB_SUBMITHOSTADDR,submitHostAddress);

    }


    //通过RPC，向Yarn的ResourceManager申请JobID对象
    JobID jobId = submitClient.getNewJobID();

    job.setJobID(jobId);


    //将 存放路径的前缀 和 JobId 拼接成完整的【Job相关文件的存放路径】
    Path submitJobDir = new Path(jobStagingArea, jobId.toString());

    JobStatus status = null;

    try {

      conf.set(MRJobConfig.USER_NAME,

          UserGroupInformation.getCurrentUser().getShortUserName());

      conf.set("hadoop.http.filter.initializers",

          "org.apache.hadoop.yarn.server.webproxy.amfilter.AmFilterInitializer");

      conf.set(MRJobConfig.MAPREDUCE_JOB_DIR, submitJobDir.toString());

      LOG.debug("Configuring job " + jobId + " with " + submitJobDir

          + " as the submit dir");

      // get delegation token for the dir

      TokenCache.obtainTokensForNamenodes(job.getCredentials(),

          new Path[] { submitJobDir }, conf);

      populateTokenCache(conf, job.getCredentials());

      // generate a secret to authenticate shuffle transfers

      if (TokenCache.getShuffleSecretKey(job.getCredentials()) == null) {

        KeyGenerator keyGen;

        try {

          keyGen = KeyGenerator.getInstance(SHUFFLE_KEYGEN_ALGORITHM);

          keyGen.init(SHUFFLE_KEY_LENGTH);

        } catch (NoSuchAlgorithmException e) {

          throw new IOException("Error generating shuffle secret key", e);

        }

        SecretKey shuffleKey = keyGen.generateKey();

        TokenCache.setShuffleSecretKey(shuffleKey.getEncoded(),

            job.getCredentials());

      }

      //向集群中拷贝所需文件,默认写入 10 份(mapreduce.client.submit.file.replication)

      copyAndConfigureFiles(job, submitJobDir);

      Path submitJobFile = JobSubmissionFiles.getJobConfPath(submitJobDir);


 

      // Create the splits for the job

      LOG.debug("Creating splits at " + jtFs.makeQualified(submitJobDir));


      //计算并确定map的个数，以及各个输入切片 Splits 的相关信息【后面详述】
      int maps = writeSplits(job, submitJobDir);

      conf.setInt(MRJobConfig.NUM_MAPS, maps);

      LOG.info("number of splits:" + maps);

      // write "queue admins of the queue to which job is being submitted"

      // to job file.(设置调度队列名)

      String queue = conf.get(MRJobConfig.QUEUE_NAME,

          JobConf.DEFAULT_QUEUE_NAME);

      AccessControlList acl = submitClient.getQueueAdmins(queue);

      conf.set(toFullPropertyName(queue,

          QueueACL.ADMINISTER_JOBS.getAclName()), acl.getAclString());

      // removing jobtoken referrals before copying the jobconf to HDFS

      // as the tasks don't need this setting, actually they may break

      // because of it if present as the referral will point to a

      // different job.

      TokenCache.cleanUpTokenReferral(conf);

      if (conf.getBoolean(

          MRJobConfig.JOB_TOKEN_TRACKING_IDS_ENABLED,

          MRJobConfig.DEFAULT_JOB_TOKEN_TRACKING_IDS_ENABLED)) {

        // Add HDFS tracking ids

        ArrayList<String> trackingIds = new ArrayList<String>();

        for (Token<? extends TokenIdentifier> t :

            job.getCredentials().getAllTokens()) {

          trackingIds.add(t.decodeIdentifier().getTrackingId());

        }

        conf.setStrings(MRJobConfig.JOB_TOKEN_TRACKING_IDS,

            trackingIds.toArray(new String[trackingIds.size()]));

      }

      // Write job file to submit dir

      //写入job.xml
      writeConf(conf, submitJobFile);

      //

      // Now, actually submit the job (using the submit name)

      //       
　　  printTokens(jobId, job.getCredentials());


      //真正的提交任务方法submitJob()【详细分析】
      status = submitClient.submitJob(

          jobId, submitJobDir.toString(), job.getCredentials());

      if (status != null) {

        return status;

      } else {

        throw new IOException("Could not launch job");

      }

    } finally {

      if (status == null) {

        LOG.info("Cleaning up the staging area " + submitJobDir);

        if (jtFs != null && submitJobDir != null)

          jtFs.delete(submitJobDir, true);

      }

    }

  }

3.2.2.查看submitClient.submitJob()方法的实现：

　　submitJob()方法是接口 ClientProtocol（RPC 协议）中的一个抽象方法。根据 RPC 原理，在【客户端代理对象submitClient】调用RPC协议中的submitJob()方法，此方法一定在服务端执行。该方法也有两种实现： LocalJobRunner(本地模式，略)和 YARNRunner(YARN模式)。

public JobStatus submitJob(JobID jobId, String jobSubmitDir, Credentials ts)

  throws IOException, InterruptedException {

    addHistoryToken(ts);

    // 构建必要的信息，以启动 MR-AM

    ApplicationSubmissionContext appContext =

      createApplicationSubmissionContext(conf, jobSubmitDir, ts);

    //提交Job到RM,返回applicationId

    try {

      ApplicationId applicationId =

          resMgrDelegate.submitApplication(appContext);

      ApplicationReport appMaster = resMgrDelegate

          .getApplicationReport(applicationId);

      String diagnostics =

          (appMaster == null ?

              "application report is null" : appMaster.getDiagnostics());

      if (appMaster == null

          || appMaster.getYarnApplicationState() == YarnApplicationState.FAILED

          || appMaster.getYarnApplicationState() == YarnApplicationState.KILLED) {

        throw new IOException("Failed to run job : " +

            diagnostics);

      }

      //最后返回 Job 此时的状态，函数退出

      return clientCache.getClient(jobId).getJobStatus(jobId);

    } catch (YarnException e) {

      throw new IOException(e);

    }

}

总结：

1.为什么会产生Yarn?

　　Hadoop1.0生态几乎是以MapReduce为核心的，其扩展性差、资源利用率低、可靠性等问题都越来越让人觉得不爽，于是才产生了Yarn，并且Hadoop2.0生态都是以Yarn为核心。Storm、Spark等都可以基于Yarn使用。
2.Configuration类的作用是什么?

　　配置文件类Configuration，是Hadoop各个模块的公共使用类，用于加载类路径下的各种配置文件，读写其中的配置选项。
3.GenericOptionsParser类的作用是什么?　　
4.如何将命令行中的参数配置到变量conf中?
5.哪个方法会获得传入的参数?

　　GenericOptionsParser类是将命令行中参数自动设置到变量conf中。其构造方法内部调用parseGeneralOptions()对传入的参数进行解析。

Configuration conf = new Configuration();

String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

6.如何在命令行指定reduce的个数?

　　命令行配置参数的规则是：-D加MapReduce的配置选项，当然还支持-fs等其他参数传入。
7.默认情况map、reduce为几?

　　默认情况下Reduce的数目为1，Map的数目也为1。
8.setJarByClass的作用是什么?

　　setJarByClass()首先判断当前Job的状态是否是运行中，接着通过class找到其所属的jar文件，将jar路径赋值给mapreduce.job.jar属性。至于寻找jar文件的方法，则是通过classloader获取类路径下的资源文件，进行循环遍历。具体实现见ClassUtil类中的findContainingJar方法。
9.如果想在控制台打印job（maoreduce）当前的进度,需要设置哪个参数?

　　如果想在控制台打印当前的进度，则设置job.waitForCompletion(true)的参数为true。
10.配置了哪个参数，在提交job的时候，会创建一个YARNRunner对象来进行任务的提交?

　　如果当前在HDFS的配置文件中配置了mapreduce.framework.name属性为“yarn”的话，会创建一个YARNRunner对象来进行任务的提交。
11.哪个类实现了读取yarn-site.xml、core-site.xml等配置文件中的配置属性的?　　
12.JobSubmitter类中的哪个方法实现了把job提交到集群?

　　JobSubmitter类中的submitJobInternal()方法。
13.DistributedCache在mapreduce中发挥了什么作用?

　　文件上传到HDFS之后，还要被DistributedCache进行缓存起来。这是因为计算节点收到该作业的第一个任务后，就会用DistributedCache自动将作业文件Cache到节点本地目录下，并且会对压缩文件进行解压，如：.zip，.jar，.tar等等，然后开始任务。最后，对于同一个计算节点接下来收到的任务，DistributedCache不会重复去下载作业文件，而是直接运行任务。如果一个作业的任务数很多，这种设计避免了在同一个节点上对用一个job的文件会下载多次，大大提高了任务运行的效率。
14.对每个输入文件进行split划分，是物理划分还是逻辑划分，他们有什么区别?

　　逻辑划分。存储时分块Block是物理划分。
15.分片的大小有哪些因素来决定?

　　
16.分片是如何计算得来的?

　　三个参数，详见Job流程：决定map个数的因素

Job流程：提交MR-Job过程的更多相关文章

Windows下Eclipse提交MR程序到HadoopCluster
作者:Syn良子出处:http://www.cnblogs.com/cssdongl 欢迎转载,转载请注明出处. 以前Eclipse上写好的MapReduce项目经常是打好包上传到Hadoop测试集 ...
客户端用java api 远程操作HDFS以及远程提交MR任务(源码和异常处理)
两个类,一个HDFS文件操作类,一个是wordcount 词数统计类,都是从网上看来的.上代码: package mapreduce; import java.io.IOException; impo ...
走进JavaWeb技术世界8：浅析Tomcat9请求处理流程与启动部署过程
谈谈 Tomcat 请求处理流程转自:https://github.com/c-rainstorm/blog/blob/tomcat-request-process/reading-notes &l ...
客户端MapReduce提交到YARN过程
在Mapreduce v1中是使用JobClient来和JobTracker交互完成Job的提交,用户先创建一个Job,通过JobConf设置好参数,通过JobClient提交并监控Job的进展,在J ...
post和get提交服务器编码过程
参考资料:http://blog.csdn.net/z55887/article/details/46975679 先说出一个知识点: 如果浏览器端编码是UTF-8,那在服务器端解决乱码问题的方法有两 ...
Android Activity启动流程， app启动流程，APK打包流程， APK安装过程
1.Activity启动流程 (7.0版本之前) 从startActivity()开始,最终都会调用startActivityForResult() 在该方法里面会调用Instrumentation. ...
spark-submit提交python脚本过程记录
最近刚学习spark,用spark-submit命令提交一个python脚本,一开始老报错,所以打算好好整理一下用spark-submit命令提交python脚本的过程.先看一下spark-submi ...
Eclipse使用git最基本流程(提交远程仓库的方法)
注册一个github账号注册完成之后,点击右上角的settings(就是那个齿轮,设置的图标) Step6 Egit的使用首先,登入你的github账号,点击加号,选择New repositror ...
SourceTree&Git -01 -代码拉取推送流程 -提交时的相关注意事项
1.进行文件的暂存,忽略不提交的文件防止自己的文件从仓库拉取时被覆盖掉 2.获取,然后从仓库拉取内容 (勾选被合并提交的内容) 先获取,可以防止冲突的发生 3.推送自己暂存的文件推送失败,请再次进 ...
微信退款流程，以及在过程中遇见的错误和解决方式(php 语言)
官方下载demo 1:https://pay.weixin.qq.com/wiki/doc/api/jsapi.php?chapter=11_1 开发步骤 : https://pay.weix ...

随机推荐

170420、maven内置常量
Maven工程插件配置中通常会用到一些Maven变量,因此需要找个地方对这些变量进行统一定义,下面介绍如何定义自定义变量. 在根节点project下增加properties节点,所有自定义变量均可以定 ...
git--简单操作
Git简介一. 安装下载地址: https://git-scm.com/downloads: https://pan.baidu.com/s/1kU5OCOB#list/path=%2F ...
PAT 甲级 1021 Deepest Root (并查集，树的遍历)
1021. Deepest Root (25) 时间限制 1500 ms 内存限制 65536 kB 代码长度限制 16000 B 判题程序 Standard 作者 CHEN, Yue A graph ...
traceroute tracert 路由器地址清单 192.168.2.1 网关路由器地址
[root@a ~]# traceroute www.ijntv.cntraceroute to www.ijntv.cn (42.81.61.31), 30 hops max, 60 byte pa ...
Quick UDP Internet Connections 让互联网更快的协议，QUIC在腾讯的实践及性能优化
https://mp.weixin.qq.com/s/44ysXnVBUq_nJByMyX9n5A 让互联网更快:通往QUIC之路原创: 史天翻译云技术实践 8月15日 QUIC(Quick U ...
Spark源码分析 – DAGScheduler
DAGScheduler的架构其实非常简单, 1. eventQueue, 所有需要DAGScheduler处理的事情都需要往eventQueue中发送event 2. eventLoop Threa ...
linux下非root用户的sudo问题
linux下的root用户是个超级管理员,一般是不用这个用户登录进行操作的,但有时候需要root权限,又不想切换用户的话可以使用sudo命令.但是不是所有的用户都可以使用sudo命令的. 首先可能会遇 ...
x86架构下的控制寄存器CR0-CR4
关于这几个寄存器,每次翻看intel手册都很不好找,干脆直接贴在这里吧!
understand EntityManager.joinTransaction()
Join Transaction The EntityManager.joinTransaction() API allows an application managed EntityManager ...
angular-file-upload
<div id="page-title"> <h2 class="title-hero" ng-if="!isEdit"& ...

Job流程：提交MR-Job过程

Job流程：提交MR-Job过程的更多相关文章

随机推荐

热门专题