InputSplit—>RecordReder—>map(key,value,context)的过程解析

上图首先描述了在TaskTracker端Task(MapTask、ReduceTask)的执行过程，MapTask(org.apache.hadoop.mapred)首先被TaskRunner调用，然后在MapTask内部首先进行一些初始化工作，然后调用run()方法，判断如果使用了新版API就调用RunNewMapper()开始执行Map操作。

1）runNewMapper()分析

1.首先创建一个Mapper对象

// make a mapper

    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper =

      (org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>)

        ReflectionUtils.newInstance(taskContext.getMapperClass(), job);

2.创建一个InputFormat对象

  org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =

       (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)

         ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);

3.创建InputSplit对象

  org.apache.hadoop.mapreduce.InputSplit split = null;

    split = getSplitDetails(new Path(splitIndex.getSplitLocation()),

        splitIndex.getStartOffset());//获得分片的详细信息

其中，splitIndex是TaskSplitIndex类型(用于指示此mapTask处理的分片)，TaskSplitIndex有两个字段：

String splitLocation //job.split在HDFS上的路径

long startOffset //此次处理的分片在job.split中的位置。

利用上述两个字段首先找到job.split，然后就可以在startOffset的位置处找到这次处理的分片的详细信息。

4.利用InputFormat和InputSplit创建RecordReader对象input，这里应该是已经确定了input具体是那种记录读取器，例如LineRecordReader

  org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =

       new NewTrackingRecordReader<INKEY,INVALUE>

           (split, inputFormat, reporter, job, taskContext);

5.创建输出收集器OutputCollector对象output（如果reduce=0创建NewDirectOutputCollector类对象，否则创建NewOutputCollector类对象）

     // get an output object

      if (job.getNumReduceTasks() == 0) {//reduce数量如果是0

         output =

           new NewDirectOutputCollector(taskContext, job, umbilical, reporter);

      } else {

        output = new NewOutputCollector(taskContext, job, umbilical, reporter);

      }

6.利用上述对象创建MapContext类的对象mapperContext

mapperContext = contextConstructor.newInstance(mapper, job, getTaskID(),

                                                     input, output, committer,

                                                     reporter, split);

其中reader和split是数据成员，getCurrentKey()是获得当前的key，同样getCurrentValue().如果还有下条记录，nextKeyValue()返回true，否则返回false，这三个方法均由reader调用。由于RecordReader是抽象类，并未实现相关方法，其子类实现了这些方法。

  @Override

   public KEYIN getCurrentKey() throws IOException, InterruptedException {

     return reader.getCurrentKey();

   }

   @Override

   public VALUEIN getCurrentValue() throws IOException, InterruptedException {

     return reader.getCurrentValue();

   }

   @Override

   public boolean nextKeyValue() throws IOException, InterruptedException {

     return reader.nextKeyValue();

   }

在MapContext的构造函数中，字段reader就是由input初始化的，所以reader的具体类型也是已经确定了的，所以会调用具体实现了的这些方法，例如LineRecorReader的方法

(在org.apache.hadoop.mapreduce.lib.input中找到，因为新版API重写了LineRecordReader),以下是LineRecordReader部分源码()：

 private CompressionCodecFactory compressionCodecs = null;

   private long start;

   private long pos;

   private long end;

   private LineReader in;

   private int maxLineLength;

   private LongWritable key = null;

   private Text value = null;

   private Seekable filePosition;

   private CompressionCodec codec;

 public void initialize(InputSplit genericSplit,

                          TaskAttemptContext context) throws IOException {

     FileSplit split = (FileSplit) genericSplit;

     Configuration job = context.getConfiguration();

     this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength",

                                     Integer.MAX_VALUE);

     start = split.getStart();

     end = start + split.getLength();

     final Path file = split.getPath();

     compressionCodecs = new CompressionCodecFactory(job);

     codec = compressionCodecs.getCodec(file);

     // open the file and seek to the start of the split

     FileSystem fs = file.getFileSystem(job);

     FSDataInputStream fileIn = fs.open(split.getPath());

     if (isCompressedInput()) {

       decompressor = CodecPool.getDecompressor(codec);

       if (codec instanceof SplittableCompressionCodec) {

         final SplitCompressionInputStream cIn =

           ((SplittableCompressionCodec)codec).createInputStream(

             fileIn, decompressor, start, end,

             SplittableCompressionCodec.READ_MODE.BYBLOCK);

         in = new LineReader(cIn, job);

         start = cIn.getAdjustedStart();

         end = cIn.getAdjustedEnd();

         filePosition = cIn;

       } else {

         in = new LineReader(codec.createInputStream(fileIn, decompressor),

             job);

         filePosition = fileIn;

       }

     } else {

       fileIn.seek(start);

       in = new LineReader(fileIn, job);

       filePosition = fileIn;

     }

     // If this is not the first split, we always throw away first record

     // because we always (except the last split) read one extra line in

     // next() method.

     if (start != 0) {

       start += in.readLine(new Text(), 0, maxBytesToConsume(start));

     }

     this.pos = start;

   }

  @Override

   public LongWritable getCurrentKey() {

     return key;

   }

   @Override

   public Text getCurrentValue() {

     return value;

   }

7-8行可看到类的字段key和value。使用initialize()方法初始化，读取分片中的数据到key/value由nextKeyValue()方法完成：

 public boolean nextKeyValue() throws IOException {

     if (key == null) {

       key = new LongWritable();

     }

     key.set(pos);//以记录的偏移量为key

     if (value == null) {

       value = new Text();

     }

     int newSize = 0;

     // We always read one extra line, which lies outside the upper

     // split limit i.e. (end - 1)

     while (getFilePosition() <= end) {

      //获取value值，调用了很多的函数

       newSize = in.readLine(value, maxLineLength,

           Math.max(maxBytesToConsume(pos), maxLineLength));

       if (newSize == 0) {

         break;

       }

       pos += newSize;//更新pos

       if (newSize < maxLineLength) {

         break;

       }

       // line too long. try again

       LOG.info("Skipped line of size " + newSize + " at pos " +

                (pos - newSize));

     }

     if (newSize == 0) {

       key = null;

       value = null;

       return false;

     } else {

       return true;

     }

   }

7.初始化记录读取器input(例如LineRecordReader.initialize())

 input.initialize(split, mapperContext);

8.调用Mapper类的run()方法：

 mapper.run(mapperContext);

Mapper类结果如下所示：

Mapper有一个内部类Context。通过run()方法调用这几个方法，run()的实现如下所示：

  public void run(Context context) throws IOException, InterruptedException {

    setup(context);

    try {

      while (context.nextKeyValue()) {

        map(context.getCurrentKey(), context.getCurrentValue(), context);

      }

    } finally {

      cleanup(context);

    }

  }

从MapTask的角度分析下Mapper中的run()方法内的context.nextkeyValue()，流程图如下所示：

上面已经给出了LineRecordRead的源码，以下做简要分析：

LineRecordRead有3个核心字段，分别是pos，key，value。pos就是读取的字段在文件中的偏移量，每次通过nextKeyValue()方法中读取分片中一个记录，并将pos设置为此记录的key，然后再将此记录存储在value中，最后更新pos的值，作为下个字段的偏移量。最后，nextKeyValue方法返回一个布尔值，true表示成功读取到一条记录，否则，表示此分片中已没有记录。

然后执行map(context.getCurrentKey(), context.getCurrentValue(), context)，其中context.getCurrentKey()调用了LineRecordRead的方法getCurrentKey()直接返回当前key，context.getCurrentValue()也是同样。

2）基于以上的分析，MapTask的任务逻辑图如下所示：

其中输入分片就是由上述的第2、3完成的。RecordReader就对应了第4步的记录读取器input对象。OutputCollector对应第5步中的输出收集器对象output。第8步就对应了上图中的Mapper，接下来就分析Mapper之后发生了什么，这就要进入到Mapper类的map()方法内部：

用户要重写Mapper的map方法，这里以WordCount为例进行分析。重写的map方法如下所示；

 public void map(Object key, Text value, Context context

                    ) throws IOException, InterruptedException {

      StringTokenizer itr = new StringTokenizer(value.toString());

      while (itr.hasMoreTokens()) {

        word.set(itr.nextToken());

        context.write(word, one);

      }

    }

关注最后一行代码:context.write(word,one),一直Context是Mapper的内部类，继承自MapContext类，那么这个write方法究竟做了什么呢？以下是整个调用过程：

context.write(word, one);

Context类只是简单的继承了MapContext类，并没有write方法，查看MapContext有没有write方法，结果MapContext也没有write方法，继续查看MapContext的父类TaskInputOutputContext，其中write方法源码为：

public void write(KEYOUT key, VALUEOUT value

                    ) throws IOException, InterruptedException {

    output.write(key, value);

  }

output是此类的一个字段，定义如下：

private RecordWriter<KEYOUT,VALUEOUT> output;

而RecordWriter是一个抽象类，没有字段，只有未实现的抽象方法write和close，Context通过继承机制，获得了output字段，这个字段肯定是RecordWriter的某个具体实现类，到底是哪个类呢？转了一圈，我们看看context对象的来源：就是在RunNewMapper中(对应第8步)

 mapper.run(mapperContext);

mapper就是一个Mapper对象，调用其run方法：

 public void run(Context context) throws IOException, InterruptedException {

    setup(context);

    try {

      while (context.nextKeyValue()) {

        map(context.getCurrentKey(), context.getCurrentValue(), context);

      }

    } finally {

      cleanup(context);

    }

  }

将mapperContext对象赋值给了context对象，也就是context的来源是mapperContext对象，那我们就需要看看mapperContext是怎么来的：

org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context mapperContext = null;

mapperContext = contextConstructor.newInstance(mapper, job, getTaskID(),input, output, committer,reporter, split);

首先mapperContext对象是Context类型，然后就是第二行代码的作用就相当于使用new Context(....)创建新对象。是时候上图了：

super就是调用父类的构造函数。再次贴上mapperContext创建的代码：

mapperContext = contextConstructor.newInstance(mapper, job, getTaskID(),input, output, committer,reporter, split);

job就对应于conf，getTaskID()就对应于taskid，input对应reader，output对应writer，...。通过观察这三个类的构造函数，不能看出最终output对象传值给了TaskInputOutputContext类中的RecordReader output对象。再回到这个output的定义：

// get an output object

      if (job.getNumReduceTasks() == 0) {//reduce数量如果是0

         output =

           new NewDirectOutputCollector(taskContext, job, umbilical, reporter);

      } else {

        output = new NewOutputCollector(taskContext, job, umbilical, reporter);

      }

这样我们就可以确定TaskInputOutputContext中字段output的类型是NewOutputCollector类型(RecordWriter抽象类的一个实现)。

当然，context继承了TaskInputOutputContext这个output字段，更重要的还有其write方法。对Context类做个小结，到目前为止所知道的它的字段和方法如下：

再回到map方法：

 public void map(Object key, Text value, Context context

                    ) throws IOException, InterruptedException {

      StringTokenizer itr = new StringTokenizer(value.toString());

      while (itr.hasMoreTokens()) {

        word.set(itr.nextToken());

        context.write(word, one);

      }

    }

可以看到context.write()方法其实就是调用了NewOutputCollector类的write方法，这个类部分声明：

private final MapOutputCollector<K,V> collector;//map的输出内存缓冲区。

    private final org.apache.hadoop.mapreduce.Partitioner<K,V> partitioner;//作业所使用的分区(Partitioner)类型(默认的Partitioner就是HashPartitioner)

    private final int partitions;//reduce的数量

    NewOutputCollector(org.apache.hadoop.mapreduce.JobContext jobContext,

                       JobConf job,

                       TaskUmbilicalProtocol umbilical,

                       TaskReporter reporter

                       ) throws IOException, ClassNotFoundException {

      collector = new MapOutputBuffer<K,V>(umbilical, job, reporter);//创建collector对象

      partitions = jobContext.getNumReduceTasks();//获得reduce的数量。

      if (partitions > 0) {

        partitioner = (org.apache.hadoop.mapreduce.Partitioner<K,V>)

          ReflectionUtils.newInstance(jobContext.getPartitionerClass(), job);//获得作业所使用的分区(Partitioner)类型(默认的Partitioner就是HashPartitioner)

      } else {

        partitioner = new org.apache.hadoop.mapreduce.Partitioner<K,V>() {

          @Override

          public int getPartition(K key, V value, int numPartitions) {

            return -1;

          }

        };

      }

    }

    public void write(K key, V value) throws IOException, InterruptedException {

      collector.collect(key, value,

                        partitioner.getPartition(key, value, partitions));

    }

NewOutputCollector有三个数据成员：collector、partitioner和partitions，这三个字段都在构造函数内完成初始化，collector是MapOutputBuffer类的对象，是本类的核心字段，partitioner是Partitioner类的对象，用于指示本次map所使用的分区类型，所谓的对key/value分区的过程其实也就是调用getPartition(key,value，reduceNums)方法返回一个整数作为此键值对的分区号，用户可以自定义分区类，其实也就是自定义getPartition(key，value，reduceNums)方法。不过分区只是根据key将map的输出分成不同的区(以0，1，2，3等数字作为分区号)，每个区用一个reduce处理。默认的分区方法是HashPartitioner，首先将key的哈希值和Integer类型最大值进行与运算，然后将结果对作业的reduce数量取模值，将这个模值作为此key/value对应的分区号，可见键值对的分区号只是与key有关，其原型如下：

public class HashPartitioner<K, V> extends Partitioner<K, V> {

  /** Use {@link Object#hashCode()} to partition. */

  public int getPartition(K key, V value,

                          int numReduceTasks) {

    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;

  }

接下来查看到NewOutputCollector的write方法又调用了collector.collect(key,value)方法：

public void write(K key, V value) throws IOException, InterruptedException {

      collector.collect(key, value,

                        partitioner.getPartition(key, value, partitions));

    }

不深入collect方法内部看的话，看到此方法的第一印象就是colllector将key、value、partition(对应的分区号)一起存入内存缓冲区。接下来分析map阶段的spill过程。

参考：

http://zheming.wang/hadoop-mapreduce-zhi-xing-liu-cheng-xiang-jie.html

InputSplit—>RecordReder—>map(key,value,context)的过程解析的更多相关文章

context创建过程解析（一）之deployDescriptors
总结:主要是创建Context对象,并且将默认context配置,host级别配置,context配置的值设置进去,设置docBase,如果是war包就解压到webapp的目录中,重新设置docBas ...
context创建过程解析（三）之deployDirectories
HostConfig.deployApps() //在监听到start事件类型,也就是StandardHost调用startInternal protected void deployApps() { ...
context创建过程解析（二）之deployWARs
HostConfig.deployApps() //在监听到start事件类型,也就是StandardHost调用startInternal protected void deployApps() { ...
map写数据到本地磁盘过程解析----spill和merge
如上次分析,其实map函数中的context.write()调用过程如下所示: 梳理下调用过程,context的write方法其实是调用了TaskInputOutputContext类的write方法 ...
Map<Key,Value>基于Value值排序
Map<Key,Value> 排序默认是按照KEY值的升序来进行. 针对按照Value来进行排序有两种方法: 第一种使用TreeMap 代码如下 public class test{ ...
TaskTracker执行map或reduce任务的过程2
TaskTracker执行map或reduce任务的过程(二) 上次说到,当MapLauncher或ReduceLancher(用于执行任务的线程,它们扩展自TaskLauncher),从它们所维护的 ...
TaskTracker获取并执行map或reduce任务的过程1
TaskTracker获取并执行map或reduce任务的过程(一) 我们知道TaskTracker在默认情况下,每个3秒就行JobTracker发送一个心跳包,也就是在这个心跳包中包含对任务的请求. ...
Android深入理解Context（一）Context关联类和Application Context创建过程
前言 Context也就是上下文对象,是Android较为常用的类,但是对于Context,很多人都停留在会用的阶段,这个系列会带大家从源码角度来分析Context,从而更加深入的理解它. 1.Con ...
Android深入理解Context（二）Activity和Service的Context创建过程
前言上一篇文章我们学习了Context关联类和Application Context的创建过程,这一篇我们接着来学习Activity和Service的Context创建过程.需要注意的是,本篇的知识 ...

随机推荐

wxPython 安装及参考文档
三种操作平台上的安装方法 1.windows 和 mac pip install -U wxPython 2.linux pip install -U -f https://extras.wxpyth ...
在 MongoDB 上模拟事务操作来实现支付
我们的产品叫「学海密探」,属于在线教育行业,产品需要有支付功能,然而支付最蛋疼是什么?有人会说是支付宝和微信等支付接口的接入开发!没错,但支付接口的开发算是比较简单的了,我觉得凡是跟钱有关系的操作最重 ...
hdu1712 分组背包 ACboy needs your help
ACboy needs your help Time Limit: 1000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Ot ...
thinkphp中dump()方法
dump ThinkPHP 框架自定义的用作框架变量调试用的输出功能可以说和 var_dump一样的
Web 安全概念
Web 安全概念 Web 应用中存在很多安全风险,这些风险会被黑客利用,轻则篡改网页内容,重则窃取网站内部数据,更为严重的则是在网页中植入恶意代码,使得用户受到侵害.常见的安全漏洞如下: XSS 攻击 ...
zoj 1508 Intervals (差分约束)
Intervals Time Limit: 10 Seconds Memory Limit: 32768 KB You are given n closed, integer interva ...
easyui中的依赖关系
参考自:http://www.easyui.info/archives/765.html 在使用easyui的过程中发现各个组件直接存在依赖关系,也就是上层的复杂组件依赖于一个或者多个简单组件,复杂组 ...
[HNOI2015][bzoj4011] 落叶枫音 [拓扑DP]
题面传送门思路首先有一个结论(应该是有比较大的利用价值的): 有向无环图的生成外向树树个数等于所有入度非0的点的入度乘积然后这道题里面,唯一不合拍的因素就是这里有一条可能成环的边我们可以把这 ...
真·APIO2018滚粗记
有人说只有大佬才会说滚粗啊爆零啊…… 然而我真滚粗真爆零啊…… D1: 听课,然后夏眠. 咦折纸……哦好吧太神仙了. 咦AI……好妙啊好妙啊. 咦二分……哇还有wqs二分这种神奇操作,学学学. (我是 ...
code forces Codeforces Round #487 (Div. 2) C
C. A Mist of Florescence time limit per test 1 second memory limit per test 256 megabytes input stan ...

InputSplit—>RecordReder—>map(key,value,context)的过程解析

InputSplit—>RecordReder—>map(key,value,context)的过程解析的更多相关文章

随机推荐

热门专题