一。前述

上次分析了客户端源码,这次分析mapper源码让大家对hadoop框架有更清晰的认识

二。代码

自定义代码如下:

  1. public class MyMapper extends Mapper<Object, Text, Text, IntWritable>{
  2.  
  3. private final static IntWritable one = new IntWritable(1);
  4. private Text word = new Text();
  5.  
  6. public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
  7.  
  8. StringTokenizer itr = new StringTokenizer(value.toString());
  9.  
  10. while (itr.hasMoreTokens()) {
  11. word.set(itr.nextToken());
  12. context.write(word, one);
  13. }
  14. }

继承Mapper源码如下:

  1. public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
  2.  
  3. /**
  4. * The <code>Context</code> passed on to the {@link Mapper} implementations.
  5. */
  6. public abstract class Context
  7. implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
  8. }
  9.  
  10. /**
  11. * Called once at the beginning of the task.
  12. */
  13. protected void setup(Context context
  14. ) throws IOException, InterruptedException {
  15. // NOTHING
  16. }
  17.  
  18. /**
  19. * Called once for each key/value pair in the input split. Most applications
  20. * should override this, but the default is the identity function.
  21. */
  22. @SuppressWarnings("unchecked")
  23. protected void map(KEYIN key, VALUEIN value,
  24. Context context) throws IOException, InterruptedException {
  25. context.write((KEYOUT) key, (VALUEOUT) value);
  26. }
  27.  
  28. /**
  29. * Called once at the end of the task.
  30. */
  31. protected void cleanup(Context context
  32. ) throws IOException, InterruptedException {
  33. // NOTHING
  34. }
  35.  
  36. /**
  37. * Expert users can override this method for more complete control over the
  38. * execution of the Mapper.
  39. * @param context
  40. * @throws IOException
  41. */
  42. public void run(Context context) throws IOException, InterruptedException {
  43. setup(context);
  44. try {
  45. while (context.nextKeyValue()) {
  46. map(context.getCurrentKey(), context.getCurrentValue(), context);
  47. }
  48. } finally {
  49. cleanup(context);
  50. }
  51. }
  52. }

解析:我们重新了map方法,所以传进run方法中才能不断执行。

MapperTask源码解析:

Container封装了一个脚本命令,通过远程调用启动Yarnchild,如果是MapTask任务,然后把反射城MapTask的对象,启动mapTask的run方法

Maptask中的run方法:

  1. public void run(final JobConf job, final TaskUmbilicalProtocol umbilical)
  2. throws IOException, ClassNotFoundException, InterruptedException {
  3. this.umbilical = umbilical;
  4.  
  5. if (isMapTask()) {
  6. // If there are no reducers then there won't be any sort. Hence the map
  7. // phase will govern the entire attempt's progress.
  8. if (conf.getNumReduceTasks() == 0) {//假如没有reduce阶段
  9. mapPhase = getProgress().addPhase("map", 1.0f);
  10. } else {
  11. // If there are reducers then the entire attempt's progress will be
  12. // split between the map phase (67%) and the sort phase (33%).
  13. mapPhase = getProgress().addPhase("map", 0.667f);
  14. sortPhase = getProgress().addPhase("sort", 0.333f);//假如有reduce阶段需要排序,说明没有reduce任务则不需要排序
  15. }
  16. }
     if (useNewApi) {
          runNewMapper(job, splitMetaInfo, umbilical, reporter);//用新api
        } else {
          runOldMapper(job, splitMetaInfo, umbilical, reporter);
        }
        done(umbilical, reporter);
      }

runNewMapper解析:

  1. private <INKEY,INVALUE,OUTKEY,OUTVALUE>
      void runNewMapper(final JobConf job,
                        final TaskSplitIndex splitIndex,
                        final TaskUmbilicalProtocol umbilical,
                        TaskReporter reporter
                        ) throws IOException, ClassNotFoundException,
                                 InterruptedException {
        // make a task context so we can get the classes
        org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
          new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job, //我们自定义的job
                                                                      getTaskID(),
                                                                      reporter);//创建上下文
        // make a mapper
        org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper =
          (org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>)
            ReflectionUtils.newInstance(taskContext.getMapperClass(), job);//反射把自定的Mapper类反射出来 对应解析1
        // make the input format
        org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
          (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
            ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);//反射把自定的InputFormat类反射出来 对应解析2
        // rebuild the input split
        org.apache.hadoop.mapreduce.InputSplit split = null;
        split = getSplitDetails(new Path(splitIndex.getSplitLocation()),//每一个切片条目对应的是一个MapTask 每个切片中对应的4个东西(文件归属,偏移量,长度,位置信息)
            splitIndex.getStartOffset());
        LOG.info("Processing split: " + split);
  2.  
  3.     org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
          new NewTrackingRecordReader<INKEY,INVALUE>//对应解析3
            (split, inputFormat, reporter, taskContext);//上面准备的输入格式化和切片为输入准备,拿到流,怎么读按文本方式读,行级
        
        job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
        org.apache.hadoop.mapreduce.RecordWriter output = null;
        
        // get an output object
        if (job.getNumReduceTasks() == 0) {
          output =
            new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
        } else {
          output = new NewOutputCollector(taskContext, job, umbilical, reporter);
        }
  4.  
  5.     org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE>
        mapContext =
          new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, getTaskID(), //对应解析4
              input, output, //mapContext即上下文对象封装了输入输出,所以可通过上下文拿到值 则可以得出Mapper类中的content的getCurrentyKey实际上是取得输入对象的LineRecorder
              committer,
              reporter, split);
  6.  
  7.     org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context
            mapperContext =
              new WrappedMapper<INKEY, INVALUE, OUTKEY, OUTVALUE>().getMapContext(
                  mapContext);
    try {
  8. input.initialize(split, mapperContext);//输入 对应解析5
  9. mapper.run(mapperContext);//run 对应解析6
  10. mapPhase.complete();
  11. setPhase(TaskStatus.Phase.SORT);
  12. statusUpdate(umbilical);
  13. input.close();
  14. input = null;
  15. output.close(mapperContext);//输出
  16. output = null;
  17. } finally {
  18. closeQuietly(input);
  19. closeQuietly(output, mapperContext);
  20. }
  21. }

解析1源码

  1. @SuppressWarnings("unchecked")
  2. public Class<? extends Mapper<?,?,?,?>> getMapperClass()
  3. throws ClassNotFoundException {
  4. return (Class<? extends Mapper<?,?,?,?>>)
  5. conf.getClass(MAP_CLASS_ATTR, Mapper.class);//用户配置则从配置中取,如果没设置取默认。
  6. }

解析2源码

  1. public Class<? extends InputFormat<?,?>> getInputFormatClass()
  2. throws ClassNotFoundException {
  3. return (Class<? extends InputFormat<?,?>>)
  4. conf.getClass(INPUT_FORMAT_CLASS_ATTR, TextInputFormat.class);//如果用户设置取用户的,没有则取TextinputfRrmat!!!
  5. }
  6.  

结论:框架默认使用的是TextInputFormat!!!

补充:继承关系InputFormat>FileInputformat>textInputformat

解析3源码:

  1. static class NewTrackingRecordReader<K,V>
  2. extends org.apache.hadoop.mapreduce.RecordReader<K,V> {
  3. private final org.apache.hadoop.mapreduce.RecordReader<K,V> real;
  4. private final org.apache.hadoop.mapreduce.Counter inputRecordCounter;
  5. private final org.apache.hadoop.mapreduce.Counter fileInputByteCounter;
  6. private final TaskReporter reporter;
  7. private final List<Statistics> fsStats;
  8.  
  9. NewTrackingRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
  10. org.apache.hadoop.mapreduce.InputFormat<K, V> inputFormat,
  11. TaskReporter reporter,
  12. org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
  13. throws InterruptedException, IOException {
  14. this.reporter = reporter;
  15. this.inputRecordCounter = reporter
  16. .getCounter(TaskCounter.MAP_INPUT_RECORDS);
  17. this.fileInputByteCounter = reporter
  18. .getCounter(FileInputFormatCounter.BYTES_READ);
  19.  
  20. List <Statistics> matchedStats = null;
  21. if (split instanceof org.apache.hadoop.mapreduce.lib.input.FileSplit) {
  22. matchedStats = getFsStatistics(((org.apache.hadoop.mapreduce.lib.input.FileSplit) split)
  23. .getPath(), taskContext.getConfiguration());
  24. }
  25. fsStats = matchedStats;
  26.  
  27. long bytesInPrev = getInputBytes(fsStats);
  28. this.real = inputFormat.createRecordReader(split, taskContext);解析3.1 源码 real来源Linerecordere
  29. long bytesInCurr = getInputBytes(fsStats);
  30. fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
  31. }
  1. 解析3.1 源码
  1. public class TextInputFormat extends FileInputFormat<LongWritable, Text> {
  2.  
  3. @Override
  4. public RecordReader<LongWritable, Text>
  5. createRecordReader(InputSplit split,
  6. TaskAttemptContext context) {
  7. String delimiter = context.getConfiguration().get(
  8. "textinputformat.record.delimiter");
  9. byte[] recordDelimiterBytes = null;
  10. if (null != delimiter)
  11. recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
  12. return new LineRecordReader(recordDelimiterBytes);//返回Linerorder,行读取器
  13. }
  1.  

解析4源码:

  1. public MapContextImpl(Configuration conf, TaskAttemptID taskid,
  2. RecordReader<KEYIN,VALUEIN> reader,//reader即输入对象
  3. RecordWriter<KEYOUT,VALUEOUT> writer,
  4. OutputCommitter committer,
  5. StatusReporter reporter,
  6. InputSplit split) {
  7. super(conf, taskid, writer, committer, reporter);
  8. this.reader = reader;
  9. this.split = split;
  10. }
     /**
       * Get the input split for this map.
       */
      public InputSplit getInputSplit() {
        return split;
      }
  11.  
  12.   @Override
      public KEYIN getCurrentKey() throws IOException, InterruptedException {
        return reader.getCurrentKey();//调用输入的input 包含一个Linerecorder对象
      }
  13.  
  14.   @Override
      public VALUEIN getCurrentValue() throws IOException, InterruptedException {
        return reader.getCurrentValue();
      }
  15.  
  16.   @Override
      public boolean nextKeyValue() throws IOException, InterruptedException {
        return reader.nextKeyValue();
      }

解析5源码:

  1. public void initialize(InputSplit genericSplit,
  2. TaskAttemptContext context) throws IOException {
  3. FileSplit split = (FileSplit) genericSplit;
  4. Configuration job = context.getConfiguration();
  5. this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
  6. start = split.getStart();//切片的起始位置
  7. end = start + split.getLength();//切片的结束位置
  8. final Path file = split.getPath();
  9.  
  10. // open the file and seek to the start of the split
  11. final FileSystem fs = file.getFileSystem(job);
  12. fileIn = fs.open(file);
  13.  
  14. CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
  15. if (null!=codec) {
  16. isCompressedInput = true;
  17. decompressor = CodecPool.getDecompressor(codec);
  18. if (codec instanceof SplittableCompressionCodec) {
  19. final SplitCompressionInputStream cIn =
  20. ((SplittableCompressionCodec)codec).createInputStream(
  21. fileIn, decompressor, start, end,
  22. SplittableCompressionCodec.READ_MODE.BYBLOCK);
  23. in = new CompressedSplitLineReader(cIn, job,
  24. this.recordDelimiterBytes);
  25. start = cIn.getAdjustedStart();
  26. end = cIn.getAdjustedEnd();
  27. filePosition = cIn;
  28. } else {
  29. in = new SplitLineReader(codec.createInputStream(fileIn,
  30. decompressor), job, this.recordDelimiterBytes);
  31. filePosition = fileIn;
  32. }
  33. } else {
  34. fileIn.seek(start);//很多mapper 去读对应的切片数量
  35. in = new UncompressedSplitLineReader(
  36. fileIn, job, this.recordDelimiterBytes, split.getLength());
  37. filePosition = fileIn;
  38. }
  39. // If this is not the first split, we always throw away first record
  40. // because we always (except the last split) read one extra line in
  41. // next() method.
  42. if (start != 0) {//除了第一个切片
  43. start += in.readLine(new Text(), 0, maxBytesToConsume(start));//匿名写法 输入初始化的时候 对于非第一个切片 读一行放空,算出长度,然后更新起始位置为第二行 这样每一个切片处理完的时候再多处理一行,这样就能保证还原。!!!
  44. }
  45. this.pos = start;
  46. }

解析6实际上调用的就是Mapper中的run方法。

  1. public void run(Context context) throws IOException, InterruptedException {
  2. setup(context);
  3. try {
  4. while (context.nextKeyValue()) {/解析6.1
  5. map(context.getCurrentKey(), context.getCurrentValue(), context);
  6. }
  7. } finally {
  8. cleanup(context);
  9. }
  10. }
  11. }

解析6.1追踪后实际上调用的是LineRewcorder中的NextKeyValue方法

  1. public boolean nextKeyValue() throws IOException {
  2. if (key == null) {
  3. key = new LongWritable();//Key中要放置偏移量
  4. }
  5. key.set(pos);//偏移量
  6. if (value == null) {
  7. value = new Text();//默认
  8. }
  9. int newSize = 0;
  10. // We always read one extra line, which lies outside the upper
  11. // split limit i.e. (end - 1)
  12. while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
  13. if (pos == 0) {
  14. newSize = skipUtfByteOrderMark();
  15. } else {
  16. newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos));//读到真的值了
  17. pos += newSize;
  18. }
  19.  
  20. if ((newSize == 0) || (newSize < maxLineLength)) {
  21. break;
  22. }
  23.  
  24. // line too long. try again
  25. LOG.info("Skipped line of size " + newSize + " at pos " +
  26. (pos - newSize));
  27. }
  28. if (newSize == 0) {
  29. key = null;
  30. value = null;
  31. return false;
  32. } else {
  33. return true;
  34. }
  35. }
    @Override//由nextkeyValue更新值所以直接取值这块 这种取值方式叫做引用传递!!!
      public LongWritable getCurrentKey() {
        return key;
      }
  36.  
  37.   @Override
      public Text getCurrentValue() {
        return value;
      }
  1.  

持续更新中。。。。,欢迎大家关注我的公众号LHWorld.

  1.  
  1.  

  1.  

Hadoop源码篇---解读Mapprer源码Input输入的更多相关文章

  1. Hadoop源码篇---解读Mapprer源码outPut输出

    一.前述 上次讲完MapReduce的输入后,这次开始讲MapReduce的输出.注意MapReduce的原语很重要: "相同"的key为一组,调用一次reduce方法,方法内迭代 ...

  2. 这篇说的是Unity Input 输入控制器

    关于Unity3D是什么.我就不多做解释了.由于工作原因,该系列原创教程不定期更新.每月必然有更新.谢谢各位 Unity Input---输入控制管理器: Edit->Project Setti ...

  3. 源码篇:SDWebImage

    攀登,一步一个脚印,方能知其乐 源码篇:SDWebImage 源码来源:https://github.com/rs/SDWebImage 版本: 3.7 SDWebImage是一个开源的第三方库,它提 ...

  4. MyBatis 源码篇-MyBatis-Spring 剖析

    本章通过分析 mybatis-spring-x.x.x.jar Jar 包中的源码,了解 MyBatis 是如何与 Spring 进行集成的. Spring 配置文件 MyBatis 与 Spring ...

  5. MyBatis 源码篇-Transaction

    本章简单介绍一下 MyBatis 的事务模块,这块内容比较简单,主要为后面介绍 mybatis-spring-1.**.jar(MyBatis 与 Spring 集成)中的事务模块做准备. 类图结构 ...

  6. MyBatis 源码篇-DataSource

    本章介绍 MyBatis 提供的数据源模块,为后面与 Spring 集成做铺垫,从以下三点出发: 描述 MyBatis 数据源模块的类图结构: MyBatis 是如何集成第三方数据源组件的: Pool ...

  7. MyBatis 源码篇-插件模块

    本章主要描述 MyBatis 插件模块的原理,从以下两点出发: MyBatis 是如何加载插件配置的? MyBatis 是如何实现用户使用自定义拦截器对 SQL 语句执行过程中的某一点进行拦截的? 示 ...

  8. MyBatis 源码篇-日志模块2

    上一章的案例,配置日志级别为 debug,执行一个简单的查询操作,会将 JDBC 操作打印出来.本章通过 MyBatis 日志部分源码分析它是如何实现日志打印的. 在 MyBatis 的日志模块中有一 ...

  9. MyBatis 源码篇-日志模块1

    在 Java 开发中常用的日志框架有 Log4j.Log4j2.Apache Common Log.java.util.logging.slf4j 等,这些日志框架对外提供的接口各不相同.本章详细描述 ...

随机推荐

  1. java学习笔记IO之File类

    File类总结 p.p1 { margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Times } p.p2 { margin: 0.0px 0.0px 0.0p ...

  2. python 爬取国家粮食局东北地区玉米收购价格监测信息

    #!/usr/bin/python# -*- coding: UTF-8 -*-import reimport sysimport timeimport urllibimport urllib.req ...

  3. greenplum在执行vacuum和insert产生死锁问题定位及解决方案

    首先声明:未经本人同意,请勿转载,谢谢! 本人使用自己编译的开源版本的greenplum数据库用于学习,版本为PostgreSQL 8.3.23 (Greenplum Database 4.3.99. ...

  4. MapReduce编程(一) Intellij Idea配置MapReduce编程环境

    介绍怎样在Intellij Idea中通过创建mavenproject配置MapReduce的编程环境. 一.软件环境 我使用的软件版本号例如以下: Intellij Idea 2017.1 Mave ...

  5. Android - include属性用法

    include属性用法 本文地址: http://blog.csdn.net/caroline_wendy Android的layout中, 能够使用include属性样式, 这样能够把不同的layo ...

  6. 导出Excel1 - 项目分解篇

    我们在全部的MIS系统(信息管理系统)中都能见到他.所以我们把这个通用功能提出来. 项目名称:车辆信息管理系统(中石化石炼) 项目负责人:xiaobin 项目时间:2006.12 - 2007.2 E ...

  7. startup alter.log spfile.ora

    SQL> select * from v$version where rownum=1; BANNER --------------------------------------------- ...

  8. 基于Office 365的随需应变业务应用平台

    作者:陈希章 发表于 2017年9月7日 这是我去年10月底在微软技术大会(Microsoft Ignite 2016) 上面的演讲主题,承蒙大家抬爱,也沾了前一场明星讲师的光,我记得会场几乎是满座. ...

  9. 注册Azure AD 2.0 应用程序

    作者:陈希章 发表于 2017年3月22日 上一篇 介绍了Microsoft Graph应用程序的一些概念,以及目前还比较普遍的Azure AD 1.0应用程序的注册方式.但正如我多次提到的那样,虽然 ...

  10. Flink升级到1.4版本遇到的坑

    Flink 1.4没出来以前,一直使用Flink 1.3.2,感觉还算稳定,最近将运行环境升级到1.4,遇到了一些坑: 1.需要将可运行程序,基于1.4.0重新编译一次 2.对比了一下flink-co ...