MapReduce的InputFormat学习过程

昨天，经过几个小时的学习。该MapReduce学习的某一位的方法的第一阶段。即当大多数文件的开头的Data至key-value制图。那是，InputFormat的过程。虽说过程不是非常难，可是也存在非常多细节的。

也非常少会有人对此做比較细腻的研究。学习。今天。就让我来为大家剖析一下这段代码的原理。

我还为此花了一点时间做了几张结构图。便于大家理解。

在这里先声明一下。我研究的MapReduce主要研究的是旧版的API，也就是mapred包下的。

InputFormat最最原始的形式就是一个接口。后面出现的各种Format都是他的衍生类。结构例如以下，仅仅包括最重要的2个方法:

public interface InputFormat<K, V> {

  /**

   * Logically split the set of input files for the job.

   *

   * <p>Each {@link InputSplit} is then assigned to an individual {@link Mapper}

   * for processing.</p>

   *

   * <p><i>Note</i>: The split is a <i>logical</i> split of the inputs and the

   * input files are not physically split into chunks. For e.g. a split could

   * be <i><input-file-path, start, offset></i> tuple.

   *

   * @param job job configuration.

   * @param numSplits the desired number of splits, a hint.

   * @return an array of {@link InputSplit}s for the job.

   */

  InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;

  /**

   * Get the {@link RecordReader} for the given {@link InputSplit}.

   *

   * <p>It is the responsibility of the <code>RecordReader</code> to respect

   * record boundaries while processing the logical split to present a

   * record-oriented view to the individual task.</p>

   *

   * @param split the {@link InputSplit}

   * @param job the job that this split belongs to

   * @return a {@link RecordReader}

   */

  RecordReader<K, V> getRecordReader(InputSplit split,

                                     JobConf job,

                                     Reporter reporter) throws IOException;

}

所以后面解说，我也仅仅是会环绕这2个方法进行分析。当然我们用的最多的是从文件里获得输入数据，也就是FileInputFormat这个类。继承关系例如以下:

public abstract class FileInputFormat<K, V> implements InputFormat<K, V>

我们看里面的1个主要方法:

public InputSplit[] getSplits(JobConf job, int numSplits)

返回的类型是一个InputSpilt对象。这是一个抽象的输入Spilt分片概念。结构例如以下:

public interface InputSplit extends Writable {

  /**

   * Get the total number of bytes in the data of the <code>InputSplit</code>.

   *

   * @return the number of bytes in the input split.

   * @throws IOException

   */

  long getLength() throws IOException;

  /**

   * Get the list of hostnames where the input split is located.

   *

   * @return list of hostnames where data of the <code>InputSplit</code> is

   *         located as an array of <code>String</code>s.

   * @throws IOException

   */

  String[] getLocations() throws IOException;

}

提供了与数据相关的2个方法。后面这个返回的值会被用来传递给RecordReader里面去的。在想理解getSplits方法之前另一个类须要理解，FileStatus，里面包装了一系列的文件基本信息方法:

public class FileStatus implements Writable, Comparable {

  private Path path;

  private long length;

  private boolean isdir;

  private short block_replication;

  private long blocksize;

  private long modification_time;

  private long access_time;

  private FsPermission permission;

  private String owner;

  private String group;

.....

看到这里你预计会有点晕了，以下是我做的一张小小类图关系:

能够看到，FileSpilt为了兼容新老版本号，继承了新的抽象类InputSpilt。同一时候附上旧的接口形式的InputSpilt。以下我们看看里面的getspilt核心过程:

/** Splits files returned by {@link #listStatus(JobConf)} when

   * they're too big.*/

  @SuppressWarnings("deprecation")

  public InputSplit[] getSplits(JobConf job, int numSplits)

    throws IOException {

	//获取全部的状态文件

    FileStatus[] files = listStatus(job);

    // Save the number of input files in the job-conf

    //在job-cof中保存文件的数量

    job.setLong(NUM_INPUT_FILES, files.length);

    long totalSize = 0;

    // compute total size,计算文件总的大小

    for (FileStatus file: files) {                // check we have valid files

      if (file.isDir()) {

    	  //假设是文件夹不是纯文件的直接抛异常

        throw new IOException("Not a file: "+ file.getPath());

      }

      totalSize += file.getLen();

    }

    //用户期待的划分大小。总大小除以spilt划分数目

    long goalSize = totalSize / (numSplits == 0 ?

1 : numSplits);

    //获取系统的划分最小值

    long minSize = Math.max(job.getLong("mapred.min.split.size", 1),

                            minSplitSize);

    // generate splits

    //创建numSplits个FileSpilt文件划分量

    ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);

    NetworkTopology clusterMap = new NetworkTopology();

    for (FileStatus file: files) {

      Path path = file.getPath();

      FileSystem fs = path.getFileSystem(job);

      long length = file.getLen();

      //获取此文件的block的位置列表

      BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);

      //假设文件系统可划分

      if ((length != 0) && isSplitable(fs, path)) {

    	//计算此文件的总的block块的大小

        long blockSize = file.getBlockSize();

        //依据期待大小。最小大小。得出终于的split分片大小

        long splitSize = computeSplitSize(goalSize, minSize, blockSize);

        long bytesRemaining = length;

        //假设剩余待划分字节倍数为划分大小超过1.1的划分比例，则进行拆分

        while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {

          //获取提供数据的splitHost位置

          String[] splitHosts = getSplitHosts(blkLocations,

              length-bytesRemaining, splitSize, clusterMap);

          //加入FileSplit

          splits.add(new FileSplit(path, length-bytesRemaining, splitSize,

              splitHosts));

          //数量降低splitSize大小

          bytesRemaining -= splitSize;

        }

        if (bytesRemaining != 0) {

          //加入刚刚剩下的没划分完的部分。此时bytesRemaining已经小于splitSize的1.1倍了

          splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,

                     blkLocations[blkLocations.length-1].getHosts()));

        }

      } else if (length != 0) {

    	//不划分。直接加入Spilt

        String[] splitHosts = getSplitHosts(blkLocations,0,length,clusterMap);

        splits.add(new FileSplit(path, 0, length, splitHosts));

      } else {

        //Create empty hosts array for zero length files

        splits.add(new FileSplit(path, 0, length, new String[0]));

      }

    }

    //最后返回FileSplit数组

    LOG.debug("Total # of splits: " + splits.size());

    return splits.toArray(new FileSplit[splits.size()]);

  }

里面有个computerSpiltSize方法非常特殊，考虑了非常多情况。总之最小值不能小于系统设定的最小值。要与期待值，块大小，系统同意最小值:

protected long computeSplitSize(long goalSize, long minSize,

                                       long blockSize) {

    return Math.max(minSize, Math.min(goalSize, blockSize));

  }

上述过程的对应流程图例如以下:

watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvQW5kcm9pZGx1c2hhbmdkZXJlbg==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt="">

3种情况3中年运行流程。

处理完getSpilt方法然后，也就是说已经把数据从文件里转划到InputSpilt中了，接下来就是给RecordRead去取出里面的一条条的记录了。当然这在FileInputFormat是抽象方法，必须由子类实现的，我在这里挑出了2个典型的子类SequenceFileInputFormat，和TextInputFormat。

他们的实现RecordRead方法例如以下：

public RecordReader<K, V> getRecordReader(InputSplit split,

                                      JobConf job, Reporter reporter)

    throws IOException {

    reporter.setStatus(split.toString());

    return new SequenceFileRecordReader<K, V>(job, (FileSplit) split);

  }

public RecordReader<LongWritable, Text> getRecordReader(

                                          InputSplit genericSplit, JobConf job,

                                          Reporter reporter)

    throws IOException {

    reporter.setStatus(genericSplit.toString());

    return new LineRecordReader(job, (FileSplit) genericSplit);

  }

能够看到里面的差别就在于LineRecordReader和SequenceFileRecordReader的不同了，这也就表明2种方式相应于数据的读取方式可能会不一样。继续往里深入看:

/** An {@link RecordReader} for {@link SequenceFile}s. */

public class SequenceFileRecordReader<K, V> implements RecordReader<K, V> {

  private SequenceFile.Reader in;

  private long start;

  private long end;

  private boolean more = true;

  protected Configuration conf;

  public SequenceFileRecordReader(Configuration conf, FileSplit split)

    throws IOException {

    Path path = split.getPath();

    FileSystem fs = path.getFileSystem(conf);

    //从文件系统中读取数据输入流

    this.in = new SequenceFile.Reader(fs, path, conf);

    this.end = split.getStart() + split.getLength();

    this.conf = conf;

    if (split.getStart() > in.getPosition())

      in.sync(split.getStart());                  // sync to start

    this.start = in.getPosition();

    more = start < end;

  }

  ......

  /**

   * 获取下一个键值对

   */

  public synchronized boolean next(K key, V value) throws IOException {

	//推断还有无下一条记录

    if (!more) return false;

    long pos = in.getPosition();

    boolean remaining = (in.next(key) != null);

    if (remaining) {

      getCurrentValue(value);

    }

    if (pos >= end && in.syncSeen()) {

      more = false;

    } else {

      more = remaining;

    }

    return more;

  }

我们能够看到SequenceFileRecordReader是从输入流in中一个键值。一个键值的读取，另外一个的实现方式例如以下:

/**

 * Treats keys as offset in file and value as line.

 */

public class LineRecordReader implements RecordReader<LongWritable, Text> {

  private static final Log LOG

    = LogFactory.getLog(LineRecordReader.class.getName());

  private CompressionCodecFactory compressionCodecs = null;

  private long start;

  private long pos;

  private long end;

  private LineReader in;

  int maxLineLength;

  ....

  /** Read a line. */

  public synchronized boolean next(LongWritable key, Text value)

    throws IOException {

    while (pos < end) {

      //设置key

      key.set(pos);

      //依据位置一行一行读取，设置value

      int newSize = in.readLine(value, maxLineLength,

                                Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),

                                         maxLineLength));

      if (newSize == 0) {

        return false;

      }

      pos += newSize;

      if (newSize < maxLineLength) {

        return true;

      }

      // line too long. try again

      LOG.info("Skipped line of size " + newSize + " at pos " + (pos - newSize));

    }

    return false;

  }

实现的方式为通过读的位置，从输入流中逐行读取key-value。

通过这2种方法，就能得到新的key-value。就会用于后面的map操作。

InputFormat的我忽略了一个事实，整个过程非常详细。通常该过程如上所述。

MapReduce的InputFormat学习过程的更多相关文章

【Hadoop离线基础总结】MapReduce自定义InputFormat和OutputFormat案例
MapReduce自定义InputFormat和OutputFormat案例自定义InputFormat 合并小文件需求无论hdfs还是mapreduce,存放小文件会占用元数据信息,白白浪费内 ...
MapReduce的InputFormat过程的学习
转自:http://blog.csdn.net/androidlushangderen/article/details/41114259 昨天经过几个小时的学习,把MapReduce的第一个阶段的过程 ...
MapReduce自定义InputFormat和OutputFormat
一.自定义InputFormat 需求:将多个小文件合并为SequenceFile(存储了多个小文件) 存储格式:文件路径+文件的内容 c:/a.txt I love Beijing c:/b.txt ...
MapReduce自定义InputFormat,RecordReader
MapReduce默认的InputFormat是TextInputFormat,且key是偏移量,value是文本,自定义InputFormat需要实现FileInputFormat,并重写creat ...
MapReduce框架原理-InputFormat数据输入
InputFormat简介 InputFormat:管控MR程序文件输入到Mapper阶段,主要做两项操作:怎么去切片?怎么将切片数据转换成键值对数据. InputFormat是一个抽象类,没有实现怎 ...
Hadoop（十七）之MapReduce作业配置与Mapper和Reducer类
前言前面一篇博文写的是Combiner优化MapReduce执行,也就是使用Combiner在map端执行减少reduce端的计算量. 一.作业的默认配置 MapReduce程序的默认配置 1)概述 ...
mapreduce深入剖析5大视频
参考代码 TVPlayCount.java package com.dajiangtai.hadoop.tvplay; import java.io.IOException; import org.a ...
MapReduce编程解析
MapReduce编程模型之案例 wordcount 输入数据 atguigu atguiguss sscls clsjiaobanzhangxuehadoop 输出数据 atguigu 2banzh ...
如何在Windows下面运行hadoop的MapReduce程序
在Windows下面运行hadoop的MapReduce程序的方法: 1.下载hadoop的安装包,这里使用的是"hadoop-2.6.4.tar.gz": 2.将安装包直接解压到 ...

随机推荐

C#使用xpath找到一个节点
Xpath这是非常强大.但对比是一个更复杂的技术,希望上面去博客园特别想看看一些专业职位.下面是一些简单Xpath的语法和示例,给你参考 <?xml version="1.0" ...
KindEditor参数具体解释
width 编辑器的宽度.能够设置px或%.比textarea输入框样式表宽度优先度高. 数据类型: String 默认值: textarea输入框的宽度演示样例: K.create('#id', ...
何时使用SET和SELECT为变量赋值
原文:何时使用SET和SELECT为变量赋值我们经常使用SET和SELECT来为变量复制,但是有时候,只能选其一来使用,下面来看看这些例子,本例中使用AdventureWorks数据库来做演示. 通 ...
一张漫画说尽IT开发过程
MapReduce 规划系列十采用HashPartitioner调整Reducer计算负荷
example4它演示了如何指定Reducer号码,本节演示如何使用HashPartitioner将Mapper根据该输出key分组后Reducer为了应对. 合理的分组策略会尽一切Reducer不能 ...
php小写金额转大写
public static function amountInWords($num) { if (!is_numeric($num) || empty($num)) ...
HDU 1505 City Game（01矩阵 dp）
Problem Description Bob is a strategy game programming specialist. In his new city building game the ...
SOA两个接口通常用于实现更：SOAP vs REST
SOA协作架构异构系统,因此,一个跨操作系统的需求.跨语言的通用信息交换格公式. SOAP和REST它们是基于消息正文文本,在跨平台方面相比二进制消息优点.因此,作为选择SOA实施通常用于界面.但SO ...
Android 最热的高速发展框架XUtils
近期搜了一些框架供刚開始学习的人学习,比較了一下XUtils是眼下git上比較活跃功能比較完好的一个框架,是基于afinal开发的,比afinal稳定性提高了不少.以下是介绍: 鉴于大家的热情,我又 ...
漂浮广告代码兼容ie、firefox，多个漂浮不冲突，调用只需两行代码
原文:漂浮广告代码兼容ie.firefox,多个漂浮不冲突,调用只需两行代码将广告内容放在div中,设置一个id,然后用下面方法调用var adcls=new AdMove("div的id ...

MapReduce的InputFormat学习过程

MapReduce的InputFormat学习过程的更多相关文章

随机推荐

热门专题