MapReduce的InputFormat过程的学习

转自：http://blog.csdn.net/androidlushangderen/article/details/41114259

昨天经过几个小时的学习，把MapReduce的第一个阶段的过程学习了一下，也就是最最开始的时候从文件中的Data到key-value的映射，也就是InputFormat的过程。虽说过程不是很难，但是也存在很多细节的。也很少会有人对此做比较细腻的研究，学习。今天，就让我来为大家剖析一下这段代码的原理。我还为此花了一点时间做了几张结构图，便于大家理解。在这里先声明一下，我研究的MapReduce主要研究的是旧版的API，也就是mapred包下的。

InputFormat最最原始的形式就是一个接口。后面出现的各种Format都是他的衍生类。结构如下，只包含最重要的2个方法:

public interface InputFormat<K, V> {
/**
* Logically split the set of input files for the job.
*
* Each {@link InputSplit} is then assigned to an individual {@link Mapper}
* for processing.
*
* Note: The split is a logical split of the inputs and the
* input files are not physically split into chunks. For e.g. a split could
* be <input-file-path, start, offset> tuple.
*
* @param job job configuration.
* @param numSplits the desired number of splits, a hint.
* @return an array of {@link InputSplit}s for the job.
*/
InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;
/**
* Get the {@link RecordReader} for the given {@link InputSplit}.
*
* It is the responsibility of the <code>RecordReader</code> to respect
* record boundaries while processing the logical split to present a
* record-oriented view to the individual task.
*
* @param split the {@link InputSplit}
* @param job the job that this split belongs to
* @return a {@link RecordReader}
*/
RecordReader<K, V> getRecordReader(InputSplit split,
JobConf job,
Reporter reporter) throws IOException;
}

所以后面讲解，我也只是会围绕这2个方法进行分析。当然我们用的最多的是从文件中获得输入数据，也就是FileInputFormat这个类。继承关系如下:

public abstract class FileInputFormat<K, V> implements InputFormat<K, V>

我们看里面的1个主要方法:

public InputSplit[] getSplits(JobConf job, int numSplits)

返回的类型是一个InputSpilt对象，这是一个抽象的输入Spilt分片概念。结构如下:

public interface InputSplit extends Writable {
/**
* Get the total number of bytes in the data of the <code>InputSplit</code>.
*
* @return the number of bytes in the input split.
* @throws IOException
*/
long getLength() throws IOException;
/**
* Get the list of hostnames where the input split is located.
*
* @return list of hostnames where data of the <code>InputSplit</code> is
* located as an array of <code>String</code>s.
* @throws IOException
*/
String[] getLocations() throws IOException;
}

提供了与数据相关的2个方法。后面这个返回的值会被用来传递给RecordReader里面去的。在想理解getSplits方法之前还有一个类需要理解，FileStatus，里面包装了一系列的文件基本信息方法:

public class FileStatus implements Writable, Comparable {
private Path path;
private long length;
private boolean isdir;
private short block_replication;
private long blocksize;
private long modification_time;
private long access_time;
private FsPermission permission;
private String owner;
private String group;

.....

看到这里你估计会有点晕了，下面是我做的一张小小类图关系:

可以看到，FileSpilt为了兼容新老版本，继承了新的抽象类InputSpilt，同时附上旧的接口形式的InputSpilt。下面我们看看里面的getspilt核心过程:

/** Splits files returned by {@link #listStatus(JobConf)} when
* they're too big.*/
@SuppressWarnings("deprecation")
public InputSplit[] getSplits(JobConf job, int numSplits)
throws IOException {
//获取所有的状态文件
FileStatus[] files = listStatus(job);
// Save the number of input files in the job-conf
//在job-cof中保存文件的数量
job.setLong(NUM_INPUT_FILES, files.length);
long totalSize = 0;
// compute total size,计算文件总的大小
for (FileStatus file: files) { // check we have valid files
if (file.isDir()) {
//如果是目录不是纯文件的直接抛异常
throw new IOException("Not a file: "+ file.getPath());
}
totalSize += file.getLen();
}
//用户期待的划分大小，总大小除以spilt划分数目
long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
//获取系统的划分最小值
long minSize = Math.max(job.getLong("mapred.min.split.size", 1),
minSplitSize);
// generate splits
//创建numSplits个FileSpilt文件划分量
ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);
NetworkTopology clusterMap = new NetworkTopology();
for (FileStatus file: files) {
Path path = file.getPath();
FileSystem fs = path.getFileSystem(job);
long length = file.getLen();
//获取此文件的block的位置列表
BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
//如果文件系统可划分
if ((length != 0) && isSplitable(fs, path)) {
//计算此文件的总的block块的大小
long blockSize = file.getBlockSize();
//根据期待大小，最小大小，得出最终的split分片大小
long splitSize = computeSplitSize(goalSize, minSize, blockSize);
long bytesRemaining = length;
//如果剩余待划分字节倍数为划分大小超过1.1的划分比例，则进行拆分
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
//获取提供数据的splitHost位置
String[] splitHosts = getSplitHosts(blkLocations,
length-bytesRemaining, splitSize, clusterMap);
//添加FileSplit
splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
splitHosts));
//数量减少splitSize大小
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
//添加刚刚剩下的没划分完的部分，此时bytesRemaining已经小于splitSize的1.1倍了
splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,
blkLocations[blkLocations.length-1].getHosts()));
}
} else if (length != 0) {
//不划分，直接添加Spilt
String[] splitHosts = getSplitHosts(blkLocations,0,length,clusterMap);
splits.add(new FileSplit(path, 0, length, splitHosts));
} else {
//Create empty hosts array for zero length files
splits.add(new FileSplit(path, 0, length, new String[0]));
}
}
//最后返回FileSplit数组
LOG.debug("Total # of splits: " + splits.size());
return splits.toArray(new FileSplit[splits.size()]);
}

里面有个computerSpiltSize方法很特殊，考虑了很多情况，总之最小值不能小于系统设定的最小值。要与期待值，块大小，系统允许最小值:

protected long computeSplitSize(long goalSize, long minSize,
long blockSize) {
return Math.max(minSize, Math.min(goalSize, blockSize));
}

上述过程的相应流程图如下:

3种情况3中年执行流程。

处理完getSpilt方法然后，也就是说已经把数据从文件中转划到InputSpilt中了，接下来就是给RecordRead去取出里面的一条条的记录了。当然这在FileInputFormat是抽象方法，必须由子类实现的，我在这里挑出了2个典型的子类SequenceFileInputFormat，和TextInputFormat。他们的实现RecordRead方法如下：

public RecordReader<K, V> getRecordReader(InputSplit split,
JobConf job, Reporter reporter)
throws IOException {
reporter.setStatus(split.toString());
return new SequenceFileRecordReader<K, V>(job, (FileSplit) split);
}

public RecordReader<LongWritable, Text> getRecordReader(
InputSplit genericSplit, JobConf job,
Reporter reporter)
throws IOException {
reporter.setStatus(genericSplit.toString());
return new LineRecordReader(job, (FileSplit) genericSplit);
}

可以看到里面的区别就在于LineRecordReader和SequenceFileRecordReader的不同了，这也就表明2种方式对应于数据的读取方式可能会不一样，继续往里深入看:

/** An {@link RecordReader} for {@link SequenceFile}s. */
public class SequenceFileRecordReader<K, V> implements RecordReader<K, V> {
private SequenceFile.Reader in;
private long start;
private long end;
private boolean more = true;
protected Configuration conf;
public SequenceFileRecordReader(Configuration conf, FileSplit split)
throws IOException {
Path path = split.getPath();
FileSystem fs = path.getFileSystem(conf);
//从文件系统中读取数据输入流
this.in = new SequenceFile.Reader(fs, path, conf);
this.end = split.getStart() + split.getLength();
this.conf = conf;
if (split.getStart() > in.getPosition())
in.sync(split.getStart()); // sync to start
this.start = in.getPosition();
more = start < end;
}
......
/**
* 获取下一个键值对
*/
public synchronized boolean next(K key, V value) throws IOException {
//判断还有无下一条记录
if (!more) return false;
long pos = in.getPosition();
boolean remaining = (in.next(key) != null);
if (remaining) {
getCurrentValue(value);
}
if (pos >= end && in.syncSeen()) {
more = false;
} else {
more = remaining;
}
return more;
}

我们可以看到SequenceFileRecordReader是从输入流in中一个键值，一个键值的读取，另外一个的实现方式如下:

/**
* Treats keys as offset in file and value as line.
*/
public class LineRecordReader implements RecordReader<LongWritable, Text> {
private static final Log LOG
= LogFactory.getLog(LineRecordReader.class.getName());
private CompressionCodecFactory compressionCodecs = null;
private long start;
private long pos;
private long end;
private LineReader in;
int maxLineLength;
....
/** Read a line. */
public synchronized boolean next(LongWritable key, Text value)
throws IOException {
while (pos < end) {
//设置key
key.set(pos);
//根据位置一行一行读取，设置value
int newSize = in.readLine(value, maxLineLength,
Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),
maxLineLength));
if (newSize == 0) {
return false;
}
pos += newSize;
if (newSize < maxLineLength) {
return true;
}
// line too long. try again
LOG.info("Skipped line of size " + newSize + " at pos " + (pos - newSize));
}
return false;
}

实现的方式为通过读的位置，从输入流中逐行读取key-value。通过这2种方法，就能得到新的key-value，就会用于后面的map操作。

InputFormat的整个流程其实我忽略了很多细节。大体流程如上述所说。

MapReduce的InputFormat过程的学习的更多相关文章

第2节 mapreduce深入学习：7、MapReduce的规约过程combiner
第2节 mapreduce深入学习:7.MapReduce的规约过程combiner 每一个 map 都可能会产生大量的本地输出,Combiner 的作用就是对 map 端的输出先做一次合并,以减少在 ...
gdb调试汇编堆栈过程的学习
gdb调试汇编堆栈过程的学习以下为C源文件使用gcc - g code.c -o code -m32指令在64位的机器上产生32位汇编,然后使用gdb example指令进入gdb调试器: 进入之 ...
MapReduce的Shuffle过程介绍
MapReduce的Shuffle过程介绍 Shuffle的本义是洗牌.混洗,把一组有一定规则的数据尽量转换成一组无规则的数据,越随机越好.MapReduce中的Shuffle更像是洗牌的逆过程,把一 ...
【Hadoop离线基础总结】MapReduce自定义InputFormat和OutputFormat案例
MapReduce自定义InputFormat和OutputFormat案例自定义InputFormat 合并小文件需求无论hdfs还是mapreduce,存放小文件会占用元数据信息,白白浪费内 ...
MapReduce的InputFormat学习过程
昨天,经过几个小时的学习.该MapReduce学习的某一位的方法的第一阶段.即当大多数文件的开头的Data至key-value制图.那是,InputFormat的过程.虽说过程不是非常难,可是也存在非 ...
Hadoop MapReduce的Shuffle过程
一.概述理解Hadoop的Shuffle过程是一个大数据工程师必须的,笔者自己将学习笔记记录下来,以便以后方便复习查看. 二. MapReduce确保每个reducer的输入都是按键排序的.系统执行 ...
MapReduce框架原理-InputFormat数据输入
InputFormat简介 InputFormat:管控MR程序文件输入到Mapper阶段,主要做两项操作:怎么去切片?怎么将切片数据转换成键值对数据. InputFormat是一个抽象类,没有实现怎 ...
DHCP协议格式、DHCP服务搭建、DHCP协商交互过程入门学习
相关学习资料 http://www.rfc-editor.org/rfc/rfc2131.txt http://baike.baidu.com/view/7992.htm?fromtitle=DHCP ...
[工具向]__关于androidstudio工具使用过程中学习到的一些知识点简记
前言在我学习android开发课程的过程中,我们通常只会关注编程语言上面的一些知识点与问题,而忽略了开发工具的使用上的一些遇到的一些知识,其实每一款IDE工具都是集编程语言大成而开发出来的,其中有很 ...

随机推荐

网络电台-SHOUTcast
网络电台种类目前的网络电台网站一般是基于以下三种协议的: mms.rtsp.http 其中mms是微软公司提出的网络流媒体协议,通常采用wma格式的文件,Android现在还不支持这种协议,也不支持 ...
ASP.NET Core Kestrel 随机404错误
一.Bug 出现最近遇到一个很诡异的bug,Visual Studio 2017调试ASP.NET Core 2.2 Web程序的时候,随机性的出现404错误.如下图事实上这个css文件是存在的, ...
格式化文本数据抽取工具awk
在管理和维护Linux系统过程中,有时可能需要从一个具有一定格式的文本(格式化文本)中抽取数据,这时可以使用awk编辑器来完成这项任务.发明这个工具的作者是Aho.Weinberg和Kernighan ...
Oracle 检索数据
SELECT * | { [ DISTINCT ] column | expression [ alias ] , ... } FROM ta ...
TP框架中如何使用SESSION限制登录？
TP框架中如何使用SESSION限制登录? 之前总是被问题今天才明白,最高效的来做页面访问限制问题. OOP思想中的继承特性,实现验证,是否已经登录,不必每个页面都进行判断. 实现如下: 继承Cont ...
【微信开发】JS和PHP分别判断当前浏览器是否微信浏览器
1.PHP端 //判断是否微信浏览器 -xzz1125 function is_weixin() { if (strpos($_SERVER['HTTP_USER_AGENT'], 'MicroMes ...
LeetCode-2： Add Two Numbers
[Problem:2-Add Two Numbers] You are given two non-empty linked lists representing two non-negative i ...
laravel的cookie操作
前提你的整个http流程一定要走完,页就是必须走到view()或Response,你写到中间中断调试,cookie是设置不上去的......(坑~~) 读取: $value = Cookie::get ...
spine 所有动画的第一帧必须把所有能K的都K上
spine 所有动画的第一帧必须把所有能K的都K上.否则在快速切换动画时会出问题.
Informix 語法
1.修改表名稱 RENAME TABLE old_table_name TO new_table_name; 2.分頁 select SKIP 0 FIRST 1 * from tablename ...

MapReduce的InputFormat过程的学习

MapReduce的InputFormat过程的学习的更多相关文章

随机推荐

热门专题