hadoop FileSplit

/** A section of an input file. Returned by {@link

* InputFormat#getSplits(JobContext)} and passed to

* {@link InputFormat#createRecordReader(InputSplit,TaskAttemptContext)}.

*

* 文件的一部分，通过InputFormat#getSplits(JobContext)生成

* 作为参数生产RecordReader：InputFormat#createRecordReader(InputSplit,TaskAttemptContext)

* 实现了InputSplit接口

*/

@InterfaceAudience.Public

@InterfaceStability.Stable

public class FileSplit extends InputSplit implements Writable {

private Path file;

private long start;

private long length;

private String[] hosts;

private SplitLocationInfo[] hostInfos;

public FileSplit() {}

/** Constructs a split with host information

*

* @param file the file name。 文件名称

* @param start the position of the first byte in the file to process。第一个byte的偏移量

* @param length the number of bytes in the file to process。 split的长度

* @param hosts the list of hosts containing the block, possibly null。 split所在的主机列表

*/

public FileSplit(Path file, long start, long length, String[] hosts) {

this.file = file;

this.start = start;

this.length = length;

this.hosts = hosts;

}

/** Constructs a split with host and cached-blocks information

*

* @param file the file name。 文件名称

* @param start the position of the first byte in the file to process。第一个byte的偏移量

* @param length the number of bytes in the file to process split的长度

* @param hosts the list of hosts containing the block split所在的主机列表

* @param inMemoryHosts the list of hosts containing the block in memory 在内存中保存block的机器列表

*/

public FileSplit(Path file, long start, long length, String[] hosts,

String[] inMemoryHosts) {

this(file, start, length, hosts);

hostInfos = new SplitLocationInfo[hosts.length];

for (int i = 0; i < hosts.length; i++) {

// because N will be tiny, scanning is probably faster than a HashSet

boolean inMemory = false;

for (String inMemoryHost : inMemoryHosts) {

if (inMemoryHost.equals(hosts[i])) {

inMemory = true;

break;

}

}

hostInfos[i] = new SplitLocationInfo(hosts[i], inMemory);

}

}

/** The file containing this split's data. */

public Path getPath() { return file; }

/** The position of the first byte in the file to process. */

public long getStart() { return start; }

/** The number of bytes in the file to process. */

@Override

public long getLength() { return length; }

@Override

public String toString() { return file + ":" + start + "+" + length; }

////////////////////////////////////////////

// Writable methods

////////////////////////////////////////////

@Override

public void write(DataOutput out) throws IOException {

Text.writeString(out, file.toString());

out.writeLong(start);

out.writeLong(length);

}

@Override

public void readFields(DataInput in) throws IOException {

file = new Path(Text.readString(in));

start = in.readLong();

length = in.readLong();

hosts = null;

}

@Override

public String[] getLocations() throws IOException {

if (this.hosts == null) {

return new String[]{};

} else {

return this.hosts;

}

}

@Override

@Evolving

public SplitLocationInfo[] getLocationInfo() throws IOException {

return hostInfos;

}

}

hadoop FileSplit的更多相关文章

工作采坑札记：4. Hadoop获取InputSplit文件信息
1. 场景基于客户的数据处理需求,客户分发诸多小数据文件,文件每行代表一条记录信息,且每个文件以"类型_yyyyMMdd_批次号"命名.由于同一条记录可能存在于多个文件中,且处于 ...
报错org.apache.hadoop.mapreduce.lib.input.FileSplit cannot be cast to org.apache.hadoop.mapred.FileSplit
报错 java.lang.Exception: java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.FileSpli ...
Hadoop之倒排索引
前言: 从IT跨度到DT,如今的数据每天都在海量的增长.面对如此巨大的数据,如何能让搜索引擎更好的工作呢?本文作为Hadoop系列的第二篇,将介绍分布式情况下搜索引擎的基础实现,即“倒排索引”. 1. ...
hadoop分片分析
上一篇分析了split的生成,现在接着来说具体的split具体内容及其相关的文件和类.以FileSplit(mapred包下org/apache/hadoop/mapreduce/lib/input/ ...
hadoop输入分片计算(Map Task个数的确定)
作业从JobClient端的submitJobInternal()方法提交作业的同时,调用InputFormat接口的getSplits()方法来创建split.默认是使用InputFormat的子类 ...
Hadoop的数据输入的源码解析
我们知道,任何一个工程项目,最重要的是三个部分:输入,中间处理,输出.今天我们来深入的了解一下我们熟知的Hadoop系统中,输入是如何输入的? 在hadoop中,输入数据都是通过对应的InputFor ...
Hadoop日记Day12---MapReduce学习
一.MapReduce简介 1.1MapReduce概述 MapReduce是一种分布式计算模型,由Google提出,主要用于搜索领域,解决海量数据的计算问题.MR由两个阶段组成:Map和Reduce ...
Hadoop日记Day18---MapReduce排序分组
本节所用到的数据下载地址为:http://pan.baidu.com/s/1bnfELmZ MapReduce的排序分组任务与要求我们知道排序分组是MapReduce中Mapper端的第四步,其中分 ...
Hadoop官方文档翻译——MapReduce Tutorial
MapReduce Tutorial(个人指导) Purpose(目的) Prerequisites(必备条件) Overview(综述) Inputs and Outputs(输入输出) MapRe ...

随机推荐

Nim博弈（nim游戏）
http://blog.csdn.net/qiankun1993/article/details/6765688 NIM 游戏重点结论:对于一个Nim游戏的局面(a1,a2,...,an),它是P- ...
CCCC练习即感
字符串进行初始化时不能通过char a[10]={'\0'}来简单进行,写循环或者memset,亲测有效,以及初始化分好情况,用空格还是'\0',别乱搞. 有一个有意思的题,连续因子,从2开始,依次向 ...
[BZOJ1004] [HNOI2008]Cards解题报告（Burnside引理）
Description 小春现在很清闲,面对书桌上的N张牌,他决定给每张染色,目前小春只有3种颜色:红色,蓝色,绿色.他询问Sun有多少种染色方案,Sun很快就给出了答案.进一步,小春要求染出Sr张红 ...
【洛谷】P1648 看守（数学）
题目链接直接暴力搞\(O(n^2)\)显然是布星滴. 试想,若是一维,最远距离就是最大值减最小值. 现在推广到二维,因为有绝对值的存在,所以有四种情况 \((x1+y1) - (x2+y2), (x ...
IPsec传输模式下ESP报文的装包和拆包过程
原创文章,拒绝转载装包过程总体流程图过程描述在原IP报文中找到TCP报文部分,在其后添加相应的ESP trailer信息. ESP trailer 包含三部分:Padding,Pad leng ...
python排序sorted与sort比较
Python list内置sort()方法用来排序,也可以用python内置的全局sorted()方法来对可迭代的序列排序生成新的序列. sorted(iterable,key=None,revers ...
PC机做ISCSI存储服务器故障
物理主机:IBM x3650 6块SAS盘,分为两组RAID.一组系统,一组数据. zabbix监控告警情况如下: 早上上班,发现服务器无法连接,网络无法通信.让IDC重启,还是无法恢复正常. 去了机 ...
HCharts的y轴保留一位和两位小数
保留一位小数,有一位小数的不变 yAxis : { labels : { formatter : function () { var strVal = ''+this.value ; if (str ...
PYTHON代理IP
import urllib.request url = 'http://www.whatismyip.com.tw/' proxy_support = urllib.request.ProxyHand ...
MVC架构中的controller的几种写法
开始写代码之前,我们先来看一下spring MVC概念.一张图能够清晰得说明. 除了controller,我们需要编写大量代码外,其余的都可以通过配置文件直接配置. MVC的本质即是将业务数据的抽取和 ...

hadoop FileSplit

hadoop FileSplit的更多相关文章

随机推荐

热门专题