InputFormat有两个比较重要的方法:(1)List<InputSplit> getSplits(JobContext job);(2)RecordReader<LongWritable, Text> createRecordReader(InputSplit split,TaskAttemptContext context)。这两个方法分别对应上面的两个功能。
先看继承关系:(1)public class TextInputFormat extends FileInputFormat;(2)public abstract class FileInputFormat<K, V> extends InputFormat;(3)public abstract class InputFormat。最顶的父类InputFormat只有两个未实现的抽象方法getSplits和createRecordReader;而FileInputFormat包含的方法比较多,如下图:
/** An {@link InputFormat} for plain text files. Files are broken into lines.
* Either linefeed or carriage-return are used to signal end of line. Keys are
* the position in the file, and values are the line of text.. */
public class TextInputFormat extends FileInputFormat<LongWritable, Text> { @Override
public RecordReader<LongWritable, Text>
createRecordReader(InputSplit split,
TaskAttemptContext context) {
return new LineRecordReader();
} @Override
protected boolean isSplitable(JobContext context, Path file) {
CompressionCodec codec =
new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
return codec == null;
} }
* Generate the list of files and make them into FileSplits.
public List<InputSplit> getSplits(JobContext job
) throws IOException {
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
long maxSize = getMaxSplitSize(job); //Long.MAX_VALUE // generate splits
List<InputSplit> splits = new ArrayList<InputSplit>();
List<FileStatus>files = listStatus(job);
for (FileStatus file: files) {
Path path = file.getPath();
FileSystem fs = path.getFileSystem(job.getConfiguration());
long length = file.getLen(); //整个文件的长度
BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
if ((length != 0) && isSplitable(job, path)) { //默认是true,但是如果是压缩的,则是false
long blockSize = file.getBlockSize(); //64M,67108864B
long splitSize = computeSplitSize(blockSize, minSize, maxSize); //计算split大小 Math.max(minSize, Math.min(maxSize, blockSize)) long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
blkLocations[blkIndex].getHosts())); //hosts是主机名,name是IP
bytesRemaining -= splitSize; //剩余块的大小
} if (bytesRemaining != 0) { //最后一个
splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,
} else if (length != 0) { //isSplitable(job, path)等于false
splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));
} else {
//Create empty hosts array for zero length files
splits.add(new FileSplit(path, 0, length, new String[0]));
} // Save the number of input files in the job-conf
job.getConfiguration().setLong(NUM_INPUT_FILES, files.size()); LOG.debug("Total # of splits: " + splits.size());
return splits;
(1)minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job)):getFormatMinSplitSize()=1,getMinSplitSize(job)获取"mapred.min.split.size"指定的大小,默认是1;
(2)maxSize = getMaxSplitSize(job):getMaxSplitSize(job)获取"mapred.max.split.size",默认是Long.MAX_VALUE,Long类型的最大值;
(5)然后如果文件长度不为0且支持分割(isSplitable方法等于true):获取block大小,默认是64MB,通过方法computeSplitSize(blockSize, minSize, maxSize)计算分片大小splitSize,这个方法Math.max(minSize, Math.min(maxSize, blockSize));然后将bytesRemaining(剩余未分片字节数)设置为整个文件的长度,A、如果bytesRemaining超过分片大小splitSize一定量才会将文件分成多个InputSplit:bytesRemaining)/splitSize > SPLIT_SLOP(默认1.1),B、就会执行getBlockIndex(blkLocations, length-bytesRemaining)获取block的索引,第二个参数是这个block在整个文件中的偏移量,在循环中会从0越来越大,该方法代码如下:
protected int getBlockIndex(BlockLocation[] blkLocations,
long offset) {
for (int i = 0 ; i < blkLocations.length; i++) {
// is the offset inside this block?
if ((blkLocations[i].getOffset() <= offset) &&
(offset < blkLocations[i].getOffset() + blkLocations[i].getLength())){
return i;
BlockLocation last = blkLocations[blkLocations.length -1];
long fileLength = last.getOffset() + last.getLength() -1;
throw new IllegalArgumentException("Offset " + offset +
" is outside of file (0.." +
fileLength + ")");
这个方法中的if语句的条件会限制获取到,这个偏移量对应的block的索引;C、将这个索引对应的block信息的主机节点以及文件的路径名、开始的便宜量、分片大小splitSize封装到一个InputSplit中加入List<InputSplit> splits;D、bytesRemaining -= splitSize修改剩余字节大小;E、返回A中继续判断,如果满足即系走BCD,否则跳出循环。如果剩余bytesRemaining还不为0,表示还有未分配的数据,将剩余的数据及最后一个block加入splits。
(7)如果文件的长度==0,则splits.add(new FileSplit(path, 0, length, new String[0]))没有block,并且初始和长度都为0;
(8)将输入目录下文件的个数赋值给 "mapreduce.input.num.files",方便以后校对;
TextInputFormat使用的RecordReader是org.apache.hadoop.mapreduce.lib.input.LineRecordReader。我们在MapReduce的MapTask任务的运行源码级分析这篇文章中有介绍过LineRecordReader,initialize方法主要是获取分片信息的初始位置和结束位置,以及输入流(若有压缩则是压缩流);mapper的key/value是通过LineRecordReader.nextKeyValue()方法将key和value读取到key和value中的,在这个方法中key被设置为在文件中的偏移量,value通过LineReader.readLine(value, maxLineLength, Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),maxLineLength))这个方法会读取一行数据放入value之中,方法代码如下:
* Read one line from the InputStream into the given Text. A line
* can be terminated by one of the following: '\n' (LF) , '\r' (CR),
* or '\r\n' (CR+LF). EOF also terminates an otherwise unterminated
* line.
* @param str the object to store the given line (without newline)
* @param maxLineLength the maximum number of bytes to store into str;
* the rest of the line is silently discarded.
* @param maxBytesToConsume the maximum number of bytes to consume
* in this call. This is only a hint, because if the line cross
* this threshold, we allow it to happen. It can overshoot
* potentially by as much as one buffer length.
* @return the number of bytes read including the (longest) newline
* found.
* @throws IOException if the underlying stream throws
public int readLine(Text str, int maxLineLength,
int maxBytesToConsume) throws IOException {
/* We're reading data from in, but the head of the stream may be
* already buffered in buffer, so we have several cases:
* 1. No newline characters are in the buffer, so we need to copy
* everything and read another buffer from the stream.
* 2. An unambiguously terminated line is in buffer, so we just
* copy to str.
* 3. Ambiguously terminated line is in buffer, i.e. buffer ends
* in CR. In this case we copy everything up to CR to str, but
* we also need to see what follows CR: if it's LF, then we
* need consume LF as well, so next call to readLine will read
* from after that.
* We use a flag prevCharCR to signal if previous character was CR
* and, if it happens to be at the end of the buffer, delay
* consuming it until we have a chance to look at the char that
* follows.
int txtLength = 0; //tracks str.getLength(), as an optimization
int newlineLength = 0; //length of terminating newline
boolean prevCharCR = false; //true of prev char was CR
long bytesConsumed = 0;
do {
int startPosn = bufferPosn; //starting from where we left off the last time
if (bufferPosn >= bufferLength) {
startPosn = bufferPosn = 0;
if (prevCharCR)
++bytesConsumed; //account for CR from previous read
bufferLength =; //从输入流中读取一定数量的字节,并将其存储在缓冲区数组 b 中。以整数形式返回实际读取的字节数。
if (bufferLength <= 0) //结束了,没数据了
break; // EOF
//'\n',ASCII码:10,意义:换行NL;;;;'\r' ,ASCII码:13,意义: 回车CR
for (; bufferPosn < bufferLength; ++bufferPosn) { //search for newline
if (buffer[bufferPosn] == LF) { //如果是换行字符\n
newlineLength = (prevCharCR) ? 2 : 1;
++bufferPosn; // at next invocation proceed from following byte,越过换行字符
if (prevCharCR) { //CR + notLF, we are at notLF,如果是回车字符\r
newlineLength = 1;
prevCharCR = (buffer[bufferPosn] == CR);
int readLength = bufferPosn - startPosn;
if (prevCharCR && newlineLength == 0) //表示还没遇到换行,有回车字符,且缓存最后一个是\r
--readLength; //CR at the end of the buffer
bytesConsumed += readLength;
int appendLength = readLength - newlineLength; //newlineLength换行符个数
if (appendLength > maxLineLength - txtLength) {
appendLength = maxLineLength - txtLength;
if (appendLength > 0) {
str.append(buffer, startPosn, appendLength); //将数据加入str
txtLength += appendLength;
} while (newlineLength == 0 && bytesConsumed < maxBytesToConsume);//循环条件没有换行并且没超过上限 if (bytesConsumed > (long)Integer.MAX_VALUE)
throw new IOException("Too many bytes before newline: " + bytesConsumed);
return (int)bytesConsumed;
这个方法目的就是读取一行记录写入str中。bytesConsumed记录这读取的字节总数;bufferLength =从输入流读取bufferLength字节的数据放入buffer中;do-while中开始部分的if语句是要保证将bufferLength个字节数据处理完毕之后再从输入流中读取下一批数据;newlineLength表示换行的标记符长度(0,1,2三种值),因为不同的系统换行标记可能不同,有三种:\r(回车符)、\n(换行符)、\r\n(\n: UNIX 系统行末结束符;\r\n: window 系统行末结束符;\r: MAC OS 系统行末结束符);for循环会挨个检查字符是否是\r或者\n,如果是回车符\r,还会将prevCharCR设置为true,当前字符如果是换行符\n,prevCharCR==true时(表示上一个字符是回车符\r)则newlineLength=2(这表明当前系统的换行标记是\r\n),prevCharCR==false时(表示上一个字符不是回车符\r)则newlineLength=1(这表明当前系统的换行标记是\n),并退出for循环;如果当前字符不是换行符\n且prevCharCR==true(表明当前系统的换行标记是\r)则newlineLength = 1并退出for循环;这样就找到了换行的标记,然后计算数据的长度appendLength(不包括换行符),将buffer中指定位置开始长度为appendLength的数据追加到str(这里其实是value)中;txtLength表示的是str(这里其实是value中值的长度);do-while循环的条件是:1、没有发现换行标记newlineLength == 0;2、读取的字节数量没有超过上限bytesConsumed < maxBytesToConsume,这俩条件要同时满足。这其中有个问题就是当前系统的换行标记是\r\n,但是这两个字符没有同时出现在这次读取的数据之中,\n在下一个批次之中,这没关系,上面的for循环会检查\r出现之后的下一个字符是否是\n再对newlineLength进行设置的。从这个方法可以看出,即使是记录跨split、跨block也不能阻止它完整读取一行数据的决心啊。
public boolean nextKeyValue() throws IOException {
if (key == null) {
key = new LongWritable();
if (value == null) {
value = new Text();
int newSize = 0;
while (pos < end) {
newSize = in.readLine(value, maxLineLength,
Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),
if (newSize == 0) {
pos += newSize;
if (newSize < maxLineLength) {
} // line too long. try again"Skipped line of size " + newSize + " at pos " +
(pos - newSize));
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
