- List<InputSplit> getSplits(), 获取由输入文件计算出输入分片(InputSplit),解决数据或文件分割成片问题。
- RecordReader<K,V> createRecordReader(),创建RecordReader,从InputSplit中读取数据,解决读取分片中数据问题。
1、Validate the input-specification of the job. (首先验证作业的输入的正确性)
2、 Split-up the input file(s) into logical InputSplit
s, each of which is then assigned to an individual Mapper
),一个InputSplit将会被分配给一个独立的MapTask )
3、Provide the RecordReader
implementation to be used to glean input records from the logical InputSplit
for processing by the Mapper
- public abstract class InputFormat<K, V> {
- /**
- * 每个InputSplit的分片被分配到一个独立的Mapper上
- * 注:1、这个分片是逻辑上对输入数据进行分片,而实际上输入文件没有被切割成一个个小块。
- * 每个分片由输入文件的路径,起始位置,偏移量等
- * 2、在InputFormat中创建的RecordReader也要使用InputSplit
- */
- public abstract
- List<InputSplit> getSplits(JobContext context ) throws IOException, InterruptedException;
- /**
- * 为每个分片创建一个record reader
- */
- public abstract
- RecordReader<K,V> createRecordReader(InputSplit split,TaskAttemptContext context)
- throws IOException, InterruptedException;
- public abstract class InputSplit {
- /**
- * 得到每个分片的大小,可以按照分片大小排序。
- */
- public abstract long getLength() throws IOException, InterruptedException;
- /**
- * Get the list of nodes by name where the data for the split would be local.
- * The locations do not need to be serialized.
- * 获取存储该分片的数据所在的节点位置
- */
- public abstract String[] getLocations() throws IOException, InterruptedException;
- }
2.1 下面看看InputSplit的一个子类,FileSplit类:
- public FileSplit() {}
- /** Constructs a split with host information
- *
- * @param file the file name
- * @param start the position of the first byte in the file to process
- * @param length the number of bytes in the file to process
- * @param hosts the list of hosts containing the block, possibly null
- */
- public FileSplit(Path file, long start, long length, String[] hosts) {
- this.file = file;
- this.start = start;
- this.length = length;
- this.hosts = hosts;
- }
- /** The file containing this split's data. */
- public Path getPath() { return file; }
- /** The position of the first byte in the file to process. */
- public long getStart() { return start; }
- /** The number of bytes in the file to process. */
- @Override
- public long getLength() { return length; }
- @Override
- public String toString() { return file + ":" + start + "+" + length; }
- ////////////////////////////////////////////
- // Writable methods
- ////////////////////////////////////////////
- @Override
- public void write(DataOutput out) throws IOException {
- Text.writeString(out, file.toString());
- out.writeLong(start);
- out.writeLong(length);
- }
- @Override
- public void readFields(DataInput in) throws IOException {
- file = new Path(Text.readString(in));
- start = in.readLong();
- length = in.readLong();
- hosts = null;
- }
- @Override
- public String[] getLocations() throws IOException {
- if (this.hosts == null) {
- return new String[]{};
- } else {
- return this.hosts;
- }
- }
上面我们介绍的FileSplit对应的是一个输入文件,也就是说,如果用FileSplit对应的FileInputFormat作为输入格式,那么即使文件特别小,也是作为一个单独的InputSplit来处理,而每一个InputSplit将会由一个独立的Mapper Task来处理。在输入数据是由大量小文件组成的情形下,就会有同样大量的InputSplit,从而需要同样大量的Mapper来处理,大量的Mapper Task创建销毁开销将是巨大的,甚至对集群来说,是灾难性的!
3 、FileInputFormat
