Hadoop MapReduce InputFormat/OutputFormat

InputFormat

import java.io.IOException;

import java.util.List;

/**

 * InputFormat describes the input-specification for a Map-Reduce job.

 *

 * The Map-Reduce framework relies on the InputFormat of the job to:

 *

 * Validate the input-specification of the job.

 *

 * Split-up the input file(s) into logical InputSplits, each of which is then

 * assigned to an individual Mapper.

 *

 * Provide the RecordReader implementation to be used to glean input records

 * from the logical InputSplit for processing by the Mapper.

 *

 * The default behavior of file-based InputFormats, typically sub-classes of

 * FileInputFormat, is to split the input into logical InputSplits based on the

 * total size, in bytes, of the input files. However, the FileSystem blocksize

 * of the input files is treated as an upper bound for input splits. A lower

 * bound on the split size can be set via mapred.min.split.size.

 *

 * Clearly, logical splits based on input-size is insufficient for many

 * applications since record boundaries are to respected. In such cases, the

 * application has to also implement a RecordReader on whom lies the

 * responsibility to respect record-boundaries and present a record-oriented

 * view of the logical InputSplit to the individual task.

 *

 */

public abstract class InputFormat<K, V> {

    /**

     * Logically split the set of input files for the job.

     *

     * <p>

     * Each {@link InputSplit} is then assigned to an individual {@link Mapper}

     * for processing.

     * </p>

     *

     * <p>

     * <i>Note</i>: The split is a <i>logical</i> split of the inputs and the

     * input files are not physically split into chunks. For e.g. a split could

     * be <i>&lt;input-file-path, start, offset&gt;</i> tuple. The InputFormat

     * also creates the {@link RecordReader} to read the {@link InputSplit}.

     *

     * @param context

     *            job configuration.

     * @return an array of {@link InputSplit}s for the job.

     */

    public abstract List<InputSplit> getSplits(JobContext context)

            throws IOException, InterruptedException;

    /**

     * Create a record reader for a given split. The framework will call

     * {@link RecordReader#initialize(InputSplit, TaskAttemptContext)} before

     * the split is used.

     *

     * @param split

     *            the split to be read

     * @param context

     *            the information about the task

     * @return a new record reader

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract RecordReader<K, V> createRecordReader(InputSplit split,

            TaskAttemptContext context) throws IOException,

            InterruptedException;

}

OutputFormat

import java.io.IOException;

import org.apache.hadoop.mapreduce.InputFormat;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.RecordReader;

/**

 * <code>InputSplit</code> represents the data to be processed by an individual

 * {@link Mapper}.

 *

 * <p>

 * Typically, it presents a byte-oriented view on the input and is the

 * responsibility of {@link RecordReader} of the job to process this and present

 * a record-oriented view.

 *

 * @see InputFormat

 * @see RecordReader

 */

public abstract class InputSplit {

    /**

     * Get the size of the split, so that the input splits can be sorted by

     * size.

     *

     * @return the number of bytes in the split

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract long getLength() throws IOException, InterruptedException;

    /**

     * Get the list of nodes by name where the data for the split would be

     * local. The locations do not need to be serialized.

     *

     * @return a new array of the node nodes.

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract String[] getLocations() throws IOException,

            InterruptedException;

}

RecordReader

import java.io.Closeable;

import java.io.IOException;

/**

 * The record reader breaks the data into key/value pairs for input to the

 * {@link Mapper}.

 *

 * @param <KEYIN>

 * @param <VALUEIN>

 */

public abstract class RecordReader<KEYIN, VALUEIN> implements Closeable {

    /**

     * Called once at initialization.

     *

     * @param split

     *            the split that defines the range of records to read

     * @param context

     *            the information about the task

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract void initialize(InputSplit split, TaskAttemptContext context)

            throws IOException, InterruptedException;

    /**

     * Read the next key, value pair.

     *

     * @return true if a key/value pair was read

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract boolean nextKeyValue() throws IOException,

            InterruptedException;

    /**

     * Get the current key

     *

     * @return the current key or null if there is no current key

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract KEYIN getCurrentKey() throws IOException,

            InterruptedException;

    /**

     * Get the current value.

     *

     * @return the object that was read

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract VALUEIN getCurrentValue() throws IOException,

            InterruptedException;

    /**

     * The current progress of the record reader through its data.

     *

     * @return a number between 0.0 and 1.0 that is the fraction of the data

     *         read

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract float getProgress() throws IOException,

            InterruptedException;

    /**

     * Close the record reader.

     */

    public abstract void close() throws IOException;

}

OutputFormat

import java.io.IOException;

import org.apache.hadoop.fs.FileSystem;

/**

 * <code>OutputFormat</code> describes the output-specification for a Map-Reduce

 * job.

 *

 * <p>

 * The Map-Reduce framework relies on the <code>OutputFormat</code> of the job

 * to:

 * <p>

 * <ol>

 * <li>

 * Validate the output-specification of the job. For e.g. check that the output

 * directory doesn't already exist.

 * <li>

 * Provide the {@link RecordWriter} implementation to be used to write out the

 * output files of the job. Output files are stored in a {@link FileSystem}.</li>

 * </ol>

 *

 * @see RecordWriter

 */

public abstract class OutputFormat<K, V> {

    /**

     * Get the {@link RecordWriter} for the given task.

     *

     * @param context

     *            the information about the current task.

     * @return a {@link RecordWriter} to write the output for the job.

     * @throws IOException

     */

    public abstract RecordWriter<K, V> getRecordWriter(

            TaskAttemptContext context) throws IOException,

            InterruptedException;

    /**

     * Check for validity of the output-specification for the job.

     *

     * <p>

     * This is to validate the output specification for the job when it is a job

     * is submitted. Typically checks that it does not already exist, throwing

     * an exception when it already exists, so that output is not overwritten.

     * </p>

     *

     * @param context

     *            information about the job

     * @throws IOException

     *             when output should not be attempted

     */

    public abstract void checkOutputSpecs(JobContext context)

            throws IOException, InterruptedException;

    /**

     * Get the output committer for this output format. This is responsible for

     * ensuring the output is committed correctly.

     *

     * @param context

     *            the task context

     * @return an output committer

     * @throws IOException

     * @throws InterruptedException

     */

    public abstract OutputCommitter getOutputCommitter(

            TaskAttemptContext context) throws IOException,

            InterruptedException;

}

RecordWriter

import java.io.IOException;

import org.apache.hadoop.fs.FileSystem;

/**

 * <code>RecordWriter</code> writes the output &lt;key, value&gt; pairs to an

 * output file.

 *

 * <p>

 * <code>RecordWriter</code> implementations write the job outputs to the

 * {@link FileSystem}.

 *

 * @see OutputFormat

 */

public abstract class RecordWriter<K, V> {

    /**

     * Writes a key/value pair.

     *

     * @param key

     *            the key to write.

     * @param value

     *            the value to write.

     * @throws IOException

     */

    public abstract void write(K key, V value) throws IOException,

            InterruptedException;

    /**

     * Close this <code>RecordWriter</code> to future operations.

     *

     * @param context

     *            the context of the task

     * @throws IOException

     */

    public abstract void close(TaskAttemptContext context) throws IOException,

            InterruptedException;

}

OutputCommitter

import java.io.IOException;

/**

 * <code>OutputCommitter</code> describes the commit of task output for a

 * Map-Reduce job.

 *

 * <p>

 * The Map-Reduce framework relies on the <code>OutputCommitter</code> of the

 * job to:

 * <p>

 * <ol>

 * <li>

 * Setup the job during initialization. For example, create the temporary output

 * directory for the job during the initialization of the job.</li>

 * <li>

 * Cleanup the job after the job completion. For example, remove the temporary

 * output directory after the job completion.</li>

 * <li>

 * Setup the task temporary output.</li>

 * <li>

 * Check whether a task needs a commit. This is to avoid the commit procedure if

 * a task does not need commit.</li>

 * <li>

 * Commit of the task output.</li>

 * <li>

 * Discard the task commit.</li>

 * </ol>

 *

 * @see org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter

 * @see JobContext

 * @see TaskAttemptContext

 *

 */

public abstract class OutputCommitter {

    /**

     * For the framework to setup the job output during initialization

     *

     * @param jobContext

     *            Context of the job whose output is being written.

     * @throws IOException

     *             if temporary output could not be created

     */

    public abstract void setupJob(JobContext jobContext) throws IOException;

    /**

     * For cleaning up the job's output after job completion

     *

     * @param jobContext

     *            Context of the job whose output is being written.

     * @throws IOException

     */

    public abstract void cleanupJob(JobContext jobContext) throws IOException;

    /**

     * Sets up output for the task.

     *

     * @param taskContext

     *            Context of the task whose output is being written.

     * @throws IOException

     */

    public abstract void setupTask(TaskAttemptContext taskContext)

            throws IOException;

    /**

     * Check whether task needs a commit

     *

     * @param taskContext

     * @return true/false

     * @throws IOException

     */

    public abstract boolean needsTaskCommit(TaskAttemptContext taskContext)

            throws IOException;

    /**

     * To promote the task's temporary output to final output location

     *

     * The task's output is moved to the job's output directory.

     *

     * @param taskContext

     *            Context of the task whose output is being written.

     * @throws IOException

     *             if commit is not

     */

    public abstract void commitTask(TaskAttemptContext taskContext)

            throws IOException;

    /**

     * Discard the task output

     *

     * @param taskContext

     * @throws IOException

     */

    public abstract void abortTask(TaskAttemptContext taskContext)

            throws IOException;

}

Hadoop MapReduce InputFormat/OutputFormat的更多相关文章

[Hadoop] - 自定义Mapreduce InputFormat&OutputFormat
在MR程序的开发过程中,经常会遇到输入数据不是HDFS或者数据输出目的地不是HDFS的,MapReduce的设计已经考虑到这种情况,它为我们提供了两个组建,只需要我们自定义适合的InputFormat ...
Hadoop MapReduce InputFormat基础
有时候你可能想要用不同的方法从input data中读取数据.那么你就需要创建一个自己的InputFormat类. InputFormat是一个只有两个函数的接口. public interf ...
Hadoop MapReduce编程 API入门系列之网页流量版本1（二十二）
不多说,直接上代码. 对流量原始日志进行流量统计,将不同省份的用户统计结果输出到不同文件. 代码 package zhouls.bigdata.myMapReduce.flowsum; import ...
Hadoop(20)-MapReduce框架原理-OutputFormat
1.outputFormat接口实现类 2.自定义outputFormat 步骤: 1). 定义一个类继承FileOutputFormat 2). 定义一个类继承RecordWrite,重写write ...
Hadoop MapReduce编程学习
一直在搞spark,也没时间弄hadoop,不过Hadoop基本的编程我觉得我还是要会吧,看到一篇不错的文章,不过应该应用于hadoop2.0以前,因为代码中有 conf.set("map ...
Hadoop MapReduce概念学习系列之mr程序组件全貌（二十）
其实啊,spilt是,控制Apache Hadoop Mapreduce的map并发任务数,详细见http://www.cnblogs.com/zlslch/p/5713652.html map,是m ...
hadoop MapReduce
简单介绍官方给出的介绍是hadoop MR是一个用于轻松编写以一种可靠的.容错的方式在商业化硬件上的大型集群上并行处理大量数据的应用程序的软件框架. MR任务通常会先把输入的数据集切分成独立的块(可 ...
【Big Data - Hadoop - MapReduce】hadoop 学习笔记：MapReduce框架详解
开始聊MapReduce,MapReduce是Hadoop的计算框架,我学Hadoop是从Hive开始入手,再到hdfs,当我学习hdfs时候,就感觉到hdfs和mapreduce关系的紧密.这个可能 ...
Hadoop MapReduce编程 API入门系列之wordcount版本1（五）
这个很简单哈,编程的版本很多种. 代码版本1 package zhouls.bigdata.myMapReduce.wordcount5; import java.io.IOException; im ...

随机推荐

TCP/IP之分层
网络协议通常分不同层次进行开发,每一层分别负责不同的通信功能.一个协议族,比方T C P / I P,是一组不同层次上的多个协议的组合.T C P / I P通常被觉得是一个四层协议系统. 1.每层的 ...
win32 api 文件操作！
CreateFile打开文件要对文件进行读写等操作,首先必须获得文件句柄,通过该函数可以获得文件句柄,该函数是通向文件世界的大门. ReadFile从文件中读取字节信息.在打开文件获得了文件句柄之后, ...
Qt知识点、疑难杂症的治疗
Q: QVariant 保存指针数据 A1: 1,使用QVariant::fromValue((void*)target)保存数据 2,使用(ShortcutItem*)(v.value<v ...
iOS开发系列之运动事件
前面我们主要介绍了触摸事件以及由触摸事件引出的手势识别,下面我们简单介绍一下运动事件.在iOS中和运动相关的有三个事件:开始运动.结束运动.取消运动. 监听运动事件对于UI控件有个前提就是监听对象必须 ...
Linux下/etc/fstab文件详解
当系统启动的时候,系统会自动地从这个文件读取信息,并且会自动将此文件中指定的文件系统挂载到指定的目录. [root@rusky ~]# vi /etc/fstab # # /etc/fstab # C ...
AIDL跨进程通信
Android跨进程通信会用到AIDL,当然跨进程通信不一定要用AIDL,像广播也是可以的,当然这里用到AIDL相对比较安全一些: AIDL允许传递基本数据类型(Java 的原生类型如int/long ...
Hql 执行CRUD
//新增] @Test public void add(){ config = new Configuration(); sessionfactory = config.configure(" ...
Ext4.1 tree grid的右键菜单
Ext4.1 tree grid的右键菜单功能其实挺简单的只要添加一个itemcontextmenu事件,并在事件中显示出Menu就OK了. 代码: this.tree.on('itemcontex ...
asp.net 读取sql存储过程返回值
关于Exec返回值的问题有很多,在这做个简要的总结. 读查询语句示例: Declare @count int select @Count 要点: ...
java测试1
发大水 package com.java1234.activiti.variable; import java.util.Date; import java.util.HashMap; import ...

Hadoop MapReduce InputFormat/OutputFormat

Hadoop MapReduce InputFormat/OutputFormat的更多相关文章

随机推荐

热门专题