CHAPTER 7.Input and Output Patterns

本章关注一个最经常忽略的问题，来改进MapReduce 的value：自定义输入和输出。我们并不会总使用Mapreduce本身的方式加载或存储数据。有时，可以跳过在hdfs存储数据这项耗时的阶段，仅存储一些数据，不是全部的，或直接在MapReduce结束后输送给后面的处理。有时，基本的Hadoop规范，文件块和输入分割不能完成你需要的事情，所以要使用自定义的InputFormat或OutputFormat。

本章三个模式处理输入：generating data，external source input，和partition pruniing。都有一个有趣的属性：map阶段完全不会意识到拿到输入键值对之前会发生复杂的事情。使用自定义的输入格式抽象出你要加载数据的方法的细节，是一种有效的方式。

另一方面，hadoop不总是按你需要的方式存储数据。本章的模式external source output，会把数据写到hadoop以外的系统。自定义的输出格式也会不让map或reduce阶段意识到数据输出时发生的复杂的事情。

Customizing Input and Output in Hadoop

Hadoop允许你修改数据load的方式，有两个主要途径：配置输入有多少连续的块，配置记录如何出现在map阶段。相关的两个类是RecordReader和InputFormat。他们随着Hadoop框架运行，跟mapper和reducer运行方式相似。也允许修改数据存储的方式，通过OutputFormat和RecordWriter。

InputFormat

Hadoop依赖job的输入格式做三件事：

1.校验job的输入配置，例如数据是否存在。 2.分割文件块为逻辑上的inputSplit类型的块，每一个对应一个map任务。

3.创建RecordReader的实现从inputsplit创建键值对。这些键值对一个一个发送到mapper。

最常用的输入格式的子类是FileInputFormat，hadoop默认是TextInputFormat。这个类首先校验job的输入，保证输入路径的存在。然后根据文件字节数大小逻辑分割输入文件，使用块大小作为分割边界值。例如，160M的文件，块大小64M时分成三个逻辑块，0M-64M，64m-128M,128M-160M.每个map任务都会对应其中一个块，然后RecordReader负责生成键值对。

通常，recordReader有额外的修复边界问题的责任，因为输入分割边界是任意的，很有可能不是记录的边界。例如，TextInputFormat使用LineRecordReader读取文本文件对每个map的每一文本行创建键值对，例如用换行符分割。键是读到的一行的字节数，值是整行字符串。因为它不像输入分片的字节块，会用换行符分开，LineRecordRead会负责读到行的末尾保证读到一条完整的记录。不同数据块的这点数据（一个完整行）理论上可能不在相同的节点，所以从所在主机上读。这个读由FSDataInputStream类处理，我们就不必处理去哪儿找数据块。

使用自己的格式时不要害怕经过了分割的边界，只需要检测没有重复或丢失数据。

Notice：自定义的输入格式不限于文件输入。你可以把输入表示为InputSplit对象和键值对，自定义或其它的，可以在一个MapReduce job里并行读入任何东西到map阶段。只需要记住输入分片表示什么和利用数据本地性的优势。

InputFormat抽象类有两个抽象方法：

getSplits

典型的实现是利用JobConText对象获取配置的输入并返回该对象的list。Inputsplit有个方法返回表示数据在集群中位置的机器的数组，提示TaskTracker应该执行的map任务。这个方法也是验证配置的正确性或抛出需要的异常的合适的地方，因为方法使用在前面。(例如在提交到jobTracker之前)

CreateRecordReader

这个方法在后面使用，用来生成RecordReader的实现，随后详细讨论。通常，一个新实例创建并立即返回，因为record reader有个初始化方法被框架调用。

RecordReader

RecordReader是用来根据给的InputSplit创建键值对的。因为inputsplit表示了分片的字节范围，使mapper的处理有意义。这就是为什么hadoophe MapReduce被认为是“读时模式”。模式是在RecordReader中定义的，单独的基于RecordReader的实现，而不是基于我们希望的job的输入。从输入源读取字节转换成writablecomparable key和一个writable value。创建自定义输入格式时经常使用自定义的类型，因为这是一种好的面向对象编程的方式来把信息给mapper。

RecordReader使用数据和由inputsplit创建的边界生成键值对。在基于文件的输入的环境中，“start“是文件中的RecordReader应该开始生成键值对的字节偏移量。“end”是应该停止读记录的偏移量。就api而言，没有硬性的边界:不能阻止一个开发人员把整个文件作为一个map的输入，当然这是不建议的，经常需要越过边界读数据，来保证读到一条完整的记录。

考虑xml的问题。当使用TextInputFormat抽取每行时，xml元素通常不在同一行，会被MapReduce input 分割。当读到输入分区边界的“end“之后，就得到一条完整记录。找到记录的末尾以后，你仅需要保证每条记录的读从xml元素的开始开始。找到inputsplit的开始之后，继续读直到开始的xml标签被读到。这允许MapReduce框架覆盖整个xml文件的内容，但不会重复任何xml记录。由于向前找xml元素的开始而跳过的xml内容会被前面的map任务处理。

recordReader 抽象类有几个方法要覆盖。

Initialize

把map任务指定的inputSplit和TaskAttemptContext作为本方法的参数。对基于文件的输入格式，这是寻找开始读文件时的字节偏移的好时机。

GetCurrentKey and getCurrentValue

这两个方法被框架使用生成键值对发送给mapper。尽可能重用这两个方法返回的对象

nextKeyValue

类似inputFormat类里的对应方法，读一个简单的键值对并返回true，直到数据读完。

GetProgress

这是个可选的方法，用于框架对度量的收集。

由框架使用，在没有键值对要处理时清除资源。

outputFormat

hadoop依靠job的输出格式做两个主要的任务：

1.检验job的输出配置。

2.创建RecordWriter的实现写job的输出。

跟FileInputFormat相对应的，FileOutputFormat处理基于文件的输出。因为MapReduce job的大多数输出写到hdfs，很多基于文件的输出格式相应的api都能解决大部分的需求。Hadoop默认使用TextOutputFormat，存储tab分隔的键值对到配置的hdfs的输出目录。TextOutputFormat也检验开始运行job之前输出目录不能存在。

TextoutputFormat 使用LineRecordWriter对每个map或reduce任务写键值对，根据是否是reduce阶段。这个类使用toString方法序列化每一键值对到存储在hdfs的part文件里，用tab分隔键值。这个分隔符是默认的，能通过配置改变。

跟inputFormat类似，数据不会受限于只存在hdfs上。只要能用java把键值对写到其它源，例如jdbc，就可以用MapReduce做批量写。只需要保证你要写到的地方能处理多个任务产生的连接。

outputFormat抽象类有三个抽象方法需要实现：

checkOutputSpecs

用于验证job指定的输出，例如保证job提交之前输出目录不能存在。否则，输出可能覆盖（看具体配置）。

GetRecordWriter

方法返回RecordWriter的实现，来序列化键值对到输出，输出可以是FileSystem对象。

GetOutputCommiter

Job的输出提交者在初始化期间设置每个任务，根据成功完成的状态提交（commit，区别于submit）任务，成功或其它状态，完成时都会清除任务。对基于文件的输出，FileOutputCommittter可以处理所有繁重的工作。它会对每个map任务创建临时输出目录，把成功的任务的输出移动到最终的输出目录。

RecordWriter

RecordWriter抽象类把键值对写到文件系统或另外的输出。与RecordReader不同，它没有初始化阶段。然而，可用构造器在需要的时候设置record writer。构造期间任何参数都能传入，因为record writer实例的创建是通过OutputFormat.getRecordWriter。

此类包含两个方法：

Write

这个方法由框架对每个要写的键值对调用。这个方法的实现很大程度上取决于你的使用。下面的例子中，我们展示怎样把键值对写到外部的内存键值存储，而不是文件系统。

当处理完键值对时，框架调用这个方法。可以释放文件句柄，关闭跟其它服务的连接，或清除需要清除的任务。

Generating Data

Pattern Description

生成数据模式很有趣，因为不是从外面加载数据，它快速，并行地产生数据。

Intent

你需要从零开始生成大量数据。

Motivation

这个模式最大的特点是，它不加载数据。用这种模式，你可以生成数据并存到分布式文件系统上。

生成数据不太常见。通常，你可以生成一批数据并反复使用。当需要生成数据时，MapReduce是一个很适合的工具。

这种模式最常见的使用案例是生成随机数据。构建一些具有代表性的数据集对大规模的测试非常有用，尽管这种测试针对的实际数据量很小。它对创建在一定范围内用来研究理论的证明的“toy domains”也很有用。

生成随机数据常用来作为基准测试的一部分，例如常用的TeraGen/TeraSort 和 DFSIO。

不幸的是，这种模式的实现用hadoop不是很简单，因为底层的框架对一个map任务指定一个输入分片并对记录指定一个map方法。在这种模式下，没有输入分片，也没有记录的概念，所以必须欺骗框架认为是有分片，有记录。

Structure

用hadoop实现这种模式，要实现一个自定义的inputFormat并让一个RecordReader生成随机数据。Map方法对数据源的处理完全忽略，所以构建非常迅速，而不是加载hdfs上的数据文件，多数情况下，可使用identity mapper，但如果想在map任务做一些后处理，或者立即分析它。见图7-1.

这种模式是map-only的。

·InputFormat创建虚拟分片，分片数量可以配置。

·RecordReader拿假分片生成随机记录数据。有些时候，在输入分片里可以指定一些信息来告诉record reader生成什么。例如，要生成随机的日期/时间数据，一个输入分片代表一个小时段。

·多数情况下使用identitymapper原样写出输入数据。

Figure 7-1. The structure of the generating data pattern

Notice：实现这种模式比较懒的方式是把job的每个假输入文件塞进单条编造的记录。然后，就可以使用通用InputFormat 和 RecordReader在map方法里生成数据。

Consequences

每个mapper输出一个包含随机数据的文件。

Resemblances

用sql或pig有几种创建随机数据的方式，但都不够简洁和有说服力。

Performance analysis

这里主要考虑的性能方面的问题是需要多少map任务来生成数据。通常，map任务越多，生成数据量越快，因为充分利用了集群的并行性。然而，map启动数多于map slots数时意义不大，他们一直在做同样的事情。

Generating Data Examples

Generating random StackOverflow comments

为了生成随机stackOverflow数据，我们使用1000个单词的list生成随机短评。我们需要生成一个随机分数，row id，userid，和创建时间。

Driver code。解析四个命令行参数配置job。设置自定义的输入格式，然后进一步调用静态方法配置。所有输出写到给定的输出目录。使用identity mapper，设置reduce数量为0从而禁用reduce阶段。

publicstaticvoid main(String[] args) throws Exception {

Configuration conf = new Configuration();

int numMapTasks = Integer.parseInt(args[0]);

int numRecordsPerTask = Integer.parseInt(args[1]);

Path wordList = new Path(args[2]);

Path outputDir = new Path(args[3]);

Job job = new Job(conf, "RandomDataGenerationDriver");

job.setJarByClass(RandomDataGenerationDriver.class);

job.setNumReduceTasks(0);

job.setInputFormatClass(RandomStackOverflowInputFormat.class);

RandomStackOverflowInputFormat.setNumMapTasks(job, numMapTasks);

RandomStackOverflowInputFormat.setNumRecordPerTask(job,

numRecordsPerTask);

RandomStackOverflowInputFormat.setRandomWordList(job, wordList);

TextOutputFormat.setOutputPath(job, outputDir);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(NullWritable.class);

System.exit(job.waitForCompletion(true) ? 0 : 2);

}

InputSplit code.FakeInputSplit类简单的继承自InputSplit并实现writable.没有覆盖任何方法。用于欺骗框架指派一个任务生成随机数据。

publicstaticclass FakeInputSplit extends InputSplit implements Writable {

publicvoid readFields(DataInput arg0) throws IOException {

}

publicvoid write(DataOutput arg0) throws IOException {

}

publiclong getLength() throws IOException, InterruptedException {

return 0;

}

public String[] getLocations() throws IOException, InterruptedException {

returnnew String[0];

}

inputFormat code。输入格式有两个目的：返回框架生成map任务需要的输入分片，然后为map任务创建RandomStackOverflowRecordReader。覆盖getSplits方法返回一个配置的数量的FakeInputSplit分片数。这个数量是从配置取的。当框架调用createRecordReader，一个

RandomStackOverflowRecordReader实例化，初始化，返回。

publicstaticclass RandomStackOverflowInputFormat extends

InputFormat<Text, NullWritable> {

publicstaticfinal String NUM_MAP_TASKS = "random.generator.map.tasks";

publicstaticfinal String NUM_RECORDS_PER_TASK = "random.generator.num.records.per.map.task";

publicstaticfinal String RANDOM_WORD_LIST = "random.generator.random.word.file";

public List<InputSplit> getSplits(JobContext job) throws IOException {

// Get the number of map tasks configured for

int numSplits = job.getConfiguration().getInt(NUM_MAP_TASKS, -1);

// Create a number of input splits equivalent to the number of tasks

ArrayList<InputSplit> splits = new ArrayList<InputSplit>();

for (int i = 0; i < numSplits; ++i) {

splits.add(new FakeInputSplit());

}

return splits;

}

public RecordReader<Text, NullWritable> createRecordReader(

InputSplit split, TaskAttemptContext context)

throws IOException, InterruptedException {

// Create a new RandomStackOverflowRecordReader and initialize it

RandomStackOverflowRecordReader rr = new RandomStackOverflowRecordReader();

rr.initialize(split, context);

return rr;

}

publicstaticvoid setNumMapTasks(Job job, int i) {

job.getConfiguration().setInt(NUM_MAP_TASKS, i);

}

publicstaticvoid setNumRecordPerTask(Job job, int i) {

job.getConfiguration().setInt(NUM_RECORDS_PER_TASK, i);

}

publicstaticvoid setRandomWordList(Job job, Path file) {

DistributedCache.addCacheFile(file.toUri(), job.getConfiguration());

}

recordReader code。Record reader是数据真正生成的地方。在初始化FakeInputSplit时给出，但简单的忽视它。从job配置中抽取要创建的记录数，从分布式缓存获得随机单词的list。对每次nextKeyValue调用，使用简单的随机数生成器创建一条随机记录。评论体由一个帮助方法从list随机选择单词来生成。单词数量从1到30个。计数器也自增，为了跟踪生成了多少条记录。一旦所有的记录生成完成，record reader返回false，给框架mapper已经没有输入信息的信号。

publicstaticclass RandomStackOverflowRecordReader extends

RecordReader<Text, NullWritable> {

privateintnumRecordsToCreate = 0;

privateintcreatedRecords = 0;

private Text key = new Text();

private NullWritable value = NullWritable.get();

private Random rndm = new Random();

private ArrayList<String> randomWords = new ArrayList<String>();

// This object will format the creation date string into a Date

// object

private SimpleDateFormat frmt = new SimpleDateFormat(

"yyyy-MM-dd'T'HH:mm:ss.SSS");

publicvoid initialize(InputSplit split, TaskAttemptContext context)

throws IOException, InterruptedException {

// Get the number of records to create from the configuration

this.numRecordsToCreate = context.getConfiguration().getInt(

NUM_RECORDS_PER_TASK, -1);

// Get the list of random words from the DistributedCache

URI[] files = DistributedCache.getCacheFiles(context

.getConfiguration());

// Read the list of random words into a list

BufferedReader rdr = new BufferedReader(new FileReader(

files[0].toString()));

String line;

while ((line = rdr.readLine()) != null) {

randomWords.add(line);

}

rdr.close();

}

publicboolean nextKeyValue() throws IOException, InterruptedException {

// If we still have records to create

if (createdRecords <numRecordsToCreate) {

// Generate random data

int score = Math.abs(rndm.nextInt()) % 15000;

int rowId = Math.abs(rndm.nextInt()) % 1000000000;

int postId = Math.abs(rndm.nextInt()) % 100000000;

int userId = Math.abs(rndm.nextInt()) % 1000000;

String creationDate = frmt.format(Math.abs(rndm.nextLong()));

// Create a string of text from the random words

String text = getRandomText();

String randomRecord = "<row Id=\"" + rowId + "\" PostId=\""

+ postId + "\" Score=\"" + score + "\" Text=\"" + text

+ "\" CreationDate=\"" + creationDate + "\" UserId\"="

+ userId + "\" />";

key.set(randomRecord);

++createdRecords;

returntrue;

} else {

// We are done creating records

returnfalse;

}

private String getRandomText() {

StringBuilder bldr = new StringBuilder();

int numWords = Math.abs(rndm.nextInt()) % 30 + 1;

for (int i = 0; i < numWords; ++i) {

bldr.append(randomWords.get(Math.abs(rndm.nextInt())

% randomWords.size())

+ " ");

}

return bldr.toString();

}

public Text getCurrentKey() throws IOException, InterruptedException {

returnkey;

}

public NullWritable getCurrentValue() throws IOException,

InterruptedException {

returnvalue;

}

publicfloat getProgress() throws IOException, InterruptedException {

return (float) createdRecords / (float) numRecordsToCreate;

}

publicvoid close() throws IOException {

// nothing to do here...

}

External Source Output

Pattern Description

正如本章早些时候说的，外部源输出模式写到hadoop系统之外。

Intent

你想把MapReduce的输出写到远程地点。

Motivation

使用这种模式，我们能够把从MapReduce框架输出的数据直接给一个外部源。这对直接加载数据到系统而不是通过中转数据的方式极其有用。这种模式跳过存储到文件系统这一步而直接发送到要去的地方。使用MapReduce大量并行导入外部源的方式有使用的地方。

数据是并行写出的。由于使用外部源做输出，你需要保证目标系统能够处理这种并行度，承受所有打开的连接。

Structure

图7-2展示了外部源输出结构，解释如下:

·OutputFormat在job提交前验证job配置指定的输出。这也是保证外部源完全可用的好时机。如果当把数据提交给外部系统时发现它并不能用，这是不好的。这个方法也负责RecordWriter方法的创建。

·RecordWriter把所有的键值对写到外部源。它的实现根据不同的外部源而不同。对象构建时，使用外部源的api建立需要的连接。这些连接用于map或reduce任务写数据。

Figure 7-2. The structure of the external source output pattern

Consequences

输出数据已经发送到外部源，并且外部源成功加载数据。

Notice：注意任务失败可能发生，如果发生了，任何write方法里的键值对都不能恢复。典型的MapReduce job里，临时输出写到文件系统里。在失败的情况下，输出被丢弃。当写到一个外部目录时，会在流中接收数据。如果任务失败，外部源不会自动识别并丢弃所有从这个任务接收的数据。如果这是不可接受的，考虑使用自定义的OutputCommitter写临时输出到文件系统。

Performance analysis

从MapReduce角度看，没什么可担心的，因为map和reduce都是通用的。不过要注意数据的接收方要能处理平行连接。运行1000个任务写到一个关系数据库里不是很好。要避免这种情况，你可能要让每个reducer处理多一点的数据。如果目标系统对并行支持的很好，这也可能不是问题。例如，写到一个分区数据库，可以把每个reducer写到指定的数据库实例。（oracle RAC？）

External Source Output Example

Writing to Redis instances

这个例子是从MapReduce并行写入多个redis实例的基本方式。Redis是一个开源内存键值存储数据库。通常作为数据结构服务器，键可以是string，hash，list，set，和sorted set。Redis用标准c写的，能在多数posix系统下工作，例如linux，不需要外部的依赖。

为了跟hadoop框架一起工作，jedis用于跟redis的交流。Jedis是开源的“blazingly small and sane Redis java 客户端”。还有其它语言编写的redis客户端可以在网上找到。

这个例子没有实际的分析业务，本章剩下的都是这样。关注于用自定义的fileOutputFormat怎样把数据存储到hdfs并存到外部数据系统。这个例子里，stackOverflow用户数据写到数量可配置的redis集群，数据是用户到声誉值的映射。这些映射数据根据redis的hash平均的随机分发。

Redis的hash是一种stirng 字段到string值的映射，跟java的hashmap类似。每个hash都有一个key标识它。每个hash可保存超过40亿的键值对。

问题：给出用户信息数据，并行随机分发用户-声誉值的映射数据到一个数量可配置的redis集群。

outputFormat code。RedisHashOutputFormat类负责在提交到jobtracker之前创建和验证job配置。也会创建RecordWriter序列化输出键值对。通常写到hdfs，但我们这里不是，一会会看到。

输出格式包含了必须被驱动代码设置的配置变量，来保证已经有了job运行需要的所有信息。

这里，有几个推测开发人员需要用到的静态方法。这个输出格式接受一些redis实例主机作为csv结构和一个写所有输出的redis hash key。checkOutputSpecs 方法里，运行之前要保证两个参数已被设置，因为没有他们job会失败。这也是你要验证配置的地方。

getRecordWriter方法用于后面为map或reduce任务创建RecordWriter实例。这里我们靠RedisHashRecordWriter得到需要的配置变量并返回一个新的实例。这个Record writer是RedisHashOutputFormat的子类，不需要但是约定的东西。

这个输出格式的最后一个方法是getOutputCommitter。在任务失败需要重跑之前框架用它管理任何临时输出。对于这个实现，我们通常不关心任务是否失败和需要重新执行。只要job完成就可以。输出提交者是框架需要的，但NullOutputFormat包含的输出提交者的实现什么也不做。

publicstaticclass RedisHashOutputFormat extends OutputFormat<Text, Text> {

publicstaticfinal String REDIS_HOSTS_CONF = "mapred.redishashoutputformat.hosts";

publicstaticfinal String REDIS_HASH_KEY_CONF = "mapred.redishashinputformat.key";

publicstaticvoid setRedisHosts(Job job, String hosts) {

job.getConfiguration().set(REDIS_HOSTS_CONF, hosts);

}

publicstaticvoid setRedisHashKey(Job job, String hashKey) {

job.getConfiguration().set(REDIS_HASH_KEY_CONF, hashKey);

}

public RecordWriter<Text, Text> getRecordWriter(TaskAttemptContext job)

throws IOException, InterruptedException {

returnnew RedisHashRecordWriter(job.getConfiguration().get(

REDIS_HASH_KEY_CONF), job.getConfiguration().get(

REDIS_HOSTS_CONF));

}

publicvoid checkOutputSpecs(JobContext job) throws IOException {

String hosts = job.getConfiguration().get(REDIS_HOSTS_CONF);

if (hosts == null || hosts.isEmpty()) {

thrownew IOException(REDIS_HOSTS_CONF

+ " is not set in configuration.");

}

String hashKey = job.getConfiguration().get(REDIS_HASH_KEY_CONF);

if (hashKey == null || hashKey.isEmpty()) {

thrownew IOException(REDIS_HASH_KEY_CONF

+ " is not set in configuration.");

}

public OutputCommitter getOutputCommitter(TaskAttemptContext context)

throws IOException, InterruptedException {

return (new NullOutputFormat<Text, Text>())

.getOutputCommitter(context);

}

publicstaticclass RedisHashRecordWriter extends

RecordWriter<Text, Text> {

// code in next section

}

RecordReader code。 RedisHashRecordWriter类通过jedis客户端处理redis的连接并写数据。每个键值对随机写到redis实例，这种写是在整个集群内平均分发的，构造器保存要写的hash key并创建新的jedis实例。

然后连接上jedis实例并跟一个整数做映射。Write方法会用这个映射得到指定的jedis实例。用key的hash码对redis实例个数取模。这个模值决定了键值对要发送的jedis实例。Jedis实例在close方法里关闭连接。

publicstaticclass RedisHashRecordWriter extends RecordWriter<Text, Text> {

private HashMap<Integer, Jedis> jedisMap = new HashMap<Integer, Jedis>();

private String hashKey = null;

public RedisHashRecordWriter(String hashKey, String hosts) {

this.hashKey = hashKey;

// Create a connection to Redis for each host

// Map an integer 0-(numRedisInstances - 1) to the instance

int i = 0;

for (String host : hosts.split(",")) {

Jedis jedis = new Jedis(host);

jedis.connect();

jedisMap.put(i, jedis);

++i;

}

publicvoid write(Text key, Text value) throws IOException,

InterruptedException {

// Get the Jedis instance that this key/value pair will be

// written to

Jedis j = jedisMap.get(Math.abs(key.hashCode()) % jedisMap.size());

// Write the key/value pair

j.hset(hashKey, key.toString(), value.toString());

}

publicvoid close(TaskAttemptContext context) throws IOException,

InterruptedException {

// For each jedis instance, disconnect it

for (Jedis jedis : jedisMap.values()) {

jedis.disconnect();

}

Mapper code。较简单。Userid和声誉值从记录获取然后输出。Outputformat会做大部分的工作，允许mapper重用多次去写任何你想要的东西到redis hash里。

publicstaticclass RedisOutputMapper extends

Mapper<Object, Text, Text, Text> {

private Text outkey = new Text();

private Text outvalue = new Text();

publicvoid map(Object key, Text value, Context context)

throws IOException, InterruptedException {

Map<String, String> parsed = MRDPUtils.transformXmlToMap(value

.toString());

String userId = parsed.get("Id");

String reputation = parsed.get("Reputation");

// Set our output key and values

outkey.set(userId);

outvalue.set(reputation);

context.write(outkey, outvalue);

}

Driver code。驱动代码解析命令行参数，调用静态方法设置要写到redis的数据。

publicstaticvoid main(String[] args) throws Exception {

Configuration conf = new Configuration();

Path inputPath = new Path(args[0]);

String hosts = args[1];

String hashName = args[2];

Job job = new Job(conf, "Redis Output");

job.setJarByClass(RedisOutputDriver.class);

job.setMapperClass(RedisOutputMapper.class);

job.setNumReduceTasks(0);

job.setInputFormatClass(TextInputFormat.class);

TextInputFormat.setInputPaths(job, inputPath);

job.setOutputFormatClass(RedisHashOutputFormat.class);

RedisHashOutputFormat.setRedisHosts(job, hosts);

RedisHashOutputFormat.setRedisHashKey(job, hashName);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

int code = job.waitForCompletion(true) ? 0 : 2;

System.exit(code);

}

摘录地址：http://blog.csdn.net/cuirong1986/article/details/8510162

（转）MapReduce Design Patterns（chapter 7 （part 1））（十三）的更多相关文章

MapReduce Design Patterns(chapter 1)(一)
Chapter 1.Design Patterns and MapReduce MapReduce 是一种运行于成百上千台机器上的处理数据的框架,目前被google,Hadoop等多家公司或社区广泛使 ...
MapReduce Design Patterns(chapter 3 (part 1))(五)
Chapter 3. Filtering Patterns 本章的模式有一个共同点:不会改变原来的记录.这种模式是找到一个数据的子集,或者更小,例如取前十条,或者很大,例如结果去重.这种过滤器模式跟前 ...
MapReduce Design Patterns(chapter 2 (part 3))(四)
Inverted Index Summarizations Pattern Description 反向索引模式在MapReduce分析中经常作为一个例子.我们将会讨论我们要创建的term跟标识符之间 ...
MapReduce Design Patterns(chapter 2 (part 2))(三)
Median and standard deviation 中值和标准差的计算比前面的例子复杂一点.因为这种运算是非关联的,它们不是那么容易的能从combiner中获益.中值是将数据集一分为两等份的数 ...
MapReduce Design Patterns(chapter 2(part 1))(二)
随着每天都有更多的数据加载进系统,数据量变得很庞大.这一章专注于对你的数据顶层的,概括性意见的设计模式,从而使你能扩展思路,但可能对局部数据是不适用的.概括性的分析都是关于对相似数据的分组和执行统计运 ...
(转)MapReduce Design Patterns（chapter 1）（一）
翻译的是这本书: Chapter 1.Design Patterns and MapReduce MapReduce 是一种运行于成百上千台机器上的处理数据的框架,目前被google,Hadoop等多 ...
（转） MapReduce Design Patterns（chapter 5 （part 1））（九）
Chapter 5. Join Patterns 把数据保存成一个巨大的数据集不是很常见.例如,用户信息数据频繁更新,所以要保存到关系数据库中.于此同时,web日志以恒定的数据流量增加,直接写到HDF ...
（转）MapReduce Design Patterns（chapter 4 （part 1））（七）
Chapter 4. Data Organization Patterns 与前面章节的过滤器相比,本章是关于数据重组.个别记录的价值通常靠分区,分片,排序成倍增加.特别是在分布式系统中,因为这能提高 ...
（转）MapReduce Design Patterns（chapter 3 （part 1））（五）
Chapter 3. Filtering Patterns 本章的模式有一个共同点:不会改变原来的记录.这种模式是找到一个数据的子集,或者更小,例如取前十条,或者很大,例如结果去重.这种过滤器模式跟前 ...

随机推荐

延迟环境变量扩展（bat）
延迟环境变量扩展(bat) 之前遇到一些环境变量的问题,简单记录下 From:http://www.cnblogs.com/dongzhiquan/archive/2012/09/05/2671218 ...
tomcat+svn+maven+jenkins实现自动构建
首先说明一个各软件的版本: tomcat:apache-tomcat-8.5.16.tar.gz maven:apache-maven-3.5.0-bin.tar.gz svn:subversion- ...
Eclipse FindBugs插件
在线安装: Update Site:http://findbugs.cs.umd.edu/eclipse 本地安装: 1.首先从findbugs网站下载插件:http://findbugs.sourc ...
转载 - POJ分类很好很有层次感
from http://blog.csdn.net/zzycsx/article/details/49103451 OJ上的一些水题(可用来练手和增加自信) (poj3299,poj2159,poj2 ...
[小问题笔记（四）] Enum枚举类型转换为DataTable（ C# ）
枚举: public enum ProductType { 小产品=, 大产品, 超大产品 } 转换方法: /// <summary> /// 枚举类型转化为DataTable /// & ...
MapReduce-实践2
进阶实践4: mapper,reducer输出数据压缩应用场景当mapper或者reducer的输出数量比较大,会影响shuffle阶段远程拷贝的网络性能,以及对存储容量的要求:这个时候可以考虑 ...
PHP整数取余返回负数的相关解决方法
PHP语言虽然功能强大,但并不代表其没有缺点,在编写代码的过程中未免会遇到一些让人头痛的问题.下面我们将为大家介绍有关PHP整数取余返回负数的解决办法. 我们先来看个例子. $res = 162447 ...
spring 及 spring boot 资源文件配置
Spring配置文件引入xml文件: <import resource=" " />标签使用总结 https://www.cnblogs.com/javahr/p/83 ...
java自带的MD5
前言: MD5可生成16.32.64位数的签名. // MD5加码,32位 public static String toMD5(String plainText) { String r ...
使用Python操作memcache
Python连接memcached的库有很多,处于简单以及高效的原则,最终选择了pymemcache, 优点完全实现了memcached text协议对于send/recv操作可以配置timeou ...

（转）MapReduce Design Patterns（chapter 7 （part 1））（十三）

CHAPTER 7.Input and Output Patterns

Customizing Input and Output in Hadoop

InputFormat

RecordReader

outputFormat

RecordWriter

Generating Data

Pattern Description

Generating Data Examples

Generating random StackOverflow comments

External Source Output

Pattern Description

Intent

Motivation

Structure

Consequences

Performance analysis

External Source Output Example

Writing to Redis instances

（转）MapReduce Design Patterns（chapter 7 （part 1））（十三）的更多相关文章

随机推荐

热门专题