Mapper 与 Reducer 解析
1 . 旧版 API 的 Mapper/Reducer 解析
Mapper/Reducer 中封装了应用程序的数据处理逻辑。为了简化接口,MapReduce 要求所有存储在底层分布式文件系统上的数据均要解释成 key/value 的形式,并交给Mapper/Reducer 中的 map/reduce 函数处理,产生另外一些 key/value。Mapper 与 Reducer 的类体系非常类似,我们以 Mapper 为例进行讲解。Mapper 的类图如图所示,包括初始化、Map操作和清理三部分。
(1)初始化
Mapper 继承了 JobConfigurable 接口。该接口中的 configure 方法允许通过 JobConf 参数对 Mapper 进行初始化。
(2)Map 操作
MapReduce 框架会通过 InputFormat 中 RecordReader 从 InputSplit 获取一个个 key/value 对, 并交给下面的 map() 函数处理:
void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) throws IOException;
该函数的参数除了 key 和 value 之外, 还包括 OutputCollector 和 Reporter 两个类型的参数, 分别用于输出结果和修改 Counter 值。
(3)清理
Mapper 通过继承 Closeable 接口(它又继承了 Java IO 中的 Closeable 接口)获得 close方法,用户可通过实现该方法对 Mapper 进行清理。
MapReduce 提供了很多 Mapper/Reducer 实现,但大部分功能比较简单,具体如图所示。它们对应的功能分别是:
❑ChainMapper/ChainReducer:用于支持链式作业。
❑IdentityMapper/IdentityReducer:对于输入 key/value 不进行任何处理, 直接输出。
❑InvertMapper:交换 key/value 位置。
❑ RegexMapper:正则表达式字符串匹配。
❑TokenMapper:将字符串分割成若干个 token(单词),可用作 WordCount 的 Mapper。
❑LongSumReducer:以 key 为组,对 long 类型的 value 求累加和。
对于一个 MapReduce 应用程序,不一定非要存在 Mapper。MapReduce 框架提供了比 Mapper 更通用的接口:MapRunnable,如图所示。用 户可以实现该接口以定制Mapper 的调用 方式或者自己实现 key/value 的处理逻辑,比如,Hadoop Pipes 自行实现了MapRunnable,直接将数据通过 Socket 发送给其他进程处理。提供该接口的另外一个好处是允许用户实现多线程 Mapper。
如图所示, MapReduce 提供了两个 MapRunnable 实现,分别是 MapRunner 和MultithreadedMapRunner,其中 MapRunner 为默认实现。 MultithreadedMapRunner 实现了一种多线程的 MapRunnable。 默认情况下,每个 Mapper 启动 10 个线程,通常用于非 CPU类型的作业以提供吞吐率。
2. 新版 API 的 Mapper/Reducer 解析
从图可知, 新 API 在旧 API 基础上发生了以下几个变化:
❑Mapper 由接口变为类,且不再继承 JobConfigurable 和 Closeable 两个接口,而是直接在类中添加了 setup 和 cleanup 两个方法进行初始化和清理工作。
❑将参数封装到 Context 对象中,这使得接口具有良好的扩展性。
❑去掉 MapRunnable 接口,在 Mapper 中添加 run 方法,以方便用户定制 map() 函数的调用方法,run 默认实现与旧版本中 MapRunner 的 run 实现一样。
❑新 API 中 Reducer 遍历 value 的迭代器类型变为 java.lang.Iterable,使得用户可以采用“ foreach” 形式遍历所有 value,如下所示:
void reduce(KEYIN key, Iterable<VALUEIN> values, Context context) throws IOException, InterruptedException {
for(VALUEIN value: values) { // 注意遍历方式
context.write((KEYOUT) key, (VALUEOUT) value);
}
}
Mapper类的完整代码如下:
package org.apache.hadoop.mapreduce;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.RawComparator;
import org.apache.hadoop.io.compress.CompressionCodec;
/**
* Maps input key/value pairs to a set of intermediate key/value pairs.
*
* <p>Maps are the individual tasks which transform input records into a
* intermediate records. The transformed intermediate records need not be of
* the same type as the input records. A given input pair may map to zero or
* many output pairs.</p>
*
* <p>The Hadoop Map-Reduce framework spawns one map task for each
* {@link InputSplit} generated by the {@link InputFormat} for the job.
* <code>Mapper</code> implementations can access the {@link Configuration} for
* the job via the {@link JobContext#getConfiguration()}.
*
* <p>The framework first calls
* {@link #setup(org.apache.hadoop.mapreduce.Mapper.Context)}, followed by
* {@link #map(Object, Object, Context)}
* for each key/value pair in the <code>InputSplit</code>. Finally
* {@link #cleanup(Context)} is called.</p>
*
* <p>All intermediate values associated with a given output key are
* subsequently grouped by the framework, and passed to a {@link Reducer} to
* determine the final output. Users can control the sorting and grouping by
* specifying two key {@link RawComparator} classes.</p>
*
* <p>The <code>Mapper</code> outputs are partitioned per
* <code>Reducer</code>. Users can control which keys (and hence records) go to
* which <code>Reducer</code> by implementing a custom {@link Partitioner}.
*
* <p>Users can optionally specify a <code>combiner</code>, via
* {@link Job#setCombinerClass(Class)}, to perform local aggregation of the
* intermediate outputs, which helps to cut down the amount of data transferred
* from the <code>Mapper</code> to the <code>Reducer</code>.
*
* <p>Applications can specify if and how the intermediate
* outputs are to be compressed and which {@link CompressionCodec}s are to be
* used via the <code>Configuration</code>.</p>
*
* <p>If the job has zero
* reduces then the output of the <code>Mapper</code> is directly written
* to the {@link OutputFormat} without sorting by keys.</p>
*
* <p>Example:</p>
* <p><blockquote><pre>
* public class TokenCounterMapper
* extends Mapper<Object, Text, Text, IntWritable>{
*
* private final static IntWritable one = new IntWritable(1);
* private Text word = new Text();
*
* public void map(Object key, Text value, Context context) throws IOException {
* StringTokenizer itr = new StringTokenizer(value.toString());
* while (itr.hasMoreTokens()) {
* word.set(itr.nextToken());
* context.collect(word, one);
* }
* }
* }
* </pre></blockquote></p>
*
* <p>Applications may override the {@link #run(Context)} method to exert
* greater control on map processing e.g. multi-threaded <code>Mapper</code>s
* etc.</p>
*
* @see InputFormat
* @see JobContext
* @see Partitioner
* @see Reducer
*/
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { public class Context
extends MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
public Context(Configuration conf, TaskAttemptID taskid,
RecordReader<KEYIN,VALUEIN> reader,
RecordWriter<KEYOUT,VALUEOUT> writer,
OutputCommitter committer,
StatusReporter reporter,
InputSplit split) throws IOException, InterruptedException {
super(conf, taskid, reader, writer, committer, reporter, split);
}
}
/**
* Called once at the beginning of the task.
*/
protected void setup(Context context
) throws IOException, InterruptedException {
// NOTHING
}
/**
* Called once for each key/value pair in the input split. Most applications
* should override this, but the default is the identity function.
*/
@SuppressWarnings("unchecked")
protected void map(KEYIN key, VALUEIN value,
Context context) throws IOException, InterruptedException {
context.write((KEYOUT) key, (VALUEOUT) value);
}
/**
* Called once at the end of the task.
*/
protected void cleanup(Context context
) throws IOException, InterruptedException {
// NOTHING
}
/**
* Expert users can override this method for more complete control over the
* execution of the Mapper.
* @param context
* @throws IOException
*/
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
cleanup(context);
}
}
从代码中可以看到,Mapper类中定义了一个新的类Context,继承自MapContext
我们来看看MapContext类的源代码:
package org.apache.hadoop.mapreduce;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
/**
* The context that is given to the {@link Mapper}.
* @param <KEYIN> the key input type to the Mapper
* @param <VALUEIN> the value input type to the Mapper
* @param <KEYOUT> the key output type from the Mapper
* @param <VALUEOUT> the value output type from the Mapper
*/
public class MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
extends TaskInputOutputContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
private RecordReader<KEYIN,VALUEIN> reader;
private InputSplit split; public MapContext(Configuration conf, TaskAttemptID taskid,
RecordReader<KEYIN,VALUEIN> reader,
RecordWriter<KEYOUT,VALUEOUT> writer,
OutputCommitter committer,
StatusReporter reporter,
InputSplit split) {
super(conf, taskid, writer, committer, reporter);
this.reader = reader;
this.split = split;
}
/**
* Get the input split for this map.
*/
public InputSplit getInputSplit() {
return split;
}
@Override
public KEYIN getCurrentKey() throws IOException, InterruptedException {
return reader.getCurrentKey();
}
@Override
public VALUEIN getCurrentValue() throws IOException, InterruptedException {
return reader.getCurrentValue();
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
return reader.nextKeyValue();
}
}
MapContext类继承自TaskInputOutputContext,再看看TaskInputOutputContext类的代码:
package org.apache.hadoop.mapreduce;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.util.Progressable;
/**
* A context object that allows input and output from the task. It is only
* supplied to the {@link Mapper} or {@link Reducer}.
* @param <KEYIN> the input key type for the task
* @param <VALUEIN> the input value type for the task
* @param <KEYOUT> the output key type for the task
* @param <VALUEOUT> the output value type for the task
*/
public abstract class TaskInputOutputContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
extends TaskAttemptContext implements Progressable {
private RecordWriter<KEYOUT,VALUEOUT> output;
private StatusReporter reporter;
private OutputCommitter committer; public TaskInputOutputContext(Configuration conf, TaskAttemptID taskid,
RecordWriter<KEYOUT,VALUEOUT> output,
OutputCommitter committer,
StatusReporter reporter) {
super(conf, taskid);
this.output = output;
this.reporter = reporter;
this.committer = committer;
}
/**
* Advance to the next key, value pair, returning null if at end.
* @return the key object that was read into, or null if no more
*/
public abstract
boolean nextKeyValue() throws IOException, InterruptedException;
/**
* Get the current key.
* @return the current key object or null if there isn't one
* @throws IOException
* @throws InterruptedException
*/
public abstract
KEYIN getCurrentKey() throws IOException, InterruptedException;
/**
* Get the current value.
* @return the value object that was read into
* @throws IOException
* @throws InterruptedException
*/
public abstract VALUEIN getCurrentValue() throws IOException,
InterruptedException;
/**
* Generate an output key/value pair.
*/
public void write(KEYOUT key, VALUEOUT value
) throws IOException, InterruptedException {
output.write(key, value);
}
public Counter getCounter(Enum<?> counterName) {
return reporter.getCounter(counterName);
}
public Counter getCounter(String groupName, String counterName) {
return reporter.getCounter(groupName, counterName);
}
@Override
public void progress() {
reporter.progress();
}
@Override
public void setStatus(String status) {
reporter.setStatus(status);
}
public OutputCommitter getOutputCommitter() {
return committer;
}
}
TaskInputOutputContext类继承自TaskAttemptContext,实现了Progressable接口,先看看Progressable接口的代码:
package org.apache.hadoop.util;
/**
* A facility for reporting progress.
*
* <p>Clients and/or applications can use the provided <code>Progressable</code>
* to explicitly report progress to the Hadoop framework. This is especially
* important for operations which take an insignificant amount of time since,
* in-lieu of the reported progress, the framework has to assume that an error
* has occured and time-out the operation.</p>
*/
public interface Progressable {
/**
* Report progress to the Hadoop framework.
*/
public void progress();
}
TaskAttemptContext类的代码:
package org.apache.hadoop.mapreduce;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.util.Progressable;
/**
* The context for task attempts.
*/
public class TaskAttemptContext extends JobContext implements Progressable {
private final TaskAttemptID taskId;
private String status = ""; public TaskAttemptContext(Configuration conf,
TaskAttemptID taskId) {
super(conf, taskId.getJobID());
this.taskId = taskId;
}
/**
* Get the unique name for this task attempt.
*/
public TaskAttemptID getTaskAttemptID() {
return taskId;
} /**
* Set the current status of the task to the given string.
*/
public void setStatus(String msg) throws IOException {
status = msg;
}
/**
* Get the last set status message.
* @return the current status message
*/
public String getStatus() {
return status;
}
/**
* Report progress. The subtypes actually do work in this method.
*/
public void progress() {
}
}
TaskAttemptContext继承自类JobContext,最后来看看JobContext的源代码:
package org.apache.hadoop.mapreduce;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.RawComparator;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;
/**
* A read-only view of the job that is provided to the tasks while they
* are running.
*/
public class JobContext {
// Put all of the attribute names in here so that Job and JobContext are
// consistent.
protected static final String INPUT_FORMAT_CLASS_ATTR =
"mapreduce.inputformat.class";
protected static final String MAP_CLASS_ATTR = "mapreduce.map.class";
protected static final String COMBINE_CLASS_ATTR = "mapreduce.combine.class";
protected static final String REDUCE_CLASS_ATTR = "mapreduce.reduce.class";
protected static final String OUTPUT_FORMAT_CLASS_ATTR =
"mapreduce.outputformat.class";
protected static final String PARTITIONER_CLASS_ATTR =
"mapreduce.partitioner.class"; protected final org.apache.hadoop.mapred.JobConf conf;
private final JobID jobId; public JobContext(Configuration conf, JobID jobId) {
this.conf = new org.apache.hadoop.mapred.JobConf(conf);
this.jobId = jobId;
}
/**
* Return the configuration for the job.
* @return the shared configuration object
*/
public Configuration getConfiguration() {
return conf;
} /**
* Get the unique ID for the job.
* @return the object with the job id
*/
public JobID getJobID() {
return jobId;
}
/**
* Get configured the number of reduce tasks for this job. Defaults to
* <code>1</code>.
* @return the number of reduce tasks for this job.
*/
public int getNumReduceTasks() {
return conf.getNumReduceTasks();
}
/**
* Get the current working directory for the default file system.
*
* @return the directory name.
*/
public Path getWorkingDirectory() throws IOException {
return conf.getWorkingDirectory();
}
/**
* Get the key class for the job output data.
* @return the key class for the job output data.
*/
public Class<?> getOutputKeyClass() {
return conf.getOutputKeyClass();
}
/**
* Get the value class for job outputs.
* @return the value class for job outputs.
*/
public Class<?> getOutputValueClass() {
return conf.getOutputValueClass();
}
/**
* Get the key class for the map output data. If it is not set, use the
* (final) output key class. This allows the map output key class to be
* different than the final output key class.
* @return the map output key class.
*/
public Class<?> getMapOutputKeyClass() {
return conf.getMapOutputKeyClass();
}
/**
* Get the value class for the map output data. If it is not set, use the
* (final) output value class This allows the map output value class to be
* different than the final output value class.
*
* @return the map output value class.
*/
public Class<?> getMapOutputValueClass() {
return conf.getMapOutputValueClass();
}
/**
* Get the user-specified job name. This is only used to identify the
* job to the user.
*
* @return the job's name, defaulting to "".
*/
public String getJobName() {
return conf.getJobName();
}
/**
* Get the {@link InputFormat} class for the job.
*
* @return the {@link InputFormat} class for the job.
*/
@SuppressWarnings("unchecked")
public Class<? extends InputFormat<?,?>> getInputFormatClass()
throws ClassNotFoundException {
return (Class<? extends InputFormat<?,?>>)
conf.getClass(INPUT_FORMAT_CLASS_ATTR, TextInputFormat.class);
}
/**
* Get the {@link Mapper} class for the job.
*
* @return the {@link Mapper} class for the job.
*/
@SuppressWarnings("unchecked")
public Class<? extends Mapper<?,?,?,?>> getMapperClass()
throws ClassNotFoundException {
return (Class<? extends Mapper<?,?,?,?>>)
conf.getClass(MAP_CLASS_ATTR, Mapper.class);
}
/**
* Get the combiner class for the job.
*
* @return the combiner class for the job.
*/
@SuppressWarnings("unchecked")
public Class<? extends Reducer<?,?,?,?>> getCombinerClass()
throws ClassNotFoundException {
return (Class<? extends Reducer<?,?,?,?>>)
conf.getClass(COMBINE_CLASS_ATTR, null);
}
/**
* Get the {@link Reducer} class for the job.
*
* @return the {@link Reducer} class for the job.
*/
@SuppressWarnings("unchecked")
public Class<? extends Reducer<?,?,?,?>> getReducerClass()
throws ClassNotFoundException {
return (Class<? extends Reducer<?,?,?,?>>)
conf.getClass(REDUCE_CLASS_ATTR, Reducer.class);
}
/**
* Get the {@link OutputFormat} class for the job.
*
* @return the {@link OutputFormat} class for the job.
*/
@SuppressWarnings("unchecked")
public Class<? extends OutputFormat<?,?>> getOutputFormatClass()
throws ClassNotFoundException {
return (Class<? extends OutputFormat<?,?>>)
conf.getClass(OUTPUT_FORMAT_CLASS_ATTR, TextOutputFormat.class);
}
/**
* Get the {@link Partitioner} class for the job.
*
* @return the {@link Partitioner} class for the job.
*/
@SuppressWarnings("unchecked")
public Class<? extends Partitioner<?,?>> getPartitionerClass()
throws ClassNotFoundException {
return (Class<? extends Partitioner<?,?>>)
conf.getClass(PARTITIONER_CLASS_ATTR, HashPartitioner.class);
}
/**
* Get the {@link RawComparator} comparator used to compare keys.
*
* @return the {@link RawComparator} comparator used to compare keys.
*/
public RawComparator<?> getSortComparator() {
return conf.getOutputKeyComparator();
}
/**
* Get the pathname of the job's jar.
* @return the pathname
*/
public String getJar() {
return conf.getJar();
}
/**
* Get the user defined {@link RawComparator} comparator for
* grouping keys of inputs to the reduce.
*
* @return comparator set by the user for grouping values.
* @see Job#setGroupingComparatorClass(Class) for details.
*/
public RawComparator<?> getGroupingComparator() {
return conf.getOutputValueGroupingComparator();
}
}
参考资料
《Hadoop技术内幕 深入理解MapReduce架构设计与实现原理》
Mapper 与 Reducer 解析的更多相关文章
- 关于Mapper、Reducer的个人总结(转)
Mapper的处理过程: 1.1. InputFormat 产生 InputSplit,并且调用RecordReader将这些逻辑单元(InputSplit)转化为map task的输入.其中Inpu ...
- Hadoop(十七)之MapReduce作业配置与Mapper和Reducer类
前言 前面一篇博文写的是Combiner优化MapReduce执行,也就是使用Combiner在map端执行减少reduce端的计算量. 一.作业的默认配置 MapReduce程序的默认配置 1)概述 ...
- 多个Mapper和Reducer的Job
多个Mapper和Reducer的Job @(Hadoop) 对于复杂的mr任务来说,只有一个map和reduce往往是不能够满足任务需求的,有可能是需要n个map之后进行reduce,reduce之 ...
- Mybatis源码解析,一步一步从浅入深(五):mapper节点的解析
在上一篇文章Mybatis源码解析,一步一步从浅入深(四):将configuration.xml的解析到Configuration对象实例中我们谈到了properties,settings,envir ...
- 《手写Mybatis》第4章:Mapper XML的解析和注册使用
作者:小傅哥 系列:https://bugstack.cn/md/spring/develop-mybatis/2022-03-20-%E7%AC%AC1%E7%AB%A0%EF%BC%9A%E5%B ...
- MapReduce之Mapper类,Reducer类中的函数(转载)
Mapper类4个函数的解析 Mapper有setup(),map(),cleanup()和run()四个方法.其中setup()一般是用来进行一些map()前的准备工作,map()则一般承担主要的处 ...
- Mapper类/Reducer类中的setup方法和cleanup方法以及run方法的介绍
在hadoop的源码中,基类Mapper类和Reducer类中都是只包含四个方法:setup方法,cleanup方法,run方法,map方法.如下所示: 其方法的调用方式是在run方法中,如下所示: ...
- 027_编写MapReduce的模板类Mapper、Reducer和Driver
模板类编写好后写MapReduce程序,的模板类编写好以后只需要改参数就行了,代码如下: package org.dragon.hadoop.mr.module; import java.io.IOE ...
- [hadoop入门]mapper与reducer(word_count计数demo)
1.mapper #!/usr/bin/env python import sys for line in sys.stdin: line = line.strip() words = line.sp ...
随机推荐
- vim常见操作命令
打开多文件vim file1 file2:open/:e 不关闭vim打开文件 分帧窗口:new 新窗口:sp 横向:vsp 纵向ctrl+w窗口切换:tabc 关闭当前窗口:tabo 关闭所有窗口: ...
- App启动广告
需求: App启动的时候获得广告图片链接,如果已经存在,判断是否和本地的相同,不相同才去下载到本地. 流程图: 这些都在广告页的前一个页面操作(logo页或者Application) import a ...
- Netstat -tln 命令是Linux查看端口使用情况
Netstat -tln 命令是Linux查看端口使用情况
- python解决组合问题
1.问题描述 比如9个数中取4个数的组合以及列出各种组合,该如何做? 我们可以考虑以下一个简单组合:从1,2,3,4,5,6中,如何选取任意四个数的组合. 固定:1 2 3 ,组合有1234 ...
- [BZOJ4651][NOI2016]网格(Tarjan)
下面直接给出结论,相关证明见官方题解. 1.若跳蚤数不超过1或仅有两只跳蚤且相邻,则答案为-1. 2.若跳蚤形成的连通块个数大于1,则答案为0. 3.若跳蚤之间建图存在割点,则答案为1. 4.否则为2 ...
- [CodeForces850C]Arpa and a game with Mojtaba
题目大意: 给你一个包含n个数的数列,两个人轮流对数列进行如下操作: 选择一个质数p和一个正整数k,将数列中所有能被p^k整除的数除以p^k. 最后不能操作者负. 问先手是否有必胜策略. 思路: 显然 ...
- [SimpleOJ238]宝藏探寻
题目大意: 给你一棵带点权的n个结点的树,有m次询问,每次从树上删掉一条路径(u,v),问删掉每条路径后各个连通块权值和的平方之和. 每次询问是独立的. 思路: 首先对树遍历一遍求出每棵子树的权值和. ...
- 1.6(SQL学习笔记)存储过程
一.什么事存储过程 可以将存储过程看做是一组完成某个特定功能的SQL语句的集合. 例如有一个转账功能(A向B转账50),先将账户A中金额扣除50,然后将账户B中金额添加50. 那么我们可以定义一个名为 ...
- 机器学习中的 precision、recall、accuracy、F1 Score
1. 四个概念定义:TP.FP.TN.FN 先看四个概念定义: - TP,True Positive - FP,False Positive - TN,True Negative - FN,False ...
- mysql group by 组内排序
有数据表 comments------------------------------------------------| id | newsID | comment | theTime |---- ...