MapReduce任务分析与讨论MapReduce job explained

In the last post we saw how to run a MapReduce job on Hadoop. Now we're going to analyze how a MapReduce program works. And, if you don't know what MapReduce is, the short answer is "MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster" (from Wikipedia).

Let's take a look at the source code: we can find a Java main method that is called from Hadoop, and two inner static classes, the mapper and the reducer. The code for the mapper is:

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

        private final static IntWritable one = new IntWritable(1);

        private Text word = new Text();

       @Override

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

                StringTokenizer itr = new StringTokenizer(value.toString());

                while (itr.hasMoreTokens()) {

                        word.set(itr.nextToken());

                        context.write(word, one);

                }

        }

}

As we can see, this class extends Mapper, which - as its JavaDoc says - maps input key/value pairs to a set of intermediate key/value pairs; when the job starts, the Hadoop framework passes to the mapper a chunk of data (a subset of the whole dataset) to process. The output of the mapper will be the input of the reducers (it's not the complete story, but we'll arrive there in another post). The Mapper uses Java generics to specify what kind of data will process; in this example, we use a class that extends Mapper and specifies Object and Text as the classes of key/value pairs in input, and Text and IntWritable as the classes of key/value pairs for the output to the reducers (we'll see the details of those classes in a moment).
Let's examine the code: there's only one overridden method, the map() that takes the key/value pair as arguments and the Hadoop context; every time this method is called by Hadoop, the method receives an offset of the file where the value is as the key, and a line of the text file we're reading as the value.
Hadoop has some basic types that ore optimized for network serialization; here is a table with a few of them:

Java type	Hadoop type
Integer	IntWritable
Long	LongWritable
Double	DoubleWritable
String	TextWritable
Map	MapWritable
Array	ArrayWritable

Now it's easy to understand what this method does: for every line of the book it receives, it uses a StringTokenizer to split the line into every single word; then it sets the word in the Textobject and maps it the the value of 1; then writes it to the mappers via the Hadoop context.

Let's now look at the reducer:

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {

        private IntWritable result = new IntWritable();

       @Override

        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

                int sum = 0;

                for (IntWritable val : values) {

                        sum += val.get();

                }

                result.set(sum);

                context.write(key, result);

        }

}

This time we have the first two arguments of the overridden method reduce that are the same type of the last two of the TokenizerMapper class; that's because - as we said - the mapper outputs the data that the reducer will use as an input. The Hadoop framework takes care of calling this method for every key that comes from the mappers; as we saw before, the keys are the words of the file we're counting the words of.
The reduce method now has to sum all the occurrences of every single word, so it initializes a sum variable to 0 and then loops over all the values for that specific key that it receives from the mappers. For every word it updates the sum variable with the value mapped to that key. At the end of the loop, when all the occurrences of that word are counted, the method sets the value obtained into an IntWritable object and gives it to the Hadoop context to be outputted to the user.

We're now at the main method of the class, which is the one that is called by Hadoop when it's executed as a JAR file.

public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();

        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

        if (otherArgs.length != 2) {

                System.err.println("Usage: wordcount <in> <out>");

                System.exit(2);

        }

        Job job = new Job(conf, "word count");

        job.setJarByClass(WordCount.class);

        job.setMapperClass(TokenizerMapper.class);

        job.setCombinerClass(IntSumReducer.class);

        job.setReducerClass(IntSumReducer.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);

        }

}

In the method, we first setup a Configuration object, then we check for the number of arguments passed to it; If the number of arguments is correct, we create a Job object and we set a few values for making it work. Let's dive into the details:

setJarByClass: sets the Jar by finding where a given class came from; this needs an explanation: Hadoop distributes the code to execute to the cluster as a JAR file; instead of specifying the name of the JAR, we tell Hadoop the name of the class that every instance on the cluster has to look for inside its classpath
setMapperClass: sets the class that will be executed as the mapper
setCombinerClass: sets the class that will be executed as the combiner (we'll explain what is a combiner in a future post)
setReducerClass: sets the class that will be executed as the reducer
setOutputKeyClass: sets the class that will be used as the key for outputting data to the user
setOutputValueClass: sets the class that will be used as the value for outputting data to the user

Then we say to Hadoop where it can find the input with the FileInputFormat.addInputPath() method and where it has to write the output with the FileOutputFormat.setOutputPath()method. The last method call is the waitForCompletion(), that submits the job to the cluster and waits for it to finish.

Now that the mechanism of a MapReduce job is more clear, we can start playing with it.

from: http://andreaiacono.blogspot.com/2014/02/mapreduce-job-explained.html

MapReduce任务分析与讨论MapReduce job explained的更多相关文章

MapReduce教程(一)基于MapReduce框架开发<转>
1 MapReduce编程 1.1 MapReduce简介 MapReduce是一种编程模型,用于大规模数据集(大于1TB)的并行运算,用于解决海量数据的计算问题. MapReduce分成了两个部分: ...
Migrating from MapReduce 1 (MRv1) to MapReduce 2 (MRv2, YARN)...
This is a guide to migrating from Apache MapReduce 1 (MRv1) to the Next Generation MapReduce (MRv2 o ...
使用Cloudera Manager搭建MapReduce集群及MapReduce HA
使用Cloudera Manager搭建MapReduce集群及MapReduce HA 作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.通过CM部署MapReduce On ...
【MapReduce】一、MapReduce简介与实例
(一)MapReduce介绍 1.MapReduce简介 MapReduce是Hadoop生态系统的一个重要组成部分,与分布式文件系统HDFS.分布式数据库HBase一起合称为传统Hadoop的三 ...
hadoop2.2编程：从default mapreduce program 来理解mapreduce
下面写一个default mapreduce 的程序: import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapr ...
Top N之MapReduce程序加强版Enhanced MapReduce for Top N items
In the last post we saw how to write a MapReduce program for finding the top-n items of a dataset. T ...
Python实现MapReduce,wordcount实例，MapReduce实现两表的Join
Python实现MapReduce 下面使用mapreduce模式实现了一个简单的统计日志中单词出现次数的程序: from functools import reduce from multiproc ...
yarn/mapreduce工作机制及mapreduce客户端代码编写
首先需要知道的就是在老版本的hadoop中是没有yarn的,mapreduce既负责资源分配又负责业务逻辑处理.为了解耦,把资源分配这块抽了出来,形成了yarn,这样不仅mapreudce可以用yar ...
【MapReduce】三、MapReduce运行机制
通过前面对map端.reduce端以及整个shuffle端工作流程的介绍,我们已经了解了MapReduce的并行运算模型,基本可以使用MapReduce进行编程,那么MapRecude究竟是如何执 ...

随机推荐

【BZOJ】3674: 可持久化并查集加强版
题解感觉全世界都写过只有我没写过毕竟是板子还是挺简单的,只要用可持久化线段树维护一下数组的形态就好了,每个数组里面维护这个数组的father,和这个点所在树的最长链的深度(如果这个点是根按秩合并要 ...
win7下docker环境搭建nginx+php-fpm+easyswoole+lavarel+mysql开发环境
win7环境基础在上一篇win7下docker环境搭建nginx+php-fpm+easyswoole开发环境中已经详细叙述搭建完成本篇文章将叙述如何在上述基础上搭建laravel开发环境,这个其实 ...
OpenJudge——0003:jubeeeeeat
OpenJudge——0003:jubeeeeeat 描述众所周知,LZF很喜欢打一个叫Jubeat的游戏.这是个音乐游戏,游戏界面是4×4的方阵,会根据音乐节奏要求玩家按下一些指定方块(以下称co ...
转 Java高级程序员面试题
1.你认为项目中最重要的过程是那些? 分析.设计阶段尽量找出进度的优先级 2.如果给你一个4-6人的team,怎么分配? 挑选一技术过硬的人作为我的替补.其它人平均分配任务,每周进行全面的任务分配 ...
Initializing the FallBack certificate failed . TDSSNIClient initialization failed
安装SQL后服务不能启动,报错: 2014-03-24 14:33:10.06 spid13s Error: 17190, Severity: 16, State: 1.2014-03-24 ...
利用python制作电子签名
有时候我们需要在文档里粘上电子签名,通常使用photoshop制作,但是通常photoshop软件还需要下载.安装,对于不经常使用的人来说,单独装这个软件没啥必要. 因此我们可以利用python对签名 ...
Git 入门使用
Git是什么? Git是一个开源的分布式版本控制系统,用于敏捷高效地处理任何或小或大的项目. Git 是 Linus Torvalds 为了帮助管理 Linux 内核开发而开发的一个开放源码的版本控制 ...
[Agc002E]Candy Piles
[Agc002E]Candy Piles 题目大意有\(n\)个数,两人轮流操作,可以做以下操作之一: 删掉一个最大的数将所有数-1 最后取没的人输,问先手是否必胜? 试题分析直接决策不知道选哪 ...
[Agc001E] BBQ Hard
[Agc001E] BBQ Hard 题目大意给定\(n\)对正整数\(a_i,b_i\),求\(\sum_{i=1}^{n-1} \sum_{j=i+1}^n \binom{a_i+b_i+a_j ...
[POI2012]Salaries
题目大意: 给定一棵n带权树,每个点的权值在[1,n]范围内且互不相等,并满足子结点的权值一定小于父结点. 现在已知一个包含根结点的联通块中个点的权值,求剩下哪些点的权值能够被求出,并求出这些权值. ...

MapReduce任务分析与讨论MapReduce job explained

MapReduce任务分析与讨论MapReduce job explained的更多相关文章

随机推荐

热门专题