In the last post and in the preceding one we saw how to write a MapReduce program for finding the top-n items of a data set. The difference between the two was that the first program (which we call basic) emitted to the reducers every single item read from input, while the second (which we call enhanced) made a partial computation and emitted only a subset of the input. The enhanced top-n optimizes network transmissions (the less the key-value pairs emitted, the less network is used for transmitting them from mapper to reducer) and reduces the number of keys shuffled and sorted; but this is obtained at the cost of rewriting of the mapper.

If we look at the code of the mapper of the enhanced top-n , we can see that it implements the idea behind the reducer: it uses a Map for making a partial count of the words and emits every word only once; looking at the reducer's code, we see that it implements the same idea. If we could execute the code of the reducer of the basic top-n after the mapper has run on every machine (with its subset of data), we would obtain exactly the same result than rewriting the mapper as in the enhanced. This is exactly what Hadoop combiners do: they're executed just after the mapper on every machine for improving performance. For telling Hadoop which class to use as a combiner, we can use the Job.setCombinerClass() method.

Caution: using the reducer as a combiner works only if the function we're computing is both commutative (a + b = b + a) and associative (a + (b + c) = (a + b) + c). 
Let's make an example. Suppose we're analyzing the traffic of a website and we have an input file with the number of visits per day like this (YYYYMMDD value):

20140401 100
20140331 1000
20140330 1300
20140329 5100
20140328 1200

We want to find which is the day with the highest number of visits. 
Let's say that we have two mappers; the first one receives the first three lines and the second receives the last two. If we write the mapper to emit every line, the reducer will evaluate something like this:

max(100, 1000, 1300, 5100, 1200) -> 5100

and the max is 5100. 
If we use the reducer as a combiner, the reducer will evaluate something like this:

max( max(100, 1000, 1300), max(5100, 1200)) -> max( 1300, 5100) -> 5100

because each of the two mapper will evaluate locally the max function. In this case the result will be 5100 as well, since the function we're evaluating (the max function) is both commutative and associative.

Let's say that now we need to compute the average number of visits per day. If we write the mapper to emit every line of the input file, the reducer will evaluate this:

mean(100, 1000, 1300, 5100, 1200) -> 1740

which is 1740. 
If we use the reducer as a combiner, the reducer will evaluate something like this:

mean( mean(100, 1000, 1300), mean(5100, 1200)) -> mean( 800, 3150) -> 1975

because each of the two mapper will evaluate locally the max function. In this case the result will be 1975, which is obviously wrong.

So, if we're computing a commutative and associative function and we want to improve the performance of our job, we can use our reducer as a combiner; if we want to improve performance but we're computing a function that is not commutative and associative, we have to rewrite the mapper or to write a new combiner from stratch.

from: http://andreaiacono.blogspot.com/2014/03/hadoop-combiners.html

Hadoop Combiners的更多相关文章

  1. 更为详细的介绍Hadoop combiners-More about Hadoop combiners

    Hadoop combiners are a very powerful tool to speed up our computations. We already saw what a combin ...

  2. Hadoop学习笔记—8.Combiner与自定义Combiner

    一.Combiner的出现背景 1.1 回顾Map阶段五大步骤 在第四篇博文<初识MapReduce>中,我们认识了MapReduce的八大步凑,其中在Map阶段总共五个步骤,如下图所示: ...

  3. Hadoop日记Day17---计数器、map规约、分区学习

    一.Hadoop计数器 1.1 什么是Hadoop计数器 Haoop是处理大数据的,不适合处理小数据,有些大数据问题是小数据程序是处理不了的,他是一个高延迟的任务,有时处理一个大数据需要花费好几个小时 ...

  4. [BigData]关于Hadoop学习笔记第四天(PPT总结)(一)

    课程安排 Partitioner编程** 自定义排序编程** Combiner编程** 常见的MapReduce算法** ---------------------------加深拓展-------- ...

  5. hadoop调优之一:概述

    hadoop集群性能低下的常见原因 (一)硬件环境 1.CPU/内存不足,或未充分利用 2.网络原因 3.磁盘原因 (二)map任务原因 1.输入文件中小文件过多,导致多次启动和停止JVM进程.可以设 ...

  6. 一脸懵逼学习Hadoop中的MapReduce程序中自定义分组的实现

    1:首先搞好实体类对象: write 是把每个对象序列化到输出流,readFields是把输入流字节反序列化,实现WritableComparable,Java值对象的比较:一般需要重写toStrin ...

  7. hadoop两大核心之一:MapReduce总结

    MapReduce是一种分布式计算模型,由Google提出,主要用于搜索领域,MapReduce程序 本质上是并行运行的,因此可以解决海量数据的计算问题. MapReduce任务过程被分为两个处理阶段 ...

  8. hadoop调优之一:概述 分类: A1_HADOOP B3_LINUX 2015-03-13 20:51 395人阅读 评论(0) 收藏

    hadoop集群性能低下的常见原因 (一)硬件环境 1.CPU/内存不足,或未充分利用 2.网络原因 3.磁盘原因 (二)map任务原因 1.输入文件中小文件过多,导致多次启动和停止JVM进程.可以设 ...

  9. Hadoop 三剑客之 —— 分布式计算框架 MapReduce

    一.MapReduce概述 二.MapReduce编程模型简述 三.combiner & partitioner 四.MapReduce词频统计案例         4.1 项目简介      ...

随机推荐

  1. 【PAT】1002. A+B for Polynomials (25)

    1002. A+B for Polynomials (25) This time, you are supposed to find A+B where A and B are two polynom ...

  2. jquery validate表单验证插件的基本使用方法及功能拓展

    1 表单验证的准备工作 在开启长篇大论之前,首先将表单验证的效果展示给大家.    1.点击表单项,显示帮助提示 2.鼠标离开表单项时,开始校验元素 3.鼠标离开后的正确.错误提示及鼠标移入时的帮助提 ...

  3. Java 基础总结大全

    Java 基础总结大全 一.基础知识 1.JVM.JRE和JDK的区别 JVM(Java Virtual Machine):java虚拟机,用于保证java的跨平台的特性. java语言是跨平台,jv ...

  4. 未能从程序集“Elmah”中加载类型“Elmah.ErrorLogModule”错误

    项目名与Elmah重名了,以为是配置文件的问题,搞了好久.

  5. 洛谷P1088 火星人 [STL]

    题目传送门 火星人 格式难调,题面就不放了. 分析: 这道题目不得不又让人感叹,还是$STL$大法好!!! $C++$的$algorithm$库中自带有$next\_permutation()$和$p ...

  6. 洛谷P4139 上帝与集合的正确用法 [扩展欧拉定理]

    题目传送门 上帝与集合的正确用法 题目描述 根据一些书上的记载,上帝的一次失败的创世经历是这样的: 第一天, 上帝创造了一个世界的基本元素,称做“元”. 第二天, 上帝创造了一个新的元素,称作“α”. ...

  7. 【面试总结-编程】多行两列数据,实现同key的value求和并输出

    一个文件,两列,多行. 第一列是字母,第二列是数字,同列数据之间通过空格分割. 统计首列字母相同的第二列之和. 样例输入: A 5 B 6 OO 7 A 6 A 2 OO 2 输出: A:13 B:6 ...

  8. cogs——2478. [HZOI 2016]简单的最近公共祖先

    2478. [HZOI 2016]简单的最近公共祖先 ★☆   输入文件:easy_LCA.in   输出文件:easy_LCA.out   简单对比时间限制:2 s   内存限制:128 MB [题 ...

  9. 在树莓派3B上安装node.js

    本文主讲如何在树莓派3B上安装node.js 环境描述1. 树莓派安装了`2016-11-25-raspbian-jessie-lite`(PS:在此版本的镜像中,默认禁用了ssh,在烧录好镜像之后, ...

  10. FastReport.Net使用:[6]HTML标签使用

    使用HTML标签的基础知识 1.FastReport所支持的HTML标签包括: ●粗体:<b>...</b> ●斜体:<i>...</i> ●下划线:& ...