更为详细的介绍Hadoop combiners-More about Hadoop combiners

Hadoop combiners are a very powerful tool to speed up our computations. We already saw what a combiner is in a previous post and we also have seen another form of optimization inthis post. Let's put all together to get the broader idea.
The combiners are optimizations that can be used with Hadoop to make a local-reduction: the idea is to reduce the key-value pairs directly on the mapper, to avoid transmitting all of them to the reducers.
Let's get back to the Top20 example from the previous post, which finds the top 20 words most used in a text. The Hadoop output of this job is shown below:

...

Map input records=4239

Map output records=37817

Map output bytes=359621

Input split bytes=118

Combine input records=0

Combine output records=0

Reduce input groups=4987

Reduce shuffle bytes=435261

Reduce input records=37817

Reduce output records=20

...

As we can see in the lines highlighted in bold, without a combiner we have 4239 lines in input for the mappers and 37817 key-value pairs emitted (the number of different words of the text). Having defined no combiner, the input and output records of combiners are 0, and so the input records for the reducers are exactly those emitted by the mappers, 37817.

Let's define a simple combiner:

    public static class WordCountCombiner extends Reducer<text, intwritable,="" text,="" intwritable=""> {

        @Override

        public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {

            // computes the number of occurrences of a single word

            int sum = 0;

            for (IntWritable val : values) {

                sum += val.get();

            }

            context.write(key, new IntWritable(sum));

        }

    }

As we can see, the code has the same logic of the reducer, since its target is the same: reducing key/value pairs.
Running the job having set the combiner gives us this result:

...

Map input records=4239

Map output records=37817

Map output bytes=359621

Input split bytes=116

Combine input records=37817

Combine output records=20

Reduce input groups=20

Reduce shuffle bytes=194

Reduce input records=20

Reduce output records=20

...

Looking at the output from Hadoop, we see that now the combiner has 37817 input records: this means that the records emitted from the mappers were all sent to the combiners; the result of the combiners is of 20 records emitted, which is the number of records received by the reducers.
Wow, that's a great result! We avoided the transmission of a lot of data: just 20 records instead of 37817 that we had without the combiner.

But there's a big disadvantage using combiners: since is an optimization, Hadoop does not guarantee their execution. So, what can we do to ensure a reduction at the mapper-level? Simple: we can put the logic of the reducer inside the mapper!

This is exactly what we've done in the mapper of this post. This pattern is called "in-mapper combiner". The reduce part is started at mapper level, so that the key-value pairs sent to the reducers are minimized.
Let's see Hadoop output with this pattern (in-mapper combiner and without the stand-alone combiner):

...

Map input records=4239

Map output records=4987

Map output bytes=61522

Input split bytes=118

Combine input records=0

Combine output records=0

Reduce input groups=4987

Reduce shuffle bytes=71502

Reduce input records=4987

Reduce output records=20...

Compared to the execution of the other mapper (without combining), this mapper outputs only 4987 records instead of the 37817 that are emitted to the reducers. A big reduction, even if not as big as the one obtained with the stand-alone combiner.
And what happens if we decide to couple the in-mapper combiner pattern and the stand-alone combiner? Well, we've got the best of the two:

...

Map input records=4239

Map output records=4987

Map output bytes=61522

Input split bytes=116

Combine input records=4987

Combine output records=20

Reduce input groups=20

Reduce shuffle bytes=194

Reduce input records=20

Reduce output records=20

...

In this last case, we have the best performance because we're emitting from the mapper a reduced number of records, the combiners (if it's executed) reduce even more the size of the data to be emitted. The only downside of this approach I can think of is that it takes a lot of time to be coded.

from: http://andreaiacono.blogspot.com/2014/05/more-about-hadoop-combiners.html

更为详细的介绍Hadoop combiners-More about Hadoop combiners的更多相关文章

原来你是这样的BERT，i了i了！ —— 超详细BERT介绍（一）BERT主模型的结构及其组件
原来你是这样的BERT,i了i了! -- 超详细BERT介绍(一)BERT主模型的结构及其组件 BERT(Bidirectional Encoder Representations from Tran ...
Window VNC远程控制LINUX：VNC详细配置介绍
Window VNC远程控制LINUX:VNC详细配置介绍 //---------------------------------------vnc linux下的详细配置 1.VNC的启动/停止/重 ...
Hadoop介绍及最新稳定版Hadoop 2.4.1下载地址及单节点安装
Hadoop介绍 Hadoop是一个能对大量数据进行分布式处理的软件框架.其基本的组成包括hdfs分布式文件系统和可以运行在hdfs文件系统上的MapReduce编程模型,以及基于hdfs和MapR ...
ThinkPHP 自动创建数据、自动验证、自动完成详细例子介绍（十九）
原文:ThinkPHP 自动创建数据.自动验证.自动完成详细例子介绍(十九) 1:自动创建数据 //$name=$_POST['name']; //$password=$_POST['password ...
hadoop学习第一天-hadoop初步环境搭建&伪分布式计算配置（详细）
一.虚拟机环境搭建我们用的虚拟机为vmware,Linux镜像为centOS6.5. vmware安装安装没什么多说的,一路下一步,但是在新建虚拟机的时候有两个地方需要注意: 1.分配处理器1个就 ...
[原]Redis详细配置介绍
Redis详细配置介绍 # redis 配置文件示例 # 当你需要为某个配置项指定内存大小的时候,必须要带上单位, # 通常的格式就是 1k 5gb 4m 等酱紫: # # 1k => 1000 ...
更为详细的Txtsetup.sif文件解释
更为详细的Txtsetup.sif文件解释;代码页定义, 以免文本安装模式下无法正常显示简体中文 (以下基本都是跟简体中文相关的, 不同语言版本的 Windows, 此处定义也不同)[nls]Ansi ...
详细版在虚拟机安装和使用hadoop分布式集群
集群模式: 一台master 192.168.85.2 一台slave 192.168.85.3 jdk jdk1.8.0_74(版本不重要,看喜欢) hadoop版本 2.7.2(版本不重要,2. ...
Hadoop（三） HADOOP常用命令参数介绍
-help 功能:输出这个命令参数手册 -ls 功能:显示目录信息示例: hadoop fs -ls hdfs://hadoop-server01:9000/ 备注 ...

随机推荐

win10家庭版和专业版远程桌面出现身份验证错误，要求的函数不受支持。解决办法【亲测有效】
1.解决 win10家庭中文版远程连接:出现身份验证错误要求的函数不受支持 Windows 5.10日更新后,远程连接出现失败. 提示: 出现身份验证错误.要求的函数不受支持这可能是由于 Cre ...
http学习笔记1
通讯的条件学前小故事通过这个故事,我们来理解两台电脑之间的通信,必须具备什么样的条件? 有一天啊,这个小明和小强,一个在山的这头放牛,一个在山的那头割草.但是,由于无聊,这个小明就像找对面的小强聊 ...
ASP.NET MVC 3和Razor中的@helper
ASP.NET MVC 3支持一项名为“Razor”的新视图引擎选项(除了继续支持/加强现有的.aspx视图引擎外).当编写一个视图模板时,Razor将所需的字符和击键数减少到最小,并保证一个快速.通 ...
python模拟QQ聊天室（tcp加多线程）
python模拟QQ聊天室(tcp加多线程) 服务器代码: from socket import * from threading import * s = socket(AF_INET,SOCK_S ...
java.lang.ClassCastException: android.widget.ImageButton异常处理
在调程序时总是出现异常关闭的现象,log显示: 03-26 07:58:09.528: E/AndroidRuntime(398): Caused by: java.lang.ClassCastExc ...
Python-函数总结
把程序分解成较小的部分,主要有3种方法. 函数(function) 对象(object) 模块(module) 本节我们先学习函数.函数是带名字的代码块,可以把多个逻辑封装起来.这样就可以在程序中可以 ...
disconf-client-for-java
一.disconf客户端部署 disconf目前仅支持java客户端,下文针对java客户端安装作为整理,记录下安装部署的步骤 1.环境依赖首先需要安装java环境及maven环境,不再过多介绍 2 ...
Java常用工具类之IO流工具类
package com.wazn.learn.util; import java.io.Closeable; import java.io.IOException; /** * IO流工具类 * * ...
nginx fastcgi_buffers to an upstream response is buffered to a temporary file
fastcgi_buffers 16 16k; 指定本地需要用多少和多大的缓冲区来缓冲FastCGI的应答,如上所示,如果一个php脚本所产生的页面大小为256k,则会为其分配16个16k的缓冲区来缓 ...
【8.26校内测试】【重构树求直径】【BFS模拟】【线段树维护DP】
题目性质比较显然,相同颜色联通块可以合并成一个点,重新建树后,发现相邻两个点的颜色一定是不一样的. 然后发现,对于一条链来说,每次把一个点反色,实际上使点数少了2个.如下图而如果一条链上面有分支,也 ...

更为详细的介绍Hadoop combiners-More about Hadoop combiners

更为详细的介绍Hadoop combiners-More about Hadoop combiners的更多相关文章

随机推荐

热门专题