It's the other way round. Number of mappers is decided based on the number of splits. In reality it is the job of InputFormat, which you are using, to create the splits. You do not have any idea about the number of mappers until number of splits has been decided. And, it's not always that splits will be created based on the HDFS block size. It totally depends on the logic inside the getSplits() method of your InputFormat.

To better understand this, assume you are processing data stored in your MySQL using MR. Since there is no concept of blocks in this case, the theory that splits are always created based on the HDFS block fails. Right? What about splits creation then? One possibility is to create splits based on ranges of rows in your MySQL table (and this is what DBInputFormat does, an input format for reading data from a relational database). Suppose you have 100 rows. Then you might have 5 splits of 20 rows each.

It is only for the InputFormats based on FileInputFormat (an InputFormat for handling data stored in files) that the splits are created based on the total size, in bytes, of the input files. However, the FileSystem blocksize of the input files is treated as an upper bound for input splits. If you have a file smaller than the HDFS block size, you'll get only 1 mapper for that file. If you want to have some different behavior, you can use mapred.min.split.size. But it again depends solely on the getSplits() of your InputFormat.

There is a fundamental difference between MR split and HDFS block and folks often get confused by this. A block is a physical piece of data while a split is just a logical piece which is going to be fed to a mapper. A split does not contain the input data, it is just a reference to the data. Then what is a split? A split basically has 2 things : a length in bytes and a set of storage locations, which are just hostname strings.

Coming back to your question. Hadoop allows much more than 200 mappers. Having said that, it doesn't make much sense to have 200 mappers for just 500MB of data. Always remember that when you talk about Hadoop, you are dealing with very huge data. Sending just 2.5 MB data to each mapper would be an overkill. And yes, if there are no free CPU slots then some mappers may run after the completion of current mappers. But the MR framework is very intelligent and tries its best to avoid these kind of situation. If the machine where data to processed is present, doesn't have any free CPU slots, the data will be moved to a nearby node, where free slots are available, and get processed.

HTH

How many mappers/reducers should be set when configuring Hadoop cluster?

There is no formula. It depends on how many cores and how much memory do you have. The number of mapper + number of reducer should not exceed the number of cores in general. Keep in mind that the machine is also running Task Tracker and Data Node daemons. One of the general suggestion is more mappers than reducers. If I were you, I would run one of my typical jobs with reasonable amount of data to try it out.

For a normal 7200rpm disk, 2-3 mappers is a good number. For you system, with 48G mem and 16 cpu thread, I/O will likely to be the problem. I suggest you to get multiple disk for each node and set them up as JBOD.

Quoting from "Hadoop The Definite Guide, 3rd edition", page 306

Because MapReduce jobs are normally  I/O-bound, it makes sense to have more tasks than processors to get better  utilization.

The amount of oversubscription depends on the CPU utilization of jobs  you run, but a good rule of thumb is to have a factor of between one and two more  tasks (counting both map and reduce tasks) than processors.

A processor in the quote above is equivalent to one logical core.

But this is just in theory, and most likely each use case is different than another, some tests need to be performed. But this number can be a good start to test with.

No. of mappers is decided in accordance with the data locality principle as described earlier. Data Locality principle : Hadoop tries its best to run map tasks on nodes where the data is present locally to optimize on the network and inter-node communication latency. As the input data is split into pieces and fed to different map tasks, it is desirable to have all the data fed to that map task available on a single node.Since HDFS only guarantees data having size equal to its block size (64M) to be present on one node, it is advised/advocated to have the split size equal to the HDFS block size so that the map task can take advantage of this data localization. Therefore, 64M of data per mapper. If we see some mappers running for a very small period of time, try to bring down the number of mappers and make them run longer for a minute or so.

No. of reducers should be slightly less than the number of reduce slots in the cluster (the concept of slots comes in with a pre-configuration in the job/task tracker properties while configuring the cluster) so that all the reducers finish in one wave and make full utilisation of the cluster resources.

mapred.tasktracker.reduce.tasks.maximum

mapred.tasktracker.map.tasks.maximum

in mapred-site.xml

This is applicable for all jobs. If you want to set for a specific one, you can use

mapred.reduce.tasks

mapred.map.tasks

Liyin Tang added a comment  - 13/Nov/10 01:16

I just finished converting common join into map join based on the file size.  There are 2 flags to control this optimization.

1) set hive.auto.convert.join = true; It means this optimization is enabled. By default right now, this flag is disabled in order not to break any existing test cases. Also I put 25 additional test cases, auto_join0.q - auto_join25.q, which covers this optimization code.

2) Set hive.hashtable.max.memory.usage = 0.9;  It means if the memory usage of local task is more than 90% of its heap size, then the local task will abort by itself. The Driver will know the local work fails and it won't submit the MapJoinTask (a Map Only MapRedTask)  to Hadoop, but instead, it will submit the originally CommonJoinTask to Hadoop to run.

3) Set hive.smalltable.filesize = 25000000L;  It means if the summary of the small table file size is less than 25M, then it will run the map join task. If not, just run the originally common join task.

MapReduce: number of mappers/reducers的更多相关文章

  1. Hadoop官方文档翻译——MapReduce Tutorial

    MapReduce Tutorial(个人指导) Purpose(目的) Prerequisites(必备条件) Overview(综述) Inputs and Outputs(输入输出) MapRe ...

  2. [转]Hive:简单查询不启用Mapreduce job而启用Fetch task

    转自:http://www.iteblog.com/archives/831 如果你想查询某个表的某一列,Hive默认是会启用MapReduce Job来完成这个任务,如下: hive> SEL ...

  3. MIT 6.824 lab1:mapreduce

    这是 MIT 6.824 课程 lab1 的学习总结,记录我在学习过程中的收获和踩的坑. 我的实验环境是 windows 10,所以对lab的code 做了一些环境上的修改,如果你仅仅对code 感兴 ...

  4. Number of dynamic partitions exceeded hive.exec.max.dynamic.partitions.pernode

    动态分区数太大的问题:[Fatal Error] Operator FS_2 (id=2): Number of dynamic partitions exceeded hive.exec.max.d ...

  5. Hive之简单查询不启用MapReduce

    假设你想查询某个表的某一列.Hive默认是会启用MapReduce Job来完毕这个任务,例如以下: 01 hive> SELECT id, money FROM m limit 10; 02 ...

  6. 011-HQL中级1-Hive快捷查询:不启用Mapreduce job启用Fetch task三种方式介绍

    如果你想查询某个表的某一列,Hive默认是会启用MapReduce Job来完成这个任务,如下: hive; Total MapReduce jobs Launching Job out since ...

  7. MapReduce 模式、算法和用例(MapReduce Patterns, Algorithms, and Use Cases)

    在新文章“MapReduce模式.算法和用例”中,Ilya Katsov提供了一个系统化的综述,阐述了能够应用MapReduce框架解决的问题. 文章开始描述了一个非常简单的.作为通用的并行计算框架的 ...

  8. 从零自学Hadoop(16):Hive数据导入导出,集群数据迁移上

    阅读目录 序 导入文件到Hive 将其他表的查询结果导入表 动态分区插入 将SQL语句的值插入到表中 模拟数据文件下载 系列索引 本文版权归mephisto和博客园共有,欢迎转载,但须保留此段声明,并 ...

  9. hive创建索引

    索引是hive0.7之后才有的功能,创建索引需要评估其合理性,因为创建索引也是要磁盘空间,维护起来也是需要代价的 创建索引 hive> create index [index_studentid ...

随机推荐

  1. Spring学习笔记--注入Bean属性

    这里通过一个MoonlightPoet类来演示了注入Bean属性property的效果. package com.moonlit.myspring; import java.util.List; im ...

  2. MVC Route路由

    由于某些原因,需要默认区域,所以需要对路由进行设置 具体实现如下: using System.Web; using System.Web.Mvc; using System.Web.Routing; ...

  3. 【MySQL】查询时强制区分大小写的方法

    MySQL默认的查询也不区分大小写.但作为用户信息,一旦用户名重复,又会浪费很多资源.再者,李逵.李鬼的多起来,侦辨起来很困难.要做到这一点,要么在建表时,明确大小写敏感(字段明确大小写敏感) sql ...

  4. 【linux】Crontab 定时任务 使用实例

    1 使用putty 登录linux 服务器 2 输入以下命令.查看已有的定时任务 crontab -l 3 输入  以下命令,进入定时任务文件 crontab -e 4  键盘 选择 i  键 进行输 ...

  5. SASS环境搭建及HBuilder中sass预编译配置

    ---------------------------------Ruby环境安装-------------------------------- 至于为什么要安装ruby环境请移步:https:// ...

  6. JS-记住用户名【cookie封装引申】

    <!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title> ...

  7. 高中生的IT之路-1.5西餐厅服务生

    之所以说漫长的求职,是因为培训结束后半年左右没有找到工作. 每次面试结束后,得到的都是“回去等消息”,然后就杳无音信了.一次次的面试,一次次的失败,一次次查找失败的原因.总结来看主要有两点:一是没有工 ...

  8. 代码片段,lucene基本操作(基于lucene4.10.2)

    1.最基本的创建索引: @Test public void testIndex(){ try { Directory directory = FSDirectory.open(new File(LUC ...

  9. c# Use NAudio Library to Convert MP3 audio into WAV audio(将Mp3格式转换成Wav格式)

    Have you been in need of converting mp3 audios to wav audios?  If so, the skill in this article prov ...

  10. WCF(四) 绑定

    绑定 是一个制定好的通道栈,包含了协议通道,传输通道和编码器.从功能上来看,一个绑定集成了通信模式.可靠性.安全性.事务传播和互操作性 绑定方式分两种:代码中和配置文件中绑定 1: 2: 3.配置ap ...