function map(String name, String document):
// name: document name
// document: document contents
for each word w in document:
emit (w, ) function reduce(String word, Iterator partialCounts):
// word: a word
// partialCounts: a list of aggregated partial counts
sum =
for each pc in partialCounts:
sum += pc
emit (word, sum)

The prototypical MapReduce example counts the appearance of each word in a set of documents:[14]

Here, each document is split into words, and each word is counted by the map function, using the word as the result key. The framework puts together all the pairs with the same key and feeds them to the same call to reduce. Thus, this function just needs to sum all of its input values to find the total appearances of that word.

 SELECT age, AVG(contacts)
FROM social.person
function Map is
input: integer K1 between and , representing a batch of million social.person records
for each social.person record in the K1 batch do
let Y be the person's age
let N be the number of contacts the person has
produce one output record (Y,(N,))
end function function Reduce is
input: age (in years) Y
for each input record (Y,(N,C)) do
Accumulate in S the sum of N*C
Accumulate in Cnew the sum of C
let A be S/Cnew
produce one output record (Y,(A,Cnew))
end function
-- map output #: age, quantity of contacts
-- map output #: age, quantity of contacts
-- map output #: age, quantity of contacts
-- reduce step #: age, average of contacts



imagine that for a database of 1.1 billion people, one would like to compute the average number of social contacts a person has according to age


The frozen part of the MapReduce framework is a large distributed sort. The hot spots, which the application defines, are:

  • an input reader
  • a Map function
  • a partition function
  • a compare function
  • a Reduce function
  • an output writer


^ "我们的灵感来自lisp和其他函数式编程语言中的古老的映射和归纳操作." -"MapReduce:大规模集群上的简单数据处理方式"









MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.[1][2]

A MapReduce program is composed of a Map() procedure (method) that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

The model is a specialization of the split-apply-combine strategy for data analysis.[3] It is inspired by the map and reduce functions commonly used in functional programming,[4] although their purpose in the MapReduce framework is not the same as in their original forms.[5] The key contributions of the MapReduce framework are not the actual map and reduce functions (which, for example, resemble the 1995 Message Passing Interface standard's[6] reduce[7] and scatter[8] operations), but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine. As such, a single-threaded implementation of MapReduce will usually not be faster than a traditional (non-MapReduce) implementation; any gains are usually only seen with multi-threaded implementations.[9] The use of this model is beneficial only when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerance features of the MapReduce framework come into play. Optimizing the communication cost is essential to a good MapReduce algorithm.[10]

MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation that has support for distributed shuffles is part of Apache Hadoop. The name MapReduce originally referred to the proprietary Google technology, but has since been genericized. By 2014, Google was no longer using MapReduce as their primary Big Data processing model,[11] and development on Apache Mahout had moved on to more capable and less disk-oriented mechanisms that incorporated full map and reduce capabilities.[12]

2014 MapReduce的更多相关文章

  1. Hadoop MapReduce执行过程详解(带hadoop例子) 摘要: 本文通过一个例子,详细介绍Hadoop 的 MapReduce过程. 分析MapReduce执行过程 Map ...

  2. PageRank算法简介及Map-Reduce实现

    PageRank对网页排名的算法,曾是Google发家致富的法宝.以前虽然有实验过,但理解还是不透彻,这几天又看了一下,这里总结一下PageRank算法的基本原理. 一.什么是pagerank Pag ...

  3. Window7中Eclipse运行MapReduce程序报错的问题

    按照文档:安装配置好Eclipse后,运行WordCount程 ...

  4. Hadoop学习之Mapreduce执行过程详解

    一.MapReduce执行过程 MapReduce运行时,首先通过Map读取HDFS中的数据,然后经过拆分,将每个文件中的每行数据分拆成键值对,最后输出作为Reduce的输入,大体执行流程如下图所示: ...

  5. Strata 2014 上的 AzureCAT 粉笔会谈

     本周,AzureCAT 团队非常高兴在 Strata 会议上首次集体亮相.对于那些对 AzureCAT 团队不太熟悉的人来说,我们是 Microsoft 云与企业部门一个核心的国际性团队,由大约 ...


    摘要:MapReduce程序进行单词计数. 关键词:MapReduce程序  单词计数 数据源:人工构造英文文档file1.txt,file2.txt. file1.txt 内容 Hello   Ha ...

  7. Hadoop之MapReduce程序应用三

    摘要:MapReduce程序进行数据去重. 关键词:MapReduce   数据去重 数据源:人工构造日志数据集log-file1.txt和log-file2.txt. log-file1.txt内容 ...

  8. Mapreduce参数调节 本文主要记录Hadoop 2.x版本中MapReduce参数调优,不涉及Yarn的调优 ...

  9. Hadoop MapReduce开发最佳实践(上篇)

    body{ font-family: "Microsoft YaHei UI","Microsoft YaHei",SimSun,"Segoe UI& ...


  1. 【Python】八大排序算法的比较

    排序是数据处理比较核心的操作,八大排序算法分别是:直接插入排序.希尔排序.简单选择排序.堆排序.冒泡排序.快速排序.归并排序.基数排序 以下是排序图解: 直接插入排序 思想 直接插入排序是一种最简单的 ...

  2. 语言那点事,crt

    C语言标准(不管是ANSI 还是ISO)包含2部分,一部分是语言本身的标准,另一部分是C标准函数库.C标准函数库规定了函数的原型和功能,但是并没限定这些函数要怎么实现.所谓满足标准C规定的C编译器,不 ...

  3. cocos2d-x 3.0rc2版公布了

    本人博客地址,转载吧亲们: 之前做小鸟的和跑酷的时候尽管cocos2d-x出了3.0版,可是还是alpha版.当时大致看了一下发现有蛮多修改 ...

  4. (一)Oracle学习笔记—— 表和表空间

    1. 表空间 一个数据库可以有多个表空间,一个表空间里可以有多个表.表空间就是存多个表的物理空间:可以指定表空间的大小位置等.  1.1 创建表空间语句 create tablespace ts3 d ...

  5. 错误: ISO C++ 不同意在类内初始化很量静态成员

    错误: ISO C++ 不同意在类内初始化很量静态成员      今天開始学C++ primer,在牵扯到Sales_item.h头文件时.出现了一些问题(和C++11新特性相关),当前的编译器版本号 ...

  6. HIVE中join、semi join、outer join

    补充说明 left outer join where is not null与left semi join的联系与区别:两者均可实现exists in操作,不同的是,前者允许右表的字段在select或 ...

  7. Java(Android)解析KML文件

    參考自: ...

  8. Atitit. 真正的全中国文字attilax易语言的特点以及范例

    Atitit. 真正的全中国文字attilax易语言的特点以及范例 1. 前言 attilax易语言是什么??1 2. attilax易语言的特点2 2.1. 支持多语言文字,不只汉字,还有藏文,维文 ...

  9. hdu1584 A strange lift (电梯最短路径问题)

    A strange lift Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 65536/32768 K (Java/Others) T ...

  10. 【Java集合源代码剖析】Java集合框架

    转载轻注明出处: Java集合工具包位于Java.util包下,包括了非常多经常使用的数据结构 ...