In this post we'll see how to compute the mean of the max temperatures of every month for the city of Milan. 
The temperature data is taken from http://archivio-meteo.distile.it/tabelle-dati-archivio-meteo/, but since the data are shown in tabular form, we had to sniff the HTTP conversation to see that the data come from this URL and are in JSON format. 
Using Jackson, we could transform this JSON into a format simpler to use with Hadoop: CSV. The result of conversion is this:

01012000,-4.0,5.0
02012000,-5.0,5.1
03012000,-5.0,7.7
04012000,-3.0,9.7
...

If you're curious to see how we transformed it, take a look at the source code.

Let's look at the mapper class for this job:

public static class MeanMapper extends Mapper<Object, Text, Text, SumCount> {

    private final int DATE = 0;
private final int MIN = 1;
private final int MAX = 2; private Map<Text, List<Double>> maxMap = new HashMap<>(); @Override
public void map(Object key, Text value, Context context) throws IOException, InterruptedException { // gets the fields of the CSV line
String[] values = value.toString().split((",")); // defensive check
if (values.length != 3) {
return;
} // gets date and max temperature
String date = values[DATE];
Text month = new Text(date.substring(2));
Double max = Double.parseDouble(values[MAX]); // if not present, put this month into the map
if (!maxMap.containsKey(month)) {
maxMap.put(month, new ArrayList<Double>());
} // adds the max temperature for this day to the list of temperatures
maxMap.get(month).add(max);
} @Override
protected void cleanup(Context context) throws IOException, InterruptedException { // loops over the months collected in the map() method
for (Text month: maxMap.keySet()) { List<Double> temperatures = maxMap.get(month); // computes the sum of the max temperatures for this month
Double sum = 0d;
for (Double max: temperatures) {
sum += max;
} // emits the month as the key and a SumCount as the value
context.write(month, new SumCount(sum, temperatures.size()));
}
}
}

How we've seen in the last posts (about optimization and combiners), in the mapper we first put values into a map, and when the input is over, we loop over the keys to sum the values and to emit them. Note that we use the SumCount class, which is a utility class that wraps the two values we need to compute a mean: the sum of all the values and the number of values. 
A common error in this kind of computation is making the mapper directly emit the mean; let's see what it can happen if we suppose to have a dataset like this:

01012000,0,10.0
02012000,0,20.0
03012000,0,2.0
04012000,0,4.0
05012000,0,3.0

and two mappers, which will receive the first two and the last three lines respectively. The first mapper will compute a mean of 15.0, given from (10.0 + 20.0) / 2. The second will compute a mean of 3.0, given from (2.0 + 4.0 + 3.0) / 3. When the reducer receive this two values, it sums them together and divide by two, so that the mean will be: 9.0, given from (15.0 + 3.0) / 2. But the correct mean for the values in this example is 7.8, which is given from (10.0 + 20.0 + 4.0 + 2.0 + 3.0) / 5. 
This error is due to the fact that any mapper can receive any number of lines, so the value it will emit is only a part of the information needed to compute a mean.

If instead of emitting the mean we emit the sum of the values and the number of values, we can overcome the problem. In the example we saw before, the first mapper will emit the pair (30.0, 2) and the second (9.0, 3); if we sum the values and divide it by the sum of the numbers, we obtain the right result.

Let's get back to our job and look at the reducer:

public static class MeanReducer extends Reducer<text, sumcount,="" text,="" doublewritable=""> {

    private Map<text, sumcount=""> sumCountMap = new HashMap<>();

    @Override
public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { SumCount totalSumCount = new SumCount(); // loops over all the SumCount objects received for this month (the "key" param)
for (SumCount sumCount : values) { // sums all of them
totalSumCount.addSumCount(sumCount);
} // puts the resulting SumCount into a map
sumCountMap.put(new Text(key), totalSumCount);
} @Override
protected void cleanup(Context context) throws IOException, InterruptedException { // loops over the months collected in the reduce() method
for (Text month: sumCountMap.keySet()) { double sum = sumCountMap.get(month).getSum().get();
int count = sumCountMap.get(month).getCount().get(); // emits the month and the mean of the max temperatures for the month
context.write(month, new DoubleWritable(sum/count));
}
}
}

The reducer is simpler because it has just to retrieve all the SumCount objects emitted from the reducers and add them together. After receiving the input, it loops over the map of the SumCount objects and emits the month and the mean.

from: http://andreaiacono.blogspot.com/2014/04/computing-mean-with-mapreduce.html

计算均值mean的MapReduce程序Computing mean with MapReduce的更多相关文章

  1. 用Python语言写Hadoop MapReduce程序Writing an Hadoop MapReduce Program in Python

    In this tutorial I will describe how to write a simple MapReduce program for Hadoop in the Python pr ...

  2. 一起学Hadoop——使用IDEA编写第一个MapReduce程序(Java和Python)

    上一篇我们学习了MapReduce的原理,今天我们使用代码来加深对MapReduce原理的理解. wordcount是Hadoop入门的经典例子,我们也不能免俗,也使用这个例子作为学习Hadoop的第 ...

  3. HDFS设计思路,HDFS使用,查看集群状态,HDFS,HDFS上传文件,HDFS下载文件,yarn web管理界面信息查看,运行一个mapreduce程序,mapreduce的demo

    26 集群使用初步 HDFS的设计思路 l 设计思想 分而治之:将大文件.大批量文件,分布式存放在大量服务器上,以便于采取分而治之的方式对海量数据进行运算分析: l 在大数据系统中作用: 为各类分布式 ...

  4. [python]使用python实现Hadoop MapReduce程序:计算一组数据的均值和方差

    这是参照<机器学习实战>中第15章“大数据与MapReduce”的内容,因为作者写作时hadoop版本和现在的版本相差很大,所以在Hadoop上运行python写的MapReduce程序时 ...

  5. 怎样通过Java程序提交yarn的mapreduce计算任务

    因为项目需求,须要通过Java程序提交Yarn的MapReduce的计算任务.与一般的通过Jar包提交MapReduce任务不同,通过程序提交MapReduce任务须要有点小变动.详见下面代码. 下面 ...

  6. 简单的java Hadoop MapReduce程序(计算平均成绩)从打包到提交及运行

    [TOC] 简单的java Hadoop MapReduce程序(计算平均成绩)从打包到提交及运行 程序源码 import java.io.IOException; import java.util. ...

  7. mapreduce程序编写(WordCount)

    折腾了半天.终于编写成功了第一个自己的mapreduce程序,并通过打jar包的方式运行起来了. 运行环境: windows 64bit eclipse 64bit jdk6.0 64bit 一.工程 ...

  8. 基于Hbase数据的Mapreduce程序环境开发

    一.实验目标 编写Mapreduce程序,以Hbase表数据为Map输入源,计算结果输出到HDFS或者Hbase表中. 在非CDH5的Hadoop集群环境中,将编写好的Mapreduce程序整个工程打 ...

  9. 从零开始学习Hadoop--第2章 第一个MapReduce程序

    1.Hadoop从头说 1.1 Google是一家做搜索的公司 做搜索是技术难度很高的活.首先要存储很多的数据,要把全球的大部分网页都抓下来,可想而知存储量有多大.然后,要能快速检索网页,用户输入几个 ...

随机推荐

  1. HDU 4443 带环树形dp

    思路:如果只有一棵树这个问题很好解决,dp一次,然后再dfs一次往下压求答案就好啦,带环的话,考虑到环上的点不是 很多,可以暴力处理出环上的信息,然后最后一次dfs往下压求答案就好啦.细节比较多. # ...

  2. zoj 3983 Crusaders Quest 思维+枚举

    题目链接 这道题意思是: 给你一个长度为9的字符串,且只有3个a.3个g.3个o 问,你可以选择删除一段连续的串或者单个的字符也可以不删,最多会出现几个三子相连的子串 比如:agoagoago只有将两 ...

  3. Chrome谷歌浏览器拓展组件的2种快速安装方法(.crx)

    谷歌浏览器拓展有至少2种安装方法,现在简单的介绍下. 第一种.当然是进入谷歌官方的应用商店直接安装 这种方法简单快捷,而且官方支持度够高,唯一的缺点是大陆用户需要“FQ”. 谷歌拓展组件应用商店地址: ...

  4. eclipse 修改js文件无法编译到项目中

    1.场景重现 在今天修改js文件完善功能时,发现在eclipse中修改了文件后,刷新页面功能无法同步: 2.分析原因 查看编译路径,文件没有修改: 2.1 可能是缓存问题: 2.2 项目未编译: 3. ...

  5. 特征向量、特征值以及降维方法(PCA、SVD、LDA)

    一.特征向量/特征值 Av = λv 如果把矩阵看作是一个运动,运动的方向叫做特征向量,运动的速度叫做特征值.对于上式,v为A矩阵的特征向量,λ为A矩阵的特征值. 假设:v不是A的速度(方向) 结果如 ...

  6. Opencv学习笔记5:Opencv处理彩虹图、铜色图、灰度反转图

    一.概述: 人类能够观察到的光的波长范围是有限的,并且人类视觉有一个特点,只能分辨出二十几种灰度,也就是说即使采集到的灰度图像分辨率超级高,有上百个灰度级,但是很遗憾,人们只能看出二十几个,也就是说信 ...

  7. Go语言特点

    作者:asta谢链接:https://www.zhihu.com/question/21409296/answer/18184584来源:知乎 1.Go有什么优势 可直接编译成机器码,不依赖其他库,g ...

  8. 【洛谷】2602: [ZJOI2010]数字计数【数位DP】

    P2602 [ZJOI2010]数字计数 题目描述 给定两个正整数a和b,求在[a,b]中的所有整数中,每个数码(digit)各出现了多少次. 输入输出格式 输入格式: 输入文件中仅包含一行两个整数a ...

  9. 2018-2019-20172329 《Java软件结构与数据结构》第二周学习总结

    2018-2019-20172329 <Java软件结构与数据结构>第二周学习总结 教材学习内容总结 <Java软件结构与数据结构>第三章 集合概述--栈 一.集合 1.我们印 ...

  10. Codeforces Round #354 (Div. 2) B. Pyramid of Glasses 模拟

    B. Pyramid of Glasses 题目连接: http://www.codeforces.com/contest/676/problem/B Description Mary has jus ...