前一阵子參加炼数成金的MapReduce培训,培训中的作业样例比較有代表性,用于解释问题再好只是了。

有一本国外的有关MR的教材,比較有用。点此下载

一.MapReduce应用场景

MR能解决什么问题?一般来说,用的最多的应该是日志分析,海量数据排序处理。近期一段时间公司用MR来解决大量日志的离线并行分析问题。

二.MapReduce机制

对于不熟悉MR工作原理的同学,推荐大家先去看一篇博文:http://blog.csdn.net/athenaer/article/details/8203990

三.经常使用计算模型

这里举一个样例。数据表在Oracle默认用户Scott下有DEPT表和EMP表。为方便,如今直接写成两个TXT文件例如以下:

1.部门表

DEPTNO,DNAME,LOC    // 部门号。部门名称,所在地

10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON

2.员工表

EMPNO,ENAME,JOB,HIREDATE,SAL,COMM,DEPTNO,MGR // 员工号,英文名,职位,聘期。工资,奖金,所属部门,管理者
7369,SMITH,CLERK,1980-12-17 00:00:00.0,800,,20,7902
7499,ALLEN,SALESMAN,1981-02-20 00:00:00.0,1600,300,30,7698
7521,WARD,SALESMAN,1981-02-22 00:00:00.0,1250,500,30,7698
7566,JONES,MANAGER,1981-04-02 00:00:00.0,2975,,20,7839
7654,MARTIN,SALESMAN,1981-09-28 00:00:00.0,1250,1400,30,7698
7698,BLAKE,MANAGER,1981-05-01 00:00:00.0,2850,,30,7839
7782,CLARK,MANAGER,1981-06-09 00:00:00.0,2450, ,10,7839
7839,KING,PRESIDENT,1981-11-17 00:00:00.0,5000,,10,
7844,TURNER,SALESMAN,1981-09-08 00:00:00.0,1500,0,30,7698
7900,JAMES,CLERK,1981-12-03 00:00:00.0,950,,30,7698
7902,FORD,ANALYST,1981-12-03 00:00:00.0,3000,,20,7566
7934,MILLER,CLERK,1982-01-23 00:00:00.0,1300,,10,7782

3.实例化为bean

这两个bean的实际作用都是切割传入的字符串,从字符串内得到所属的属性信息。

emp.java
public Emp(String inStr) {
String[] split = inStr.split(",");
this.empno = (split[0].isEmpty()? "" : split[0]);
this.ename = (split[1].isEmpty() ? "" : split[1]);
this.job = (split[2].isEmpty() ? "" : split[2]);
this.hiredate = (split[3].isEmpty() ? "" : split[3]);
this.sal = (split[4].isEmpty() ? "0" : split[4]);
this.comm = (split[5].isEmpty() ? "" : split[5]);
this.deptno = (split[6].isEmpty() ? "" : split[6]);
try {
this.mgr = (split[7].isEmpty() ? "" : split[7]);
} catch (IndexOutOfBoundsException e) { //防止最后一位为空的情况
this.mgr = "";
}
}

dept.java

public Dept(String string) {
String[] split = string.split(",");
this.deptno = split[0];
this.dname = split[1];
this.loc = split[2];
}

4.模型分析

4.1 求和

求各个部门的总工资
public static class Map_1 extends MapReduceBase implements Mapper<Object, Text, Text, IntWritable> {
public void map(Object key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
try {
Emp emp = new Emp(value.toString());
output.collect(new Text(emp.getDeptno()), new IntWritable(Integer.parseInt(emp.getSal()))); // { k=部门号,v=员工薪资}
} catch (Exception e) {
reporter.getCounter(ErrCount.LINESKIP).increment(1);
WriteErrLine.write("./input/" + this.getClass().getSimpleName() + "err_lines", reporter.getCounter(ErrCount.LINESKIP).getCounter() + " " + value.toString());
}
}
} public static class Reduce_1 extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum = sum + values.next().get();
}
output.collect(key, new IntWritable(sum));
} }
执行结果:

watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt="" />

4.3 平均值

求各个部门的人数和平均工资
public static class Map_2 extends MapReduceBase implements Mapper<Object, Text, Text, IntWritable> {
public void map(Object key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
try {
Emp emp = new Emp(value.toString());
output.collect(new Text(emp.getDeptno()), new IntWritable(Integer.parseInt(emp.getSal()))); //{ k=部门号,v=薪资}
} catch (Exception e) {
reporter.getCounter(ErrCount.LINESKIP).increment(1);
WriteErrLine.write("./input/" + this.getClass().getSimpleName() + "err_lines", reporter.getCounter(ErrCount.LINESKIP).getCounter() + " " + value.toString());
} }
} public static class Reduce_2 extends MapReduceBase implements Reducer<Text, IntWritable, Text, Text> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
double sum = 0; //部门工资
int count =0 ; //人数
while (values.hasNext()) {
count++;
sum = sum + values.next().get();
}
output.collect(key, new Text( count+" "+sum/count));
} }
执行结果

watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center" alt="" />

4.4 分组排序

求每一个部门最早进入公司的员工姓名
	public static class Map_3 extends MapReduceBase implements Mapper<Object, Text, Text, Text> {
public void map(Object key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
try {
Emp emp = new Emp(value.toString());
output.collect(new Text(emp.getDeptno()), new Text(emp.getHiredate() + "~" + emp.getEname())); // { k=部门号。v=聘期}
} catch (Exception e) {
reporter.getCounter(ErrCount.LINESKIP).increment(1);
WriteErrLine.write("./input/" + this.getClass().getSimpleName() + "err_lines", reporter.getCounter(ErrCount.LINESKIP).getCounter() + " " + value.toString());
} }
} public static class Reduce_3 extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
DateFormat sdf = DateFormat.getDateInstance();
Date minDate = new Date(9999, 12, 30);
Date d;
String[] strings = null;
while (values.hasNext()) {
try {
strings = values.next().toString().split("~"); // 获取名字和日期
d = sdf.parse(strings[0].toString().substring(0, 10));
if (d.before(minDate)) {
minDate = d;
}
} catch (ParseException e) {
e.printStackTrace();
}
}
output.collect(key, new Text(minDate.toLocaleString() + " " + strings[1])); } }

执行结果

4.5 多表关联

求各个城市的员工的总工资
public static class Map_4 extends MapReduceBase implements Mapper<Object, Text, Text, Text> {
public void map(Object key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
try {
String fileName = ((FileSplit) reporter.getInputSplit()).getPath().getName();
if (fileName.equalsIgnoreCase("emp.txt")) {
Emp emp = new Emp(value.toString());
output.collect(new Text(emp.getDeptno()), new Text("A#" + emp.getSal()));
}
if (fileName.equalsIgnoreCase("dept.txt")) {
Dept dept = new Dept(value.toString());
output.collect(new Text(dept.getDeptno()), new Text("B#" + dept.getLoc()));
}
} catch (Exception e) {
reporter.getCounter(ErrCount.LINESKIP).increment(1);
WriteErrLine.write("./input/" + this.getClass().getSimpleName() + "err_lines", reporter.getCounter(ErrCount.LINESKIP).getCounter() + " " + value.toString());
} }
} public static class Reduce_4 extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
String deptV;
Vector<String> empList = new Vector<String>(); // 保存EMP表的工资数据
Vector<String> deptList = new Vector<String>(); // 保存DEPT表的位置数据
while (values.hasNext()) {
deptV = values.next().toString();
if (deptV.startsWith("A#")) {
empList.add(deptV.substring(2));
}
if (deptV.startsWith("B#")) {
deptList.add(deptV.substring(2));
}
}
double sumSal = 0;
for (String location : deptList) {
for (String salary : empList) {
//每一个城市员工工资总和
sumSal = Integer.parseInt(salary) + sumSal;
}
output.collect(new Text(location), new Text(Double.toString(sumSal)));
}
} }
执行结果

4.6 单表关联

工资比上司高的员工姓名及其工资
public static class Map_5 extends MapReduceBase implements Mapper<Object, Text, Text, Text> {
public void map(Object key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
try {
Emp emp = new Emp(value.toString());
output.collect(new Text(emp.getMgr()), new Text("A#" + emp.getEname() + "~" + emp.getSal())); // 员工表 { k=上司名。v=员工工资}
output.collect(new Text(emp.getEmpno()), new Text("B#" + emp.getEname() + "~" + emp.getSal()));// “经理表” { k=员工名,v=员工工资}
} catch (Exception e) {
reporter.getCounter(ErrCount.LINESKIP).increment(1);
WriteErrLine.write("./input/" + this.getClass().getSimpleName() + "err_lines", reporter.getCounter(ErrCount.LINESKIP).getCounter() + " " + value.toString());
}
}
} public static class Reduce_5 extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
String value;
Vector<String> empList = new Vector<String>(); // 员工表
Vector<String> mgrList = new Vector<String>(); // 经理表
while (values.hasNext()) {
value = values.next().toString();
if (value.startsWith("A#")) {
empList.add(value.substring(2));
}
if (value.startsWith("B#")) {
mgrList.add(value.substring(2));
}
}
String empName, empSal, mgrSal; for (String emploee : empList) {
for (String mgr : mgrList) {
String[] empInfo = emploee.split("~");
empName = empInfo[0];
empSal = empInfo[1];
String[] mgrInfo = mgr.split("~");
mgrSal = mgrInfo[1];
if (Integer.parseInt(empSal) > Integer.parseInt(mgrSal)) {
output.collect(key, new Text(empName + " " + empSal));
}
}
}
} }

执行结果

4.7 TOP N

列出工资最高的头三名员工姓名及其工资
public static class Map_8 extends MapReduceBase implements Mapper<Object, Text, Text, Text> {
public void map(Object key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
try {
Emp emp = new Emp(value.toString());
output.collect(new Text("1"), new Text(emp.getEname() + "~" + emp.getSal())); // { k=任意字符串或数字,v=员工名字+薪资}
} catch (Exception e) {
reporter.getCounter(ErrCount.LINESKIP).increment(1);
WriteErrLine.write("./input/" + this.getClass().getSimpleName() + "err_lines", reporter.getCounter(ErrCount.LINESKIP).getCounter() + " " + value.toString());
} }
} public static class Reduce_8 extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
Map<Integer, String> emp = new TreeMap<Integer, String>(); // TreeMap默认key升序排列,巧妙利用这点能够实现top N
while (values.hasNext()) {
String[] valStrings = values.next().toString().split("~");
emp.put(Integer.parseInt(valStrings[1]), valStrings[0]);
}
int count = 0; // 计数器
for (Iterator<Integer> keySet = emp.keySet().iterator(); keySet.hasNext();) {
if (count < 3) { // N =3
Integer current_key = keySet.next();
output.collect(new Text(emp.get(current_key)), new Text(current_key.toString())); // 迭代key,即SAL
count++;
} else {
break;
}
}
}
}

运算结果


4.8 降序排序

将全体员工依照总收入(工资+提成)从高到低排列。要求列出姓名及其总收入
public static class Map_9 extends MapReduceBase implements Mapper<Object, Text, Text, Text> {
public void map(Object key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
try {
Emp emp = new Emp(value.toString());
int totalSal = Integer.parseInt(emp.getComm()) + Integer.parseInt(emp.getSal());
output.collect(new Text("1"), new Text(emp.getEname() + "~" + totalSal));
} catch (Exception e) {
reporter.getCounter(ErrCount.LINESKIP).increment(1);
WriteErrLine.write("./input/" + this.getClass().getSimpleName() + "err_lines", reporter.getCounter(ErrCount.LINESKIP).getCounter() + " " + value.toString());
} }
} public static class Reduce_9 extends MapReduceBase implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
Map<Integer, String> emp = new TreeMap<Integer, String>(
// 重写比較器,使降序排列
new Comparator<Integer>() {
public int compare(Integer o1, Integer o2) {
return o2.compareTo(o1);
}
});
while (values.hasNext()) {
String[] valStrings = values.next().toString().split("~");
emp.put(Integer.parseInt(valStrings[1]), valStrings[0]);
}
for (Iterator<Integer> keySet = emp.keySet().iterator(); keySet.hasNext();) {
Integer current_key = keySet.next();
output.collect(new Text(emp.get(current_key)), new Text(current_key.toString())); // 迭代key,即SAL
}
}
}

执行结果

四.总结

把sql里经常使用的计算模型写成MR是一件比較麻烦的事,由于非常多情况下一行sql预计要十几甚至几十行代码来实现,略显笨拙。可是从数据计算速度来说,MR跟sql不是一个级别的。
但不可否认的一点是。不管是什么技术都有各自的适用范围,MR不是万能的。详细要看使用场景再选择适当的技术。

【MapReduce】经常使用计算模型具体解释的更多相关文章

  1. 重要 | Spark和MapReduce的对比,不仅仅是计算模型?

    [前言:笔者将分上下篇文章进行阐述Spark和MapReduce的对比,首篇侧重于"宏观"上的对比,更多的是笔者总结的针对"相对于MapReduce我们为什么选择Spar ...

  2. MapReduce 计算模型

    前言 本文讲解Hadoop中的编程及计算模型MapReduce,并将给出在MapReduce模型下编程的基本套路. 模型架构 在Hadoop中,用于执行计算任务(MapReduce任务)的机器有两个角 ...

  3. MapReduce计算模型

    MapReduce计算模型 MapReduce两个重要角色:JobTracker和TaskTracker. ​ MapReduce Job 每个任务初始化一个Job,没个Job划分为两个阶段:Map和 ...

  4. MapReduce计算模型的优化

    MapReduce 计算模型的优化涉及了方方面面的内容,但是主要集中在两个方面:一是计算性能方面的优化:二是I/O操作方面的优化.这其中,又包含六个方面的内容. 1.任务调度 任务调度是Hadoop中 ...

  5. 第四篇:MapReduce计算模型

    前言 本文讲解Hadoop中的编程及计算模型MapReduce,并将给出在MapReduce模型下编程的基本套路. 模型架构 在Hadoop中,用于执行计算任务(MapReduce任务)的机器有两个角 ...

  6. MapReduce计算模型二

    之前写过关于Hadoop方面的MapReduce框架的文章MapReduce框架Hadoop应用(一) 介绍了MapReduce的模型和Hadoop下的MapReduce框架,此文章将进一步介绍map ...

  7. 【CDN+】 Spark入门---Handoop 中的MapReduce计算模型

    前言 项目中运用了Spark进行Kafka集群下面的数据消费,本文作为一个Spark入门文章/笔记,介绍下Spark基本概念以及MapReduce模型 Spark的基本概念: 官网: http://s ...

  8. 性能测试学习之二 ——性能测试模型(PV计算模型)

    PV计算模型 现有的PV计算公式是: 每台服务器每秒平均PV量 =( (总PV*80%)/(24*60*60*40%))/服务器数量 =2*(总PV)/* (24*60*60) /服务器数量 通过定积 ...

  9. Spark计算模型

    [TOC] Spark计算模型 Spark程序模型 一个经典的示例模型 SparkContext中的textFile函数从HDFS读取日志文件,输出变量file var file = sc.textF ...

随机推荐

  1. zzulioj--1815--easy problem(暴力加技巧)

    1815: easy problem Time Limit: 1 Sec  Memory Limit: 128 MB Submit: 98  Solved: 48 SubmitStatusWeb Bo ...

  2. hdoj--1034--Hidden String(dfs)

    Hidden String Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 262144/262144 K (Java/Others) ...

  3. hdoj--2682--Tree()

    Tree Time Limit: 6000/2000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Others) Total Submi ...

  4. 5. webservice通信调用天气预报接口实例

    转自:https://blog.csdn.net/xiejuan6105/article/details/78452605 一:环境搭建 1:新建一个java project工程weatherInf ...

  5. Android CardView卡片布局 标签: 控件

    CardView介绍 CardView是Android 5.0系统引入的控件,相当于FragmentLayout布局控件然后添加圆角及阴影的效果:CardView被包装为一种布局,并且经常在ListV ...

  6. Excel 文本内容拆分

    1.首先把文本数据粘贴到excel-->在旁边插入空白列..选择数据-->分列-->固定宽度 2.数据预览点击下一步 3.最后分好的数据就在 归去来兮,田园将芜胡不归?既自以心为形役 ...

  7. 总结C#保留小数位数

    2.C#保留小数位N位,四舍五入 . decimal d= decimal.Round(decimal.Parse("0.55555"),2); 3.C#保留小数位N位四舍五入 M ...

  8. IDEA全局更改统一编码为utf-8

    File -> Other Settings->Deaault Settings->Settings->File Encodings -> Defaule encodin ...

  9. C# 对Excel操作时,单元格值的读取

    一.Range中Value与Value2的区别 当range("A1:B10")设置为 Currency (货币)和 Date (日期.日期时间)数据类型时,range2将返回对应 ...

  10. boost::asio与ACE的对比

    http://blog.163.com/miky_sun/blog/static/3369405201041753652505/