MapReduce编程小结

　　（1）key-value到map端比较容易，每个分片都会交由一个MapTask，而每个分片由InputFormat（一般是FileInputFormat）决定（一般是64M），

　　　　每个MapTask会调用N次map函数，具体是多少次map函数呢？

　　　　由job.setInputFormatClass(?)中？决定，默认是TextInputFormat.class，TextInputFormat是以一行为解析对象，一行对应一个map函数的调用。

　　（2）key-value在reduce端比较复杂，第二参数是Iterable<?>对象，涉及<key,list{value1,value2...}>,它对应一次reduce函数的调用，

　　　　也就是说，一次reduce函数调用将会处理一个key，多个value，

　　（3）而这个<key,list{value1,value2...}>输入是如何来的呢？

　　　　mapreduce框架自带了预定义key（Text、LongWritable等）的排序，

　　　　将来自不同MapTask的相同的key加以聚合，变为<key,list{value1,value2...}>作为reduce函数的输入。

　　（4）说了MapTask个数有分片决定，那ReduceTask将由什么决定呢？

　　　　每个map函数执行后都会调用一次getPartition函数(默认是HashPartitioner类的)来获取分区号，最终写入磁盘文件带有分区号这条尾巴，以便reduce端的拉取，

　　　　而getPartition函数中最重要的参数numReduceTasks将由job.setNumReduceTasks决定，默认值为1，

　　　　故若不设置此参数很多情况下getPartition函数会返回0，也就对应一个ReduceTask。

　　（5）说完了分区，再来说分组。分区是在map端确定，相对于每个map函数，而分组却放到了reduce端，相对于多个MapTask，组属于区。

　　　　分组会影响什么呢？

　　（6）当map端的输出key是自定义NewK2时，且自定义了compareTo，使用分组后，

　　　将使用分组类MyGroupingComparator的compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2)进行sort，

　　　　得到<key,list{value1,value2...}>。

　　附上一个例子：

package examples; 

import java.io.DataInput;

import java.io.DataOutput;

import java.io.IOException;

import java.net.URI;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.RawComparator;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.io.WritableComparable;

import org.apache.hadoop.io.WritableComparator;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;

public class GroupApp {

	static final String INPUT_PATH = "hdfs://192.168.2.100:9000/hello";

	static final String OUTPUT_PATH = "hdfs://192.168.2.100:9000/out";

	public static void main(String[] args) throws Exception {

		Configuration conf = new Configuration();

		final FileSystem fileSystem = FileSystem.get(new URI(INPUT_PATH), conf);

		final Path outPath = new Path(OUTPUT_PATH);

		if(fileSystem.exists(outPath)) {

			fileSystem.delete(outPath, true);

		}

		final Job job = new Job(conf, GroupApp.class.getSimpleName());

		job.setJarByClass(GroupApp.class);

		FileInputFormat.setInputPaths(job, INPUT_PATH);

		job.setInputFormatClass(TextInputFormat.class);

		job.setMapperClass(MyMapper.class);

		job.setMapOutputKeyClass(NewK2.class);

		job.setMapOutputValueClass(LongWritable.class);

		job.setPartitionerClass(MyPartitoner.class);

		job.setNumReduceTasks(3);

		job.setGroupingComparatorClass(MyGroupingComparator.class);

		job.setReducerClass(MyReducer.class);

		job.setOutputKeyClass(LongWritable.class);

		job.setOutputValueClass(LongWritable.class);

		FileOutputFormat.setOutputPath(job, outPath);

		job.waitForCompletion(true);

	}

	static class MyPartitoner extends HashPartitioner<NewK2, LongWritable> {

		  public int getPartition(NewK2 key, LongWritable value, int numReduceTasks) {

			  System.out.println("the getPartition() is called...");

			  if(key.first == 1) {

				  return 0 % numReduceTasks;

			  }

			  else if(key.first == 2) {

				  return 1 % numReduceTasks;

			  }

			  else {

				  return 2 % numReduceTasks;

			  }

		  }

	}

	static class NewK2 implements WritableComparable<NewK2> {

		Long first = 0L;

		Long second = 0L;

		public NewK2(){}

		public NewK2(long first, long second) {

			this.first = first;

			this.second = second;

		}

		public void write(DataOutput out) throws IOException {

			out.writeLong(first);

			out.writeLong(second);

		}

		public void readFields(DataInput in) throws IOException {

			first = in.readLong();

			second = in.readLong();

		}

		public int compareTo(NewK2 o) {

			System.out.println("the compareTo() is called...");

			final long minus = this.first - o.first;

			if(minus != 0) {

				return (int)minus;

			}

			return (int) (this.second - o.second);

		}

	}

	static class MyGroupingComparator implements RawComparator<NewK2> {

		public int compare(NewK2 o1, NewK2 o2) {

	//		System.out.println("the compare() is called...");

			return (int) (o1.first - o2.first);

		}

		public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {

			System.out.println("the compare() is called...");

			return WritableComparator.compareBytes(b1, s1, 8, b2, s2, 8);

		}

	}

	static class MyMapper extends Mapper<LongWritable, Text, NewK2, LongWritable> {

		protected void map(LongWritable k1, Text v1, Context ctx) throws IOException, InterruptedException {

			final String[] splited = v1.toString().split("\t");

			System.out.println("the map() is called...");

			NewK2 k2 = new NewK2(Integer.parseInt(splited[0]), Integer.parseInt(splited[1]));

			LongWritable v2 = new LongWritable(Long.parseLong((splited[1])));

			ctx.write(k2, v2);

//			System.out.println("the real map output...");

//			System.out.println("<"+k2.first+","+v2+">");

		}

	}

	static class MyReducer extends Reducer<NewK2, LongWritable, LongWritable, LongWritable> {

		long v3 = 0;

		protected void reduce(NewK2 k2, Iterable<LongWritable> v2s, Context ctx) throws IOException, InterruptedException {

			System.out.println("the reduce() is called...");

			for(LongWritable secend : v2s) {

				v3 = secend.get();

				System.out.println("<"+k2.first+","+k2.second+">, "+v3+"");

			}

			System.out.println("--------------------------------------------");

			System.out.println("the real reduce output...");

			System.out.println("<"+k2.first+","+v3+">");

			ctx.write(new LongWritable(k2.first), new LongWritable(v3));

			System.out.println("--------------------------------------------");

		}

	}

}

MapReduce编程小结的更多相关文章

MapReduce 编程模型
一.简单介绍 1.MapReduce 应用广泛的原因之中的一个在于它的易用性.它提供了一个因高度抽象化而变得异常简单的编程模型. 2.从MapReduce 自身的命名特点能够看出,MapReduce ...
MapReduce编程实战之“调试”和"调优"
本篇内容在上一篇的"初识"环节,我们已经在本地和Hadoop集群中,成功的执行了几个MapReduce程序,对MapReduce编程,已经有了最初的理解. 在本篇文章中,我们对M ...
Hadoop MapReduce编程 API入门系列之压缩和计数器（三十）
不多说,直接上代码. Hadoop MapReduce编程 API入门系列之小文件合并(二十九) 生成的结果,作为输入源. 代码 package zhouls.bigdata.myMapReduce. ...
[Hadoop入门] - 1 Ubuntu系统 Hadoop介绍 MapReduce编程思想
Ubuntu系统 (我用到版本号是140.4) ubuntu系统是一个以桌面应用为主的Linux操作系统,Ubuntu基于Debian发行版和GNOME桌面环境.Ubuntu的目标在于为一般用户提供一 ...
mapreduce编程模型你知道多少？
上次新霸哥给大家介绍了一些hadoop的相关知识,发现大家对hadoop有了一定的了解,但是还有很多的朋友对mapreduce很模糊,下面新霸哥将带你共同学习mapreduce编程模型. mapred ...
hadoop2.2编程：使用MapReduce编程实例（转）
原文链接:http://www.cnblogs.com/xia520pi/archive/2012/06/04/2534533.html 从网上搜到的一篇hadoop的编程实例,对于初学者真是帮助太大 ...
《Data-Intensive Text Processing with mapReduce》读书笔记之二：mapreduce编程、框架及运行
搜狐视频的屌丝男士第二季大结局了,惊现波多野老师,怀揣着无比鸡冻的心情啊,可惜随着剧情的推进发展,并没有出现期待中的屌丝奇遇,大鹏还是没敢冲破尺度的界线.想百度些种子吧,又不想让电脑留下污点证据,要知 ...
Linux多线程编程小结
Linux多线程编程小结前一段时间由于开题的事情一直耽搁了我搞Linux的进度,搞的我之前学的东西都遗忘了,非常烦躁的说,如今抽个时间把之前所学的做个小节.文章内容主要总结于<Linux程序 ...
Windows Store 手势编程小结
Windows Store 手势编程小结最近完成了一个Windows Store上面的手势操作的页面.在这里总结了一下经验和心得,希望能和大家一起分享和讨论一下. 首先,要纠正一个误区,在Windo ...

随机推荐

ECSHOP用户评论
可以不需要审核吗?现在的用户评论要审核才能显示 ,我需要不用审核就可以显示可以么? 在论坛上看见这个问题,顺便就记录下来吧. 这个是可以的,下面是操作步骤后台->系统设置->商店设置-& ...
js 的try catch
语法: try { //在此运行代码 } catch(err) { //在此处理错误 } 注意:try...catch 使用小写字母.大写字母会出错. <script language=&quo ...
MVC之MVCSQO方法查询、排序、分页、投影
【转】SSIS 2012 – Package Configurations Menu Option Missing
原文:http://dataqueen.unlimitedviz.com/2012/01/ssis-2012-package-configurations-menu-option-missing/ I ...
[Linked List]Insertion Sort List
Total Accepted: 59422 Total Submissions: 213019 Difficulty: Medium Sort a linked list using insertio ...
UVa230 Borrowers (STL)
Borrowers I mean your borrowers of books - those mutilators of collections, spoilers of the symmet ...
MYSQL insert
准备: create table T4(X int ,Y int); 方法 1. insert [low_priority][high_priority][delayed] into table_na ...
SQL Server 为索引启动硬件加速（分区）的 2 方法
背景知识: 如果你想看<三国>这部电视剧它有假设它有400G这么大,现在你有两个朋友他们都已经把这部剧保存在自己的硬盘上了. A用一个硬盘就把这部剧保存了下来,B用了两个硬盘才保存了一下 ...
一步一步学python(五) -条件循环和其他语句
1.print 使用逗号输出 - 打印多个表达式也是可行的,但要用逗号隔开 >>> print 'chentongxin',23 SyntaxError: invalid synta ...
【python】中文的输出，打印，文件编码问题解决方法
直接在python中输入中文的字符串会报编译错误SyntaxError: Non-ASCII character,因为python文件默认编码方式是ASCII.如果想要打印中文字符,有两种方式: 1. ...

MapReduce编程小结

MapReduce编程小结的更多相关文章

随机推荐

热门专题