MapReuce 编程总结-多MapReduce执行

学习hadoop，必不可少的就是写MapReduce程序，当然，对于简单的分析程序，我们只需一个MapReduce就能搞定，这里就不提单MapReuce的情况了，网上例子很多，大家可以百度Google一下。对于比较复杂的分析程序，我们可能需要多个Job或者多个Map或者Reduce进行分析计算。

多Job或者多MapReduce的编程形式有以下几种：

1、迭代式MapReduce

MapReduce迭代方式，通常是前一个MapReduce的输出作为下一个MapReduce的输入，最终可只保留最终结果，中间数据可以删除或保留，根据业务需要自己决定

示例代码如下：

Configuration conf = new Configuration();

//first Job

Job job1 = new Job(conf,"job1");

.....

FileInputFormat.addInputPath(job1,InputPaht1);

FileOutputFromat.setOutputPath(job1,Outpath1);

job1.waitForCompletion(true);

//second Mapreduce

Job job2 = new Job(conf1,"job1");

.....

FileInputFormat.addInputPath(job2,Outpath1);

FileOutputFromat.setOutputPath(job2,Outpath2);

job2.waitForCompletion(true);

//third Mapreduce

Job job3 = new Job(conf1,"job1");

.....

FileInputFormat.addInputPath(job3,Outpath2);

FileOutputFromat.setOutputPath(job3,Outpath3);

job3.waitForCompletion(true);

.....

下面列举一个mahout怎样运用mapreduce迭代的，下面的代码快就是mahout中kmeans的算法的代码，在main函数中用一个while循环来做mapreduce的迭代，其中：runIteration()是一次mapreduce的过程。

但个人感觉现在的mapreduce迭代设计不太满意的地方。

1. 每次迭代，如果所有Job（task）重复创建，代价将非常高。

2.每次迭代，数据都写入本地和读取本地，I/O和网络传输的代价比较大。

好像Twister和Haloop的模型能过比较好的解决这些问题，但他们抽象度不够高，支持的计算有限。

期待着下个版本hadoop更好的支持迭代算法。

//main function

while (!converged && iteration <= maxIterations) {

      log.info("K-Means Iteration {}", iteration);

      // point the output to a new directory per iteration

      Path clustersOut = new Path(output, AbstractCluster.CLUSTERS_DIR + iteration);

      converged = runIteration(conf, input, clustersIn, clustersOut, measure.getClass().getName(), delta);

      // now point the input to the old output directory

      clustersIn = clustersOut;

      iteration++;

}

  private static boolean runIteration(Configuration conf,

                                      Path input,

                                      Path clustersIn,

                                      Path clustersOut,

                                      String measureClass,

                                      String convergenceDelta)

    throws IOException, InterruptedException, ClassNotFoundException {

    conf.set(KMeansConfigKeys.CLUSTER_PATH_KEY, clustersIn.toString());

    conf.set(KMeansConfigKeys.DISTANCE_MEASURE_KEY, measureClass);

    conf.set(KMeansConfigKeys.CLUSTER_CONVERGENCE_KEY, convergenceDelta);

    Job job = new Job(conf, "KMeans Driver running runIteration over clustersIn: " + clustersIn);

    job.setMapOutputKeyClass(Text.class);

    job.setMapOutputValueClass(ClusterObservations.class);

    job.setOutputKeyClass(Text.class);

    job.setOutputValueClass(Cluster.class);

    job.setInputFormatClass(SequenceFileInputFormat.class);

    job.setOutputFormatClass(SequenceFileOutputFormat.class);

    job.setMapperClass(KMeansMapper.class);

    job.setCombinerClass(KMeansCombiner.class);

    job.setReducerClass(KMeansReducer.class);

    FileInputFormat.addInputPath(job, input);

    FileOutputFormat.setOutputPath(job, clustersOut);

    job.setJarByClass(KMeansDriver.class);

    HadoopUtil.delete(conf, clustersOut);

    if (!job.waitForCompletion(true)) {

      throw new InterruptedException("K-Means Iteration failed processing " + clustersIn);

    }

    FileSystem fs = FileSystem.get(clustersOut.toUri(), conf);

    return isConverged(clustersOut, conf, fs);

  }

2、依赖关系式MapReuce-JobControl

依赖关系式主要是由JobControl来实现，JobControl由两个类组成：Job和JobControl。其中，Job类封装了一个MapReduce作业及其对应的依赖关系，主要负责监控各个依赖作业的运行状态，以此更新自己的状态。

JobControl包含了一个线程用于周期性的监控和更新各个作业的运行状态，调度依赖作业运行完成的作业，提交处于READY状态的作业等，同事，还提供了一些API用于挂起、回复和暂停该线程。

示例代码如下：

Configuration job1conf = new Configuration();

Job job1 = new Job(job1conf,"Job1");

.........//job1 其他设置

Configuration job2conf = new Configuration();

Job job2 = new Job(job2conf,"Job2");

.........//job2 其他设置

Configuration job3conf = new Configuration();

Job job3 = new Job(job3conf,"Job3");

.........//job3 其他设置

job3.addDepending(job1);//设置job3和job1的依赖关系

job3.addDepending(job2);

JobControl JC = new JobControl("123");

JC.addJob(job1);//把三个job加入到jobcontorl中

JC.addJob(job2);

JC.addJob(job3);

JC.run();

3、线性链式MapReduce-ChainMapper/ChainReduce

ChainMapper/ChainReduce主要为了解决线性链式Mapper提出的。在Map或者Reduce阶段存在多个Mapper，这些Mapper像Linux管道一样，前一个Mapper的输出结果直接重定向到下一个Mapper的输入，行程流水线。

需要注意的是，对于任意一个MapReduce作业，Map和Reduce阶段可以有无线个Mapper，但是Reduce只能有一个。所以包含多个Reduce的作业，不能使用ChainMapper/ChainReduce来完成。

代码如下：

...

conf.setJobName("chain");

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

JobConf mapper1Conf=new JobConf(false);

JobConf mapper2Conf=new JobConf(false);

JobConf redduce1Conf=new JobConf(false);

JobConf mappe3Conf=new JobConf(false);

...

ChainMapper.addMapper(conf,Mapper1.class,LongWritable.class,Text.class,Text.class,Text.class,true,mapper1Conf);

ChainMapper.addMapper(conf,Mapper2.class,Text.class,Text.class,LongWritable.class,Text.class,false,mapper2Conf);

ChainReducer.setReduce(conf,Reducer.class,LongWritable.class,Text.class,Text.class,Text.class,true,reduce1Conf);

ChainReducer.addMapper(conf,Mapper3.class,Text.class,Text.class,LongWritable.class,Text.class,true,mapper3Conf);

JobClient.runJob(conf);

4、子Job式MapReduce

子Job式其实也是迭代式中的一种，我这里单独的提取出来了，说白了，就是一个父Job包含多个子Job。

在nutch中，Crawler是一个父Job，通过run方法中调用runTool工具进行子Job的调用，而runTool是通过反射来调用子Job执行。

下面来看下Nutch里面是如何实现的

....

private NutchTool currentTool = null;

....

private Map<String, Object> runTool(Class<? extends NutchTool> toolClass,

			Map<String, Object> args) throws Exception {

		currentTool = (NutchTool) ReflectionUtils.newInstance(toolClass,

				getConf());

		return currentTool.run(args);

	}

...

@Override

	public Map<String, Object> run(Map<String, Object> args) throws Exception {

		results.clear();

		status.clear();

		String crawlId = (String) args.get(Nutch.ARG_CRAWL);

		if (crawlId != null) {

			getConf().set(Nutch.CRAWL_ID_KEY, crawlId);

		}

		String seedDir = null;

		String seedList = (String) args.get(Nutch.ARG_SEEDLIST);

		if (seedList != null) { // takes precedence

			String[] seeds = seedList.split("\\s+");

			// create tmp. dir

			String tmpSeedDir = getConf().get("hadoop.tmp.dir") + "/seed-"

					+ System.currentTimeMillis();

			FileSystem fs = FileSystem.get(getConf());

			Path p = new Path(tmpSeedDir);

			fs.mkdirs(p);

			Path seedOut = new Path(p, "urls");

			OutputStream os = fs.create(seedOut);

			for (String s : seeds) {

				os.write(s.getBytes());

				os.write('\n');

			}

			os.flush();

			os.close();

			cleanSeedDir = true;

			seedDir = tmpSeedDir;

		} else {

			seedDir = (String) args.get(Nutch.ARG_SEEDDIR);

		}

		Integer depth = (Integer) args.get(Nutch.ARG_DEPTH);

		if (depth == null)

			depth = 1;

		boolean parse = getConf().getBoolean(FetcherJob.PARSE_KEY, false);

		String solrUrl = (String) args.get(Nutch.ARG_SOLR);

		int onePhase = 3;

		if (!parse)

			onePhase++;

		float totalPhases = depth * onePhase;

		if (seedDir != null)

			totalPhases++;

		float phase = 0;

		Map<String, Object> jobRes = null;

		LinkedHashMap<String, Object> subTools = new LinkedHashMap<String, Object>();

		status.put(Nutch.STAT_JOBS, subTools);

		results.put(Nutch.STAT_JOBS, subTools);

		// inject phase

		if (seedDir != null) {

			status.put(Nutch.STAT_PHASE, "inject");

			jobRes = runTool(InjectorJob.class, args);

			if (jobRes != null) {

				subTools.put("inject", jobRes);

			}

			status.put(Nutch.STAT_PROGRESS, ++phase / totalPhases);

			if (cleanSeedDir && tmpSeedDir != null) {

				LOG.info(" - cleaning tmp seed list in " + tmpSeedDir);

				FileSystem.get(getConf()).delete(new Path(tmpSeedDir), true);

			}

		}

		if (shouldStop) {

			return results;

		}

		// run "depth" cycles

		for (int i = 0; i < depth; i++) {

			status.put(Nutch.STAT_PHASE, "generate " + i);

			jobRes = runTool(GeneratorJob.class, args);

			if (jobRes != null) {

				subTools.put("generate " + i, jobRes);

			}

			status.put(Nutch.STAT_PROGRESS, ++phase / totalPhases);

			if (shouldStop) {

				return results;

			}

			status.put(Nutch.STAT_PHASE, "fetch " + i);

			jobRes = runTool(FetcherJob.class, args);

			if (jobRes != null) {

				subTools.put("fetch " + i, jobRes);

			}

			status.put(Nutch.STAT_PROGRESS, ++phase / totalPhases);

			if (shouldStop) {

				return results;

			}

			if (!parse) {

				status.put(Nutch.STAT_PHASE, "parse " + i);

				jobRes = runTool(ParserJob.class, args);

				if (jobRes != null) {

					subTools.put("parse " + i, jobRes);

				}

				status.put(Nutch.STAT_PROGRESS, ++phase / totalPhases);

				if (shouldStop) {

					return results;

				}

			}

			status.put(Nutch.STAT_PHASE, "updatedb " + i);

			jobRes = runTool(DbUpdaterJob.class, args);

			if (jobRes != null) {

				subTools.put("updatedb " + i, jobRes);

			}

			status.put(Nutch.STAT_PROGRESS, ++phase / totalPhases);

			if (shouldStop) {

				return results;

			}

		}

		if (solrUrl != null) {

			status.put(Nutch.STAT_PHASE, "index");

			jobRes = runTool(SolrIndexerJob.class, args);

			if (jobRes != null) {

				subTools.put("index", jobRes);

			}

		}

		return results;

	}

MapReuce 编程总结-多MapReduce执行的更多相关文章

Hadoop MapReduce执行过程详解（带hadoop例子）
https://my.oschina.net/itblog/blog/275294 摘要: 本文通过一个例子,详细介绍Hadoop 的 MapReduce过程. 分析MapReduce执行过程 Map ...
分析MapReduce执行过程
分析MapReduce执行过程 MapReduce运行的时候,会通过Mapper运行的任务读取HDFS中的数据文件,然后调用自己的方法,处理数据,最后输出. Reducer任务会接收Mapper任务输 ...
Hadoop学习之Mapreduce执行过程详解
一.MapReduce执行过程 MapReduce运行时,首先通过Map读取HDFS中的数据,然后经过拆分,将每个文件中的每行数据分拆成键值对,最后输出作为Reduce的输入,大体执行流程如下图所示: ...
.NET 并行(多核)编程系列之五 Task执行和异常处理
原文:.NET 并行(多核)编程系列之五 Task执行和异常处理 .NET 并行(多核)编程系列之五 Task执行和异常处理前言:本篇主要讲述等待task执行完成. 本篇的议题如下: 1. 等待Ta ...
Hadoop MapReduce执行过程实例分析
1.MapReduce是如何执行任务的?2.Mapper任务是怎样的一个过程?3.Reduce是如何执行任务的?4.键值对是如何编号的?5.实例,如何计算没见最高气温? 分析MapReduce执行过程 ...
hadoop2.2编程：使用MapReduce编程实例（转）
原文链接:http://www.cnblogs.com/xia520pi/archive/2012/06/04/2534533.html 从网上搜到的一篇hadoop的编程实例,对于初学者真是帮助太大 ...
大数据学习笔记——Hadoop编程实战之Mapreduce
Hadoop编程实战——Mapreduce基本功能实现此篇博客承接上一篇总结的HDFS编程实战,将会详细地对mapreduce的各种数据分析功能进行一个整理,由于实际工作中并不会过多地涉及原理,因此 ...
如何查看MapReduce执行的程序中的输出日志
我们开发程序的时候,好多人都喜欢用sysout输出内容来查看运行情况.但是在MR程序里写了之后,却不知道去哪里查找,可以参考这篇文章. 第一种方法,我们可以在MapReduce任务查看页面找到这些日志 ...
mapreduce执行流程
角色描述:JobClient:执行任务的客户端JobTracker:任务调度器TaskTracker:任务跟踪器Task:具体的任务(Map OR Reduce) 从生命周期的角度来看,mapredu ...

随机推荐

jQuery实现拖动布局并将排序结果保存到数据库
很多网站的拖动布局的例子都是采用浏览器的COOKIE来记录用户拖动模块的位置,也就是说拖动后各模块的排序位置信息是记录在客户端的cookie里的.当用户清空客户端的cookie或浏览器的cookie过 ...
在eclipse中怎么安装插件
1.方法1是help中安装新软件,这个一般要你真到软件的url,如果不知道呢?那么就要用到marketpalce,这个一般也在help中的,
iCloud同步测试
步骤一在iPad上拍照A后,相机胶卷与照片流都出现照片A --> Mac上iCloud我的照片流内出现照片A --> iphone上我的照片流出现照片A 同理,在iphone拍摄照片B后 ...
OC语法9——Category类别
Category(分类): 当我们在开发过程中要给类添加新的方法时,一般不要去动原类. 再不改动原类的限制下,怎么拓展类的方法?以往我们的做法是新建子类使其继承该类,然后通过子类拓展类的行为. OC提 ...
css background-position (图片裁取)
语法:background-position : length || length background-position : position || position 取值:length : 百分 ...
JavaScript瀑布流代码
function osCode(){ var boxWidth = parseInt($(".item").css('width')), marginTop = parseInt( ...
Android minHeight/Width,maxHeight/Width
在layout文件中,设置IamgeView的最大(最小)高度(宽度)时,需要同时设置android:adjustViewBounds="true",这样设置才会生效.在代码中设置 ...
Android应用程序架构之res
res/drawable 专门存放png.jpg等图标文件.在代码中使用getResources().getDrawable(resourceId)获取该目录下的资源. res/layout 专门存放 ...
objective -c こだわり
You make an object by creating an instance of a particular class. You do this by allocating the obje ...
codeforces 553D . Nudist Beach 二分
题目链接有趣的题. 给一个图, n个点m条边. 有k个点不可选择. 现在让你选出一个非空的点集, 使得点集中strength最小的点的strength最大. strength的定义:一个点周围的点中 ...

MapReuce 编程总结-多MapReduce执行

MapReuce 编程总结-多MapReduce执行的更多相关文章

随机推荐

热门专题