解读：MultipleOutputs类

//MultipleOutputs类用于简化多文件输出
The MultipleOutputs class simplifies writing output data to multiple outputs


//案例一：在job默认的输出之外,附加自定义的输出.自定义的输出可以指定：输出格式以及 key/value 类型.

Case one: writing to additional outputs other than the job default output. Each additional output, or named output, may be configured with its own OutputFormat, with its own key class and with its own value class.


//案例二：将不同的数据写到不同的文件中 
Case two: to write data to different files provided by user


//MultipleOutputs支持计数器,默认是不启用状态.计数器组名是MultipleOutputs类的名字.计数器名字是自定义输出的名字.将记录个数写入对应的计数器.

MultipleOutputs supports counters, by default they are disabled. The counters group is the MultipleOutputs class name. The names of the counters are the same as the output name. These count the number records written to each output name.


//Job配置模板
Usage pattern for job submission: 

 Job job = new Job();

 FileInputFormat.setInputPath(job, inDir);

 FileOutputFormat.setOutputPath(job, outDir);

 job.setMapperClass(MOMap.class);

 job.setReducerClass(MOReduce.class);

 ...

 //定义TextOutputFormat格式的'text'输出

 MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,

 LongWritable.class, Text.class);

 //定义SequenceFileOutputFormat格式的'seq'输出

 MultipleOutputs.addNamedOutput(job, "seq",

   SequenceFileOutputFormat.class,

   LongWritable.class, Text.class);

 ...

 job.waitForCompletion(true);

 ...

//reduce中使用

Usage in Reducer: 

  String generateFileName(K k, V v) {

   return k.toString() + "_" + v.toString();

 }

 public class MOReduce extends

   Reducer<WritableComparable, Writable,WritableComparable, Writable> { 

 //1. 定义MultipleOutputs类型变量
 private MultipleOutputs mos;


 public void setup(Context context) {

 ...

 //2. setup()方法对其初始化
 mos = new MultipleOutputs(context);

 }

 public void reduce(WritableComparable key, Iterator<Writable> values,

 Context context)

 throws IOException {

 ...

 mos.write("text", , key, new Text("Hello"));


 //3. reduce()方法中使用MultipleOutputs类的write方法输出

 /**
  *参数列表
  * @ 自定义的输出名
  * @ 输出的key
  * @ 输出的value
  * @ 输出的基础路径
  */
 mos.write("seq", LongWritable(1), new Text("Bye"), "seq_a");

 mos.write("seq", LongWritable(2), key, new Text("Chau"), "seq_b");

 mos.write(key, new Text("value"), generateFileName(key, new Text("value")));

 ...

 }

 public void cleanup(Context) throws IOException { 

 //4. 关闭MultipleOutputs输出流
 mos.close();

 ...

 }

}


When used in conjuction with org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat, MultipleOutputs can mimic the behaviour of MultipleTextOutputFormat and MultipleSequenceFileOutputFormat from the old Hadoop API - ie, output can be written from the Reducer to more than one location.


//使用以下方法可以不用指定自定义输出

Use MultipleOutputs.write(KEYOUT key, VALUEOUT value, String baseOutputPath) to write key and value to a path specified by baseOutputPath, with no need to specify a named output:


 //定义变量

 private MultipleOutputs out;

 public void setup(Context context) {

   //初始化变量
   out = new MultipleOutputs(context);

   ...

 }

 public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {

 for (Text t : values) {


   //调用类中的Write()方法
   /**
    *参数列表
    * @ 输出的key
    * @ 输出的value
    * @ 指定输出的基础路径
    */
   out.write(key, t, generateFileName(<parameter list...>));

   }

 }

 protected void cleanup(Context context) throws IOException, InterruptedException {


   //关闭输出流
   out.close();

 }
//自定义的生成基础路径的方法,即符号"/"有无的区别

Use your own code in generateFileName() to create a custom path to your results. '/' characters in baseOutputPath will be translated into directory levels in your file system. Also, append your custom-generated path with "part" or similar, otherwise your output will be -00000, -00001 etc. No call to context.write() is necessary. See example generateFileName() code below. 

 private String generateFileName(Text k) {

   // expect Text k in format "Surname|Forename"

   String[] kStr = k.toString().split("\\|");

   String sName = kStr[0];

   String fName = kStr[1];

   // example for k = Smith|John

   // output written to /user/hadoop/path/to/output/Smith/John-r-00000 (etc)

   return sName + "/" + fName;

 }


//以上使用MultipleOutputs类的方法方式都会产生一个空的默认的【part-*-00000】的文件.
//在Job的配置中使用 LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
//代替 job.setOutputFormatClass(TextOutputFormat.class);
//可以避免差生【part-*-00000】这一空文件

Using MultipleOutputs in this way will still create zero-sized default output, eg part-00000. To prevent this use LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class); instead of job.setOutputFormatClass(TextOutputFormat.class); in your Hadoop job configuration.

总结：MR案例：多文件输出MultipleOutputs

使用指定自定义输出的write方法，需要在Job配置中添加 MultipleOutputs.addNamedOutput(Job job, String namedOutput, Class<? extends OutputFormat> outputFormatClass, Class<?> keyClass, Class<?> valueClass);方法
对于不使用指定自定义输出的write方法则不需要
Job结果中不再产生默认的空文件【part-*-00000】需要在配置中使用 LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);

解读：MultipleOutputs类的更多相关文章

【java源码】解读HashTable类背后的实现细节
HashTable这个类实现了哈希表从key映射到value的数据结构形式.任何非null的对象都可以作为key或者value. 要在hashtable中存储和检索对象,作为key的对象必须实现has ...
详细解读LruCache类
LruCache是android提供的一个缓存工具类,其算法是最近最少使用算法.它把最近使用的对象用“强引用”存储在LinkedHashMap中,并且把最近最少使用的对象在缓存值达到预设定值之前就从内 ...
逐步解读String类（一）
一句题外话面试刚入行的Java新手,侧重基础知识:面试有多年工作经验的老鸟,多侧重对具体问题的解决策略. 从一类面试题说起考察刚入行菜鸟对基础知识的掌握程度,面试官提出关于String类的内容挺常 ...
MR案例：多文件输出MultipleOutputs
问题描述:现有 ip-to-hosts.txt 数据文件,文件中每行数据有两个字段:分别是ip地址和该ip地址对应的国家,以'\t'分隔.要求汇总不同国家的IP数,并以国家名为文件名将其输出.解读:M ...
通过MultipleOutputs写到多个文件
MultipleOutputs 类可以将数据写到多个文件,这些文件的名称源于输出的键和值或者任意字符串.这允许每个 reducer(或者只有 map 作业的 mapper)创建多个文件. 采用name ...
MapReduce 规划六系列 MultipleOutputs采用
在前面的示例,输出文件名是默认: _logs part-r-00001 part-r-00003 part-r-00005 part-r-00007 part-r-00009 part-r-00011 ...
hadoop多文件输出MultipleOutputFormat和MultipleOutputs
1.MultipleOutputFormat可以将相似的记录输出到相同的数据集.在写每条记录之前,MultipleOutputFormat将调用generateFileNameForKeyValue方 ...
详细解读Volley（三）—— ImageLoader & NetworkImageView
ImageLoader是一个加载网络图片的封装类,其内部还是由ImageRequest来实现的.但因为源码中没有提供磁盘缓存的设置,所以咱们还需要去源码中进行修改,让我们可以更加自如的设定是否进行磁盘 ...
使用MultipleInputs和MultipleOutputs
还是计算矩阵的乘积,待计算的表达式如下: S=F*[B+mu(u+s+b+d)] 其中,矩阵B.u.s.d分别存放在名称对应的SequenceFile文件中. 1)我们想分别读取这些文件(放在不同的文 ...

随机推荐

初级Java面试题 - JavaSE篇
p{font-size:18px;} li{font-size:18px;} 加入我的QQ群(701974765) 获取更多好用又好玩的软件,还有不定期发放的福利呦(-￣▽￣)- Java基本数据类型 ...
C# 控件，MenuStrip，statusStrip,contextMenuStrip,ImageList, Listview,MonthCalendar、DataGridView,combobox,textbox，DateTimePicker,treeview,picturebox、toolStrip,radioButton,TableLayoutPanel
一.菜单栏 1)MenuStrip 菜单栏选择工具栏控件:menuStrip C# Menustrip控件的常用属性用法详解 C#WinForm应用程序——添加菜单栏MenuStrip] 1.通过右 ...
OKEx量化分析报告[2017-12-19]
[分析时间]2017-12-19 09:05 [分析对象]OKEx [有效期限]2017-12-19 09:00:00 — 2017-12-19 09:59:59 [报告内容]DASH_USDT ...
tomcat启动报错:Injection of autowired dependencies failed
Error creating bean with name 'backPrintPaperController': Injection of autowired dependencies failed ...
riemann的安装和使用
Riemann monitors distributed systems. 具体介绍就不多说了,一个分布式的监控系统.可以接收各种event上报,然后通过强大的脚本和插件,展示曲线,柱状,饼图等来对系 ...
【Servlet】把文件写到Respond输出流里面供用户下载
本文区分于<[Jsp]把Java写到Respond输出流里面供用户下载>(点击打开链接)把原本该打印到控制台的内容,直接打印到一个文本文件txt中给用户下载. 实际上是<[Strut ...
git获取远程仓库代码
首先在本地创建一个目录“ MyProject”,用来存放工程文件,git进入该文件夹,执行 git clone 远程项目MyCode地址将代码克隆到本地然后进入“MyCode”文件夹下 cd MyC ...
『HTML5挑战经典』是英雄就下100层-开源讲座(一)从天而降的英雄
是英雄就下100层是一款经典的手机小游戏,以前是在诺基亚手机上十分有名.今天我们就用HTML5和lufylegend一步步地实现它. 一,准备工作首先,你需要下载lufylegend,下载地址如下: ...
WebHDFS vs HttpFS GateWay
基于hadoop 2.7.1版本一.简介 1. WebHDFS官方简介: Introduction The HTTP REST API supports the complete FileSyste ...
hadoop-3.0.0-beta1分布式安装
楼主是从Hadoop2.x版本过来的,在工作之余自己搭建了一套3.0的版本来耍一耍,此文章的前置环境准备工作省略.主要介绍一些和Hadoop2.x版本不同的安装之处 Hadoop版本:hadoop-3 ...

解读：MultipleOutputs类

解读：MultipleOutputs类的更多相关文章

随机推荐

热门专题