MapReduce之Reduce Join

一介绍

Reduce Join其主要思想如下：
在map阶段，map函数同时读取两个文件File1和File2，为了区分两种来源的key/value数据对，对每条数据打一个标签（tag），比如：tag=0表示来自文件File1，tag=2表示来自文件File2。即：map阶段的主要任务是对不同文件中的数据打标签。在reduce阶段，reduce函数获取key相同的来自File1和File2文件的value list，然后对于同一个key，对File1和File2中的数据进行join（笛卡尔乘积)，即：reduce阶段进行实际的连接操作。

在这个例子中我们假设有两个数据文件如下：
存储客户信息的文件：customers.csv

1,stephaie leung,555-555-5555

2,edward kim,123-456-7890

3,jose madriz,281-330-8004

4,david storkk,408-55-0000

存储订单信息的文件：orders.csv

3,A,12.95,02-Jun-2008

1,B,88.25,20-May-2008

2,C,32.00,30-Nov-2007

3,D,25.02,22-Jan-2009

要求最终的输出结果为：

1,Stephanie Leung,555-555-5555,B,88.25,20-May-2008

2,Edward Kim,123-456-7890,C,32.00,30-Nov-2007

3,Jose Madriz,281-330-8004,A,12.95,02-Jun-2008

3,Jose Madriz,281-330-8004,D,25.02,22-Jan-2009

二代码部分

自定义数据类型：用于对不同文件数据打标签

 package mapreduce.reducejoin;

 import java.io.DataInput;

 import java.io.DataOutput;

 import java.io.IOException;

 import org.apache.hadoop.io.Writable;

 public class DataJoinWritable implements Writable {

     // mark ,customer / order

     private String tag;

     // info

     private String data;

     public DataJoinWritable() {

     }

     public DataJoinWritable(String tag, String data) {

         this.set(tag, data);

     }

     public void set(String tag, String data) {

         this.setTag(tag);

         this.setData(data);

     }

     public String getTag() {

         return tag;

     }

     public void setTag(String tag) {

         this.tag = tag;

     }

     public String getData() {

         return data;

     }

     public void setData(String data) {

         this.data = data;

     }

     public void write(DataOutput out) throws IOException {

         out.writeUTF(this.getTag());

         out.writeUTF(this.getData());

     }

     public void readFields(DataInput in) throws IOException {

         this.setTag(in.readUTF());

         this.setData(in.readUTF());

     }

     @Override

     public int hashCode() {

         final int prime = 31;

         int result = 1;

         result = prime * result + ((data == null) ? 0 : data.hashCode());

         result = prime * result + ((tag == null) ? 0 : tag.hashCode());

         return result;

     }

     @Override

     public boolean equals(Object obj) {

         if (this == obj)

             return true;

         if (obj == null)

             return false;

         if (getClass() != obj.getClass())

             return false;

         DataJoinWritable other = (DataJoinWritable) obj;

         if (data == null) {

             if (other.data != null)

                 return false;

         } else if (!data.equals(other.data))

             return false;

         if (tag == null) {

             if (other.tag != null)

                 return false;

         } else if (!tag.equals(other.tag))

             return false;

         return true;

     }

     @Override

     public String toString() {

         return tag + "," + data;

     }

 }

MapReduce代码部分

 package mapreduce.reducejoin;

 import java.io.IOException;

 import java.util.ArrayList;

 import java.util.List;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.conf.Configured;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.LongWritable;

 import org.apache.hadoop.io.NullWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 import org.apache.hadoop.util.Tool;

 import org.apache.hadoop.util.ToolRunner;

 public class DataJoinMapReduce extends Configured implements Tool {

     // step 1: Mapper

     public static class DataJoinMapper extends

             Mapper<LongWritable, Text, LongWritable, DataJoinWritable> {

         // map output key

         private LongWritable mapOutputKey = new LongWritable();

         // map output value

         private DataJoinWritable mapOutputValue = new DataJoinWritable();

         @Override

         public void setup(Context context) throws IOException,

                 InterruptedException {

         }

         @Override

         public void map(LongWritable key, Text value, Context context)

                 throws IOException, InterruptedException {

             // line value

             String lineValue = value.toString();

             // split

             String[] vals = lineValue.split(",");

             int length = vals.length;

             if ((3 != length) && (4 != length)) {

                 return;

             }

             // get cid

             Long cid = Long.valueOf(vals[0]);

             // get name

             String name = vals[1];

             // set customer

             if (3 == length) {

                 String phone = vals[2];

                 // set

                 mapOutputKey.set(cid);

                 mapOutputValue.set("customer", name + "," + phone);

             }

             // set order

             if (4 == length) {

                 String price = vals[2];

                 String date = vals[3];

                 // set

                 mapOutputKey.set(cid);

                 mapOutputValue.set("order", name + "," + price + "," + date);

             }

             // output

             context.write(mapOutputKey, mapOutputValue);

         }

         @Override

         public void cleanup(Context context) throws IOException,

                 InterruptedException {

         }

     }

     // step 2: Reducer

     public static class DataJoinReducer extends

             Reducer<LongWritable, DataJoinWritable, NullWritable, Text> {

         private Text outputValue = new Text();

         @Override

         protected void setup(Context context) throws IOException,

                 InterruptedException {

         }

         @Override

         protected void reduce(LongWritable key,

                 Iterable<DataJoinWritable> values, Context context)

                 throws IOException, InterruptedException {

             String customerInfo = null;

             List<String> orderList = new ArrayList<String>();

             for (DataJoinWritable value : values) {

                 if ("customer".equals(value.getTag())) {

                     customerInfo = value.getData();

                 } else if ("order".equals(value.getTag())) {

                     orderList.add(value.getData());

                 }

             }

             // output

             for (String order : orderList) {

                 // ser outout value

                 outputValue.set(key.get() + "," + customerInfo + "," + order);

                 // output

                 context.write(NullWritable.get(), outputValue);

             }

         }

         @Override

         protected void cleanup(Context context) throws IOException,

                 InterruptedException {

         }

     }

     /**

      * Execute the command with the given arguments.

      *

      * @param args

      *            command specific arguments.

      * @return exit code.

      * @throws Exception

      */

     // step 3: Driver

     public int run(String[] args) throws Exception {

         Configuration configuration = this.getConf();

         // set job

         Job job = Job.getInstance(configuration, this.getClass().getSimpleName());

         job.setJarByClass(DataJoinMapReduce.class);

         // input

         Path inpath = new Path(args[0]);

         FileInputFormat.addInputPath(job, inpath);

         // output

         Path outPath = new Path(args[1]);

         FileOutputFormat.setOutputPath(job, outPath);

         // Mapper

         job.setMapperClass(DataJoinMapper.class);

         job.setMapOutputKeyClass(LongWritable.class);

         job.setMapOutputValueClass(DataJoinWritable.class);

         // Reducer

         job.setReducerClass(DataJoinReducer.class);

         job.setOutputKeyClass(NullWritable.class);

         job.setOutputValueClass(Text.class);

         // submit job -> YARN

         boolean isSuccess = job.waitForCompletion(true);

         return isSuccess ? 0 : 1;

     }

     public static void main(String[] args) throws Exception {

         Configuration configuration = new Configuration();

         args = new String[] {

                 "hdfs://beifeng01:8020/user/beifeng01/mapreduce/input/reducejoin",

                 "hdfs://beifeng01:8020/user/beifeng01/mapreduce/output" };

         // run job

         int status = ToolRunner.run(configuration, new DataJoinMapReduce(),

                 args);

         // exit program

         System.exit(status);

     }

 }

执行代码后查询结果

[hadoop@beifeng01 hadoop-2.5.0-cdh5.3.6]$ bin/hdfs dfs -text /user/beifeng01/mapreduce/output/p*

1,stephaie leung,555-555-5555,B,88.25,20-May-2008

2,edward kim,123-456-7890,C,32.00,30-Nov-2007

3,jose madriz,281-330-8004,D,25.02,22-Jan-2009

3,jose madriz,281-330-8004,A,12.95,02-Jun-2008

MapReduce之Reduce Join的更多相关文章

Hadoop学习之路（二十一）MapReduce实现Reduce Join（多个文件联合查询）
MapReduce Join 对两份数据data1和data2进行关键词连接是一个很通用的问题,如果数据量比较小,可以在内存中完成连接. 如果数据量比较大,在内存进行连接操会发生OOM.mapredu ...
MapReduce编程之Reduce Join多种应用场景与使用
在关系型数据库中 Join 是非常常见的操作,各种优化手段已经到了极致.在海量数据的环境下,不可避免的也会碰到这种类型的需求, 例如在数据分析时需要连接从不同的数据源中获取到数据.不同于传统的单机模式 ...
MapReduce的Reduce side Join
1. 简单介绍 reduce side join是全部join中用时最长的一种join,可是这样的方法可以适用内连接.left外连接.right外连接.full外连接和反连接等全部的join方式.r ...
MapReduce实现的Join
MapReduce Join 对两份数据data1和data2进行关键词连接是一个很通用的问题,如果数据量比较小,可以在内存中完成连接. 如果数据量比较大,在内存进行连接操会发生OOM.mapredu ...
MapReduce三种join实例分析
本文引自吴超博客实现原理 1.在Reudce端进行连接. 在Reudce端进行连接是MapReduce框架进行表之间join操作最为常见的模式,其具体的实现原理如下: Map端的主要工作:为来自不同 ...
MapReduce中的Join
一. MR中的join的两种方式: 1.reduce side join(面试题) reduce side join是一种最简单的join方式,其主要思想如下: 在map阶段,map函数同时读取两个文 ...
MapReduce之Map Join
一介绍之所以存在Reduce Join,是因为在map阶段不能获取所有需要的join字段,即:同一个key对应的字段可能位于不同map中.Reduce side join是非常低效的,因为shuf ...
Mapreduce中的join操作
一.背景 MapReduce提供了表连接操作其中包括Map端join.Reduce端join还有半连接,现在我们要讨论的是Map端join,Map端join是指数据到达map处理函数之前进行合并的,效 ...
mapreduce作业reduce被大量kill掉
之前有一段时间.我们的hadoop2.4集群压力非常大.导致提交的job出现大量的reduce被kill掉.同样的job执行时间比在hadoop0.20.203上面长了非常多.这个问题事实上是redu ...

随机推荐

webstorm中使用git
webstorm中使用git将代码放入tfs两种方式: 直接在tfs上建立仓库,复制仓库地址,然后在本地打开webstorm,然后git克隆这个仓库使用git命令将本地项目上传到tfs git re ...
解决maven工程无法创建src/main/java包名的方法
我的maven工程不知道为什么无法创建src/main/java这样的包,我创建好的maven工程只有src/main/resources包,其他的主要包都没有,而且不能创建包,new出来的包都是一个 ...
mysql实现‘主从复制’
mysql主从复制(超简单) 怎么安装mysql数据库,这里不说了,只说它的主从复制,步骤如下: 首先准备多台服务器,其中一台作为主服务器,从服务器数量自定. 1.主从服务器分别作以下操作: 主服务器 ...
一个较复杂的执行redis的lue脚本
easyui学习笔记6—基本的Accordion(手风琴)
手风琴也是web页面中常见的一个控件,常常用在网站后台管理中,这里我们看看easyui中基本的手风琴设置. 1.先看看引用的资源 <meta charset="UTF-8" ...
[原]Linux 修改时区
1.查看当前时区 date -R 2.修改当前时区 tzselect 之后会出来一个选项菜单,选择你想要的时区就OK了 3.替换系统时区文件 cp /usr/share/zoneinfo/XXX/YY ...
使用BAPISDORDER_GETDETAILEDLIST创建S/4HANA的Outbound Delivery
要在S/4HANA里创建Outbound Delivery,首先要具有一个销售订单,ID为376,通过事务码VA03查看. 只用61行代码就能实现基于这个Sales Order去创建对应的outbou ...
关于mvvm：UI、数据、绑定、状态、中间变量、数据适配、数据处理
绑定: UI控件 --> VM VM -> UI控件关于mvvm:UI.数据.绑定.状态.中间变量.数据适配.数据处理: https://github.com/zzf073/Log ...
Android HttpClient自己主动登陆discuz论坛！
你登陆论坛的时候,我们先看看浏览器干了什么事儿: 用Firefox打开HiPda 的登陆页面,输入用户名和password,点登陆. 以下是通过firebug插件获取的数据: 能够看到浏览器这个htt ...
HDU 2588 GCD 【Euler + 暴力技巧】
任意门:http://acm.hdu.edu.cn/showproblem.php?pid=2588 GCD Time Limit: 2000/1000 MS (Java/Others) Mem ...

MapReduce之Reduce Join

MapReduce之Reduce Join的更多相关文章

随机推荐

热门专题