一起学Hadoop——实现两张表之间的连接操作

---恢复内容开始---

之前我们都是学习使用MapReduce处理一张表的数据（一个文件可视为一张表，hive和关系型数据库Mysql、Oracle等都是将数据存储在文件中）。但是我们经常会遇到处理多张表的场景，不同的数据存储在不同的文件中，因此Hadoop也提供了类似传统关系型数据库的join操作。Hadoop生态组件的高级框架Hive、Pig等也都实现了join连接操作，编写类似SQL的语句，就可以在MapReduce中运行，底层的实现也是基于MapReduce。本文介绍如何使用MapReduce实现join操作，为以后学习hive打下基础。

1、Map端连。
数据在进入到map函数之前就进行连接操作。适用场景：一个文件比较大，一个文件比较小，小到可以加载到内存中。如果两个都是大文件，就会出现OOM内存溢出的异常。实现Map端连接操作需要用到Job类的addCacheFile()方法将小文件分发到各个计算节点，然后加载到节点的内存中。

下面通过一个例子来实现Map端join连接操作：
1、雇员employee表数据如下：
name gender age dept_no
Tom male 30 1
Tony male 35 2
Lily female 28 1
Lucy female 32 3

2、部门表dept数据如下：
dept_no dept_name
1 TSD
2 MCD
3 PPD

代码实现如下:

 package join;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.conf.Configured;

 import org.apache.hadoop.fs.FileSystem;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 import org.apache.hadoop.util.Tool;

 import org.apache.hadoop.io.*;

 import org.apache.hadoop.util.ToolRunner;

 import org.apache.hadoop.mapreduce.Mapper;

 import java.io.BufferedReader;

 import java.io.FileReader;

 import java.io.IOException;

 import java.net.URI;

 import java.util.HashMap;

 import java.util.Map;

 import org.apache.hadoop.fs.Path;

 public class MapJoin extends Configured implements Tool {

     public static class MapJoinMapper extends Mapper<LongWritable, Text, Text,NullWritable> {

         private Map<Integer, String> deptData = new HashMap<Integer, String>();

         @Override

         protected void setup(Mapper<LongWritable, Text, Text,NullWritable>.Context context) throws IOException, InterruptedException {

             super.setup(context);

             //从缓存的中读取文件。

             Path[] files = context.getLocalCacheFiles();

 //            Path file1path = new Path(files[0]);

             BufferedReader reader = new BufferedReader(new FileReader(files[0].toString()));

             String str = null;

             try {

                 // 一行一行读取

                 while ((str = reader.readLine()) != null) {

                     // 对缓存中的数据以" "分隔符进行分隔。

                     String[] splits = str.split(" ");

                     // 把需要的数据放在Map中。注意不能操作Map的大小，否则会出现OOM的异常

                     deptData.put(Integer.parseInt(splits[0]), splits[1]);

                 }

             } catch (Exception e) {

                 e.printStackTrace();

             } finally{

                 reader.close();

             }

         }

         @Override

         protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text,NullWritable>.Context context) throws IOException,

                 InterruptedException {

             // 获取从HDFS中加载的表

             String[] values = value.toString().split(" ");

             // 获取关联字段depNo，这个字段是关键

             int depNo = Integer.parseInt(values[3]);

             // 根据deptNo从内存中的关联表中获取要关联的属性depName

             String depName = deptData.get(depNo);

             String resultData = value.toString() + " " + depName;

             // 将数据通过context写入到Reduce中。

             context.write(new Text(resultData),NullWritable.get());

         }

     }

     public static class MapJoinReducer extends Reducer<Text,NullWritable,Text,NullWritable> {

         public void reduce(Text key, Iterable<NullWritable> values,Context context)throws IOException,InterruptedException{

             context.write(key,NullWritable.get());

         }

     }

     @Override

     public int run(String[] args) throws Exception {

         Configuration conf = new Configuration();

         Job job = Job.getInstance(conf, "Total Sort app");

         //将小表加载到缓存中。

         job.addCacheFile(new URI(args[0]));

         job.setJarByClass(MapJoinMapper.class);

         //1.1 设置输入目录和设置输入数据格式化的类

         FileInputFormat.setInputPaths(job,new Path(args[1]));

         job.setInputFormatClass(TextInputFormat.class);

         //1.2 设置自定义Mapper类和设置map函数输出数据的key和value的类型

         job.setMapperClass(MapJoinMapper.class);

         job.setMapOutputKeyClass(Text.class);

         job.setMapOutputValueClass(NullWritable.class);

         //1.3 设置reduce数量

         job.setNumReduceTasks(1);

         //设置实现了reduce函数的类

         job.setReducerClass(MapJoinReducer.class);

         //设置reduce函数的key值

         job.setOutputKeyClass(Text.class);

         //设置reduce函数的value值

         job.setOutputValueClass(NullWritable.class);

         // 判断输出路径是否存在，如果存在，则删除

         Path mypath = new Path(args[2]);

         FileSystem hdfs = mypath.getFileSystem(conf);

         if (hdfs.isDirectory(mypath)) {

             hdfs.delete(mypath, true);

         }

         FileOutputFormat.setOutputPath(job, new Path(args[2]));

         return job.waitForCompletion(true) ? 0 : 1;

     }

     public static void main(String[] args)throws Exception{

         int exitCode = ToolRunner.run(new MapJoin(), args);

         System.exit(exitCode);

     }

 }

执行脚本文件如下：：

 /usr/local/src/hadoop-2.6./bin/hadoop jar MapJoin.jar \

 hdfs://hadoop-master:8020/data/dept.txt \

 hdfs://hadoop-master:8020/data/employee.txt \

 hdfs://hadoop-master:8020/mapjoin_output

运行结果：

Lily female 28 1 TSD

Lucy female 32 3 PPD

Tom male 30 1 TSD

Tony male 35 2 MCD

2、Reduce端连接(Reduce side join)。
数据在Reduce进程中执行连接操作。实现思路：在Map进程中对来自不同表的数据打上标签，例如来自表employee的数据打上a标签，来自文件dept表的数据打上b标签。然后在Reduce进程，对同一个key，来自不同表的数据进行笛卡尔积操作。请看下图，我们对表employee和表dept的dept_no字段进行关联，将dept_no字段当做key。

在MapReduce中，key相同的数据会放在一起，因此我们只需在reduce函数中判断数据是来自哪张表，来自相同表的数据不进行join。

代码如下：

 public class ReduceJoin extends Configured implements Tool {

     public static class JoinMapper extends

             Mapper<LongWritable,Text,Text,Text> {

         String employeeValue = "";

         protected void map(LongWritable key, Text value, Context context)

                 throws IOException,InterruptedException {

             /*

              * 根据命令行传入的文件名，判断数据来自哪个文件，来自employee的数据打上a标签，来自dept的数据打上b标签

              */

             String filepath = ((FileSplit)context.getInputSplit()).getPath().toString();

             String line = value.toString();

             if (line == null || line.equals("")) return;

             if (filepath.indexOf("employee") != -1) {

                 String[] lines = line.split(" ");

                 if(lines.length < 4) return;

                 String deptNo = lines[3];

                 employeeValue = line + " a";

                 context.write(new Text(deptNo),new Text(employeeValue));

             }

             else if(filepath.indexOf("dept") != -1) {

                 String[] lines = line.split(" ");

                 if(lines.length < 2) return;

                 String deptNo = lines[0];

                 context.write(new Text(deptNo), new Text(line + " b"));

             }

         }

     }

     public static class JoinReducer extends

             Reducer<Text, Text, Text, NullWritable> {

         protected void reduce(Text key, Iterable<Text> values,

                               Context context) throws IOException, InterruptedException{

             List<String[]> lista = new ArrayList<String[]>();

             List<String[]> listb = new ArrayList<String[]>();

             for(Text val:values) {

                 String[] str = val.toString().split(" ");

                 //最后一位是标签位，因此根据最后一位判断数据来自哪个文件，标签为a的数据放在lista中，标签为b的数据放在listb中

                 String flag = str[str.length -1];

                 if("a".equals(flag)) {

                     //String valueA = str[0] + " " + str[1] + " " + str[2];

                     lista.add(str);

                 } else if("b".equals(flag)) {

                     //String valueB = str[0] + " " + str[1];

                     listb.add(str);

                 }

             }

             for (int i = 0; i < lista.size(); i++) {

                 if (listb.size() == 0) {

                     continue;

                 } else {

                     String[] stra = lista.get(i);

                     for (int j = 0; j < listb.size(); j++) {

                         String[] strb = listb.get(j);

                         String keyValue = stra[0] + " " + stra[1] + " " + stra[2] + " " + stra[3] + " " + strb[1];

                         context.write(new Text(keyValue), NullWritable.get());

                     }

                 }

             }

         }

     }

     @Override

     public int run(String[] args) throws Exception {

         Configuration conf = getConf();

         GenericOptionsParser optionparser = new GenericOptionsParser(conf, args);

         conf = optionparser.getConfiguration();

         Job job = Job.getInstance(conf, "Reduce side join");

         job.setJarByClass(ReduceJoin.class);

         //1.1 设置输入目录和设置输入数据格式化的类

         //FileInputFormat.setInputPaths(job,new Path(args[0]));

         FileInputFormat.addInputPaths(job, conf.get("input_data"));

         job.setInputFormatClass(TextInputFormat.class);

         //1.2 设置自定义Mapper类和设置map函数输出数据的key和value的类型

         job.setMapperClass(JoinMapper.class);

         job.setMapOutputKeyClass(Text.class);

         job.setMapOutputValueClass(Text.class);

         //1.3 设置reduce数量

         job.setNumReduceTasks(1);

         //设置实现了reduce函数的类

         job.setReducerClass(JoinReducer.class);

         //设置reduce函数的key值

         job.setOutputKeyClass(Text.class);

         //设置reduce函数的value值

         job.setOutputValueClass(NullWritable.class);

         // 判断输出路径是否存在，如果存在，则删除

         Path output_dir = new Path(conf.get("output_dir"));

         FileSystem hdfs = output_dir.getFileSystem(conf);

         if (hdfs.isDirectory(output_dir)) {

             hdfs.delete(output_dir, true);

         }

         FileOutputFormat.setOutputPath(job, output_dir);

         return job.waitForCompletion(true) ? 0 : 1;

     }

     public static void main(String[] args)throws Exception{

         int exitCode = ToolRunner.run(new ReduceJoin(), args);

         System.exit(exitCode);

     }

 }

执行MapReduce的shell脚本如下：

 /usr/local/src/hadoop-2.6./bin/hadoop jar ReduceJoin.jar \

 -Dinput_data=hdfs://hadoop-master:8020/data/dept.txt,hdfs://hadoop-master:8020/data/employee.txt \

 -Doutput_dir=hdfs://hadoop-master:8020/reducejoin_output

总结：
1、Map side join的运行速度比Reduce side join快，因为Reduce side join在shuffle阶段会消耗大量的资源。Map side join由于把小表放在内存中，所以执行效率很高。
2、当有一张表的数据很小时，小到可以加载到内存中，那么建议使用Map side join。

欢迎关注本人公众号了解更多关于大数据方面的知识：

一起学Hadoop——实现两张表之间的连接操作的更多相关文章

EF Core中如何正确地设置两张表之间的关联关系
数据库假设现在我们在SQL Server数据库中有下面两张表: Person表,代表的是一个人: CREATE TABLE [dbo].[Person]( ,) NOT NULL, ) NULL, ...
mysql 如何找出两张表之间的关系
分析步骤: #1.先站在左表的角度去找是否左表的多条记录可以对应右表的一条记录,如果是,则证明左表的一个字段foreign key 右表一个字段(通常是id) #2.再站在右表的角度去找是否右表的 ...
JS之document例题讲解1（两张表之间数据转移、日期时间选择、子菜单下拉、用div做下拉菜单、事件总结）
作业一:两个列表之间数据从一个列表移动到另一个列表 <div style="width:600px; height:500px; margin-top:20px"> & ...
关于跨DB增量（增、改）同步两张表的数据小技巧
有些场景下,需要隔离不同的DB,彼此DB之间不能互相访问,但实际的业务场景又需要从A DB访问B DB的情形,这时怎么办?我认为有如下常规的三种方案: 1.双方提供RESET API,需要访问不同DB ...
MySQL实现两张表数据的同步
有两张表A和B,要求往A里面插入一条记录的同时要向B里面也插入一条记录,向B里面插入一条记录的同时也向A插入一条记录.两张表的结构不同,需要将其中几个字段对应起来.可以用下面的触发器实现. 表A的触发 ...
Oracle 两个表之间更新的实现
Oracle 两个表之间更新的实现来源:互联网作者:佚名时间:2014-04-23 21:39 Oracle中,如果跨两个表进行更新,Sql语句写成这样,Oracle 不会通过.查了资料,S ...
SQLSERVER中如何快速比较两张表的不一样
SQLSERVER中如何快速比较两张表的不一样不知不觉要写2014年的最后一篇博文了~ 一般来说,如何检测两张表的内容是否一致,体现在复制的时候发布端和订阅端的两端的数据上面我这里罗列了一些如何从 ...
CROSS JOIN连接用于生成两张表的笛卡尔集
将两张表的情况全部列举出来结果表: 列= 原表列数相加行= 原表行数相乘 CROSS JOIN连接用于生成两张表的笛卡尔集. 在sql中cross join的使用: 1.返回的记录数为两个 ...
SQLServer两张表筛选相同数据和不同数据
概述项目中经常会对两张数据库表的数据进行比较,选出相同的数据或者不同的数据.在SQL SERVER 2000中只能用Exists来判断,到了SQL SERVER 2005以后可以采用EXCEPT和I ...

随机推荐

SharePoint 2016: 数据库正在兼容性范围内运行，建议进行升级
问题描述: SharePoint 运行状况分析器提示: 中文:数据库正在兼容性范围内运行,建议进行升级. 英文:Database running in compatibility range and ...
hibernate学习笔记第七天：二级缓存和session管理
二级缓存配置 1.导入ehcache对应的三个jar包 ehcache/*.jar 2.配置hibernate使用二级缓存 2.1设置当前环境开始二级缓存的使用 <property name=& ...
Tour HDU - 3488 有向环最小权值覆盖费用流
http://acm.hdu.edu.cn/showproblem.php?pid=3488 给一个无源汇的,带有边权的有向图让你找出一个最小的哈密顿回路可以用KM算法写,但是费用流也行思路 1 ...
Mybatis--01
mybatis 封装jdbc访问代码的一个框架 (hibernate) ORM对象关系映射 SpringMVC:用来封装servlet的框架 (struts) Spring:体系整合框架,其他框架的 ...
ubuntu安装jdk8
文章连接:https://www.cnblogs.com/lighten/p/6105463.html 1.简单的安装方法安装JDK的最简单方法应该就是使用apt-get来安装了,但是源一般是Ope ...
cdh5.15集群添加spark2.3服务（parcels安装）
背景: 机器系统:redhat7.3:已经配置了http服务集群在安装的时候没有启动spark服务,因为cdh5.15自带的spark不是2.0,所以直接进行spark2.3的安装参考文档:htt ...
微信video最上层解决问题
/* http://blog.csdn.net/kepoon/article/details/53608190 */ //x5-video-player-type="h5" x ...
Modbus库开发笔记之二：Modbus消息帧的生成
前面我们已经对Modbus的基本事务作了说明,也据此设计了我们将要实现的主从站的操作流程.这其中与Modbus直接相关的就是Modbus消息帧的生成.Modbus消息帧也是实现Modbus通讯协议的根 ...
Confluence 6 新安装配置数据库字符集编码
Confluence 和你的数据库必须配置使用相同的字符集. Confluence 使用 UTF-8 字符集编码,所以你的数据库也必须配置为使用 UTF-8 (或者与 UTF-8 相同的编码,例如在 ...
python 爬虫简化树状图

一起学Hadoop——实现两张表之间的连接操作

一起学Hadoop——实现两张表之间的连接操作的更多相关文章

随机推荐

热门专题