Hadoop 多表关联

一、实例描述

　　多表关联和单表关联类似，它也是通过对原始数据进行一定的处理，从其中挖掘出关心的信息。下面进入这个实例。

　　输入是两个文件，一个代表工厂表，包含工厂名列和地址编号列；另一个代表地址列，包含地址名列和地址编号列。要求从输入数据中找出工厂名和地址名的对应关系，输出工厂名-地址名表。

　　样例输入：

　　factory：

　　factoryname addressed
　　Beijing Red Star 1
　　Shenzhen Thunder 3
　　Guangzhou Honda 2
　　Beijing Rising 1
　　Guangzhou Development Bank 2
　　Tencent 3
　　Bank of Beijing 1

　　address：

　　addressID addressname
　　1 Beijing
　　2 Guangzhou
　　3 Shenzhen
　　4 Xian

　　样例输出：

二、设计思路

　　多表关联和单表关联类似，都类似于数据库中的自然连接。相比单表关联，多表关联的左右表和连接列更清楚，因此可以采用和单表关联相同的处理方式。Map识别出输入的行属于哪个表之后，对其进行分割，将连接的值保存在key中，另一列和左右表标志保存在value中，然后输出。Reduce拿到连接结果后，解析value内容，根据标志将左右表内容分开存放，然后求笛卡尔积，最后直接输出。

　　这个实例的具体分析参考Hadoop 单表关联博客，下面贴出代码。

三、程序代码

　　程序代码如下：

 import java.io.IOException;

 import java.util.Iterator;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 import org.apache.hadoop.util.GenericOptionsParser;

 public class MTjoin {

     public static int time = 0;

     public static class Map extends Mapper<Object, Text, Text, Text>{

         // 在Map中先区分输入行属于左表还是右表，然后对两列值进行分割，

         // 连接列保存在key值，剩余列和左右表标志保存在value中，最后输出

         @Override

         protected void map(Object key, Text value,Mapper<Object, Text, Text, Text>.Context context)

                 throws IOException, InterruptedException {

             // super.map(key, value, context);

             String line = value.toString();

             int i=0;

             // 输入文件首行，不处理

             if(line.contains("factoryname")==true || line.contains("addressID")==true){

                 return ;

             }

             // 找出数据中的分割点

             while(line.charAt(i)>='9' || line.charAt(i)<='0'){

                 i++;

             }

             if (line.charAt(0)>='9'||line.charAt(0)<='0') {

                 // 左表

                 int j = i-1;

                 while(line.charAt(j)!=' ') j--;

                 String [] values = {line.substring(0,j),line.substring(i)};

                 context.write(new Text(values[1]), new Text("1+"+values[0]));

             }else {

                 // 右表

                 int j = i+1;

                 while(line.charAt(j)!=' ') j++;

                 String[] values = {line.substring(0,i+1),line.substring(j)};

                 context.write(new Text(values[0]), new Text("2"+values[1]));

             }

         }

     }

     public static class Reduce extends Reducer<Text, Text, Text, Text>{

         // Reduce解析Map输出，将value中数据按照左右表分别保存，然后求 // 笛卡尔积，输出

         @Override

         protected void reduce(Text key, Iterable<Text> values,Reducer<Text, Text, Text, Text>.Context context)

                 throws IOException, InterruptedException {

             // super.reduce(arg0, arg1, arg2);

             if (time==0) {

                 //  输出文件第一行

                 context.write(new Text("factoryname"), new Text("addressname"));

                 time++;

             }

             int factorynum = 0;

             String[] factory = new String[10];

             int addressnum = 0;

             String[] address = new String[10];

             Iterator ite = values.iterator();

             while (ite.hasNext()) {

                 String record = ite.next().toString();

                 int len = record.length();

                 int i = 2;

                 char type = record.charAt(0);

                 String factoryname = new String();

                 String addressname = new String();

                 if (type=='1') {

                     // 左表

                     factory[factorynum] = record.substring(2);

                     factorynum++;

                 }else {

                     // 右表

                     address[addressnum] = record.substring(2);

                     addressnum++;

                 }

             }

             if (factorynum != 0 && addressnum !=0) {

                 // 求笛卡尔积

                 for(int m=0;m<factorynum;m++){

                     for(int n=0;n<addressnum;n++){

                         context.write(new Text(factory[m]), new Text(address[n]));

                     }

                 }

             }

         }

     }

     public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

         Configuration conf = new Configuration();

         String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();

         if(otherArgs.length!=2){

             System.out.println("Usage:wordcount <in> <out>");

             System.exit(2);

         }

         Job job = new Job(conf,"multiple table join");

         job.setJarByClass(MTjoin.class);

         job.setMapperClass(Map.class);

         job.setReducerClass(Reduce.class);

         job.setOutputKeyClass(Text.class);

         job.setOutputValueClass(Text.class);

         FileInputFormat.addInputPath(job,new Path(otherArgs[0]));

         FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

         System.exit(job.waitForCompletion(true)?0:1);

     }

 }

Hadoop 多表关联的更多相关文章

Hadoop 单表关联
前面的实例都是在数据上进行一些简单的处理,为进一步的操作打基础.单表关联这个实例要求从给出的数据中寻找到所关心的数据,它是对原始数据所包含信息的挖掘.下面进入这个实例. 1.实例描述实例中给出chi ...
Hadoop on Mac with IntelliJ IDEA - 8 单表关联NullPointerException
简化陆喜恒. Hadoop实战(第2版)5.4单表关联的代码时遇到空指向异常,经分析是逻辑问题,在此做个记录. 环境:Mac OS X 10.9.5, IntelliJ IDEA 13.1.5, Ha ...
hadoop实例---多表关联
多表关联和单表关联类似,它也是通过对原始数据进行一定的处理,从其中挖掘出关心的信息.如下输入的是两个文件,一个代表工厂表,包含工厂名列和地址编号列:另一个代表地址表,包含地址名列和地址编号列.要求从 ...
hadoop 多表join：Map side join及Reduce side join范例
最近在准备抽取数据的工作.有一个id集合200多M,要从另一个500GB的数据集合中抽取出所有id集合中包含的数据集.id数据集合中每一个行就是一个id的字符串(Reduce side join要在每 ...
MapReduce应用案例--单表关联
1. 实例描述单表关联这个实例要求从给出的数据中寻找出所关心的数据,它是对原始数据所包含信息的挖掘. 实例中给出child-parent 表, 求出grandchild-grandparent表. ...
MapRedece(多表关联)
多表关联: 准备数据 ******************************************** 工厂表: Factory Addressed BeijingRedStar 1 Shen ...
MapRedece(单表关联)
源数据:Child--Parent表 Tom Lucy Tom Jack Jone Lucy Jone Jack Lucy Marry Lucy Ben Jack Alice Jack Jesse T ...
MR案例：单表关联查询
"单表关联"这个实例要求从给出的数据中寻找所关心的数据,它是对原始数据所包含信息的挖掘. 需求:实例中给出 child-parent(孩子—父母)表,要求输出 grandchild ...
20亿与20亿表关联优化方法(超级大表与超级大表join优化方法)
记得5年前遇到一个SQL.就是一个简单的两表关联.SQL跑了几乎相同一天一夜,这两个表都非常巨大.每一个表都有几十个G.数据量每一个表有20多亿,表的字段也特别多. 相信大家也知道SQL慢在哪里了,单 ...

随机推荐

iview 路由权限判断的处理
主要是在main.vue做处理其它地方不需要处理 menuList () { let getRouter = JSON.parse(sessionStorage.getItem('getUserDa ...
linux下编译protobuf(可以编译成pb.go)
编译前需要安装gtest $ cd googletest $ cmake -DBUILD_SHARED_LIBS=ON . $ make $ sudo cp -a include/gtest /hom ...
mac电脑使用技巧和相关快捷键
移动与选取 1. 光标移动刚从 Windows 转过来的时候可能会发现,Mac 上没有 Home 和 End 键.其实,直接这样就好了: Cmd + ← 移至行首 (Home)Cmd + → 移 ...
weblogic linux环境下新建domain
1. cd /home/weblogic/Oracle/Middleware/wlserver_10.3/common/bin 2. ./config.sh -mode=console(用控制台模式安 ...
Maximum Sum Circular Subarray LT918
Given a circular array C of integers represented by A, find the maximum possible sum of a non-empty ...
openstack镜像制作centos7
1,找一台宿主机安装kvm并检查是否支持虚拟化,这里我用的是vmware来做宿主机 egrep '(vmx|svm)' /proc/cpuinfo 然后安装kvm: yum install epel- ...
Oracle学习——dmp文件(表)导入与导出
Oracle学习——dmp文件(表)导入与导出 2014-12-28 0个评论来源:张文康廊坊师范学院信息技术提高班第九期收藏我要投稿前言关于dmp文件我们用的 ...
配置了yum本地源
测试机不能联网所以不能直接安装软件只能配置本地源 1. 在联网的电脑上下载与Linux内核版本相同的镜像 2. 把此安装镜像放在此Linux测试机上比如放在家目录下 /home/ ...
zepto 源码 $.contains 学习笔记
$.contains(parent,node) 返回值为一个布尔值 ==> boolean parent,node我们需要检查的节点检查父节点是否包含给定的dom节点,如果两者是相同的节点,返 ...
java程序员随笔
之前坚持过一段时间的博客,不过后来因为一些琐事,也因为自己的懒惰,没坚持下来.一晃本科毕业到现在已经快9年了,本科毕业的时候经常想,自己十年之后会是什么样子,那时候筹措满志,心里的每一个答案,都离现在 ...

Hadoop 多表关联

Hadoop 多表关联的更多相关文章

随机推荐

热门专题