需求

数据：

【主表】：存放在log.txt中

--------------------------------------------------------

手机号码    品牌类型    登录时间                  在线时长

13512435454    1    2018-11-12 12：32：32    50

.......

--------------------------------------------------------

【从表】：存放在type.txt中

--------------------------------------------------------

品牌类型（主键）   品牌名称

1    动感地带

2    xxxxx

.......

--------------------------------------------------------

目标输出：

--------------------------------------------------------------------

手机号码    品牌类型 品牌名称     登录时间                  在线时长

13512435454    1    动感地带    2018-11-12 12：32：32    50

.......

--------------------------------------------------------------------

测试数据

type.txt（type表）

1    动感地带

2    全球通

3    神州行

4    神州大众卡

7    流量王

log.txt（log表）

13112345123    1    2018-11-11 00:00:00    50

13245612378    1    2018-11-11 12:32:45    18

13674589656    5    2018-11-12 13:25:15    66

13192258656    2    2018-11-14 07:05:15    12

13747958635    4    2018-11-15 09:12:59    47

13565412545    3    2018-11-16 13:04:09    19

注：数据均以TAB键划分

目标输出

13245612378    1    动感地带    2018-11-11 12:32:45    18

13112345123    1    动感地带    2018-11-11 00:00:00    50

13192258656    2    全球通    2018-11-14 07:05:15    12

13565412545    3    神州行    2018-11-16 13:04:09    19

13747958635    4    神州大众卡    2018-11-15 09:12:59    47

13674589656    5    null    2018-11-12 13:25:15    66

实现方式一：Reducer端的join实现

思路

在Mapper阶段：将 type.txt 和 log.txt 放在同一个文件夹上，通过判断输入文件的路径来判断数据来自哪个表
- 对于type表的数据就输出<品牌类型，“t”+品牌名称>
- 对于log表的数据就输出<品牌类型, "l"+手机号码+“\t’”+登录时间+“\t’”+在线时长>
在Reducer阶段：由于Mapper输出的Key为品牌类型，那么两个表中同一品牌类型的数据就会在一次reduce函数被调用时被处理，同时由于品牌类型是type表的主键，所以reduce函数处理的数据中至多有一个value来自type表，因此可以遍历整个value-list，将对应Key的品牌名称以及对应log表的数据保存起来，然后再遍历收集到的来自log表的数据将Key值对应的品牌名称数据插入到每一行中即可。

代码实现

package test.linzch3;

import java.io.IOException;

import java.util.LinkedList;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

public class LeftOuterJoin1 {

    private static class MyMapper extends Mapper<Object, Text, Text, Text>{

        private final Text outKey = new Text();

        private final Text outVal = new Text();

        @Override

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException{

            String line = value.toString();

            if(line == null || line.trim().equals("")) return;//抛弃空记录

            String filePath = ((FileSplit) context.getInputSplit()).getPath().toString();

            String[] values = line.split("\t");

            //根据输入文件路径分别处理type.txt和log.txt，用"t"和"l"标记两个表的value

            if(filePath.contains("type.txt") && values.length == 2){

                outKey.set(values[0]);

                outVal.set("t" + values[1]);

                context.write(outKey, outVal);

            }else if(filePath.contains("log.txt") && values.length == 4){

                outKey.set(values[1]);

                outVal.set("l" + values[0] + "\t" + values[2] + "\t" + values[3]);

                context.write(outKey, outVal);

            }

        }

    }

    private static class MyReducer extends Reducer<Text, Text, Text, Text> {

        private  LinkedList<String> logs = new LinkedList<String>();

        private  String type = "";

        private final Text outKey = new Text();

        private final Text outVal = new Text();

        @Override

        public void reduce(Text key, Iterable<Text> values, Context context)

                throws IOException, InterruptedException {

            logs.clear();

            type = "Null"; //默认为Null

            //根据value的第一个标记字符判断是type表的数据还是log表的数据

            for(Text tval:values){

                String val = tval.toString();

                if(val.startsWith("l"))

                    logs.add(val.substring(1));

                else if(val.startsWith("t"))

                    type = val.substring(1);

            }

            for(String log:logs){

                String[] fields = log.split("\t");

                outKey.set(fields[0]);

                outVal.set(key.toString() + "\t" + type + "\t" + fields[1] + "\t" + fields[2]);

                context.write(outKey, outVal);

            }

        }

    }

    public static void main(String[] args) throws Exception{

        Configuration conf = new Configuration();

        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

        if (otherArgs.length != 2) {

          System.err.println("Usage: LeftOuterJoin <in> <out>");

          System.exit(2);

        }

        Job job = Job.getInstance(conf, "Left outer join1");

        job.setJarByClass(LeftOuterJoin1.class);

        job.setMapperClass(MyMapper.class);

        job.setReducerClass(MyReducer.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

        /*delete the output directory if exists*/

        Path out = new Path(otherArgs[otherArgs.length - 1]);

        FileSystem fileSystem = FileSystem.get(conf);

         if (fileSystem.exists(out)) {

              fileSystem.delete(out, true);

          }

        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

}

实现方式二：Mapper端的join实现

思路

当join的两个表中有一个表数据量不大，可以轻松加载到各节点内存中时，可以使用DistributedCache将小表的数据加载到分布式缓存，然后MapReduce框架会缓存数据分发到需要执行map任务的节点上，在map节点上直接调用本地的缓存文件参与计算。在Map端完成join操作，可以降低网络传输到Reduce端的数据流量，有利于提高整个作业的执行效率。
假设type表数据量较小，则将type.txt的数据添加到DistributedCache中，在map计算中读取本地缓存的type.txt数据并将对应log表中的每一行数据插入对应品牌类型的品牌名称，这里无需实现Reducer。

代码实现

package test.linzch3;

import java.io.BufferedReader;

import java.io.FileReader;

import java.io.IOException;

import java.net.URI;

import java.util.LinkedList;

import java.util.Map;

import org.apache.commons.collections.map.HashedMap;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.mapreduce.filecache.DistributedCache;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.hadoop.yarn.api.records.URL;

public class LeftOuterJoin2 {

    private static class MyMapper extends Mapper<Object, Text, Text, Text>{

        private Map<String, String> typeMaps = new HashedMap();

        private final Text outKey = new Text();

        private final Text outVal = new Text();

        @Override

        protected void setup(Context context) throws IOException ,InterruptedException {

            //此处使用快捷方式type.txt访问

            FileReader fr = new FileReader("type.txt");

            BufferedReader br = new BufferedReader(fr);

            String line;

            while((line = br.readLine()) != null) {

                //map端加载缓存数据

                String[] values = line.split("\t");

                if(values.length != 2) continue;

                typeMaps.put(values[0], values[1]);

            }

        };

        @Override

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException{

            String line = value.toString();

            if(line == null || line.trim().equals("")) return;//抛弃空记录

            String[] values = line.split("\t");

            outKey.set(values[0]);

            outVal.set(values[1] + "\t" + typeMaps.get(values[1]) + "\t" + values[2] + "\t" + values[3]);

            context.write(outKey, outVal);

        }

    }

    private final static String FILE_IN_PATH = "hdfs://localhost:9000/user/hadoop/input2/log.txt";

    private final static String FILE_OUT_PATH = "hdfs://localhost:9000/user/hadoop/output2/";

    public static void main(String[] args) throws Exception{

        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf, "Left outer join2");

        job.addCacheFile(new URI("hdfs://localhost:9000/user/hadoop/input2/type.txt"));//添加分布式缓存文件 可以在map或reduce中直接通过type.txt链接访问对应缓存文件

        job.setJarByClass(LeftOuterJoin1.class);

        job.setMapperClass(MyMapper.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(FILE_IN_PATH));

        FileOutputFormat.setOutputPath(job, new Path(FILE_OUT_PATH));

        /*delete the output directory if exists*/

        Path out = new Path(FILE_OUT_PATH);

        FileSystem fileSystem = FileSystem.get(conf);

         if (fileSystem.exists(out)) {

              fileSystem.delete(out, true);

          }

        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

}

实现方式三：二次排序版实现

思路

思考上面的两种实现方式
- 实现方式一：在Reducer端，假设输入的value-list很长很长，按照这种方式实现，遍历整个value-list找到对应Key（品牌类型）的品牌名称这单个数据（暂且称为数据A）并且将属于log表的数据都暂时保存到一个LinkedList上，时间和存储上的开销都随着value-list的长度增加而增长，这显然不适合大量数据的场合
- 实现方式二：虽然在Mapper端可以直接读文件，这样的处理确实是比较高效的，但是其前提是type表可以分布式缓存到各个节点上，但是一旦两个表都很大无法缓存到所有节点，这样该方式就失效了
- 总结：对于实现方式二，文件一大就不行了，这是无法优化的。但是对于实现方式一，其实还有优化的余地，实现方式一的问题就在于要遍历整个value-list的开销很大，而之所以要遍历整个value-list的原因便是为了数据A，那么有没有办法不用遍历就可以找到整个值呢？答案就是利用二次排序。
优化思路
- 要想不用遍历就可以找到数据A，那么问题就等价于在这个value-list中，我们事先就知道数据A在value-list的位置了，很明显的两个位置就是：value-list的第一个和最后一个，而如果是第一个的话，那么我们在reduce函数每次都只用判断第一个value是否来自type表，剩下的就迭代value-list输出即可，这样甚至都不用保存log表数据，时间和存储上都一并优化了。
- 那么，如何让数据A能保持在value-list的第一个呢？这里就要利用MR的magic field——shuffle阶段了，具体操作如下：
  - 设计组合Key：<数据类型tag, 品牌类型brandType>，两者都是Int型数据，tag的数据只有0或者1（type表的数据对应0，log表的数据对应1）
  - 自定义实现分区类和分组类：让属于同个brandType的数据（不管来自哪个表）都能在同一个Reducer的一次函数调用被一并处理
  - Mapper端：和实现方式一的Mapper的原理一样
  - Reducer端：先判断value-list的第一个数据是否来自type表（若没有，数据A就默认是null），然后再遍历value-list输出剩余log表的所有数据（插入数据A在每一行中）

代码实现

package test.linzch3;

import java.io.DataInput;

import java.io.DataOutput;

import java.io.IOException;

import java.util.Iterator;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.io.WritableComparable;

import org.apache.hadoop.io.WritableComparator;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Partitioner;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

public class LeftOuterJoin3 {

    public static class CompositeKey implements WritableComparable<CompositeKey>{

        private int tag;

        private int brandType;

        public int getTag() {

            return tag;

        }

        public int getBrandType() {

            return brandType;

        }

        public void set(int tag_, int brandType_){

            tag = tag_;

            brandType = brandType_;

        }

        @Override

        public void write(DataOutput out) throws IOException {

            out.writeInt(tag);

            out.writeInt(brandType);

        }

        @Override

        public void readFields(DataInput in) throws IOException {

            tag = in.readInt();

            brandType = in.readInt();

        }

        @Override

        public int compareTo(CompositeKey other) {

            if(brandType != other.brandType)

                return brandType < other.brandType ? -1 : 1;

            else if(tag != other.tag)

                return tag < other.tag ? -1: 1;

            else return 0;

        }

    }

    private static class MyPartitioner extends Partitioner<CompositeKey, IntWritable>{

        @Override

        public int getPartition(CompositeKey key, IntWritable value,

                int numPartitions) {

            return key.getBrandType() % numPartitions;

        }

    }

    private static class MyGroupingComparator extends WritableComparator{

       protected MyGroupingComparator()

       {

           super(CompositeKey.class, true);

       }

       @Override

       public int compare(WritableComparable w1, WritableComparable w2)

       {

            CompositeKey key1 = (CompositeKey) w1;

            CompositeKey key2 = (CompositeKey) w2;

            int l = key1.getBrandType();

            int r = key2.getBrandType();

            return l == r ? 0 : (l < r ? -1 : 1);

       }

    }

    private static class MyMapper extends Mapper<Object, Text, CompositeKey, Text>{

        private final CompositeKey outKey = new CompositeKey();

        private final Text outVal = new Text();

        @Override

        protected void map(Object key, Text value, Context context)

                throws IOException, InterruptedException {

            String line = value.toString();

            if(line == null || line.trim().equals("")) return;//抛弃空记录

            String filePath = ((FileSplit) context.getInputSplit()).getPath().toString();

            String[] values = line.split("\t");

            //根据输入文件路径分别处理type.txt和log.txt，用"t"和"l"标记两个表的value

            if(filePath.contains("type.txt") && values.length == 2){

                outKey.set(0, Integer.valueOf(values[0]));

                outVal.set(values[1]);

                context.write(outKey, outVal);

            }else if(filePath.contains("log.txt") && values.length == 4){

                outKey.set(1, Integer.valueOf(values[1]));

                outVal.set(values[0] + "\t" + values[2] + "\t" + values[3]);

                context.write(outKey, outVal);

            }

        }

    }

    private static class MyReducer extends Reducer<CompositeKey, Text, Text, Text> {

        private String type = "";

        private final Text outKey = new Text();

        private final Text outVal = new Text();

        @Override

        protected void reduce(CompositeKey key, Iterable<Text> values, Context context)

                throws IOException, InterruptedException {

            Iterator<Text> it = values.iterator();

            String val = it.next().toString();

            //根据第一个value对应的key判断第一个数据是否来自type表

            if(key.getTag() == 0){

                type = val;

            }else{

                type = "null";

                String[] fields = val.split("\t");

                outKey.set(fields[0]);

                outVal.set(key.brandType + "\t" + type + "\t" + fields[1] + "\t" + fields[2]);

                context.write(outKey, outVal);

            }

            while(it.hasNext()){

                val = it.next().toString();

                String[] fields = val.split("\t");

                outKey.set(fields[0]);

                outVal.set(key.brandType + "\t" + type + "\t" + fields[1] + "\t" + fields[2]);

                context.write(outKey, outVal);

            }

        }

    }

    public static void main(String[] args) throws Exception{

        Configuration conf = new Configuration();

        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

        if (otherArgs.length != 2) {

          System.err.println("Usage: LeftOuterJoin <in> <out>");

          System.exit(2);

        }

        Job job = Job.getInstance(conf, "Left outer join3");

        job.setJarByClass(LeftOuterJoin3.class);

        job.setMapperClass(MyMapper.class);

        job.setReducerClass(MyReducer.class);

        job.setPartitionerClass(MyPartitioner.class);

        job.setGroupingComparatorClass(MyGroupingComparator.class);

        job.setMapOutputKeyClass(CompositeKey.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

        /*delete the output directory if exists*/

        Path out = new Path(otherArgs[otherArgs.length - 1]);

        FileSystem fileSystem = FileSystem.get(conf);

         if (fileSystem.exists(out)) {

              fileSystem.delete(out, true);

          }

        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

}

参考资料

使用MapReduce实现join操作

mapreduce使用 left outer join 的几种方式的更多相关文章

[sql]join的5种方式：inner join、left(outer) join、right (outer) Join、full(outer) join、cross join
现在有两张表如下图所示: 一 .inner join 返回的结果:两个表的交集行二. left join 是left outer join的简写返回结果:左表的 ...
PyODPS DataFrame 处理笛卡尔积的几种方式
PyODPS 提供了 DataFrame API 来用类似 pandas 的接口进行大规模数据分析以及预处理,本文主要介绍如何使用 PyODPS 执行笛卡尔积的操作. 笛卡尔积最常出现的场景是两两之间 ...
Chapter 4 Left Outer Join in MapReduce
4.1 Introdution Consider a company such as Amazon, which has over 200 millions of users and possibly ...
HIVE中join、semi join、outer join
补充说明 left outer join where is not null与left semi join的联系与区别:两者均可实现exists in操作,不同的是,前者允许右表的字段在select或 ...
MapReduce的Reduce side Join
1. 简单介绍 reduce side join是全部join中用时最长的一种join,可是这样的方法可以适用内连接.left外连接.right外连接.full外连接和反连接等全部的join方式.r ...
EF架构~linq模拟left join的两种写法,性能差之千里!
回到目录对于SQL左外连接我想没什么可说的,left join将左表数据都获出来,右表数据如果在左表中不存在,结果为NULL,而对于LINQ来说,要实现left join的效果,也是可以的,在进行j ...
left (outer) join , right (outer) join, full (outer) join, (inner) join, cross join 区别
z -- -- select a.*,b.* from a left join b on a.k = b.k select a ...
Linq表连接大全(INNER JOIN、LEFT OUTER JOIN、RIGHT OUTER JOIN、FULL OUTER JOIN、CROSS JOIN)
我们知道在SQL中一共有五种JOIN操作:INNER JOIN.LEFT OUTER JOIN.RIGHT OUTER JOIN.FULL OUTER JOIN.CROSS JOIN 1>先创建 ...
SQL 查询条件放在LEFT OUTER JOIN 的ON语句后与放在WHERE中的区别
这两种条件放置的位置不同很容易让人造成混淆,以致经常查询出莫名其妙的结果出来,特别是副本的条件与主表不匹配时,下面以A,B表为例简单说下我的理解. 首先要明白的是: 跟在ON 后面的条件是对参与左联接 ...

随机推荐

利用.NET Core类库System.Reflection.DispatchProxy实现简易Aop
背景 Aop即是面向切面编程,众多Aop框架里Castle是最为人所知的,另外还有死去的Spring.NET,当然,.NET Core社区新秀AspectCore在性能与功能上都非常优秀,已经逐渐被社 ...
万能的JDBC工具类。通过反射机制直接简单处理数据库操作
package com.YY.util; import java.io.IOException; import java.io.InputStream; import java.sql.Connect ...
linux 下通过过 hbase 的Java api 操作hbase
hbase版本:0.98.5 hadoop版本:1.2.1 使用自带的zk 本文的内容是在集群中创建java项目调用api来操作hbase,主要涉及对hbase的创建表格,删除表格,插入数据,删除数据 ...
Java学习---程序设计_面试题[2]
百度2017春招笔试真题编程题集合之买帽子 // 2017-10-09 // 题目描述 // 度度熊想去商场买一顶帽子,商场里有N顶帽子,有些帽子的价格可能相同.度度熊想买一顶价格第三便宜的帽子,问第 ...
Shell脚本例子集合
# vi xx.sh 退出并保存 # chmod +x xx.sh # ./xx.sh -2. 调试脚本的方法 # bash -x xx.sh 就可以调试了 . -1. 配置 secureCRT 的设 ...
Microsoft Windows XP SP3 官方原版镜像下载，绝对原版加系列号！
转:http://blog.sina.com.cn/s/blog_638c2e010100op5z.html 写在前面:1. VOL是Volume Licensing for Organization ...
SharePoint问题杂集——要创建计时器作业，必须运行SVC
问题场景:在SharePoint2010服务器上使用PowerShell部署解决方案时,遇到问题: 解决办法是进入控制面板----管理工具----服务,找到SharePoint 2010 Admini ...
深入浅出SharePoint——常用的url命令
?&displaymode=design 页面可编辑
Selenium2+python自动化
一.打开网站1.第一步:从selenium里面导入webdriver模块2.打开Firefox浏览器(Ie和Chrome对应下面的)3.打开百度网址二.设置休眠1.由于打开百度网址后,页面加载需要几秒 ...
链表回文判断(基于链表反转)—Java实现
学习数据结构的时候遇到一个经典的回文链表问题对于一个链表,请设计一个时间复杂度为O(n),额外空间复杂度为O(1)的算法,判断其是否为回文结构. 如果有链表反转的基础,实现链表回文判断就简单的多,如 ...

mapreduce使用 left outer join 的几种方式

需求

测试数据

目标输出

实现方式一：Reducer端的join实现

思路

代码实现

实现方式二：Mapper端的join实现

思路

代码实现

实现方式三：二次排序版实现

思路

代码实现

参考资料

mapreduce使用 left outer join 的几种方式的更多相关文章

随机推荐

热门专题