MapReduce实战（五）实现关联查询

需求：

利用MapReduce程序，实现SQL语句中的join关联查询。

订单数据表order：

id	date	pid	amount
1001	20150710	P0001	2
1002	20150710	P0001	3
1002	20150710	P0002	3
1003	20150710	P0003	4

商品信息表product：

pid	pname	category_id	price
P0001	小米6	1000	2499
P0002	锤子T3	1001	2500
P0003	三星S8	1002	6999

假如数据量巨大，两表的数据是以文件的形式存储在HDFS中，需要用mapreduce程序来实现一下SQL查询运算：

select  a.id,a.date,b.name,b.category_id,b.price from t_order a join t_product b on a.pid = b.id

分析：

通过将关联的条件作为map输出的key，将两表满足join条件的数据并携带数据所来源的文件信息，发往同一个reduce task，在reduce中进行数据的串联。

实现：

首先，我们将表中的数据转换成我们需要的格式：

order.txt:

1001,20150710,P0001,2

1002,20150710,P0001,3

1002,20150710,P0002,3

1003,20150710,P0003,4

product.txt:

P0001,小米6,1000,2499

P0002,锤子T3,1001,2500

P0003,三星S8,1002,6999

并且导入到HDFS的/join/srcdata目录下面。

因为我们有两种格式的文件，所以在map阶段需要根据文件名进行一下判断，不同的文案进行不同的处理。同理，在reduce阶段我们也要针对同一key(pid)的不同种类数据进行判断，是通过判断id是否为空字符串进行判断的。

InfoBean.java:

package com.darrenchan.mr.bean;

import java.io.DataInput;

import java.io.DataOutput;

import java.io.IOException;

import org.apache.hadoop.io.Writable;

/**

 * id date pid amount pname category_id price

 *

 * @author chenchi

 *

 */

public class InfoBean implements Writable {

    private String id;// 订单id

    private String date;

    private String pid;// 产品id

    private String amount;

    private String pname;

    private String category_id;

    private String price;

    public InfoBean() {

    }

    public InfoBean(String id, String date, String pid, String amount, String pname, String category_id, String price) {

        super();

        this.id = id;

        this.date = date;

        this.pid = pid;

        this.amount = amount;

        this.pname = pname;

        this.category_id = category_id;

        this.price = price;

    }

    public String getId() {

        return id;

    }

    public void setId(String id) {

        this.id = id;

    }

    public String getDate() {

        return date;

    }

    public void setDate(String date) {

        this.date = date;

    }

    public String getPid() {

        return pid;

    }

    public void setPid(String pid) {

        this.pid = pid;

    }

    public String getAmount() {

        return amount;

    }

    public void setAmount(String amount) {

        this.amount = amount;

    }

    public String getPname() {

        return pname;

    }

    public void setPname(String pname) {

        this.pname = pname;

    }

    public String getCategory_id() {

        return category_id;

    }

    public void setCategory_id(String category_id) {

        this.category_id = category_id;

    }

    public String getPrice() {

        return price;

    }

    public void setPrice(String price) {

        this.price = price;

    }

    @Override

    public String toString() {

        return "InfoBean [id=" + id + ", date=" + date + ", pid=" + pid + ", amount=" + amount + ", pname=" + pname

                + ", category_id=" + category_id + ", price=" + price + "]";

    }

    /**

     * id date pid amount pname category_id price

     */

    @Override

    public void readFields(DataInput in) throws IOException {

        id = in.readUTF();

        date = in.readUTF();

        pid = in.readUTF();

        amount = in.readUTF();

        pname = in.readUTF();

        category_id = in.readUTF();

        price = in.readUTF();

    }

    @Override

    public void write(DataOutput out) throws IOException {

        out.writeUTF(id);

        out.writeUTF(date);

        out.writeUTF(pid);

        out.writeUTF(amount);

        out.writeUTF(pname);

        out.writeUTF(category_id);

        out.writeUTF(price);

    }

}

Join.java:

package com.darrenchan.mr.join;

import java.io.IOException;

import java.lang.reflect.InvocationTargetException;

import java.util.ArrayList;

import java.util.List;

import org.apache.commons.beanutils.BeanUtils;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import com.darrenchan.mr.bean.InfoBean;

public class Join {

    /**

     * Mapper类

     * @author chenchi

     *

     */

    public static class JoinMapper extends Mapper<LongWritable, Text, Text, InfoBean>{

        //提前在这里new一个对象，剩下的就是改变它的值，不至于在map方法中创建出大量的InfoBean对象

        InfoBean infoBean = new InfoBean();

        Text text = new Text();//理由同上

        @Override

        protected void map(LongWritable key, Text value, Context context)

                throws IOException, InterruptedException {

            //首先，要判断文件名称，读的是订单数据还是商品数据

            FileSplit inputSplit = (FileSplit) context.getInputSplit();

            String name = inputSplit.getPath().getName();//文件名称

            if(name.startsWith("order")){//来自订单数据

                String line = value.toString();

                String[] fields = line.split(",");

                String id = fields[0];

                String date = fields[1];

                String pid = fields[2];

                String amount = fields[3];

                infoBean.setId(id);

                infoBean.setDate(date);

                infoBean.setPid(pid);

                infoBean.setAmount(amount);

                //对于订单数据来说，后面三个属性都置为""

                //之所以不置为null，是因为其要进行序列化和反序列化

                infoBean.setPname("");

                infoBean.setCategory_id("");

                infoBean.setPrice("");

                text.set(pid);

                context.write(text, infoBean);

            }else{//来自商品数据

                String line = value.toString();

                String[] fields = line.split(",");

                String pid = fields[0];

                String pname = fields[1];

                String category_id = fields[2];

                String price = fields[3];

                infoBean.setPname(pname);

                infoBean.setCategory_id(category_id);

                infoBean.setPrice(price);

                infoBean.setPid(pid);

                //对于订单数据来说，后面三个属性都置为""

                //之所以不置为null，是因为其要进行序列化和反序列化

                infoBean.setId("");

                infoBean.setDate("");

                infoBean.setAmount("");

                text.set(pid);

                context.write(text, infoBean);

            }

        }

    }

    public static class JoinReducer extends Reducer<Text, InfoBean, InfoBean, NullWritable>{

        //订单数据中一个pid会有多条数据

        //商品数据中一个pid只有一条

        @Override

        protected void reduce(Text key, Iterable<InfoBean> values, Context context) throws IOException, InterruptedException {

            List<InfoBean> list = new ArrayList<InfoBean>();//存储订单数据中的多条

            InfoBean info = new InfoBean();//存储商品数据中的一条

            for (InfoBean infoBean : values) {

                if(!"".equals(infoBean.getId())){//来自订单数据

                    InfoBean infoBean2 = new InfoBean();

                    try {

                        BeanUtils.copyProperties(infoBean2, infoBean);

                    } catch (Exception e) {

                        e.printStackTrace();

                    }

                    list.add(infoBean2);

                }else{//来自商品数据

                    try {

                        BeanUtils.copyProperties(info, infoBean);

                    } catch (IllegalAccessException | InvocationTargetException e) {

                        e.printStackTrace();

                    }

                }

            }

            for (InfoBean infoBean : list) {

                infoBean.setPname(info.getPname());

                infoBean.setCategory_id(info.getCategory_id());

                infoBean.setPrice(info.getPrice());

                context.write(infoBean, NullWritable.get());

            }

        }

    }

    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf);

        job.setJarByClass(Join.class);

        job.setMapperClass(JoinMapper.class);

        job.setReducerClass(JoinReducer.class);

        job.setMapOutputKeyClass(Text.class);

        job.setMapOutputValueClass(InfoBean.class);

        job.setOutputKeyClass(InfoBean.class);

        job.setOutputValueClass(NullWritable.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));

        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

}

注：这里有一个地方需要注意，就是reduce方法的Iterable<InfoBean> values，一定要new 新对象，不能直接赋值，因为迭代器的内容在不断变化。

执行指令：hadoop jar mywc.jar cn.darrenchan.hadoop.mr.wordcount.WCRunner /wc/src /wc/output

运行效果：

但是呢？这种方式是有缺陷的，什么缺陷呢？

这种方式中，join的操作是在reduce阶段完成，reduce端的处理压力太大，map节点的运算负载则很低，资源利用率不高，且在reduce阶段极易产生数据倾斜。什么叫数据倾斜呢？比如在中国买小米6的人特别多，三星S8的人特别少，汇总的时候，当汇总小米6的pid的时候就运算压力特别大，而S8的pid的时候运算压力就特别小，显然负载不均衡。

那么我们应该用什么方法进行解决呢？就是map端join实现方式了。

我们将业务操作移到了map端，reduce甚至可以不用了，因为商品表一般内容不多，所以我们可以提前加载到内存中，运行map方法的时候直接查找即可，利用了MapReduce的分布式缓存。

代码如下：

package com.darrenchan.mr.mapedjoin;

import java.io.BufferedReader;

import java.io.File;

import java.io.FileInputStream;

import java.io.IOException;

import java.io.InputStreamReader;

import java.net.URI;

import java.net.URISyntaxException;

import java.util.HashMap;

import java.util.Map;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FSDataInputStream;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import com.darrenchan.mr.bean.InfoBean;

public class MapedJoin {

    public static class MapedJoinMapper extends Mapper<LongWritable, Text, InfoBean, NullWritable> {

        // 用一个map来存储商品信息表

        private Map<String, String> map = new HashMap<>();

        //提前在这里new一个对象，剩下的就是改变它的值，不至于在map方法中创建出大量的InfoBean对象

        InfoBean infoBean = new InfoBean();

        @Override

        protected void setup(Context context) throws IOException, InterruptedException {

            // 因为已经加载到本地目录了，所以可以本地读取

            FileInputStream inputStream = new FileInputStream(new File("product.txt"));

            InputStreamReader isr = new InputStreamReader(inputStream);

            BufferedReader br = new BufferedReader(isr);

            String line = null;

            while ((line = br.readLine()) != null) {

                String[] fields = line.split(",");

                map.put(fields[0], line);

            }

            br.close();

        }

        @Override

        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            // 判断文件类型，就不用读取商品数据了

            FileSplit inputSplit = (FileSplit) context.getInputSplit();

            String name = inputSplit.getPath().getName();

            if (name.startsWith("order")) {

                String line = value.toString();

                String[] fields = line.split(",");

                String id = fields[0];

                String date = fields[1];

                String pid = fields[2];

                String amount = fields[3];

                infoBean.setId(id);

                infoBean.setDate(date);

                infoBean.setPid(pid);

                infoBean.setAmount(amount);

                String product = map.get(pid);

                String[] splits = product.split(",");

                String pname = splits[1];

                String category_id = splits[2];

                String price = splits[3];

                infoBean.setPname(pname);

                infoBean.setCategory_id(category_id);

                infoBean.setPrice(price);

                context.write(infoBean, NullWritable.get());

            }

        }

    }

    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf);

        job.setJarByClass(MapedJoin.class);

        job.setMapperClass(MapedJoinMapper.class);

        job.setMapOutputKeyClass(InfoBean.class);

        job.setMapOutputValueClass(NullWritable.class);

        // map端join的逻辑不需要reduce阶段，设置reducetask数量为0

        // 因为即便不写reduce，它也默认启动一个reduce

        job.setNumReduceTasks(0);

        // 指定需要缓存一个文件到所有的maptask运行节点工作目录

        /* job.addArchiveToClassPath(archive); */// 缓存jar包到task运行节点的classpath中

        /* job.addFileToClassPath(file); */// 缓存普通文件到task运行节点的classpath中

        /* job.addCacheArchive(uri); */// 缓存压缩包文件到task运行节点的工作目录

        /* job.addCacheFile(uri) */// 缓存普通文件到task运行节点的工作目录

        // 将产品表文件缓存到task工作节点的工作目录中去

        // 就可以直接本地读取了

        job.addCacheFile(new URI("/join/srcdata/product.txt"));

        FileInputFormat.setInputPaths(job, new Path(args[0]));

        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        boolean b = job.waitForCompletion(true);

        System.exit(b ? 0 : 1);

    }

}

结果同上。

MapReduce实战（五）实现关联查询的更多相关文章

Mybatis学习系列（五）关联查询
前面几节的示例基本都是一些单表查询,实际项目中,经常用到关联表的查询,比如一对一,一对多等情况.在Java实体对象中,一对一和一对多可是使用包装对象解决,属性使用List或者Set来实现,在mybat ...
mybatis实战教程二：多对一关联查询(一对多)
多对一关联查询一.数据库关系.article表和user表示多对一的关系 CREATE TABLE `article` ( `id` ) NOT NULL AUTO_INCREMENT, `user ...
MyBatis初级实战之五：一对一关联查询
欢迎访问我的GitHub https://github.com/zq2599/blog_demos 内容:所有原创文章分类汇总及配套源码,涉及Java.Docker.Kubernetes.DevOPS ...
MyBatis初级实战之六：一对多关联查询
欢迎访问我的GitHub https://github.com/zq2599/blog_demos 内容:所有原创文章分类汇总及配套源码,涉及Java.Docker.Kubernetes.DevOPS ...
MyBatis学习总结(五)——实现关联表查询（转载）
本文转载自:http://www.cnblogs.com/jpf-java/p/6013516.html 一.一对一关联 1.1.提出需求根据班级id查询班级信息(带老师的信息) 1.2.创建表和数 ...
MyBatis学习总结(五)——实现关联表查询
一.一对一关联 1.1.提出需求根据班级id查询班级信息(带老师的信息) 1.2.创建表和数据创建一张教师表和班级表,这里我们假设一个老师只负责教一个班,那么老师和班级之间的关系就是一种一对一的关 ...
MyBatis学习总结(五)——实现关联表查询
一.一对一关联 1.1.提出需求根据班级id查询班级信息(带老师的信息) 1.2.创建表和数据创建一张教师表和班级表,这里我们假设一个老师只负责教一个班,那么老师和班级之间的关系就是一种一对一的关 ...
【Mybatis】MyBatis之表的关联查询（五）
本章介绍Mybatis之表的关联查询一对一关联查询员工信息以及员工的部门信息 1.准备表employee员工表,department部门表 CREATE TABLE `employee` ( `i ...
MyBatis学习笔记(五)——实现关联表查询
转自孤傲苍狼的博客:http://www.cnblogs.com/xdp-gacl/p/4264440.html 一.一对一关联 1.1.提出需求根据班级id查询班级信息(带老师的信息) 1.2.创 ...

随机推荐

java实现 tf-idf
1.前言 TF-IDF(term frequency–inverse document frequency)是一种用于信息检索与数据挖掘的常用加权技术.TF意思是词频(Term Frequency), ...
《深入理解Java虚拟机》笔记6
class文件由无符号数和表两种类型数据构成.表其实相当于一种结构体,内部又嵌套无符号数或者表. 用u1,u2,u4,u8分别代表一个字节,两个字节,四个字节,八个字节的无符号数. 如图中所示,cla ...
第七章：JAVA引用ESWCF及部分方法说明
WCF搭建地址: http://127.0.0.1:8081/Search.svc?wsdl 操作如下: 引用webservice 2.输入webservice访问地址,点击下一步 3.点击Finis ...
【Linux】debian jessie版本安装1.9 svn
今天封装使用官方mysql docker镜像时,发现svn版本有问题.需要更新到1.9版本.方法如下. Subversion 1.9 on Debian Jessie November 4, 2015 ...
Java中PriorityQueue详解
Java中PriorityQueue通过二叉小顶堆实现,可以用一棵完全二叉树表示.本文从Queue接口函数出发,结合生动的图解,深入浅出地分析PriorityQueue每个操作的具体过程和时间复杂度, ...
猫猫学iOS之UITextField右边设置图片，以及UITextField全解
猫猫分享,必须精品原创文章.欢迎转载.转载请注明:翟乃玉的博客地址:http://blog.csdn.net/u013357243 效果: 封装好的方法: 猫猫封装的一个小方法,简单共享出来,方便 ...
VS提示无法连接到已配置的开发web服务器的解决方法
VS2013每次启动项目调试好好的,今天出现了提示“提示无法连接到已配置的开发web服务器“,使用环境是本地IISExpress,操作系统为windows10,之前也出现过就是重启电脑又好了,这次是刚 ...
taro 创建 Tabbar
1.代码 src/app.js import '@tarojs/async-await' import Taro, { Component } from '@tarojs/taro' import H ...
一个架构合理的UITableView应该是如何的？
原文: http://www.chentoo.com/?p=200 iOS 开发中,UITableView 应该是最经常使用到的了.完毕一个UITableView应该实现他的DataSource和De ...
Java环境变量设置辅助工具
安装完JDK之后,很容易忘了设置系统的环境变.最近发现一个设置JDK的小工具,非常简单,推荐给大家: 下载地址:http://files.cnblogs.com/eastson/JavaPathSet ...

MapReduce实战（五）实现关联查询

但是呢？这种方式是有缺陷的，什么缺陷呢？

MapReduce实战（五）实现关联查询的更多相关文章

随机推荐

热门专题