MapReduce三种join实例分析

本文引自吴超博客

实现原理

　　1、在Reudce端进行连接。

　　在Reudce端进行连接是MapReduce框架进行表之间join操作最为常见的模式，其具体的实现原理如下：

　　Map端的主要工作：为来自不同表(文件)的key/value对打标签以区别不同来源的记录。然后用连接字段作为key，其余部分和新加的标志作为value，最后进行输出。

　　reduce端的主要工作：在reduce端以连接字段作为key的分组已经完成，我们只需要在每一个分组当中将那些来源于不同文件的记录(在map阶段已经打标志)分开，最后进行笛卡尔只就ok了。原理非常简单，下面来看一个实例：

　　(1)自定义一个value返回类型:

　　package com.mr.reduceSizeJoin;

　　import java.io.DataInput;

　　import java.io.DataOutput;

　　import java.io.IOException;

　　import org.apache.hadoop.io.Text;

　　import org.apache.hadoop.io.WritableComparable;

　　public class CombineValues implements WritableComparable{

　　//private static final Logger logger = LoggerFactory.getLogger(CombineValues.class);

　　private Text joinKey;//链接关键字

　　private Text flag;//文件来源标志

　　private Text secondPart;//除了链接键外的其他部分

　　public void setJoinKey(Text joinKey) {

　　this.joinKey = joinKey;

　　}

　　public void setFlag(Text flag) {

　　this.flag = flag;

　　}

　　public void setSecondPart(Text secondPart) {

　　this.secondPart = secondPart;

　　}

　　public Text getFlag() {

　　return flag;

　　}

　　public Text getSecondPart() {

　　return secondPart;

　　}

　　public Text getJoinKey() {

　　return joinKey;

　　}

　　public CombineValues() {

　　this.joinKey = new Text();

　　this.flag = new Text();

　　this.secondPart = new Text();

　　}

　　@Override

　　public void write(DataOutput out) throws IOException {

　　this.joinKey.write(out);

　　this.flag.write(out);

　　this.secondPart.write(out);

　　}

　　@Override

　　public void readFields(DataInput in) throws IOException {

　　this.joinKey.readFields(in);

　　this.flag.readFields(in);

　　this.secondPart.readFields(in);

　　}

　　@Override

　　public int compareTo(CombineValues o) {

　　return this.joinKey.compareTo(o.getJoinKey());

　　}

　　@Override

　　public String toString() {

　　// TODO Auto-generated method stub

　　return "[flag="+this.flag.toString()+",joinKey="+this.joinKey.toString()+",secondPart="+this.secondPart.toString()+"]";

　　}

　　(2)map、reduce主体代码

　　package com.mr.reduceSizeJoin;

　　import java.io.IOException;

　　import java.util.ArrayList;

　　import org.apache.hadoop.conf.Configuration;

　　import org.apache.hadoop.conf.Configured;

　　import org.apache.hadoop.fs.Path;

　　import org.apache.hadoop.io.Text;

　　import org.apache.hadoop.mapreduce.Job;

　　import org.apache.hadoop.mapreduce.Mapper;

　　import org.apache.hadoop.mapreduce.Reducer;

　　import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

　　import org.apache.hadoop.mapreduce.lib.input.FileSplit;

　　import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

　　import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

　　import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

　　import org.apache.hadoop.util.Tool;

　　import org.apache.hadoop.util.ToolRunner;

　　import org.slf4j.Logger;

　　import org.slf4j.LoggerFactory;

　　/**

　　* @author zengzhaozheng

　　* 用途说明：

　　* reudce side join中的left outer join

　　* 左连接，两个文件分别代表2个表,连接字段table1的id字段和table2的cityID字段

　　* table1(左表):tb_dim_city(id int,name string,orderid int,city_code,is_show)

　　* tb_dim_city.dat文件内容,分隔符为"|"：

　　* id name orderid city_code is_show

　　* 0 其他 9999 9999 0

　　* 1 长春 1 901 1

　　* 2 吉林 2 902 1

　　* 3 四平 3 903 1

　　* 4 松原 4 904 1

　　* 5 通化 5 905 1

　　* 6 辽源 6 906 1

　　* 7 白城 7 907 1

　　* 8 白山 8 908 1

　　* 9 延吉 9 909 1

　　* -------------------------分割线-------------------------------

　　* table2(右表)：tb_user_profiles(userID int,userName string,network string,double flow,cityID int)

　　* tb_user_profiles.dat文件内容,分隔符为"|"：

　　* userID network flow cityID

　　* 1 2G 123 1

　　* 2 3G 333 2

　　* 3 3G 555 1

　　* 4 2G 777 3

　　* 5 3G 666 4

　　* -------------------------分割线-------------------------------

　　* 结果：

　　* 1 长春 1 901 1 1 2G 123

　　* 1 长春 1 901 1 3 3G 555

　　* 2 吉林 2 902 1 2 3G 333

　　* 3 四平 3 903 1 4 2G 777

　　* 4 松原 4 904 1 5 3G 666

　　public class ReduceSideJoin_LeftOuterJoin extends Configured implements Tool{

　　private static final Logger logger = LoggerFactory.getLogger(ReduceSideJoin_LeftOuterJoin.class);

　　public static class LeftOutJoinMapper extends Mapper{

　　private CombineValues combineValues = new CombineValues();

　　private Text flag = new Text();

　　private Text joinKey = new Text();

　　private Text secondPart = new Text();

　　@Override

　　protected void map(Object key, Text value, Context context)

　　throws IOException, InterruptedException {

　　//获得文件输入路径

　　String pathName = ((FileSplit) context.getInputSplit()).getPath().toString();

　　//数据来自tb_dim_city.dat文件,标志即为"0"

　　if(pathName.endsWith("tb_dim_city.dat")){

　　String[] valueItems = value.toString().split("\\|");

　　//过滤格式错误的记录

　　if(valueItems.length != 5){

　　return;

　　}

　　flag.set("0");

　　joinKey.set(valueItems[0]);

　　secondPart.set(valueItems[1]+"\t"+valueItems[2]+"\t"+valueItems[3]+"\t"+valueItems[4]);

　　combineValues.setFlag(flag);

　　combineValues.setJoinKey(joinKey);

　　combineValues.setSecondPart(secondPart);

　　context.write(combineValues.getJoinKey(), combineValues); }//数据来自于tb_user_profiles.dat，标志即为"1"

　　else if(pathName.endsWith("tb_user_profiles.dat")){

　　String[] valueItems = value.toString().split("\\|");

　　//过滤格式错误的记录

　　if(valueItems.length != 4){

　　return;

　　}

　　flag.set("1");

　　joinKey.set(valueItems[3]);

　　secondPart.set(valueItems[0]+"\t"+valueItems[1]+"\t"+valueItems[2]);

　　combineValues.setFlag(flag);

　　combineValues.setJoinKey(joinKey);

　　combineValues.setSecondPart(secondPart);

　　context.write(combineValues.getJoinKey(), combineValues);

　　}

　　public static class LeftOutJoinReducer extends Reducer {

　　//存储一个分组中的左表信息

　　private ArrayList leftTable = new ArrayList();

　　//存储一个分组中的右表信息

　　private ArrayList rightTable = new ArrayList();

　　private Text secondPar = null;

　　private Text output = new Text();

　　/**

　　* 一个分组调用一次reduce函数

　　@Override

　　protected void reduce(Text key, Iterable value, Context context)

　　throws IOException, InterruptedException {

　　leftTable.clear();

　　rightTable.clear();

　　/**

　　* 将分组中的元素按照文件分别进行存放

　　* 这种方法要注意的问题：

　　* 如果一个分组内的元素太多的话，可能会导致在reduce阶段出现OOM，

　　* 在处理分布式问题之前最好先了解数据的分布情况，根据不同的分布采取最

　　* 适当的处理方法，这样可以有效的防止导致OOM和数据过度倾斜问题。

　　for(CombineValues cv : value){

　　secondPar = new Text(cv.getSecondPart().toString());

　　//左表tb_dim_city

　　if("0".equals(cv.getFlag().toString().trim())){

　　leftTable.add(secondPar);

　　}

　　//右表tb_user_profiles

　　else if("1".equals(cv.getFlag().toString().trim())){

　　rightTable.add(secondPar);

　　}

　　logger.info("tb_dim_city:"+leftTable.toString());

　　logger.info("tb_user_profiles:"+rightTable.toString());

　　for(Text leftPart : leftTable){

　　for(Text rightPart : rightTable){

　　output.set(leftPart+ "\t" + rightPart);

　　context.write(key, output);

　　}

　　@Override

　　public int run(String[] args) throws Exception {

　　Configuration conf=getConf(); //获得配置文件对象

　　Job job=new Job(conf,"LeftOutJoinMR");

　　job.setJarByClass(ReduceSideJoin_LeftOuterJoin.class); FileInputFormat.addInputPath(job, new Path(args[0])); //设置map输入文件路径

　　FileOutputFormat.setOutputPath(job, new Path(args[1])); //设置reduce输出文件路径

　　job.setMapperClass(LeftOutJoinMapper.class);

　　job.setReducerClass(LeftOutJoinReducer.class); job.setInputFormatClass(TextInputFormat.class); //设置文件输入格式

　　job.setOutputFormatClass(TextOutputFormat.class);//使用默认的output格格式

　　//设置map的输出key和value类型

　　job.setMapOutputKeyClass(Text.class);

　　job.setMapOutputValueClass(CombineValues.class);

　　//设置reduce的输出key和value类型

　　job.setOutputKeyClass(Text.class);

　　job.setOutputValueClass(Text.class);

　　job.waitForCompletion(true);

　　return job.isSuccessful()?0:1;

　　}

　　public static void main(String[] args) throws IOException,

　　ClassNotFoundException, InterruptedException {

　　try {

　　int returnCode = ToolRunner.run(new ReduceSideJoin_LeftOuterJoin(),args);

　　System.exit(returnCode);

　　} catch (Exception e) {

　　// TODO Auto-generated catch block

　　logger.error(e.getMessage());

　　}

　　其中具体的分析以及数据的输出输入请看代码中的注释已经写得比较清楚了，这里主要分析一下reduce join的一些不足。之所以会存在reduce join这种方式，我们可以很明显的看出原：因为整体数据被分割了，每个map task只处理一部分数据而不能够获取到所有需要的join字段，因此我们需要在讲join key作为reduce端的分组将所有join key相同的记录集中起来进行处理，所以reduce join这种方式就出现了。这种方式的缺点很明显就是会造成map和reduce端也就是shuffle阶段出现大量的数据传输，效率很低。

　　2、在Map端进行连接。

　　使用场景：一张表十分小、一张表很大。

　　用法:在提交作业的时候先将小表文件放到该作业的DistributedCache中，然后从DistributeCache中取出该小表进行join key / value解释分割放到内存中(可以放大Hash Map等等容器中)。然后扫描大表，看大表中的每条记录的join key /value值是否能够在内存中找到相同join key的记录，如果有则直接输出结果。

　　直接上代码，比较简单：

　　package com.mr.mapSideJoin;

　　import java.io.BufferedReader;

　　import java.io.FileReader;

　　import java.io.IOException;

　　import java.util.HashMap;

　　import org.apache.hadoop.conf.Configuration;

　　import org.apache.hadoop.conf.Configured;

　　import org.apache.hadoop.filecache.DistributedCache;

　　import org.apache.hadoop.fs.Path;

　　import org.apache.hadoop.io.Text;

　　import org.apache.hadoop.mapreduce.Job;

　　import org.apache.hadoop.mapreduce.Mapper;

　　import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

　　import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

　　import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

　　import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

　　import org.apache.hadoop.util.Tool;

　　import org.apache.hadoop.util.ToolRunner;

　　import org.slf4j.Logger;

　　import org.slf4j.LoggerFactory;

　　/**

　　* @author zengzhaozheng

　　* 用途说明：

　　* Map side join中的left outer join

　　* 左连接，两个文件分别代表2个表,连接字段table1的id字段和table2的cityID字段

　　* table1(左表):tb_dim_city(id int,name string,orderid int,city_code,is_show)，

　　* 假设tb_dim_city文件记录数很少，tb_dim_city.dat文件内容,分隔符为"|"：

　　* id name orderid city_code is_show

　　* 0 其他 9999 9999 0

　　* 1 长春 1 901 1

　　* 2 吉林 2 902 1

　　* 3 四平 3 903 1

　　* 4 松原 4 904 1

　　* 5 通化 5 905 1

　　* 6 辽源 6 906 1

　　* 7 白城 7 907 1

　　* 8 白山 8 908 1

　　* 9 延吉 9 909 1

　　* -------------------------分割线-------------------------------

　　* table2(右表)：tb_user_profiles(userID int,userName string,network string,double flow,cityID int)

　　* tb_user_profiles.dat文件内容,分隔符为"|"：

　　* userID network flow cityID

　　* 1 2G 123 1

　　* 2 3G 333 2

　　* 3 3G 555 1

　　* 4 2G 777 3

　　* 5 3G 666 4

　　* -------------------------分割线-------------------------------

　　* 结果：

　　* 1 长春 1 901 1 1 2G 123

　　* 1 长春 1 901 1 3 3G 555

　　* 2 吉林 2 902 1 2 3G 333

　　* 3 四平 3 903 1 4 2G 777

　　* 4 松原 4 904 1 5 3G 666

　　public class MapSideJoinMain extends Configured implements Tool{

　　private static final Logger logger = LoggerFactory.getLogger(MapSideJoinMain.class);

　　public static class LeftOutJoinMapper extends Mapper{ private HashMap city_info = new HashMap();

　　private Text outPutKey = new Text();

　　private Text outPutValue = new Text();

　　private String mapInputStr = null;

　　private String mapInputSpit[] = null;

　　private String city_secondPart = null;

　　/**

　　* 此方法在每个task开始之前执行，这里主要用作从DistributedCache

　　* 中取到tb_dim_city文件，并将里边记录取出放到内存中。

　　@Override

　　protected void setup(Context context)

　　throws IOException, InterruptedException {

　　BufferedReader br = null;

　　//获得当前作业的DistributedCache相关文件

　　Path[] distributePaths = DistributedCache.getLocalCacheFiles(context.getConfiguration());

　　String cityInfo = null;

　　for(Path p : distributePaths){

　　if(p.toString().endsWith("tb_dim_city.dat")){

　　//读缓存文件，并放到mem中

　　br = new BufferedReader(new FileReader(p.toString()));

　　while(null!=(cityInfo=br.readLine())){

　　String[] cityPart = cityInfo.split("\\|",5);

　　if(cityPart.length ==5){

　　city_info.put(cityPart[0], cityPart[1]+"\t"+cityPart[2]+"\t"+cityPart[3]+"\t"+cityPart[4]);

　　}

　　/**

　　* Map端的实现相当简单，直接判断tb_user_profiles.dat中的

　　* cityID是否存在我的map中就ok了，这样就可以实现Map Join了

　　@Override

　　protected void map(Object key, Text value, Context context)

　　throws IOException, InterruptedException {

　　//排掉空行

　　if(value == null || value.toString().equals("")){

　　return;

　　}

　　mapInputStr = value.toString();

　　mapInputSpit = mapInputStr.split("\\|",4);

　　//过滤非法记录

　　if(mapInputSpit.length != 4){

　　return;

　　}

　　//判断链接字段是否在map中存在

　　city_secondPart = city_info.get(mapInputSpit[3]);

　　if(city_secondPart != null){

　　this.outPutKey.set(mapInputSpit[3]);

　　this.outPutValue.set(city_secondPart+"\t"+mapInputSpit[0]+"\t"+mapInputSpit[1]+"\t"+mapInputSpit[2]);

　　context.write(outPutKey, outPutValue);

　　}

　　@Override

　　public int run(String[] args) throws Exception {

　　Configuration conf=getConf(); //获得配置文件对象

　　DistributedCache.addCacheFile(new Path(args[1]).toUri(), conf);//为该job添加缓存文件

　　Job job=new Job(conf,"MapJoinMR");

　　job.setNumReduceTasks(0); FileInputFormat.addInputPath(job, new Path(args[0])); //设置map输入文件路径

　　FileOutputFormat.setOutputPath(job, new Path(args[2])); //设置reduce输出文件路径

　　job.setJarByClass(MapSideJoinMain.class);

　　job.setMapperClass(LeftOutJoinMapper.class); job.setInputFormatClass(TextInputFormat.class); //设置文件输入格式

　　job.setOutputFormatClass(TextOutputFormat.class);//使用默认的output格式

　　//设置map的输出key和value类型

　　job.setMapOutputKeyClass(Text.class); //设置reduce的输出key和value类型

　　job.setOutputKeyClass(Text.class);

　　job.setOutputValueClass(Text.class);

　　job.waitForCompletion(true);

　　return job.isSuccessful()?0:1;

　　}

　　public static void main(String[] args) throws IOException,

　　ClassNotFoundException, InterruptedException {

　　try {

　　int returnCode = ToolRunner.run(new MapSideJoinMain(),args);

　　System.exit(returnCode);

　　} catch (Exception e) {

　　// TODO Auto-generated catch block

　　logger.error(e.getMessage());

　　}

　　这里说说DistributedCache。DistributedCache是分布式缓存的一种实现，它在整个MapReduce框架中起着相当重要的作用，他可以支撑我们写一些相当复杂高效的分布式程序。说回到这里，JobTracker在作业启动之前会获取到DistributedCache的资源uri列表，并将对应的文件分发到各个涉及到该作业的任务的TaskTracker上。另外，关于DistributedCache和作业的关系，比如权限、存储路径区分、public和private等属性。

　　另外还有一种比较变态的Map Join方式，就是结合HBase来做Map Join操作。这种方式完全可以突破内存的控制，使你毫无忌惮的使用Map Join，而且效率也非常不错。

　　3、SemiJoin。

　　SemiJoin就是所谓的半连接，其实仔细一看就是reduce join的一个变种，就是在map端过滤掉一些数据，在网络中只传输参与连接的数据不参与连接的数据不必在网络中进行传输，从而减少了shuffle的网络传输量，使整体效率得到提高，其他思想和reduce join是一模一样的。说得更加接地气一点就是将小表中参与join的key单独抽出来通过DistributedCach分发到相关节点，然后将其取出放到内存中(可以放到HashSet中)，在map阶段扫描连接表，将join key不在内存HashSet中的记录过滤掉，让那些参与join的记录通过shuffle传输到reduce端进行join操作，其他的和reduce join都是一样的。看代码：

　　package com.mr.SemiJoin;

　　import java.io.BufferedReader;

　　import java.io.FileReader;

　　import java.io.IOException;

　　import java.util.ArrayList;

　　import java.util.HashSet;

　　import org.apache.hadoop.conf.Configuration;

　　import org.apache.hadoop.conf.Configured;

　　import org.apache.hadoop.filecache.DistributedCache;

　　import org.apache.hadoop.fs.Path;

　　import org.apache.hadoop.io.Text;

　　import org.apache.hadoop.mapreduce.Job;

　　import org.apache.hadoop.mapreduce.Mapper;

　　import org.apache.hadoop.mapreduce.Reducer;

　　import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

　　import org.apache.hadoop.mapreduce.lib.input.FileSplit;

　　import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

　　import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

　　import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

　　import org.apache.hadoop.util.Tool;

　　import org.apache.hadoop.util.ToolRunner;

　　import org.slf4j.Logger;

　　import org.slf4j.LoggerFactory;

　　/**

　　* @author zengzhaozheng

　　* 用途说明：

　　* reudce side join中的left outer join

　　* 左连接，两个文件分别代表2个表,连接字段table1的id字段和table2的cityID字段

　　* table1(左表):tb_dim_city(id int,name string,orderid int,city_code,is_show)

　　* tb_dim_city.dat文件内容,分隔符为"|"：

　　* id name orderid city_code is_show

　　* 0 其他 9999 9999 0

　　* 1 长春 1 901 1

　　* 2 吉林 2 902 1

　　* 3 四平 3 903 1

　　* 4 松原 4 904 1

　　* 5 通化 5 905 1

　　* 6 辽源 6 906 1

　　* 7 白城 7 907 1

　　* 8 白山 8 908 1

　　* 9 延吉 9 909 1

　　* -------------------------分割线-------------------------------

　　* table2(右表)：tb_user_profiles(userID int,userName string,network string,double flow,cityID int)

　　* tb_user_profiles.dat文件内容,分隔符为"|"：

　　* userID network flow cityID

　　* 1 2G 123 1

　　* 2 3G 333 2

　　* 3 3G 555 1

　　* 4 2G 777 3

　　* 5 3G 666 4

　　* -------------------------分割线-------------------------------

　　* joinKey.dat内容：

　　* city_code

　　* 1

　　* 2

　　* 3

　　* 4

　　* -------------------------分割线-------------------------------

　　* 结果：

　　* 1 长春 1 901 1 1 2G 123

　　* 1 长春 1 901 1 3 3G 555

　　* 2 吉林 2 902 1 2 3G 333

　　* 3 四平 3 903 1 4 2G 777

　　* 4 松原 4 904 1 5 3G 666

　　public class SemiJoin extends Configured implements Tool{

　　private static final Logger logger = LoggerFactory.getLogger(SemiJoin.class);

　　public static class SemiJoinMapper extends Mapper{

　　private CombineValues combineValues = new CombineValues();

　　private HashSet joinKeySet = new HashSet();

　　private Text flag = new Text();

　　private Text joinKey = new Text();

　　private Text secondPart = new Text();

　　/**

　　* 将参加join的key从DistributedCache取出放到内存中，以便在map端将要参加join的key过滤出来。b

　　@Override

　　protected void setup(Context context)

　　throws IOException, InterruptedException {

　　BufferedReader br = null;

　　//获得当前作业的DistributedCache相关文件

　　Path[] distributePaths = DistributedCache.getLocalCacheFiles(context.getConfiguration());

　　String joinKeyStr = null;

　　for(Path p : distributePaths){

　　if(p.toString().endsWith("joinKey.dat")){

　　//读缓存文件，并放到mem中

　　br = new BufferedReader(new FileReader(p.toString()));

　　while(null!=(joinKeyStr=br.readLine())){

　　joinKeySet.add(joinKeyStr);

　　}

　　@Override

　　protected void map(Object key, Text value, Context context)

　　throws IOException, InterruptedException {

　　//获得文件输入路径

　　String pathName = ((FileSplit) context.getInputSplit()).getPath().toString();

　　//数据来自tb_dim_city.dat文件,标志即为"0"

　　if(pathName.endsWith("tb_dim_city.dat")){

　　String[] valueItems = value.toString().split("\\|");

　　//过滤格式错误的记录

　　if(valueItems.length != 5){

　　return;

　　}

　　//过滤掉不需要参加join的记录

　　if(joinKeySet.contains(valueItems[0])){

　　flag.set("0");

　　joinKey.set(valueItems[0]);

　　secondPart.set(valueItems[1]+"\t"+valueItems[2]+"\t"+valueItems[3]+"\t"+valueItems[4]);

　　combineValues.setFlag(flag);

　　combineValues.setJoinKey(joinKey);

　　combineValues.setSecondPart(secondPart);

　　context.write(combineValues.getJoinKey(), combineValues);

　　}else{

　　return ;

　　}

　　}//数据来自于tb_user_profiles.dat，标志即为"1"

　　else if(pathName.endsWith("tb_user_profiles.dat")){

　　String[] valueItems = value.toString().split("\\|");

　　//过滤格式错误的记录

　　if(valueItems.length != 4){

　　return;

　　}

　　//过滤掉不需要参加join的记录

　　if(joinKeySet.contains(valueItems[3])){

　　flag.set("1");

　　joinKey.set(valueItems[3]);

　　secondPart.set(valueItems[0]+"\t"+valueItems[1]+"\t"+valueItems[2]);

　　combineValues.setFlag(flag);

　　combineValues.setJoinKey(joinKey);

　　combineValues.setSecondPart(secondPart);

　　context.write(combineValues.getJoinKey(), combineValues);

　　}else{

　　return ;

　　}

　　public static class SemiJoinReducer extends Reducer {

　　//存储一个分组中的左表信息

　　private ArrayList leftTable = new ArrayList();

　　//存储一个分组中的右表信息

　　private ArrayList rightTable = new ArrayList();

　　private Text secondPar = null;

　　private Text output = new Text();

　　/**

　　* 一个分组调用一次reduce函数

　　@Override

　　protected void reduce(Text key, Iterable value, Context context)

　　throws IOException, InterruptedException {

　　leftTable.clear();

　　rightTable.clear();

　　/**

　　* 将分组中的元素按照文件分别进行存放

　　* 这种方法要注意的问题：

　　* 如果一个分组内的元素太多的话，可能会导致在reduce阶段出现OOM，

　　* 在处理分布式问题之前最好先了解数据的分布情况，根据不同的分布采取最

　　* 适当的处理方法，这样可以有效的防止导致OOM和数据过度倾斜问题。

　　for(CombineValues cv : value){

　　secondPar = new Text(cv.getSecondPart().toString());

　　//左表tb_dim_city

　　if("0".equals(cv.getFlag().toString().trim())){

　　leftTable.add(secondPar);

　　}

　　//右表tb_user_profiles

　　else if("1".equals(cv.getFlag().toString().trim())){

　　rightTable.add(secondPar);

　　}

　　logger.info("tb_dim_city:"+leftTable.toString());

　　logger.info("tb_user_profiles:"+rightTable.toString());

　　for(Text leftPart : leftTable){

　　for(Text rightPart : rightTable){

　　output.set(leftPart+ "\t" + rightPart);

　　context.write(key, output);

　　}

　　@Override

　　public int run(String[] args) throws Exception {

　　Configuration conf=getConf(); //获得配置文件对象

　　DistributedCache.addCacheFile(new Path(args[2]).toUri(), conf); Job job=new Job(conf,"LeftOutJoinMR");

　　job.setJarByClass(SemiJoin.class);

　　FileInputFormat.addInputPath(job, new Path(args[0])); //设置map输入文件路径

　　FileOutputFormat.setOutputPath(job, new Path(args[1])); //设置reduce输出文件路径job.setMapperClass(SemiJoinMapper.class);

　　job.setReducerClass(SemiJoinReducer.class);

　　job.setInputFormatClass(TextInputFormat.class); //设置文件输入格式

　　job.setOutputFormatClass(TextOutputFormat.class);//使用默认的output格式 //设置map的输出key和value类型

　　job.setMapOutputKeyClass(Text.class);

　　job.setMapOutputValueClass(CombineValues.class); //设置reduce的输出key和value类型

　　job.setOutputKeyClass(Text.class);

　　job.setOutputValueClass(Text.class);

　　job.waitForCompletion(true);

　　return job.isSuccessful()?0:1;

　　}

　　public static void main(String[] args) throws IOException,

　　ClassNotFoundException, InterruptedException {

　　try {

　　int returnCode = ToolRunner.run(new SemiJoin(),args);

　　System.exit(returnCode);

　　} catch (Exception e) {

　　logger.error(e.getMessage());

　　}

　　这里还说说SemiJoin也是有一定的适用范围的，其抽取出来进行join的key是要放到内存中的，所以不能够太大，容易在Map端造成OOM。

　　三、总结

　这三种join方式适用于不同的场景，其处理效率上的相差还是蛮大的，其中主要导致因素是网络传输。Map join效率最高，其次是SemiJoin，最低的是reduce join。另外，写分布式大数据处理程序的时最好要对整体要处理的数据分布情况作一个了解，这可以提高我们代码的效率，使数据的倾斜度降到最低，使我们的代码倾向性更好。

MapReduce三种join实例分析的更多相关文章

061 hive中的三种join与数据倾斜
一:hive中的三种join 1.map join 应用场景:小表join大表一:设置mapjoin的方式: )如果有一张表是小表,小表将自动执行map join. 默认是true. <pro ...
Hive的三种Join方式
Hive的三种Join方式 hive Hive中就是把Map,Reduce的Join拿过来,通过SQL来表示. 参考链接:https://cwiki.apache.org/confluence/dis ...
SQL Server中的三种Join方式
1.测试数据准备参考:Sql Server中的表访问方式Table Scan, Index Scan, Index Seek 这篇博客中的实验数据准备.这两篇博客使用了相同的实验数据. 2.SQ ...
数据库常见的三种join方式
数据库常见的join方式有三种:inner join, left outter join, right outter join(还有一种full join,因不常用,本文不讨论).这三种连接方式都是将 ...
MapReduce三种路径输入
目前为止知道MapReduce有三种路径输入方式.1.第一种是通过一下方式输入: FileInputFormat.addInputPath(job, new Path(args[0]));FileIn ...
jquery动态加载js三种方法实例
这里为你提供了三种动态加载js的jquery实例代码哦,由于jquery是为用户提供方便的,所以利用jquery动态加载文件只要一句话$.getScript(\"test.js\" ...
hibernate映射对象三种状态的分析
一,首先hibernate中对象的状态有三种:瞬态.游离态和持久态,三种状态转化的方法都是通过session来调用,瞬态到持久态的方法有save().saveOrUpdate(). get().lo ...
(六)C#中判断空字符串的三种方法性能分析
三种方法分别是: string a=""; 1.if(a=="") 2.if(a==string.Empty) 3.if(a.Length==0) 三种方法是等 ...
Apache2 三种MPM对比分析
就最新版本的Web服务器Apache(版本是Apache 2.4.10,发布于2014年7月21日)来说,一共有三种稳定的MPM(Multi-Processing Module,多进程处理模块)模式. ...

随机推荐

c#获取数组中指定元素的索引
//获取元素的索引 ArrayList arrList = new ArrayList(); ; i < array.Length; i++) { ) { arrList.Add(i); } } ...
MySQL 内建函数
日期相关 mysql> select curdate(),curtime(),now(),unix_timestamp(),week('2017-07-24'),year('2017-07-24 ...
vue-cli脚手架npm相关文件解读（3）webpack.dev.conf.js
系列文章传送门: 1.build/webpack.base.conf.js 2.build/webpack.prod.conf.js 3.build/webpack.dev.conf.js 4.bui ...
Spring集成RabbitMQ-必须知道的几个概念
上篇<Spring集成RabbiMQ-Spring AMQP新特性>我们了解了最新spring-rabbit的2.0.0.M5版本相较于之前有哪些变化.其实使用Spring-amqp确实简 ...
004-谈一谈lock和synchronized
这两个关键字都是用来对线程进行同步操作的. 参考疯狂java讲义16.5节线程的同步. (完全答反了...)
有哪些关于 Python 的技术博客？
Python是一种动态解释型的编程语言,它可以在Windows.UNIX.MAC等多种操作系统以及Java..NET开发平台上使用.不过包含的内容很多,加上各种标准库.拓展库,乱花渐欲迷人眼.因此如何 ...
URI和URL的区别一起学习呗
一直存在很多技术上的争论,其中最为妙的恐怕就是web地址应该叫什么的问题.通常情况就是这样:有人把地址栏的内容叫"URL",这时候有些人就来劲了:"不!其实那就是URI. ...
javascript学习笔记-4
document.getElementByTagName返回的是一个NodeList,这个NodeList和js数组很类似,都可以使用下标读取,如:array[0],但他们也有不同,不同在于不能对No ...
poj3249 拓扑排序+DP
题意:给出一个有向无环图,每个顶点都有一个权值.求一条从入度为0的顶点到出度为0的顶点的一条路径,路径上所有顶点权值和最大. 思路:因为是无环图,则对于每个点经过的路径求其最大权值有,dp[i]=ma ...
201521123107 《Java程序设计》第2周学习总结
第2周作业-Java基本语法与类库 1.本周学习总结要点主要有: (1)String类 String类是本周的一个重点,String类的对象是不可变的,即String对象后就在内存中开辟了一个字符串 ...

MapReduce三种join实例分析

MapReduce三种join实例分析的更多相关文章

随机推荐

热门专题