一. MR中的join的两种方式：

1.reduce side join(面试题)

reduce side join是一种最简单的join方式，其主要思想如下：

在map阶段，map函数同时读取两个文件File1和File2，为了区分两种来源的key/value对，对每条数据打一个标签（tag）,比如：tag=1表示来自文件File1，tag=2表示来自文件File2。即：map阶段的主要任务是对不同文件中的数据打标签,在shuffle阶段已经自然按key分组.

在reduce阶段，reduce函数获取相同k2的v2 list（v2来自File1和File2），然后对于同一个key，对File1和File2中的数据进行join（笛卡尔乘积）。即：reduce阶段进行实际的连接操作。

这种方法有2个问题：

1, map阶段没有对数据瘦身，shuffle的网络传输和排序性能很低。

2, reduce端对2个集合做乘积计算，很耗内存，容易导致OOM。

我关于reduce side join的博文总结地址：http://www.cnblogs.com/DreamDrive/p/7692042.html

2.map side join(面试题)

之所以存在reduce side join，是因为在map阶段不能获取所有需要的join字段，即：同一个key对应的字段可能位于不同map中。Reduce side join是非常低效的，因为shuffle阶段要进行大量的数据传输。

Map side join是针对以下场景进行的优化：

两个待连接表中，有一个表非常大，而另一个表非常小，以至于小表可以直接存放到内存中。这样，我们可以将小表复制多份，让每个map task内存中存在一份（比如存放到hash table中），

然后只扫描大表：对于大表中的每一条记录key/value，在hash table中查找是否有相同的key的记录，如果有，则连接后输出即可。

为了支持文件的复制，Hadoop提供了一个类DistributedCache，使用该类的方法如下：

（1）用户使用静态方法DistributedCache.addCacheFile()指定要复制的文件，它的参数是文件的URI（如果是HDFS上的文件，可以这样：hdfs://namenode:9000/home/XXX/file，其中9000是自己配置的NameNode端口号）。Job在作业启动之前会获取这个URI列表，并将相应的文件拷贝到各个Container的本地磁盘上。

（2）用户使用DistributedCache.getLocalCacheFiles()方法获取文件目录，并使用标准的文件读写API读取相应的文件。

这种方法的局限性：

这种方法，要使用hadoop中的DistributedCache把小数据分布到各个计算节点，每个map节点都要把小数据库加载到内存，按关键字建立索引。

这种方法有明显的局限性：有一份数据比较小，在map端，能够把它加载到内存，并进行join操作。

3.针对Map Side Join 局限的解决方法：

①使用内存服务器，扩大节点的内存空间

针对map join，可以把一份数据存放到专门的内存服务器，在map()方法中，对每一个<key,value>的输入对，根据key到内存服务器中取出数据，进行连接

②使用BloomFilter过滤空连接的数据

对其中一份数据在内存中建立BloomFilter，另外一份数据在连接之前，用BloomFilter判断它的key是否存在，如果不存在，那这个记录是空连接，可以忽略。

③使用mapreduce专为join设计的包

在mapreduce包里看到有专门为join设计的包，对这些包还没有学习，不知道怎么使用，只是在这里记录下来，作个提醒。

jar： mapreduce-client-core.jar

package： org.apache.hadoop.mapreduce.lib.join

4.具体Map Side Join的使用

有客户数据customer和订单数据orders。

customer

客户编号	姓名	地址	电话
1	hanmeimei	ShangHai	110
2	leilei	BeiJing	112
3	lucy	GuangZhou	119

** order**

订单编号	客户编号	其它字段被忽略
1	1	50
2	1	200
3	3	15
4	3	350
5	3	58
6	1	42
7	1	352
8	2	1135
9	2	400
10	2	2000
11	2	300

要求对customer和orders按照客户编号进行连接，结果要求对客户编号分组，对订单编号排序，对其它字段不作要求

客户编号	订单编号	订单金额	姓名	地址	电话
1	1	50	hanmeimei	ShangHai	110
1	2	200	hanmeimei	ShangHai	110
1	6	42	hanmeimei	ShangHai	110
1	7	352	hanmeimei	ShangHai	110
2	8	1135	leilei	BeiJing	112
2	9	400	leilei	BeiJing	112
2	10	2000	leilei	BeiJing	112
2	11	300	leilei	BeiJing	112
3	3	15	lucy	GuangZhou	119
3	4	350	lucy	GuangZhou	119
3	5	58	lucy	GuangZhou	119

在提交job的时候，把小数据通过DistributedCache分发到各个节点。
map端使用DistributedCache读到数据，在内存中构建映射关系--如果使用专门的内存服务器，就把数据加载到内存服务器，map()节点可以只保留一份小缓存；如果使用BloomFilter来加速，在这里就可以构建；
map()函数中，对每一对<key,value>，根据key到第2)步构建的映射里面中找出数据，进行连接，输出。

上代码：

 public class MapSideJoin extends Configured implements Tool {

     // customer文件在hdfs上的位置。

     private static final String CUSTOMER_CACHE_URL = "hdfs://hadoop1:9000/user/hadoop/mapreduce/cache/customer.txt";

     //客户数据表对应的实体类

     private static class CustomerBean {

         private int custId;

         private String name;

         private String address;

         private String phone;

         public CustomerBean() {

         }

         public CustomerBean(int custId, String name, String address,String phone) {

             super();

             this.custId = custId;

             this.name = name;

             this.address = address;

             this.phone = phone;

         }

         public int getCustId() {

             return custId;

         }

         public String getName() {

             return name;

         }

         public String getAddress() {

             return address;

         }

         public String getPhone() {

             return phone;

         }

     }

     //客户订单对应的实体类

     private static class CustOrderMapOutKey implements WritableComparable<CustOrderMapOutKey> {

         private int custId;

         private int orderId;

         public void set(int custId, int orderId) {

             this.custId = custId;

             this.orderId = orderId;

         }

         public int getCustId() {

             return custId;

         }

         public int getOrderId() {

             return orderId;

         }

         @Override

         public void write(DataOutput out) throws IOException {

             out.writeInt(custId);

             out.writeInt(orderId);

         }

         @Override

         public void readFields(DataInput in) throws IOException {

             custId = in.readInt();

             orderId = in.readInt();

         }

         @Override

         public int compareTo(CustOrderMapOutKey o) {

             int res = Integer.compare(custId, o.custId);

             return res == 0 ? Integer.compare(orderId, o.orderId) : res;

         }

         @Override

         public boolean equals(Object obj) {

             if (obj instanceof CustOrderMapOutKey) {

                 CustOrderMapOutKey o = (CustOrderMapOutKey)obj;

                 return custId == o.custId && orderId == o.orderId;

             } else {

                 return false;

             }

         }

         @Override

         public String toString() {

             return custId + "\t" + orderId;

         }

     }

     private static class JoinMapper extends Mapper<LongWritable, Text, CustOrderMapOutKey, Text> {

         private final CustOrderMapOutKey outputKey = new CustOrderMapOutKey();

         private final Text outputValue = new Text();

         /**

          * 把表中每一行的客户信息封装成一个Map，存储在内存中

          * Map的key是客户的id，value是封装的客户bean对象

          */

         private static final Map<Integer, CustomerBean> CUSTOMER_MAP = new HashMap<Integer, Join.CustomerBean>();

         @Override

         protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

             // 格式: 订单编   客户编号    订单金额

             String[] cols = value.toString().split("\t");

             if (cols.length < 3) {

                 return;

             }

             int custId = Integer.parseInt(cols[1]);// 取出客户编号

             CustomerBean customerBean = CUSTOMER_MAP.get(custId);

             if (customerBean == null) {// 没有对应的customer信息可以连接

                 return;

             }

             StringBuffer sb = new StringBuffer();

             sb.append(cols[2]).append("\t")

                 .append(customerBean.getName()).append("\t")

                 .append(customerBean.getAddress()).append("\t")

                 .append(customerBean.getPhone());

             outputValue.set(sb.toString());

             outputKey.set(custId, Integer.parseInt(cols[0]));

             context.write(outputKey, outputValue);

         }

         //在Mapper方法执行前执行

         @Override

         protected void setup(Context context) throws IOException, InterruptedException {

             FileSystem fs = FileSystem.get(URI.create(CUSTOMER_CACHE_URL), context.getConfiguration());

             FSDataInputStream fdis = fs.open(new Path(CUSTOMER_CACHE_URL));

             BufferedReader reader = new BufferedReader(new InputStreamReader(fdis));

             String line = null;

             String[] cols = null;

             // 格式：客户编号  姓名  地址  电话

             while ((line = reader.readLine()) != null) {

                 cols = line.split("\t");

                 if (cols.length < 4) {// 数据格式不匹配，忽略

                     continue;

                 }

                 CustomerBean bean = new CustomerBean(Integer.parseInt(cols[0]), cols[1], cols[2], cols[3]);

                 CUSTOMER_MAP.put(bean.getCustId(), bean);

             }

         }

     }

     /**

      * reduce

      */

     private static class JoinReducer extends Reducer<CustOrderMapOutKey, Text, CustOrderMapOutKey, Text> {

         @Override

         protected void reduce(CustOrderMapOutKey key, Iterable<Text> values, Context context) throws IOException, InterruptedException {

             // 什么事都不用做，直接输出

             for (Text value : values) {

                 context.write(key, value);

             }

         }

     }

     /**

      * @param args

      * @throws Exception

      */

     public static void main(String[] args) throws Exception {

         if (args.length < 2) {

             new IllegalArgumentException("Usage: <inpath> <outpath>");

             return;

         }

         ToolRunner.run(new Configuration(), new Join(), args);

     }

     @Override

     public int run(String[] args) throws Exception {

         Configuration conf = getConf();

         Job job = Job.getInstance(conf, Join.class.getSimpleName());

         job.setJarByClass(SecondarySortMapReduce.class);

         // 添加customer cache文件

         job.addCacheFile(URI.create(CUSTOMER_CACHE_URL));

         FileInputFormat.addInputPath(job, new Path(args[0]));

         FileOutputFormat.setOutputPath(job, new Path(args[1]));

         // map settings

         job.setMapperClass(JoinMapper.class);

         job.setMapOutputKeyClass(CustOrderMapOutKey.class);

         job.setMapOutputValueClass(Text.class);

         // reduce settings

         job.setReducerClass(JoinReducer.class);

         job.setOutputKeyClass(CustOrderMapOutKey.class);

         job.setOutputKeyClass(Text.class);

         boolean res = job.waitForCompletion(true);

         return res ? 0 : 1;

     }

 }

上面的代码没有使用DistributedCache类：

5.Map Side Join的再一个例子：

 import java.io.BufferedReader;

 import java.io.FileReader;

 import java.io.IOException;

 import java.util.HashMap;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.conf.Configured;

 import org.apache.hadoop.filecache.DistributedCache;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

 import org.apache.hadoop.util.Tool;

 import org.apache.hadoop.util.ToolRunner;

 import org.slf4j.Logger;

 import org.slf4j.LoggerFactory;

 /**

  * 用途说明：

  * Map side join中的left outer join

  * 左连接，两个文件分别代表2个表,连接字段table1的id字段和table2的cityID字段

  * table1(左表):tb_dim_city

  * (id int,name string,orderid int,city_code int,is_show int)，

  * 假设tb_dim_city文件记录数很少

  * tb_dim_city.dat文件内容,分隔符为"|"：

  * id     name  orderid  city_code  is_show

  * 0       其他        9999     9999         0

  * 1       长春        1        901          1

  * 2       吉林        2        902          1

  * 3       四平        3        903          1

  * 4       松原        4        904          1

  * 5       通化        5        905          1

  * 6       辽源        6        906          1

  * 7       白城        7        907          1

  * 8       白山        8        908          1

  * 9       延吉        9        909          1

  * -------------------------风骚的分割线-------------------------------

  * table2(右表)：tb_user_profiles

  * (userID int,userName string,network string,flow double,cityID int)

  * tb_user_profiles.dat文件内容,分隔符为"|"：

  * userID   network     flow    cityID

  * 1           2G       123      1

  * 2           3G       333      2

  * 3           3G       555      1

  * 4           2G       777      3

  * 5           3G       666      4

  * ..................................

  * ..................................

  * -------------------------风骚的分割线-------------------------------

  *  结果：

  *  1   长春  1   901 1   1   2G  123

  *  1   长春  1   901 1   3   3G  555

  *  2   吉林  2   902 1   2   3G  333

  *  3   四平  3   903 1   4   2G  777

  *  4   松原  4   904 1   5   3G  666

  */

 public class MapSideJoinMain extends Configured implements Tool{

     private static final Logger logger = LoggerFactory.getLogger(MapSideJoinMain.class);

     public static class LeftOutJoinMapper extends Mapper<Object, Text, Text, Text> {

         private HashMap<String,String> city_infoMap = new HashMap<String, String>();

         private Text outPutKey = new Text();

         private Text outPutValue = new Text();

         private String mapInputStr = null;

         private String mapInputSpit[] = null;

         private String city_secondPart = null;

         /**

          * 此方法在每个task开始之前执行，这里主要用作从DistributedCache

          * 中取到tb_dim_city文件，并将里边记录取出放到内存中。

          */

         @Override

         protected void setup(Context context) throws IOException, InterruptedException {

             BufferedReader br = null;

             //获得当前作业的DistributedCache相关文件

             Path[] distributePaths = DistributedCache.getLocalCacheFiles(context.getConfiguration());

             String cityInfo = null;

             for(Path p : distributePaths){

                 if(p.toString().endsWith("tb_dim_city.dat")){

                     //读缓存文件，并放到mem中

                     br = new BufferedReader(new FileReader(p.toString()));

                     while(null!=(cityInfo=br.readLine())){

                         String[] cityPart = cityInfo.split("\\|",5);

                         if(cityPart.length ==5){

                             city_infoMap.put(cityPart[0], cityPart[1]+"\t"+cityPart[2]+"\t"+cityPart[3]+"\t"+cityPart[4]);

                         }

                     }

                 }

             }

         }

         /**

          * Map端的实现相当简单，直接判断tb_user_profiles.dat中的

          * cityID是否存在我的map中就ok了，这样就可以实现Map Join了

          */

         @Override

         protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {

             //排掉空行

             if(value == null || value.toString().equals("")){

                 return;

             }

             mapInputStr = value.toString();

             mapInputSpit = mapInputStr.split("\\|",4);

             //过滤非法记录

             if(mapInputSpit.length != 4){

                 return;

             }

             //判断链接字段是否在map中存在

             city_secondPart = city_infoMap.get(mapInputSpit[3]);

             if(city_secondPart != null){

                 this.outPutKey.set(mapInputSpit[3]);

                 this.outPutValue.set(city_secondPart+"\t"+mapInputSpit[0]+"\t"+mapInputSpit[1]+"\t"+mapInputSpit[2]);

                 context.write(outPutKey, outPutValue);

             }

         }

     }

     @Override

     public int run(String[] args) throws Exception {

             Configuration conf=getConf(); //获得配置文件对象

             DistributedCache.addCacheFile(new Path(args[1]).toUri(), conf);//为该job添加缓存文件

             Job job=new Job(conf,"MapJoinMR");

             job.setNumReduceTasks(0);

             FileInputFormat.addInputPath(job, new Path(args[0])); //设置map输入文件路径

             FileOutputFormat.setOutputPath(job, new Path(args[2])); //设置reduce输出文件路径

             job.setJarByClass(MapSideJoinMain.class);

             job.setMapperClass(LeftOutJoinMapper.class);

             job.setInputFormatClass(TextInputFormat.class); //设置文件输入格式

             job.setOutputFormatClass(TextOutputFormat.class);//使用默认的output格式

             //设置map的输出key和value类型

             job.setMapOutputKeyClass(Text.class);

             //设置reduce的输出key和value类型

             job.setOutputKeyClass(Text.class);

             job.setOutputValueClass(Text.class);

             job.waitForCompletion(true);

             return job.isSuccessful()?0:1;

     }

     public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

         try {

             int returnCode = ToolRunner.run(new MapSideJoinMain(),args);

             System.exit(returnCode);

         } catch (Exception e) {

             logger.error(e.getMessage());

         }

     }

 }

6.SemiJoin

SemiJoin就是所谓的半连接，其实仔细一看就是reduce join的一个变种，就是在map端过滤掉一些数据，在网络中只传输参与连接的数据不参与连接的数据不必在网络中进行传输，从而减少了shuffle的网络传输量，使整体效率得到提高，其他思想和reduce join是一模一样的。说得更加接地气一点就是将小表中参与join的key单独抽出来通过DistributedCach分发到相关节点，然后将其取出放到内存中（可以放到HashSet中），在map阶段扫描连接表，将join key不在内存HashSet中的记录过滤掉，让那些参与join的记录通过shuffle传输到reduce端进行join操作，其他的和reduce join都是一样的。看代码：

 import java.io.BufferedReader;

 import java.io.FileReader;

 import java.io.IOException;

 import java.util.ArrayList;

 import java.util.HashSet;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.conf.Configured;

 import org.apache.hadoop.filecache.DistributedCache;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.input.FileSplit;

 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

 import org.apache.hadoop.util.Tool;

 import org.apache.hadoop.util.ToolRunner;

 import org.slf4j.Logger;

 import org.slf4j.LoggerFactory;

 /**

  * @author zengzhaozheng

  *

  * 用途说明：

  * reudce side join中的left outer join

  * 左连接，两个文件分别代表2个表,连接字段table1的id字段和table2的cityID字段

  * table1(左表):tb_dim_city

  * (id int,name string,orderid int,city_code,is_show)

  * tb_dim_city.dat文件内容,分隔符为"|"：

  * id     name  orderid  city_code  is_show

  * 0       其他        9999     9999         0

  * 1       长春        1        901          1

  * 2       吉林        2        902          1

  * 3       四平        3        903          1

  * 4       松原        4        904          1

  * 5       通化        5        905          1

  * 6       辽源        6        906          1

  * 7       白城        7        907          1

  * 8       白山        8        908          1

  * 9       延吉        9        909          1

  * -------------------------风骚的分割线-------------------------------

  * table2(右表)：tb_user_profiles(userID int,userName string,network string,double flow,cityID int)

  * tb_user_profiles.dat文件内容,分隔符为"|"：

  * userID   network     flow    cityID

  * 1           2G       123      1

  * 2           3G       333      2

  * 3           3G       555      1

  * 4           2G       777      3

  * 5           3G       666      4

  * ..................................

  * ..................................

  * -------------------------风骚的分割线-------------------------------

  * joinKey.dat内容：

  * city_code

  * 1

  * 2

  * 3

  * 4

  * -------------------------风骚的分割线-------------------------------

  *  结果：

  *  1   长春  1   901 1   1   2G  123

  *  1   长春  1   901 1   3   3G  555

  *  2   吉林  2   902 1   2   3G  333

  *  3   四平  3   903 1   4   2G  777

  *  4   松原  4   904 1   5   3G  666

  */

 public class SemiJoin extends Configured implements Tool{

     private static final Logger logger = LoggerFactory.getLogger(SemiJoin.class);

     public static class SemiJoinMapper extends Mapper<Object, Text, Text, CombineValues> {

         private CombineValues combineValues = new CombineValues();

         private HashSet<String> joinKeySet = new HashSet<String>();

         private Text flag = new Text();

         private Text joinKey = new Text();

         private Text secondPart = new Text();

         /**

          * 将参加join的key从DistributedCache取出放到内存中，以便在map端将要参加join的key过滤出来。b

          */

         @Override

         protected void setup(Context context) throws IOException, InterruptedException {

             BufferedReader br = null;

             //获得当前作业的DistributedCache相关文件

             Path[] distributePaths = DistributedCache.getLocalCacheFiles(context.getConfiguration());

             String joinKeyStr = null;

             for(Path p : distributePaths){

                 if(p.toString().endsWith("joinKey.dat")){

                     //读缓存文件，并放到mem中

                     br = new BufferedReader(new FileReader(p.toString()));

                     while(null!=(joinKeyStr=br.readLine())){

                         joinKeySet.add(joinKeyStr);

                     }

                 }

             }

         }

         @Override

         protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {

             //获得文件输入路径

             String pathName = ((FileSplit) context.getInputSplit()).getPath().toString();

             //数据来自tb_dim_city.dat文件,标志即为"0"

             if(pathName.endsWith("tb_dim_city.dat")){

                 String[] valueItems = value.toString().split("\\|");

                 //过滤格式错误的记录

                 if(valueItems.length != 5){

                     return;

                 }

                 //过滤掉不需要参加join的记录

                 if(joinKeySet.contains(valueItems[0])){

                     flag.set("0");

                     joinKey.set(valueItems[0]);

                     secondPart.set(valueItems[1]+"\t"+valueItems[2]+"\t"+valueItems[3]+"\t"+valueItems[4]);

                     combineValues.setFlag(flag);

                     combineValues.setJoinKey(joinKey);

                     combineValues.setSecondPart(secondPart);

                     context.write(combineValues.getJoinKey(), combineValues);

                 }else{

                     return ;

                 }

             }//数据来自于tb_user_profiles.dat，标志即为"1"

             else if(pathName.endsWith("tb_user_profiles.dat")){

                 String[] valueItems = value.toString().split("\\|");

                 //过滤格式错误的记录

                 if(valueItems.length != 4){

                     return;

                 }

                 //过滤掉不需要参加join的记录

                 if(joinKeySet.contains(valueItems[3])){

                     flag.set("1");

                     joinKey.set(valueItems[3]);

                     secondPart.set(valueItems[0]+"\t"+valueItems[1]+"\t"+valueItems[2]);

                     combineValues.setFlag(flag);

                     combineValues.setJoinKey(joinKey);

                     combineValues.setSecondPart(secondPart);

                     context.write(combineValues.getJoinKey(), combineValues);

                 }else{

                     return ;

                 }

             }

         }

     }

     public static class SemiJoinReducer extends Reducer<Text, CombineValues, Text, Text> {

         //存储一个分组中的左表信息

         private ArrayList<Text> leftTable = new ArrayList<Text>();

         //存储一个分组中的右表信息

         private ArrayList<Text> rightTable = new ArrayList<Text>();

         private Text secondPar = null;

         private Text output = new Text();

         /**

          * 一个分组调用一次reduce函数

          */

         @Override

         protected void reduce(Text key, Iterable<CombineValues> value, Context context) throws IOException, InterruptedException {

             leftTable.clear();

             rightTable.clear();

             /**

              * 将分组中的元素按照文件分别进行存放

              * 这种方法要注意的问题：

              * 如果一个分组内的元素太多的话，可能会导致在reduce阶段出现OOM，

              * 在处理分布式问题之前最好先了解数据的分布情况，根据不同的分布采取最

              * 适当的处理方法，这样可以有效的防止导致OOM和数据过度倾斜问题。

              */

             for(CombineValues cv : value){

                 secondPar = new Text(cv.getSecondPart().toString());

                 //左表tb_dim_city

                 if("0".equals(cv.getFlag().toString().trim())){

                     leftTable.add(secondPar);

                 }

                 //右表tb_user_profiles

                 else if("1".equals(cv.getFlag().toString().trim())){

                     rightTable.add(secondPar);

                 }

             }

             logger.info("tb_dim_city:"+leftTable.toString());

             logger.info("tb_user_profiles:"+rightTable.toString());

             for(Text leftPart : leftTable){

                 for(Text rightPart : rightTable){

                     output.set(leftPart+ "\t" + rightPart);

                     context.write(key, output);

                 }

             }

         }

     }

     @Override

     public int run(String[] args) throws Exception {

             Configuration conf=getConf(); //获得配置文件对象

             DistributedCache.addCacheFile(new Path(args[2]).toUri(), conf);

             Job job=new Job(conf,"LeftOutJoinMR");

             job.setJarByClass(SemiJoin.class);

             FileInputFormat.addInputPath(job, new Path(args[0])); //设置map输入文件路径

             FileOutputFormat.setOutputPath(job, new Path(args[1])); //设置reduce输出文件路径

             job.setMapperClass(SemiJoinMapper.class);

             job.setReducerClass(SemiJoinReducer.class);

             job.setInputFormatClass(TextInputFormat.class); //设置文件输入格式

             job.setOutputFormatClass(TextOutputFormat.class);//使用默认的output格式

             //设置map的输出key和value类型

             job.setMapOutputKeyClass(Text.class);

             job.setMapOutputValueClass(CombineValues.class);

             //设置reduce的输出key和value类型

             job.setOutputKeyClass(Text.class);

             job.setOutputValueClass(Text.class);

             job.waitForCompletion(true);

             return job.isSuccessful()?0:1;

     }

     public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

         try {

             int returnCode =  ToolRunner.run(new SemiJoin(),args);

             System.exit(returnCode);

         } catch (Exception e) {

             logger.error(e.getMessage());

         }

     }

 }

这里还说说SemiJoin也是有一定的适用范围的，其抽取出来进行join的key是要放到内存中的，所以不能够太大，容易在Map端造成OOM。

二、总结

blog介绍了三种join方式。这三种join方式适用于不同的场景，其处理效率上的相差还是蛮大的，其中主要导致因素是网络传输。Map join效率最高，其次是SemiJoin，最低的是reduce join。另外，写分布式大数据处理程序的时最好要对整体要处理的数据分布情况作一个了解，这可以提高我们代码的效率，使数据的倾斜度降到最低，使我们的代码倾向性更好。

MapReduce中的Join的更多相关文章

Mapreduce中的join操作
一.背景 MapReduce提供了表连接操作其中包括Map端join.Reduce端join还有半连接,现在我们要讨论的是Map端join,Map端join是指数据到达map处理函数之前进行合并的,效 ...
MapReduce中的Join算法
在关系型数据库中Join是非常常见的操作,各种优化手段已经到了极致.在海量数据的环境下,不可避免的也会碰到这种类型的需求,例如在数据分析时需要从不同的数据源中获取数据.不同于传统的单机模式,在分布式存 ...
（转）MapReduce中的两表join几种方案简介
转自:http://blog.csdn.net/leoleocmm/article/details/8602081 1. 概述在传统数据库(如:MYSQL)中,JOIN操作是非常常见且非常耗时的.而 ...
（转）MapReduce 中的两表 join 几种方案简介
1. 概述在传统数据库(如:MYSQL)中,JOIN操作是非常常见且非常耗时的.而在HADOOP中进行JOIN操作,同样常见且耗时,由于Hadoop的独特设计思想,当进行JOIN操作时,有一些特殊的 ...
MapReduce 中的两表 join 几种方案简介
转自:http://my.oschina.net/leejun2005/blog/95186 MapSideJoin例子:http://my.oschina.net/leejun2005/blog/1 ...
MapReduce 中的两表 join 方案解析
1. 概述在传统数据库(如:MYSQL)中,JOIN操作是非常常见且非常耗时的.而在HADOOP中进行JOIN操作,同样常见且耗时,由于Hadoop的独特设计思想,当进行JOIN操作时,有一些特殊的 ...
MapReduce实现的Join
MapReduce Join 对两份数据data1和data2进行关键词连接是一个很通用的问题,如果数据量比较小,可以在内存中完成连接. 如果数据量比较大,在内存进行连接操会发生OOM.mapredu ...
Spark 中的join方式(pySpark)
spark基础知识请参考spark官网:http://spark.apache.org/docs/1.2.1/quick-start.html 无论是mapreduce还是spark ,分布式框架的性 ...
MapReduce三种join实例分析
本文引自吴超博客实现原理 1.在Reudce端进行连接. 在Reudce端进行连接是MapReduce框架进行表之间join操作最为常见的模式,其具体的实现原理如下: Map端的主要工作:为来自不同 ...

随机推荐

非root用户安装cuda和cudnn
1.根据自己的系统在官网下载cuda (选择runfile(local)) https://developer.nvidia.com/cuda-downloads 2.进入下载目录,并执行 sh cu ...
2019.03.29 NOIP训练友好国度（点分治+容斥）
传送门思路: 直接上点分治+容斥计算每个因数对应的贡献即可. 代码: #include<bits/stdc++.h> #define ri register int using name ...
nginx报错:./configure: error: C compiler cc is not found, gcc 是已经安装了的
源码安装nginx报错,找不到gcc,但是实际上gcc是存在的,如下: # ./configure checking for OS + Linux -.el7.x86_64 x86_64 checki ...
初识大数据（二. Hadoop是什么）
hadoop是一个由Apache基金会所发布的用于大规模集群上的分布式系统并行编程基础框架.目前已经是大数据领域最流行的开发架构.并且已经从HDFS.MapReduce.Hbase三大核心组件成长为一 ...
mac os 下 vs code 开发 .net core
1.软件下载 .net core 2.0 sdk: vs code 最新版: 2.用 dotnet 命令行 dotnet cli 创建项目打开终端,创建这次项目的文件夹,mkdir Demo1: c ...
vue组件通信新姿势
在vue项目实际开发中我们经常会使用props和emit来进行子父组件的传值通信,父组件向子组件传递数据是通过prop传递的, 子组件传递数据给父组件是通过$emit触发事件来做到的.例如: Vue. ...
prim最小生成树
prim和DIjkstra相似,都使用了贪心策略,加一些限制条件. prim每次会找出尽量小的那个边,将其加入到树中,最终使得生成树长大. 树中有n-1个节点时或者剩下的所有边都是INF,算法结束. ...
[转] Introduction to AppArmor
Introduction to AppArmor http://ubuntuforums.org/showthread.php?t=1008906 Contents Post 1 Introducti ...
凌华Express CVC D2550 Win7 64-bit无法正常关机的解决办法
[问题现象]: 在Windows点击shutdown按钮后,显示器一直停在“正在关闭...”的界面上: 此时硬盘已经停止工作了: CPU没有发送S5信号,因此,主板上的电没有被切断: [解决办法]: ...
2017-12-04 编写Visual Studio Code插件初尝试
参考官方入门: Your First Visual Studio Code Extension - Hello World 源码在: program-in-chinese/vscode_helloWo ...

MapReduce中的Join