Hadoop的Map侧join

写了关于Hadoop下载地址的Map侧join
和Reduce的join，今天我们就来在看另外一种比较中立的Join。

SemiJoin，一般称为半链接，其原理是在Map侧过滤掉了一些不需要join的数据，从而大大减少了reduce的shffule时间，因为我们知道，如果仅仅使用Reduce侧连接，那么如果一份数据中，存在大量的无效数据，而这些数据，在join中，并不需要，但是因为没有做过预处理，所以这些数据，直到真正的执行reduce函数时，才被定义为无效数据，而这时候，前面已经执行过shuffle和merge和sort，所以这部分无效的数据，就浪费了大量的网络IO和磁盘IO，所以在整体来讲，这是一种降低性能的表现，如果存在的无效数据越多，那么这种趋势，就越明显。

之所以会出现半连接，这其实也是reduce侧连接的一个变种，只不过我们在Map侧，过滤掉了一些无效的数据，所以减少了reduce过程的shuffle时间，所以能获取一个性能的提升。

具体的原理也是利用DistributedCache将小表的的分发到各个节点上，在Map过程的setup函数里，读取缓存里面的文件，只将小表的链接键存储在hashset里，在map函数执行时，对每一条数据，进行判断，如果这条数据的链接键为空或者在hashset里面不存在，那么则认为这条数据，是无效的数据，所以这条数据，并不会被partition分区后写入磁盘，参与reduce阶段的shuffle和sort下载地址，所以在一定程序上，提升了join性能。需要注意的是如果
小表的key依然非常巨大，可能会导致我们的程序出现OOM的情况，那么这时候我们就需要考虑其他的链接方式了。

测试数据如下：
模拟小表数据：
1,三劫散仙,13575468248
2,凤舞九天,18965235874
3,忙忙碌碌,15986854789
4,少林寺方丈,15698745862

模拟大表数据：
3,A,99,2013-03-05
1,B,89,2013-02-05
2,C,69,2013-03-09
3,D,56,2013-06-07
5,E,100,2013-09-09
6,H,200,2014-01-10

代码如下：

package com.semijoin;
import java.io.BufferedReader;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
/***
*
* Hadoop1.2的版本
*
* hadoop的半链接
*
* SemiJoin实现
*
* @author qindongliang
*
* 大数据交流群：376932160
* 搜索技术交流群：324714439
*
*
*
* **/
public class Semjoin {
/**
*
*
* 自定义一个输出实体
*
* **/
private static class CombineEntity implements WritableComparable<CombineEntity>{
private Text joinKey;//连接key
private Text flag;//文件来源标志
private Text secondPart;//除了键外的其他部分的数据
public CombineEntity() {
// TODO Auto-generated constructor stub
this.joinKey=new Text();
this.flag=new Text();
this.secondPart=new Text();
}
public Text getJoinKey() {
return joinKey;
}
public void setJoinKey(Text joinKey) {
this.joinKey = joinKey;
}
public Text getFlag() {
return flag;
}
public void setFlag(Text flag) {
this.flag = flag;
}
public Text getSecondPart() {
return secondPart;
}
public void setSecondPart(Text secondPart) {
this.secondPart = secondPart;
}
@Override
public void readFields(DataInput in) throws IOException {
this.joinKey.readFields(in);
this.flag.readFields(in);
this.secondPart.readFields(in);
}
@Override
public void write(DataOutput out) throws IOException {
this.joinKey.write(out);
this.flag.write(out);
this.secondPart.write(out);
}
@Override
public int compareTo(CombineEntity o) {
// TODO Auto-generated method stub
return this.joinKey.compareTo(o.joinKey);
}
}
private static class JMapper extends Mapper<LongWritable, Text, Text, CombineEntity>{
private CombineEntity combine=new CombineEntity();
private Text flag=new Text();
private Text joinKey=new Text();
private Text secondPart=new Text();
/**
* 存储小表的key
*
*
* */
private HashSet<String> joinKeySet=new HashSet<String>();
@Override
protected void setup(Context context)throws IOException, InterruptedException {
//读取文件流
BufferedReader br=null;
String temp;
// 获取DistributedCached里面的共享文件
Path path[]=DistributedCache.getLocalCacheFiles(context.getConfiguration());
for(Path p:path){
if(p.getName().endsWith("a.txt")){
br=new BufferedReader(new FileReader(p.toString()));
//List<String> list=Files.readAllLines(Paths.get(p.getName()), Charset.forName("UTF-8"));
while((temp=br.readLine())!=null){
String ss[]=temp.split(",");
//map.put(ss[0], ss[1]+"\t"+ss[2]);//放入hash表中
joinKeySet.add(ss[0]);//加入小表的key
}
}
}
}
@Override
protected void map(LongWritable key, Text value,Context context)
throws IOException, InterruptedException {
//获得文件输入路径
String pathName = ((FileSplit) context.getInputSplit()).getPath().toString();
if(pathName.endsWith("a.txt")){
String valueItems[]=value.toString().split(",");
/**
* 在这里过滤必须要的连接字符
*
* */
if(joinKeySet.contains(valueItems[0])){
//设置标志位
flag.set("0");
//设置链接键
joinKey.set(valueItems[0]);
//设置第二部分
secondPart.set(valueItems[1]+"\t"+valueItems[2]);
//封装实体
combine.setFlag(flag);//标志位
combine.setJoinKey(joinKey);//链接键
combine.setSecondPart(secondPart);//其他部分
//写出
context.write(combine.getJoinKey(), combine);
}else{
System.out.println("a.txt里");
System.out.println("在小表中无此记录，执行过滤掉！");
for(String v:valueItems){
System.out.print(v+" ");
}
return ;
}
}else if(pathName.endsWith("b.txt")){
String valueItems[]=value.toString().split(",");
/**
*
* 判断是否在集合中
*
* */
if(joinKeySet.contains(valueItems[0])){
//设置标志位
flag.set("1");
//设置链接键
joinKey.set(valueItems[0]);
//设置第二部分注意不同的文件的列数不一样
secondPart.set(valueItems[1]+"\t"+valueItems[2]+"\t"+valueItems[3]);
//封装实体
combine.setFlag(flag);//标志位
combine.setJoinKey(joinKey);//链接键
combine.setSecondPart(secondPart);//其他部分
//写出
context.write(combine.getJoinKey(), combine);
}else{
//执行过滤 ......
System.out.println("b.txt里");
System.out.println("在小表中无此记录，执行过滤掉！");
for(String v:valueItems){
System.out.print(v+" ");
}
return ;
}
}
}
}
private static class JReduce extends Reducer<Text, CombineEntity, Text, Text>{
//存储一个分组中左表信息
private List<Text> leftTable=new ArrayList<Text>();
//存储一个分组中右表信息
private List<Text> rightTable=new ArrayList<Text>();
private Text secondPart=null;
private Text output=new Text();
//一个分组调用一次
@Override
protected void reduce(Text key, Iterable<CombineEntity> values,Context context)
throws IOException, InterruptedException {
leftTable.clear();//清空分组数据
rightTable.clear();//清空分组数据
/**
* 将不同文件的数据，分别放在不同的集合
* 中，注意数据量过大时，会出现
* OOM的异常
*
* **/
for(CombineEntity ce:values){
this.secondPart=new Text(ce.getSecondPart().toString());
//左表
if(ce.getFlag().toString().trim().equals("0")){
leftTable.add(secondPart);
}else if(ce.getFlag().toString().trim().equals("1")){
rightTable.add(secondPart);
}
}
//=====================
for(Text left:leftTable){
for(Text right:rightTable){
output.set(left+"\t"+right);//连接左右数据
context.write(key, output);//输出
}
}
}
}
public static void main(String[] args)throws Exception {
//Job job=new Job(conf,"myjoin");
JobConf conf=new JobConf(Semjoin.class);
conf.set("mapred.job.tracker","192.168.75.130:9001");
conf.setJar("tt.jar");
//小表共享
String bpath="hdfs://192.168.75.130:9000/root/dist/a.txt";
//添加到共享cache里
DistributedCache.addCacheFile(new URI(bpath), conf);
Job job=new Job(conf, "aaaaa");
job.setJarByClass(Semjoin.class);
System.out.println("模式： "+conf.get("mapred.job.tracker"));;
//设置Map和Reduce自定义类
job.setMapperClass(JMapper.class);
job.setReducerClass(JReduce.class);
//设置Map端输出
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(CombineEntity.class);
//设置Reduce端的输出
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileSystem fs=FileSystem.get(conf);
Path op=new Path("hdfs://192.168.75.130:9000/root/outputjoindbnew4");
if(fs.exists(op)){
fs.delete(op, true);
System.out.println("存在此输出路径，已删除！！！");
}
FileInputFormat.setInputPaths(job, new Path("hdfs://192.168.75.130:9000/root/inputjoindb"));
FileOutputFormat.setOutputPath(job, op);
System.exit(job.waitForCompletion(true)?0:1);
}
}

package com.semijoin;
import java.io.BufferedReader;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
/***
*
* Hadoop1.2的版本
*
* hadoop的半链接
*
* SemiJoin实现
*
* @author qindongliang
*
* 大数据交流群：376932160
* 搜索技术交流群：324714439
*
*
*
* **/
public class Semjoin {
/**
*
*
* 自定义一个输出实体
*
* **/
private static class CombineEntity implements WritableComparable<CombineEntity>{
private Text joinKey;//连接key
private Text flag;//文件来源标志
private Text secondPart;//除了键外的其他部分的数据
public CombineEntity() {
// TODO Auto-generated constructor stub
this.joinKey=new Text();
this.flag=new Text();
this.secondPart=new Text();
}
public Text getJoinKey() {
return joinKey;
}
public void setJoinKey(Text joinKey) {
this.joinKey = joinKey;
}
public Text getFlag() {
return flag;
}
public void setFlag(Text flag) {
this.flag = flag;
}
public Text getSecondPart() {
return secondPart;
}
public void setSecondPart(Text secondPart) {
this.secondPart = secondPart;
}
@Override
public void readFields(DataInput in) throws IOException {
this.joinKey.readFields(in);
this.flag.readFields(in);
this.secondPart.readFields(in);
}
@Override
public void write(DataOutput out) throws IOException {
this.joinKey.write(out);
this.flag.write(out);
this.secondPart.write(out);
}
@Override
public int compareTo(CombineEntity o) {
// TODO Auto-generated method stub
return this.joinKey.compareTo(o.joinKey);
}
}
private static class JMapper extends Mapper<LongWritable, Text, Text, CombineEntity>{
private CombineEntity combine=new CombineEntity();
private Text flag=new Text();
private Text joinKey=new Text();
private Text secondPart=new Text();
/**
* 存储小表的key
*
*
* */
private HashSet<String> joinKeySet=new HashSet<String>();
@Override
protected void setup(Context context)throws IOException, InterruptedException {
//读取文件流
BufferedReader br=null;
String temp;
// 获取DistributedCached里面的共享文件
Path path[]=DistributedCache.getLocalCacheFiles(context.getConfiguration());
for(Path p:path){
if(p.getName().endsWith("a.txt")){
br=new BufferedReader(new FileReader(p.toString()));
//List<String> list=Files.readAllLines(Paths.get(p.getName()), Charset.forName("UTF-8"));
while((temp=br.readLine())!=null){
String ss[]=temp.split(",");
//map.put(ss[0], ss[1]+"\t"+ss[2]);//放入hash表中
joinKeySet.add(ss[0]);//加入小表的key
}
}
}
}
@Override
protected void map(LongWritable key, Text value,Context context)
throws IOException, InterruptedException {
//获得文件输入路径
String pathName = ((FileSplit) context.getInputSplit()).getPath().toString();
if(pathName.endsWith("a.txt")){
String valueItems[]=value.toString().split(",");
/**
* 在这里过滤必须要的连接字符
*
* */
if(joinKeySet.contains(valueItems[0])){
//设置标志位
flag.set("0");
//设置链接键
joinKey.set(valueItems[0]);
//设置第二部分
secondPart.set(valueItems[1]+"\t"+valueItems[2]);
//封装实体
combine.setFlag(flag);//标志位
combine.setJoinKey(joinKey);//链接键
combine.setSecondPart(secondPart);//其他部分
//写出
context.write(combine.getJoinKey(), combine);
}else{
System.out.println("a.txt里");
System.out.println("在小表中无此记录，执行过滤掉！");
for(String v:valueItems){
System.out.print(v+" ");
}
return ;
}
}else if(pathName.endsWith("b.txt")){
String valueItems[]=value.toString().split(",");
/**
*
* 判断是否在集合中
*
* */
if(joinKeySet.contains(valueItems[0])){
//设置标志位
flag.set("1");
//设置链接键
joinKey.set(valueItems[0]);
//设置第二部分注意不同的文件的列数不一样
secondPart.set(valueItems[1]+"\t"+valueItems[2]+"\t"+valueItems[3]);
//封装实体
combine.setFlag(flag);//标志位
combine.setJoinKey(joinKey);//链接键
combine.setSecondPart(secondPart);//其他部分
//写出
context.write(combine.getJoinKey(), combine);
}else{
//执行过滤 ......
System.out.println("b.txt里");
System.out.println("在小表中无此记录，执行过滤掉！");
for(String v:valueItems){
System.out.print(v+" ");
}
return ;
}
}
}
}
private static class JReduce extends Reducer<Text, CombineEntity, Text, Text>{
//存储一个分组中左表信息
private List<Text> leftTable=new ArrayList<Text>();
//存储一个分组中右表信息
private List<Text> rightTable=new ArrayList<Text>();
private Text secondPart=null;
private Text output=new Text();
//一个分组调用一次
@Override
protected void reduce(Text key, Iterable<CombineEntity> values,Context context)
throws IOException, InterruptedException {
leftTable.clear();//清空分组数据
rightTable.clear();//清空分组数据
/**
* 将不同文件的数据，分别放在不同的集合
* 中，注意数据量过大时，会出现
* OOM的异常
*
* **/
for(CombineEntity ce:values){
this.secondPart=new Text(ce.getSecondPart().toString());
//左表
if(ce.getFlag().toString().trim().equals("0")){
leftTable.add(secondPart);
}else if(ce.getFlag().toString().trim().equals("1")){
rightTable.add(secondPart);
}
}
//=====================
for(Text left:leftTable){
for(Text right:rightTable){
output.set(left+"\t"+right);//连接左右数据
context.write(key, output);//输出
}
}
}
}
public static void main(String[] args)throws Exception {
//Job job=new Job(conf,"myjoin");
JobConf conf=new JobConf(Semjoin.class);
conf.set("mapred.job.tracker","192.168.75.130:9001");
conf.setJar("tt.jar");
//小表共享
String bpath="hdfs://192.168.75.130:9000/root/dist/a.txt";
//添加到共享cache里
DistributedCache.addCacheFile(new URI(bpath), conf);
Job job=new Job(conf, "aaaaa");
job.setJarByClass(Semjoin.class);
System.out.println("模式： "+conf.get("mapred.job.tracker"));;
//设置Map和Reduce自定义类
job.setMapperClass(JMapper.class);
job.setReducerClass(JReduce.class);
//设置Map端输出
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(CombineEntity.class);
//设置Reduce端的输出
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileSystem fs=FileSystem.get(conf);
Path op=new Path("hdfs://192.168.75.130:9000/root/outputjoindbnew4");
if(fs.exists(op)){
fs.delete(op, true);
System.out.println("存在此输出路径，已删除！！！");
}
FileInputFormat.setInputPaths(job, new Path("hdfs://192.168.75.130:9000/root/inputjoindb"));
FileOutputFormat.setOutputPath(job, op);
System.exit(job.waitForCompletion(true)?0:1);
}
}

运行日志如下：

模式： 192.168.75.130:9001
存在此输出路径，已删除！！！
WARN - JobClient.copyAndConfigureFiles(746) | Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
INFO - FileInputFormat.listStatus(237) | Total input paths to process : 2
WARN - NativeCodeLoader.<clinit>(52) | Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARN - LoadSnappy.<clinit>(46) | Snappy native library not loaded
INFO - JobClient.monitorAndPrintJob(1380) | Running job: job_201404260312_0002
INFO - JobClient.monitorAndPrintJob(1393) | map 0% reduce 0%
INFO - JobClient.monitorAndPrintJob(1393) | map 50% reduce 0%
INFO - JobClient.monitorAndPrintJob(1393) | map 100% reduce 0%
INFO - JobClient.monitorAndPrintJob(1393) | map 100% reduce 33%
INFO - JobClient.monitorAndPrintJob(1393) | map 100% reduce 100%
INFO - JobClient.monitorAndPrintJob(1448) | Job complete: job_201404260312_0002
INFO - Counters.log(585) | Counters: 29
INFO - Counters.log(587) | Job Counters
INFO - Counters.log(589) | Launched reduce tasks=1
INFO - Counters.log(589) | SLOTS_MILLIS_MAPS=12445
INFO - Counters.log(589) | Total time spent by all reduces waiting after reserving slots (ms)=0
INFO - Counters.log(589) | Total time spent by all maps waiting after reserving slots (ms)=0
INFO - Counters.log(589) | Launched map tasks=2
INFO - Counters.log(589) | Data-local map tasks=2
INFO - Counters.log(589) | SLOTS_MILLIS_REDUCES=9801
INFO - Counters.log(587) | File Output Format Counters
INFO - Counters.log(589) | Bytes Written=172
INFO - Counters.log(587) | FileSystemCounters
INFO - Counters.log(589) | FILE_BYTES_READ=237
INFO - Counters.log(589) | HDFS_BYTES_READ=455
INFO - Counters.log(589) | FILE_BYTES_WRITTEN=169503
INFO - Counters.log(589) | HDFS_BYTES_WRITTEN=172
INFO - Counters.log(587) | File Input Format Counters
INFO - Counters.log(589) | Bytes Read=227
INFO - Counters.log(587) | Map-Reduce Framework
INFO - Counters.log(589) | Map output materialized bytes=243
INFO - Counters.log(589) | Map input records=10
INFO - Counters.log(589) | Reduce shuffle bytes=243
INFO - Counters.log(589) | Spilled Records=16
INFO - Counters.log(589) | Map output bytes=215
INFO - Counters.log(589) | Total committed heap usage (bytes)=336338944
INFO - Counters.log(589) | CPU time spent (ms)=1770
INFO - Counters.log(589) | Combine input records=0
INFO - Counters.log(589) | SPLIT_RAW_BYTES=228
INFO - Counters.log(589) | Reduce input records=8
INFO - Counters.log(589) | Reduce input groups=4
INFO - Counters.log(589) | Combine output records=0
INFO - Counters.log(589) | Physical memory (bytes) snapshot=442564608
INFO - Counters.log(589) | Reduce output records=4
INFO - Counters.log(589) | Virtual memory (bytes) snapshot=2184306688
INFO - Counters.log(589) | Map output records=8

模式： 192.168.75.130:9001
存在此输出路径，已删除！！！
WARN - JobClient.copyAndConfigureFiles(746) | Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
INFO - FileInputFormat.listStatus(237) | Total input paths to process : 2
WARN - NativeCodeLoader.<clinit>(52) | Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARN - LoadSnappy.<clinit>(46) | Snappy native library not loaded
INFO - JobClient.monitorAndPrintJob(1380) | Running job: job_201404260312_0002
INFO - JobClient.monitorAndPrintJob(1393) | map 0% reduce 0%
INFO - JobClient.monitorAndPrintJob(1393) | map 50% reduce 0%
INFO - JobClient.monitorAndPrintJob(1393) | map 100% reduce 0%
INFO - JobClient.monitorAndPrintJob(1393) | map 100% reduce 33%
INFO - JobClient.monitorAndPrintJob(1393) | map 100% reduce 100%
INFO - JobClient.monitorAndPrintJob(1448) | Job complete: job_201404260312_0002
INFO - Counters.log(585) | Counters: 29
INFO - Counters.log(587) | Job Counters
INFO - Counters.log(589) | Launched reduce tasks=1
INFO - Counters.log(589) | SLOTS_MILLIS_MAPS=12445
INFO - Counters.log(589) | Total time spent by all reduces waiting after reserving slots (ms)=0
INFO - Counters.log(589) | Total time spent by all maps waiting after reserving slots (ms)=0
INFO - Counters.log(589) | Launched map tasks=2
INFO - Counters.log(589) | Data-local map tasks=2
INFO - Counters.log(589) | SLOTS_MILLIS_REDUCES=9801
INFO - Counters.log(587) | File Output Format Counters
INFO - Counters.log(589) | Bytes Written=172
INFO - Counters.log(587) | FileSystemCounters
INFO - Counters.log(589) | FILE_BYTES_READ=237
INFO - Counters.log(589) | HDFS_BYTES_READ=455
INFO - Counters.log(589) | FILE_BYTES_WRITTEN=169503
INFO - Counters.log(589) | HDFS_BYTES_WRITTEN=172
INFO - Counters.log(587) | File Input Format Counters
INFO - Counters.log(589) | Bytes Read=227
INFO - Counters.log(587) | Map-Reduce Framework
INFO - Counters.log(589) | Map output materialized bytes=243
INFO - Counters.log(589) | Map input records=10
INFO - Counters.log(589) | Reduce shuffle bytes=243
INFO - Counters.log(589) | Spilled Records=16
INFO - Counters.log(589) | Map output bytes=215
INFO - Counters.log(589) | Total committed heap usage (bytes)=336338944
INFO - Counters.log(589) | CPU time spent (ms)=1770
INFO - Counters.log(589) | Combine input records=0
INFO - Counters.log(589) | SPLIT_RAW_BYTES=228
INFO - Counters.log(589) | Reduce input records=8
INFO - Counters.log(589) | Reduce input groups=4
INFO - Counters.log(589) | Combine output records=0
INFO - Counters.log(589) | Physical memory (bytes) snapshot=442564608
INFO - Counters.log(589) | Reduce output records=4
INFO - Counters.log(589) | Virtual memory (bytes) snapshot=2184306688
INFO - Counters.log(589) | Map output records=8

在map侧过滤的数据，在50030中查看的截图如下：

运行结果如下所示：

1 三劫散仙 13575468248 B 89 2013-02-05
2 凤舞九天 18965235874 C 69 2013-03-09
3 忙忙碌碌 15986854789 A 99 2013-03-05
3 忙忙碌碌 15986854789 D 56 2013-06-07

1 三劫散仙 13575468248 B 89 2013-02-05
2 凤舞九天 18965235874 C 69 2013-03-09
3 忙忙碌碌 15986854789 A 99 2013-03-05
3 忙忙碌碌 15986854789 D 56 2013-06-07

至此，这个半链接就完成了，结果正确，在hadoop的几种join方式里，只有在Map侧的链接比较高效，但也需要根据具体的实际情况，进行选择。

Hadoop的Map侧join的更多相关文章

hadoop 多表join：Map side join及Reduce side join范例
最近在准备抽取数据的工作.有一个id集合200多M,要从另一个500GB的数据集合中抽取出所有id集合中包含的数据集.id数据集合中每一个行就是一个id的字符串(Reduce side join要在每 ...
hadoop的压缩解压缩,reduce端join,map端join
hadoop的压缩解压缩 hadoop对于常见的几种压缩算法对于我们的mapreduce都是内置支持,不需要我们关心.经过map之后,数据会产生输出经过shuffle,这个时候的shuffle过程特别 ...
hadoop中MapReduce多种join实现实例分析
转载自:http://zengzhaozheng.blog.51cto.com/8219051/1392961 1.在Reudce端进行连接. 在Reudce端进行连接是MapReduce框架进行表之 ...
Hadoop中两表JOIN的处理方法(转)
1. 概述在传统数据库(如:MYSQL)中,JOIN操作是非常常见且非常耗时的.而在HADOOP中进行JOIN操作,同样常见且耗时,由于Hadoop的独特设计思想,当进行JOIN操作时,有一些特殊的 ...
Hadoop中两表JOIN的处理方法
Dong的这篇博客我觉得把原理写的很详细,同时介绍了一些优化办法,利用二次排序或者布隆过滤器,但在之前实践中我并没有在join中用二者来优化,因为我不是作join优化的,而是做单纯的倾斜处理,做joi ...
Hadoop基础-MapReduce的Join操作
Hadoop基础-MapReduce的Join操作作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.连接操作Map端Join(适合处理小表+大表的情况) no001 no002 ...
map侧连接
两个数据集中一个非常小,可以让小数据集存入缓存.在作业开始这些文件会被复制到运行task的节点上. 一开始,它的setup方法会检索缓存文件. 与reduce侧连接不同,Map侧连接需要等待参与连接的 ...
Hadoop_22_MapReduce map端join实现方式解决数据倾斜（DistributedCache）
1.Map端Join解决数据倾斜 1.Mapreduce中会将map输出的kv对,按照相同key分组(调用getPartition),然后分发给不同的reducetask 2.Map输出结果的时候 ...
【Spark调优】：如果实在要shuffle，使用map侧预聚合的算子
因业务上的需要,无可避免的一些运算一定要使用shuffle操作,无法用map类的算子来替代,那么尽量使用可以map侧预聚合的算子. map侧预聚合,是指在每个节点本地对相同的key进行一次聚合操作,类 ...

随机推荐

POJ 2234 Matches Game
Matches Game Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 7567 Accepted: 4327 Desc ...
QPaintDevice: Cannot destroy paint device that is being painted
在paintEvent中,使用QPainter * 绘制图像出现此问题.解决: 1.改为不使用QPainter指针. 2.添上begin(), end() QPainter * painter = n ...
iis错误记录
1:iis错误解决方法: 输入C:\Windows\Microsoft.NET\Framework\v4.0.30319>aspnet_regiis -i 这里由于我的是默认在Administ ...
apache+tomcat整合后的编码问题
apache+tomcat整合提供webserver服务的方式是为了实现两个目的:一是方便利用apache http server将客户请求均衡的分给tomcat1,tomcat2....去处理,即负 ...
SNF开发平台WinForm之四-开发-主细表管理页面-SNF快速开发平台3.3-Spring.Net.Framework
4.1运行效果: 4.2开发实现: 4.2.1 有了第一个程序的开发,代码生成器的配置应该是没有问题了,我们只要在对应的数据库中创建我们需要的表结构就可以了,如下: 主表结构如下: ...
iOS_拨打电话/发送短信
GitHub address : https://github.com/mancongiOS/makeACallAndSendMessage.git 功能一: 拨打电话 1.可以有提示框.提示该电话号 ...
iOS-NSDate
一.概念解释 1.什么是NSTimeZone? NSTimeZone:时区是一个地理名字,是为了克服各个地区或者国家之间在时间上的混乱设定的. 1).GMT:0:00格林威治标准时间:UTC +00: ...
[mysql]brew 安装配置操作 mysql（中文问题）
mac 下卸载mysqldmg mac下mysql的DMG格式安装内有安装文件,却没有卸载文件--很郁闷的事. 网上搜了一下,发现给的方法原来得手动去删. 很多文章记述要删的文件不完整,后来在stac ...
使用aspose.cell动态导出多表头 EXCEL
效果图: 前台调用: using System; using System.Collections.Generic; using System.Linq; using System.Web; usin ...
使用Spark分析拉勾网招聘信息(一):准备工作
本系列专属github地址:https://github.com/ios122/spark_lagou 前言我觉得如果动笔,就应该努力地把要说的东西表达清楚.今后一段时间,尝试下系列博客文章.简单说 ...

Hadoop的Map侧join

Hadoop的Map侧join的更多相关文章

随机推荐

热门专题