Map/Reduce中Join查询实现

张表，分别较data.txt和info.txt，字段之间以/t划分。

data.txt内容如下：

201001 1003 abc

201002 1005 def

201003 1006 ghi

201004 1003 jkl

201005 1004 mno

201006 1005 pqr

info.txt内容如下：

1003 kaka

1004 da

1005 jue

1006 zhao

期望输出结果：

1003 201001 abc kaka

1003 201004 jkl kaka

1004 201005 mno da

1005 201002 def jue

1005 201006 pqr jue

1006 201003 ghi zhao

四、Map代码

首先是map的代码，我贴上，然后简要说说

public static class Example_Join_01_Mapper extends Mapper<LongWritable, Text, TextPair, Text> {

@Override

protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

// 获取输入文件的全路径和名称

String pathName = ((FileSplit) context.getInputSplit()).getPath().toString();

if (pathName.contains("data.txt")) {

String values[] = value.toString().split("/t");

if (values.length < 3) {

// data数据格式不规范，字段小于3，抛弃数据

return;

} else {

// 数据格式规范，区分标识为1

TextPair tp = new TextPair(new Text(values[1]), new Text("1"));

context.write(tp, new Text(values[0] + "/t" + values[2]));

}

if (pathName.contains("info.txt")) {

String values[] = value.toString().split("/t");

if (values.length < 2) {

// data数据格式不规范，字段小于2，抛弃数据

return;

} else {

// 数据格式规范，区分标识为0

TextPair tp = new TextPair(new Text(values[0]), new Text("0"));

context.write(tp, new Text(values[1]));

}

这里需要注意以下部分：

A、pathName是文件在HDFS中的全路径(例如：hdfs://M1:9000/MengYan/join/data/info.txt)，可以以endsWith()的方法来判断。

B、资料表，也就是这里的info.txt需要放在前面，也就是标识号是0.否则无法输出理想结果。

C、Map执行完成之后，输出的中间结果如下：

1003,0 kaka

1004,0 da

1005,0 jue

1006,0 zhao

1003,1 201001 abc

1003,1 201004 jkl

1004,1 201005 mon

1005,1 201002 def

1005,1 201006 pqr

1006,1 201003 ghi

五、分区和分组

、map之后的输出会进行一些分区的操作，代码贴出来：

public static class Example_Join_01_Partitioner extends Partitioner<TextPair, Text> {

@Override

public int getPartition(TextPair key, Text value, int numParititon) {

return Math.abs(key.getFirst().hashCode() * 127) % numParititon;

}

，那么就是分区有多个，但是还是在一个reduce中处理。但是结果会按照分区的原则排序）。分区后结果大致如下：

同一区：

1003,0 kaka

1003,1 201001 abc

1003,1 201004 jkl

同一区：

1004,0 da

1004,1 201005 mon

同一区：

1005,0 jue

1005,1 201002 def

1005,1 201006 pqr

同一区：

1006,0 zhao

1006,1 201003 ghi

、分组操作，代码如下

public static class Example_Join_01_Comparator extends WritableComparator {

public Example_Join_01_Comparator() {

super(TextPair.class, true);

}

@SuppressWarnings("unchecked")

public int compare(WritableComparable a, WritableComparable b) {

TextPair t1 = (TextPair) a;

TextPair t2 = (TextPair) b;

return t1.getFirst().compareTo(t2.getFirst());

}

分组操作就是把在相同分区的数据按照指定的规则进行分组的操作，就以上来看，是按照复合key的第一个字段做分组原则，达到忽略复合key的第二个字段值的目的，从而让数据能够迭代在一个reduce中。输出后结果如下：

同一组：

1003,0 kaka

1003,0 201001 abc

1003,0 201004 jkl

同一组：

1004,0 da

1004,0 201005 mon

同一组：

1005,0 jue

1005,0 201002 def

1005,0 201006 pqr

同一组：

1006,0 zhao

1006,0 201003 ghi

六、reduce操作

贴上代码如下：

public static class Example_Join_01_Reduce extends Reducer<TextPair, Text, Text, Text> {

protected void reduce(TextPair key, Iterable<Text> values, Context context) throws IOException,

InterruptedException {

Text pid = key.getFirst();

String desc = values.iterator().next().toString();

while (values.iterator().hasNext()) {

context.write(pid, new Text(values.iterator().next().toString() + "/t" + desc));

}

、代码比较简单，首先获取关键的ID值，就是key的第一个字段。

、获取公用的字段，通过排组织后可以看到，一些共有字段是在第一位，取出来即可。

、遍历余下的结果，输出。

七、其他的支撑代码

、首先是TextPair代码，没有什么可以细说的，贴出来：

public class TextPair implements WritableComparable<TextPair> {

private Text first;

private Text second;

public TextPair() {

set(new Text(), new Text());

}

public TextPair(String first, String second) {

set(new Text(first), new Text(second));

}

public TextPair(Text first, Text second) {

set(first, second);

}

public void set(Text first, Text second) {

this.first = first;

this.second = second;

}

public Text getFirst() {

return first;

}

public Text getSecond() {

return second;

}

public void write(DataOutput out) throws IOException {

first.write(out);

second.write(out);

}

public void readFields(DataInput in) throws IOException {

first.readFields(in);

second.readFields(in);

}

public int compareTo(TextPair tp) {

int cmp = first.compareTo(tp.first);

if (cmp != 0) {

return cmp;

}

return second.compareTo(tp.second);

}

、Job的入口函数

public static void main(String agrs[]) throws IOException, InterruptedException, ClassNotFoundException {

Configuration conf = new Configuration();

GenericOptionsParser parser = new GenericOptionsParser(conf, agrs);

String[] otherArgs = parser.getRemainingArgs();

if (agrs.length < 3) {

System.err.println("Usage: Example_Join_01 <in_path_one> <in_path_two> <output>");

System.exit(2);

}

//conf.set("hadoop.job.ugi", "root,hadoop");

Job job = new Job(conf, "Example_Join_01");

// 设置运行的job

job.setJarByClass(Example_Join_01.class);

// 设置Map相关内容

job.setMapperClass(Example_Join_01_Mapper.class);

// 设置Map的输出

job.setMapOutputKeyClass(TextPair.class);

job.setMapOutputValueClass(Text.class);

// 设置partition

job.setPartitionerClass(Example_Join_01_Partitioner.class);

// 在分区之后按照指定的条件分组

job.setGroupingComparatorClass(Example_Join_01_Comparator.class);

// 设置reduce

job.setReducerClass(Example_Join_01_Reduce.class);

// 设置reduce的输出

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

// 设置输入和输出的目录

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

FileInputFormat.addInputPath(job, new Path(otherArgs[1]));

FileOutputFormat.setOutputPath(job, new Path(otherArgs[2]));

// 执行，直到结束就退出

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

八、总结

、这是个简单的join查询，可以看到，我在处理输入源的时候是在map端做来源判断。其实在0.19可以用MultipleInputs.addInputPath()的方法，但是它用了JobConf做参数。这个方法原理是多个数据源就采用多个map来处理。方法各有优劣。

、对于资源表，如果我们采用0和1这样的模式来区分，资源表是需要放在前的。例如本例中info.txt就是资源表，所以标识位就是0.如果写为1的话，可以试下，在分组之后，资源表对应的值放在了迭代器最后一位，无法追加在最后所有的结果集合中。

、关于分区，并不是所有的map都结束才开始的，一部分数据完成就会开始执行。同样，分组操作在一个分区内执行，如果分区完成，分组将会开始执行，也不是等所有分区完成才开始做分组的操作。

Map/Reduce中Join查询实现的更多相关文章

Map Reduce Application(Join)
We are going to explain how join works in MR , we will focus on reduce side join and map side join. ...
Hive 中Join的专题---Join详解
1.什么是等值连接? 2.hive转换多表join时,如果每个表在join字句中,使用的都是同一个列,该如何处理? 3.LEFT,RIGHT,FULL OUTER连接的作用是什么? 4.LEFT或RI ...
Map/Reduce之间的Partitioner接口
一.Partitioner介绍 Partitioner的作用是对Mapper产生的中间结果进行分片,以便将同一分组的数据交给同一个Reduce处理,它直接影响Reduce阶段的负载均衡(个人理解:就是 ...
mapreduce: 揭秘InputFormat--掌控Map Reduce任务执行的利器
随着越来越多的公司采用Hadoop,它所处理的问题类型也变得愈发多元化.随着Hadoop适用场景数量的不断膨胀,控制好怎样执行以及何处执行map任务显得至关重要.实现这种控制的方法之一就是自定义Inp ...
Map Reduce和流处理
欢迎大家前往腾讯云+社区,获取更多腾讯海量技术实践干货哦~ 本文由@从流域到海域翻译,发表于腾讯云+社区 map()和reduce()是在集群式设备上用来做大规模数据处理的方法,用户定义一个特定的映射 ...
Hadoop学习笔记2 - 第一和第二个Map Reduce程序
转载请标注原链接http://www.cnblogs.com/xczyd/p/8608906.html 在Hdfs学习笔记1 - 使用Java API访问远程hdfs集群中,我们已经可以完成了访问hd ...
入门大数据---Map/Reduce，Yarn是什么？
简单概括:Map/Reduce是分布式离线处理的一个框架. Yarn是Map/Reduce中的一个资源管理器. 一.图形说明下Map/Reduce结构: 官方示意图: 另外还可以参考这个: 流程介绍: ...
hadoop 多表join：Map side join及Reduce side join范例
最近在准备抽取数据的工作.有一个id集合200多M,要从另一个500GB的数据集合中抽取出所有id集合中包含的数据集.id数据集合中每一个行就是一个id的字符串(Reduce side join要在每 ...
hadoop的压缩解压缩,reduce端join,map端join
hadoop的压缩解压缩 hadoop对于常见的几种压缩算法对于我们的mapreduce都是内置支持,不需要我们关心.经过map之后,数据会产生输出经过shuffle,这个时候的shuffle过程特别 ...

随机推荐

appium安装
appium 这个移动端的自动化测试框架.是神器啊.selenium系列的工具.webdirver是一个使用很广泛的自动化测试框架. 至于API 测试,等,使用代码做单元测试就好了,各种框架很多,只要 ...
git workflow常用命令
git init git status git add readme.txt git add --all Adds all new or modified files git comm ...
js方式进行地理位置的定位api搜集
新浪 //int.dpool.sina.com.cn/iplookup/iplookup.php?format=js //int.dpool.sina.com.cn/iplookup/iplookup ...
Asp.net中的HttpModule和HttpHandler的简单用法
在Asp.net中,HttpModule和HttpHandler均可以截取IIS消息进行处理,这使得我们制作人员能够非常方便的进行诸如图片水印添加,图片盗链检查等功能. 下面先就HttpModule的 ...
mysql-主从复制（二）
1)主服务器上开启binlog服务器 log-bin=mysql-bin 2)用户授权(并不是privileges授权!!!!),正确有从服务器授权如下 grant replication slave ...
HDU 5348 MZL's endless loop 给边定向（欧拉回路，最大流）
题意: 给一个所有你可能想得到的奇葩无向图,要求给每条边定向,使得每个点的入度与出度之差不超过1.输出1表示定向往右,输出0表示定向往左. 思路: 网络流也是可以解决的!!应该挺简单理解的.但是由于复 ...
win7下的IP-主机名映射
今天学了个技巧,win7下有个目录:C:\Windows\System32\drivers\etc 该目录下有个文件: hosts 在这个文件里面我们可以映射IP-主机名: 127.0.0.1 loc ...
FPGA代码设计规范整理
1.设计中的FIFO.状态机接口需要有异常恢复状态和状态上报机制,格雷码电路防止被综合电路优化掉. a)自行设计的格雷码FIFO(一般用于连续数据流跨时钟域)用Synplify综合时,为了防止被优化需 ...
js COOKIE 记住帐号或者uuid
当开始接到这个任务的时候,我对cookie还是没多少了解的,而uuid的生成也是一无所知.但是当你发现这个网址http://stackoverflow.com/questions/105034/how ...
删除binlog的方法
不知道你有没有为mysql的binlog占用大量磁盘感到无奈,my.cnf里binlog的size可以设置多大做分割,但没有看到删除的配置,在mysql里show了一下variables, mysql ...

Map/Reduce中Join查询实现

Map/Reduce中Join查询实现的更多相关文章

随机推荐

热门专题