HIVE JOIN

概述

Hive join的实现包含了：

Common (Reduce-side) Join
Broadcast (Map-side) Join
Bucket Map Join
Sort Merge Bucket Join
Skew Join

这里记录下前两种.

第一种是common join，就像字面意思那样，它是一种最常见的join实现方式，但是不够灵活，并且性能也不够好。

一个common join包含了一个map阶段和一个shuffle阶段，以及一个reduce阶段。Map阶段会生成根据join的条件生成所需要的join key

和join value，并将这些信息保存在中间文件中。 Shuffle阶段会对这些文件按照join key进行排序，并且将key相同的数据合并到一个文件

中。Ruduce会进行最终的合并，并产生结果数据。

第二种是broadcast join，这种方式是取消shuffle和reduce阶段，将join动作在map 阶段完成，它会将join中的小表加载到内存中，所有

mapper都可以直接使用内存中的表数据进行join。所有的join 动作都可以在map阶段完成。

如何将小表加载到内存中也是挺讲究的，先要讲小表加载到内存中，然后将其序列化到一个hashtable file。当map阶段开始的时候，将这个

hashtable file 加载到distributed cache中，并将其分发到每个mapper所在的硬盘里，然后这些mapper将hashtable file加载到内存中，并进行join运算。通过优化，这些小表只需要读一次就OK，如果很多个mappper在同一台机器上，那么就只需要一个份hashtable file。

通过EXPLAIN查看

准备了两张表，分别是test_a和test_city。

test_a的数据如下：

test_a.id	test_a.uid	test_a.city_id
1	1	1
2	2	2
3	3	3

test_city的数据如下：

test_city.id	test_city.name
1	beijing
2	shanghai
3	hangzhou

LEFT JOIN

具体的SQL如下：

explain

select a.id, a.uid, b.name

from

    temp.test_a as a

left join

    temp.test_city as b

on a.city_id = b.id;

因为表很小，所以就使用了 map side join,具体过程如下：

STAGE DEPENDENCIES:

2	  Stage-4 is a root stage

3	  Stage-3 depends on stages: Stage-4

4	  Stage-0 depends on stages: Stage-3

5

6	STAGE PLANS:

7	  Stage: Stage-4

8	    Map Reduce Local Work

9	      Alias -> Map Local Tables://从文件中读取数据

10	        $hdt$_1:b

11	          Fetch Operator

12	            limit: -1

13	      Alias -> Map Local Operator Tree:

14	        $hdt$_1:b

15	          TableScan //扫描表 test_city，一行一行读取数据

16	            alias: b

17	            Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE

18	            Select Operator //选取数据

19	              expressions: id (type: bigint), name (type: string)

20	              outputColumnNames: _col0, _col1

21	              Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE

22	              HashTable Sink Operator //我理解这里应该在将数据放到distribute cache中所用到的key，但是不是很确定。

23	                keys:

24	                  0 _col2 (type: bigint)

25	                  1 _col0 (type: bigint)

26

27	  Stage: Stage-3

28	    Map Reduce

29	      Map Operator Tree:

30	          TableScan

31	            alias: a

32	            Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE

33	            Select Operator

34	              expressions: id (type: bigint), uid (type: bigint), city_id (type: bigint)

35	              outputColumnNames: _col0, _col1, _col2

36	              Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE

37	              Map Join Operator //注意这里用到了map side join

38	                condition map:

39	                     Left Outer Join0 to 1

40	                keys:

41	                  0 _col2 (type: bigint)

42	                  1 _col0 (type: bigint)

43	                outputColumnNames: _col0, _col1, _col4

44	                Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

45	                Select Operator

46	                  expressions: _col0 (type: bigint), _col1 (type: bigint), _col4 (type: string)

47	                  outputColumnNames: _col0, _col1, _col2

48	                  Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

49	                  File Output Operator

50	                    compressed: false

51	                    Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

52	                    table:

53	                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat

54	                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

55	                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

56	      Local Work:

57	        Map Reduce Local Work

58

59	  Stage: Stage-0

60	    Fetch Operator

61	      limit: -1

62	      Processor Tree:

63	        ListSink

如果设置了

set hive.auto.convert.join=false;

就会变为 Reduce-side join. 这是最普遍用到的join实现。整个过程包含了两部分：

STAGE DEPENDENCIES:

2	  Stage-1 is a root stage

3	  Stage-0 depends on stages: Stage-1

4

5	STAGE PLANS:

6	  Stage: Stage-1

7	    Map Reduce

8	      Map Operator Tree: //map过程

9	          TableScan

10	            alias: a

11	            Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE

12	            Select Operator

13	              expressions: id (type: bigint), uid (type: bigint), city_id (type: bigint)

14	              outputColumnNames: _col0, _col1, _col2

15	              Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE

16	              Reduce Output Operator //map端的Reduce，然后输出到reduce整体的Reduce阶段

17	                key expressions: _col2 (type: bigint)

18	                sort order: +

19	                Map-reduce partition columns: _col2 (type: bigint)

20	                Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE

21	                value expressions: _col0 (type: bigint), _col1 (type: bigint)

22	          TableScan

23	            alias: b

24	            Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE

25	            Select Operator

26	              expressions: id (type: bigint), name (type: string)

27	              outputColumnNames: _col0, _col1

28	              Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE

29	              Reduce Output Operator

30	                key expressions: _col0 (type: bigint)

31	                sort order: +

32	                Map-reduce partition columns: _col0 (type: bigint)

33	                Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE

34	                value expressions: _col1 (type: string)

35	      Reduce Operator Tree:

36	        Join Operator

37	          condition map:

38	               Left Outer Join0 to 1

39	          keys:

40	            0 _col2 (type: bigint)

41	            1 _col0 (type: bigint)

42	          outputColumnNames: _col0, _col1, _col4

43	          Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

44	          Select Operator

45	            expressions: _col0 (type: bigint), _col1 (type: bigint), _col4 (type: string)

46	            outputColumnNames: _col0, _col1, _col2

47	            Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

48	            File Output Operator

49	              compressed: false

50	              Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

51	              table:

52	                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat

53	                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

54	                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

55

56	  Stage: Stage-0

57	    Fetch Operator

58	      limit: -1

59	      Processor Tree:

60	        ListSink

参考：

HIVE JOIN_1的更多相关文章

初识Hadoop、Hive
2016.10.13 20:28 很久没有写随笔了,自打小宝出生后就没有写过新的文章.数次来到博客园,想开始新的学习历程,总是被各种琐事中断.一方面确实是最近的项目工作比较忙,各个集群频繁地上线加多版 ...
Hive安装配置指北（含Hive Metastore详解）
个人主页: http://www.linbingdong.com 本文介绍Hive安装配置的整个过程,包括MySQL.Hive及Metastore的安装配置,并分析了Metastore三种配置方式的区 ...
Hive on Spark安装配置详解（都是坑啊）
个人主页:http://www.linbingdong.com 简书地址:http://www.jianshu.com/p/a7f75b868568 简介本文主要记录如何安装配置Hive on Sp ...
HIVE教程
完整PDF下载:<HIVE简明教程> 前言 Hive是对于数据仓库进行管理和分析的工具.但是不要被“数据仓库”这个词所吓倒,数据仓库是很复杂的东西,但是如果你会SQL,就会发现Hive是那 ...
基于Ubuntu Hadoop的群集搭建Hive
Hive是Hadoop生态中的一个重要组成部分,主要用于数据仓库.前面的文章中我们已经搭建好了Hadoop的群集,下面我们在这个群集上再搭建Hive的群集. 1.安装MySQL 1.1安装MySQL ...
hive
Hive Documentation https://cwiki.apache.org/confluence/display/Hive/Home 2016-12-22 14:52:41 ANTLR ...
深入浅出数据仓库中SQL性能优化之Hive篇
转自:http://www.csdn.net/article/2015-01-13/2823530 一个Hive查询生成多个Map Reduce Job,一个Map Reduce Job又有Map,R ...
Hive读取外表数据时跳过文件行首和行尾
作者:Syn良子出处:http://www.cnblogs.com/cssdongl 转载请注明出处有时候用hive读取外表数据时,比如csv这种类型的,需要跳过行首或者行尾一些和数据无关的或者自 ...
Hive索引功能测试
作者:Syn良子出处:http://www.cnblogs.com/cssdongl 转载请注明出处从Hive的官方wiki来看,Hive0.7以后增加了一个对表建立index的功能,想试下性能是 ...

随机推荐

cocos2dx2.0 与cocos2dx3.1 创建线程不同方式总结
尽管内容是抄过来的.可是经过了我的验证.并且放在一起就清楚非常多了,cocos2dx版本号常常变化非常大.总会导致这样那样的问题. cocos2dx2.0 中 1. 头文件 #include < ...
zookeeper应用场景练习（分布式锁）
在寻常的高并发的程序中.为了保证数据的一致性.因此都会用到锁.来对当前的线程进行锁定.在单机操作中.非常好做到,比方能够採用Synchronized.Lock或者其它的读写多来锁定当前的线程.可是在分 ...
C语言：constkeyword、结构体
前几节内容的解说,主要是内存地址及指针的分析.这一节解说一下easy混淆的keywordconstant及结构体的知识. 一.constkeyword 1. 字符常量的指针 char const *p ...
less04 运算符、命名空间
less //.wp{ // margin: 0 auto; // background: forestgreen; // width: 450px + 450; //有一个有单位就可以 // hei ...
将一个文件夹纳入library或者移除remove
https://support.microsoft.com/en-us/help/4026298/windows-show-libraries-in-file-explorer To show lib ...
sql server management studio 快速折叠object explorer中的instance
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/6e20fa7a-c0a9-496b-89b2-19c6bd996ffc/how-to ...
一次误报引发的DNS检测方案的思考：DNS隧道检测平民解决方案
摘自:http://www.freebuf.com/articles/network/149328.html 通过以上分析得出监控需要关注的几个要素:长域名.频率.txt类型.终端是否对解析ip发起访 ...
zzulioj--1775-- 和尚特烦恼1——是不是素数(素数水题)
1775: 和尚特烦恼1--是不是素数 Time Limit: 2 Sec Memory Limit: 128 MB Submit: 563 Solved: 193 SubmitStatusWeb ...
BZOJ 3600 替罪羊树+线段树
思路: 当然是抄的黄学长的题解啦 //By SiriusRen #include <cstdio> #include <algorithm> using namespace s ...
Kali linux 2016.2（Rolling）里Metasploit的OpenVAS
不多说,直接上干货! 关于OpenAVS的概念,我这里不多赘述. 前提得,大家要先安装好OpenVAS!!! 我们都知道,BT5中已经预先安装好了OpenVAS网络漏洞扫描工具,我们只需进行一些配置即 ...

HIVE JOIN_1

HIVE JOIN

概述

通过EXPLAIN查看

LEFT JOIN

HIVE JOIN_1的更多相关文章

随机推荐

热门专题