hive left outer join的问题

最近BA用户反馈有两句看似很像的语句返回的结果数不一样，比较奇怪，怀疑是不是Hive的Bug

Query 1 返回结果数6071

select count(distinct reviewid) as dis_reviewcnt

from

(select a.reviewid

from bi.dpods_dp_reviewreport  a

left outer join bi.dpods_dp_reviewlog  b

on a.reviewid=b.reviewid and  b.hp_statdate='2013-07-24'

where to_date(a.feedadddate) >= '2013-07-01'   and a.hp_statdate='2013-07-24'

) a

Query 2 返回结果数6443

select count(distinct reviewid) as dis_reviewcnt

from

(select a.reviewid

from bi.dpods_dp_reviewreport  a

left outer join bi.dpods_dp_reviewlog  b

on a.reviewid=b.reviewid and  b.hp_statdate='2013-07-24'   and a.hp_statdate='2013-07-24'

where to_date(a.feedadddate) >= '2013-07-01'

) a

第二条query比第一条多了372条数据，而且在子查询的左表中并不存在

两条语句唯一的区别是dpods_dp_reviewreport的分区过滤条件（hp_statdate是partition column）一个在where后面，另一个在on后面

粗看感觉出来的数据应该是一样的，但是玄机其实就在where和on的区别。

where 后面跟的是过滤条件，query 1 中的a.hp_statdate='2013-07-24', 在table scan之前就会Partition Pruner 过滤分区，所以只有'2013-07-24'下的数据会和dpods_dp_reviewlog进行join。

而query 2中会读入所有partition下的数据，再和dpods_dp_reviewlog join，并且根据join的关联条件只有a.hp_statdate='2013-07-24'的时候才会真正执行join，其余情况下又由于是left outer join, join不上右面会留NULL，query 2中其实是取出了所有的reviewid，所以会和query 1 结果不一样

可以做一个实验，query2去掉on后面的a.hp_statdate='2013-07-24'，其余不动，执行语句，出来的distinct reviewcnt 也是 6443

select count(distinct reviewid) as dis_reviewcnt

from

(select a.reviewid

from bi.dpods_dp_reviewreport  a

left outer join bi.dpods_dp_reviewlog  b

on a.reviewid=b.reviewid and  b.hp_statdate='2013-07-24'

where to_date(a.feedadddate) >= '2013-07-01'

) a

query 1的query plan

ABSTRACT SYNTAX TREE:

  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF (TOK_TABNAME bi dpods_dp_reviewreport) a) (TOK_TABREF (TOK_TABNAME bi dpods_dp_reviewlog) b) (and (= (. (TOK_TABLE_OR_COL a) reviewid) (. (TOK_TABLE_OR_COL b) reviewid)) (= (. (TOK_TABLE_OR_COL b) hp_statdate) '2013-07-24')))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) reviewid))) (TOK_WHERE (and (>= (TOK_FUNCTION to_date (. (TOK_TABLE_OR_COL a) feedadddate)) '2013-07-01') (= (. (TOK_TABLE_OR_COL a) hp_statdate) '2013-07-24'))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL reviewid)) dis_reviewcnt))))

STAGE DEPENDENCIES:

  Stage-5 is a root stage , consists of Stage-1

  Stage-1

  Stage-2 depends on stages: Stage-1

  Stage-0 is a root stage

STAGE PLANS:

  Stage: Stage-5

    Conditional Operator

  Stage: Stage-1

    Map Reduce

      Alias -> Map Operator Tree:

        a:a

          TableScan

            alias: a

            Filter Operator

              predicate:

                  expr: (to_date(feedadddate) >= '2013-07-01')

                  type: boolean

              Reduce Output Operator

                key expressions:

                      expr: reviewid

                      type: int

                sort order: +

                Map-reduce partition columns:

                      expr: reviewid

                      type: int

                tag: 0

                value expressions:

                      expr: feedadddate

                      type: string

                      expr: reviewid

                      type: int

                      expr: hp_statdate

                      type: string

        a:b

          TableScan

            alias: b

            Reduce Output Operator

              key expressions:

                    expr: reviewid

                    type: int

              sort order: +

              Map-reduce partition columns:

                    expr: reviewid

                    type: int

              tag: 1

      Reduce Operator Tree:

        Join Operator

          condition map:

               Left Outer Join0 to 1

          condition expressions:

            0 {VALUE._col5} {VALUE._col8} {VALUE._col17}

            1

          handleSkewJoin: false

          outputColumnNames: _col5, _col8, _col17

          Select Operator

            expressions:

                  expr: _col8

                  type: int

            outputColumnNames: _col0

            Select Operator

              expressions:

                    expr: _col0

                    type: int

              outputColumnNames: _col0

              Group By Operator

                aggregations:

                      expr: count(DISTINCT _col0)

                bucketGroup: false

                keys:

                      expr: _col0

                      type: int

                mode: hash

                outputColumnNames: _col0, _col1

                File Output Operator

                  compressed: true

                  GlobalTableId: 0

                  table:

                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat

                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-2

    Map Reduce

      Alias -> Map Operator Tree:

        hdfs://10.2.6.102/tmp/hive-hadoop/hive_2013-07-26_18-10-59_408_7272696604651905662/-mr-10002

            Reduce Output Operator

              key expressions:

                    expr: _col0

                    type: int

              sort order: +

              tag: -1

              value expressions:

                    expr: _col1

                    type: bigint

      Reduce Operator Tree:

        Group By Operator

          aggregations:

                expr: count(DISTINCT KEY._col0:0._col0)

          bucketGroup: false

          mode: mergepartial

          outputColumnNames: _col0

          Select Operator

            expressions:

                  expr: _col0

                  type: bigint

            outputColumnNames: _col0

            File Output Operator

              compressed: false

              GlobalTableId: 0

              table:

                  input format: org.apache.hadoop.mapred.TextInputFormat

                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0

    Fetch Operator

      limit: -1

Query 2的query plan

ABSTRACT SYNTAX TREE:

  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF (TOK_TABNAME bi dpods_dp_reviewreport) a) (TOK_TABREF (TOK_TABNAME bi dpods_dp_reviewlog) b) (and (and (= (. (TOK_TABLE_OR_COL a) reviewid) (. (TOK_TABLE_OR_COL b) reviewid)) (= (. (TOK_TABLE_OR_COL b) hp_statdate) '2013-07-24')) (= (. (TOK_TABLE_OR_COL a) hp_statdate) '2013-07-24')))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) reviewid))) (TOK_WHERE (>= (TOK_FUNCTION to_date (. (TOK_TABLE_OR_COL a) feedadddate)) '2013-07-01')))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL reviewid)) dis_reviewcnt))))

STAGE DEPENDENCIES:

  Stage-5 is a root stage , consists of Stage-1

  Stage-1

  Stage-2 depends on stages: Stage-1

  Stage-0 is a root stage

STAGE PLANS:

  Stage: Stage-5

    Conditional Operator

  Stage: Stage-1

    Map Reduce

      Alias -> Map Operator Tree:

        a:a

          TableScan

            alias: a

            Filter Operator

              predicate:

                  expr: (to_date(feedadddate) >= '2013-07-01')

                  type: boolean

              Reduce Output Operator

                key expressions:

                      expr: reviewid

                      type: int

                sort order: +

                Map-reduce partition columns:

                      expr: reviewid

                      type: int

                tag: 0

                value expressions:

                      expr: feedadddate

                      type: string

                      expr: reviewid

                      type: int

                      expr: hp_statdate

                      type: string

        a:b

          TableScan

            alias: b

            Reduce Output Operator

              key expressions:

                    expr: reviewid

                    type: int

              sort order: +

              Map-reduce partition columns:

                    expr: reviewid

                    type: int

              tag: 1

      Reduce Operator Tree:

        Join Operator

          condition map:

               Left Outer Join0 to 1

          condition expressions:

            0 {VALUE._col5} {VALUE._col8}

            1

          filter predicates:

            0 {(VALUE._col17 = '2013-07-24')}

            1

          handleSkewJoin: false

          outputColumnNames: _col5, _col8

          Select Operator

            expressions:

                  expr: _col8

                  type: int

            outputColumnNames: _col0

            Select Operator

              expressions:

                    expr: _col0

                    type: int

              outputColumnNames: _col0

              Group By Operator

                aggregations:

                      expr: count(DISTINCT _col0)

                bucketGroup: false

                keys:

                      expr: _col0

                      type: int

                mode: hash

                outputColumnNames: _col0, _col1

                File Output Operator

                  compressed: true

                  GlobalTableId: 0

                  table:

                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat

                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-2

    Map Reduce

      Alias -> Map Operator Tree:

        hdfs://10.2.6.102/tmp/hive-hadoop/hive_2013-07-26_18-13-32_879_3623450294049807419/-mr-10002

            Reduce Output Operator

              key expressions:

                    expr: _col0

                    type: int

              sort order: +

              tag: -1

              value expressions:

                    expr: _col1

                    type: bigint

      Reduce Operator Tree:

        Group By Operator

          aggregations:

                expr: count(DISTINCT KEY._col0:0._col0)

          bucketGroup: false

          mode: mergepartial

          outputColumnNames: _col0

          Select Operator

            expressions:

                  expr: _col0

                  type: bigint

            outputColumnNames: _col0

            File Output Operator

              compressed: false

              GlobalTableId: 0

              table:

                  input format: org.apache.hadoop.mapred.TextInputFormat

                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0

    Fetch Operator

      limit: -1

参考：

http://blog.sina.com.cn/s/blog_6ff05a2c01010oxp.html

hive left outer join的问题的更多相关文章

HIVE中join、semi join、outer join举例详解
转自 http://www.cnblogs.com/xd502djj/archive/2013/01/18/2866662.html 举例子: hive> select * from zz0; ...
hive中left join、left outer join和left semi join的区别
先说结论,再举例子. hive中,left join与left outer join等价. left semi join与left outer join的区别:left semi join相当 ...
HIVE中join、semi join、outer join
补充说明 left outer join where is not null与left semi join的联系与区别:两者均可实现exists in操作,不同的是,前者允许右表的字段在select或 ...
hive 包含操作（left semi join）（left outer join = in）迪卡尔积
目前hive不支持 in或not in 中包含查询子句的语法,所以只能通过left join实现. 假设有一个登陆表login(当天登陆记录,只有一个uid),和一个用户注册表regusers(当天注 ...
hive regex insert join group cli
1.insert Insert时,from子句既能够放在select子句后,也能够放在insert子句前,以下两句是等价的 hive> FROM invites a INSERT OVERWRI ...
一起学Hive——总结各种Join连接的用法
Hive支持常用的SQL join语句,例如内连接.左外连接.右外连接以及HiVe独有的map端连接.其中map端连接是用于优化Hive连接查询的一个重要技巧. 在介绍各种连接之前,先准备好表和数据. ...
hive中的join
建表 : jdbc:hive2://localhost:10000> create database myjoin; No rows affected (3.78 seconds) : jdbc ...
Oracle Partition Outer Join 稠化报表
partition outer join实现将稀疏数据转为稠密数据,举例: with t as (select deptno, job, sum(sal) sum_sal from emp group ...
SQL Server 2008 R2——使用FULL OUTER JOIN实现多表信息汇总
=================================版权声明================================= 版权声明:原创文章谢绝转载请通过右侧公告中的“联系邮 ...

随机推荐

LeetCode——Flatten Binary Tree to Linked List
Given a binary tree, flatten it to a linked list in-place. For example, Given 1 / \ 2 5 / \ \ 3 4 6 ...
HDU2586
最近的共同祖先反复问的问题. #include <iostream> #include <algorithm> #include <vector> #include ...
openSUSE 安装
https://lug.ustc.edu.cn/sites/opensuse-guide/installation.php 开始 1. 简介2. 改用 GNU/Linux3. 获取 openSUSE4 ...
异步编程（Async和Await）的使用
.net4.5新特性之异步编程(Async和Await)的使用一.简介首先来看看.net的发展中的各个阶段的特性:NET 与C# 的每个版本发布都是有一个“主题”.即:C#1.0托管代码→C#2. ...
C#函数式编程-序列
C#函数式编程之序列过了许久的时间,终于趁闲暇的时间来继续将函数式编程这个专辑连载下去,这段时间开头是为IOS这个新方向做准备,将OC的教程写成了SWIFT版,当然我个人是支持Xamarin,但是我 ...
hibernate之使用Annotation注解搭建项目
之前开发都是使用xml配置来开发项目,开发起来特别繁琐大家会发现通过注解大大简化了我们开发流程,使我们从繁琐的XML配置中解放出来. 第一步:新建一个javaweb项目.并将hibernate需要的 ...
IOC 在Mvc中的使用
IOC 在Mvc中的使用 IOC,是控制反转(Inversion of Control)的英文简写, 控制反转一般分为两种类型,依赖注入(Dependency Injection)和依赖查找(Depe ...
SQL点滴28—一个简单的存储过程
原文:SQL点滴28-一个简单的存储过程在表中写入一条数据同事要向另外一个表中写入信息,所以会使用到事务.实际使用的时候还会一次向一个表中吸入多条数据,下面的存储过程,将字符串拆分成数组然后写入到表 ...
ASP.Net使用母版页窗
背景:每一个网页的基本框架结构类似: 浏览站点的时候会发现,好多站点中.每一个网页的基本框架都是一样的,比方,最上面都是站点的标题,中间是内容.最以下是站点的版权.开发提供商等信息: watermar ...
解决Postman发送post数据但是Node.js中req.body接收不到数据的问题[已解决]
之前编写后台接口,测试数据都是使用的Postman,相当的方便,之前也一直使用get方法,编写Node.js一直没有问题,但是由于要编写一个注册/登陆的功能,所以发送的post数据,后台的逻辑已经编写 ...

hive left outer join的问题

hive left outer join的问题的更多相关文章

随机推荐

热门专题