1. Project data with SELECT

The most common use case for Hive is to query data in Hadoop. To achieve this, we need to write and execute a SELECT statement. The typical work done by the SELECT statement is to project the whole row (with SELECT * ) or specified columns (with SELECT column1, column2, ... ) from a table, with or without conditions.Most simple SELECT statements will not trigger a Yarn job. Instead, a dump task is created just for dumping the data, such as the hdfs dfs -cat command. The SELECT statement is quite often used with the FROM and DISTINCT keywords. A FROM keyword followed by a table is where SELECT projects data. The DISTINCT keyword used after SELECT ensures only unique rows or combination of columns are returned from the table. In addition, SELECT also supports columns combined with user-defined functions, IF() , or a CASE WHEN THEN ELSE END statement, and regular expressions. The following are examples of projecting data with a SELECT statement:

SELECT * FROM employee; -- Project the whole row
SELECT name FROM employee; -- Project specified columns --List all columns match java regular expression
SET hive.support.quoted.identifiers = none; -- Enable this
SELECT `^work.*` FROM employee; -- All columns start with work SELECT DISTINCT name, work_place FROM employee; SELECT
CASE WHEN gender_age.gender = 'Female' THEN 'Ms.'
ELSE 'Mr.'
END as title,
name,
IF(array_contains(work_place, 'New York'), 'US', 'CA') as country
FROM employee;

Multiple SELECT statements can work together to build a complex query using nested queries or CTE. A nested query, which is also called a subquery, is a query projecting data from the result of another query. Nested queries can be rewritten using CTE with the WITH and AS keywords. When using nested queries, an alias should be given for the inner query (see t1 in the following example), or else Hive will report exceptions. The following are a few examples of using nested queries in HQL:

--1. A nested query example with the mandatory alias:
SELECT
name, gender_age.gender as gender
FROM (
SELECT * FROM employee WHERE gender_age.gender = 'Male'
) t1; -- t1 here is mandatory --2. A nested query can be rewritten with CTE as follows.
--This is the recommended way of writing a complex single HQL query
WITH t1 as (
SELECT * FROM employee WHERE gender_age.gender = 'Male'
)
SELECT name, gender_age.gender as gender
FROM t1;

In addition, a special SELECT followed by a constant expression can work without the FROM table clause. It returns the result of the expression. This is equivalent to querying a dummy table with one dummy record.

SELECT concat('','+','','=',cast((1 + 3) as string)) as res;
+-------+
| res |
+-------+
| 1+3=4 |
+-------+

2. Filtering data with conditions

It is quite common to narrow down the result set by using a condition clause, such as LIMIT , WHERE , IN / NOT IN , and EXISTS / NOT EXISTS . The LIMIT keyword limits the specified number of rows returned randomly. Compared with LIMIT , WHERE is a more powerful and generic condition clause to limit the returned result set by expressions, functions, and nested queries as in the following examples:

SELECT name FROM employee LIMIT 2;

SELECT name, work_place FROM employee WHERE name = 'Michael';

-- All the conditions can use together and use after WHERE
SELECT name, work_place FROM employee WHERE name = 'Michael' LIMIT 1;

IN / NOT IN is used as an expression to check whether values belong to a set specified by IN or NOT IN . With effect from Hive v2.1.0, IN and NOT IN statements support more than one column.

SELECT name FROM employee WHERE gender_age.age in (27, 30);

-- With multiple columns support after v2.1.0
SELECT
name, gender_age
FROM employee
WHERE (gender_age.gender, gender_age.age) IN
(('Female', 27), ('Male', 27 + 3)); -- Also support expression

In addition, filtering data can also use a subquery in the WHERE clause with IN / NOT IN and EXISTS / NOT EXISTS . A subquery that uses EXISTS or NOT EXISTS must refer to both inner and outer expressions:

SELECT
name, gender_age.gender as gender
FROM
employee
WHERE name IN
(
SELECT
name
FROM
employee
WHERE
gender_age.gender = 'Male'
); SELECT
name, gender_age.gender as gender
FROM
employee a
WHERE EXISTS (
SELECT *
FROM
employee b
WHERE
a.gender_age.gender = b.gender_age.gender AND
b.gender_age.gender = 'Male'
); -- This likes join table a and b with column gender

There are additional restrictions for subqueries used in WHERE clauses:

  • Subqueries can only appear on the right-hand side of WHERE clauses
  • Nested subqueries are not allowed
  • IN / NOT IN in subqueries only support the use of a single column, although they support more in regular expressions

3. Linking data with JOIN

JOIN is used to link rows from two or more tables together. Hive supports most SQL JOIN operations, such as INNER JOIN and OUTER JOIN . In addition, HQL supports some special joins, such as MapJoin and Semi-Join too. In its earlier version, Hive only supported equal join. After v2.2.0, unequal join is also supported. However, you should be more careful when using unequal join unless you know what is expected, since unequal join is likely to return many rows by producing a Cartesian product of joined tables. When you want to restrict the output of a join, you should apply a WHERE clause after join as JOIN occurs before the WHERE clause. If possible, push filter conditions on the join conditions rather than where conditions to have data filtered earlier. What's more, all types of left/right joins are not commutative and always\ left/right associative, while INNER and FULL OUTER JOINS are both commutative and associative.

3.1 INNER JOIN

INNER JOIN or JOIN returns rows meeting the join conditions from both sides of joined tables. The JOIN keyword can also be omitted by comma-separated table names; this is called an implicit join . Here are examples of the HQL JOIN operation:

--1. First, prepare a table to join with and load data to it:
CREATE TABLE IF NOT EXISTS employee_hr (
name string,
employee_id int,
sin_number string,
start_date date
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'; LOAD DATA INPATH '/tmp/hivedemo/data/employee_hr.txt'
OVERWRITE INTO TABLE employee_hr; --2. Perform an INNER JOIN between two tables with equal and unequal join
--conditions, along with complex expressions as well as a post join WHERE
--condition. Usually, we need to add a table name or table alias before columns in
--the join condition, although Hive always tries to resolve them:
SELECT
emp.name, emph.sin_number
FROM employee emp
JOIN employee_hr emph
ON emp.name = emph.name; -- Equal Join
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 547-968-091|
| Will | 527-948-090|
| Lucy | 577-928-094|
+-----------+------------------+ SELECT
emp.name, emph.sin_number
FROM employee emp
-- Unequal join supported since v2.2.0 returns more rows
JOIN employee_hr emph
ON emp.name != emph.name;
+----------+-----------------+
| emp.name | emph.sin_number |
+----------+-----------------+
| Michael | 527-948-090|
| Michael | 647-968-598|
| Michael | 577-928-094|
| Will | 547-968-091|
| Will | 647-968-598|
| Will | 577-928-094|
| Shelley | 547-968-091|
| Shelley | 527-948-090|
| Shelley | 647-968-598|
| Shelley | 577-928-094|
| Lucy | 547-968-091|
| Lucy | 527-948-090|
| Lucy | 647-968-598|
+----------+-----------------+ -- Join with complex expression in join condition
-- This is also the way to implement conditional join
-- Below, conditional ignore row with name = 'Will'
SELECT
emp.name, emph.sin_number
FROM employee emp
JOIN employee_hr emph
ON IF(emp.name = 'Will', '', emp.name) =CASE WHEN emph.name = 'Will' THEN '' ELSE emph.name END;
+----------+-----------------+
| emp.name | emph.sin_number |
+----------+-----------------+
| Michael | 547-968-091|
| Lucy | 577-928-094|
+----------+-----------------+ -- Use where/limit to limit the output of join
SELECT
emp.name, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON emp.name = emph.name
WHERE emp.name = 'Will';
+----------+-----------------+
| emp.name | emph.sin_number |
+----------+-----------------+
| Will | 527-948-090|
+----------+-----------------+ --3. The JOIN operation can be performed on more tables (such as table A, B, and C) with sequence joins.
--The tables can either join from A to B and B to C, or join from A to B and A to C
SELECT
emp.name, empi.employee_id, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON emp.name = emph.name
JOIN employee_id empi ON emp.name = empi.name;
+-----------+-------------------+------------------+
| emp.name | empi.employee_id | emph.sin_number |
+-----------+-------------------+------------------+
| Michael | 100 | 547-968-091 |
| Will | 101 | 527-948-090 |
| Lucy | 103 | 577-928-094 |
+-----------+-------------------+------------------+ --4. Self-join is where one table joins itself. When doing such joins,
--a different alias should be given to distinguish the same table
> SELECT
> emp.name -- Use alias before column name
> FROM employee as emp
> JOIN employee as emp_b -- Here, use a different alias
> ON emp.name = emp_b.name;
+-----------+
| emp.name |
+-----------+
| Michael |
| Will |
| Shelley |
| Lucy |
+-----------+ --5. Perform an implicit join without using the JOIN keyword.
--This is only applicable to the INNER JOIN
SELECT
emp.name, emph.sin_number
FROM
employee emp, employee_hr emph -- Only applies for inner join
WHERE emp.name = emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 547-968-091 |
| Will | 527-948-090 |
| Lucy | 577-928-094 |
+-----------+------------------+ --6. The join condition uses different columns, which will create an additional job
SELECT
emp.name, empi.employee_id, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON emp.name = emph.name
JOIN employee_id empi ON emph.employee_id = empi.employee_id;
+-----------+-------------------+------------------+
| emp.name | empi.employee_id | emph.sin_number |
+-----------+-------------------+------------------+
| Michael | 100 | 547-968-091 |
| Will | 101 | 527-948-090 |
| Lucy | 103 | 577-928-094 |
+-----------+-------------------+------------------+

If JOIN uses different columns in its conditions, it will request an additional job to complete the join. If the JOIN operation uses the same column in the join conditions, it will join on this condition using one job.

When JOIN is performed between multiple tables, Yarn/MapReduce jobs are created to process the data in the HDFS. Each of the jobs is called a stage. Usually, it is suggested to put the big table right at the end of the JOIN statement for better performance and to avoid Out Of Memory (OOM) exceptions. This is because the last table in the JOIN sequence is usually streamed through reducers where as the others are buffered in the reducer by default. Also, a hint, /*+STREAMTABLE (table_name)*/ , can be specified to advise which table should be streamed over the default decision, as in the following example

SELECT /*+ STREAMTABLE(employee_hr) */
emp.name, empi.employee_id, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON emp.name = emph.name
JOIN employee_id empi ON emph.employee_id = empi.employee_id;

3.2 OUTER JOIN

Besides INNER JOIN , HQL also supports regular OUTER JOIN and FULL JOIN . The logic of such a join is the same as what's in the SQL. The following table summarizes the differences between common joins. Here, we assume table_m has m rows and table_n has n rows with one-to-one mapping.

Join type Logic Rows  returned
table_m JOIN table_n This returns all rows matched in both tables. m ∩ n
table_m LEFT JOIN table_n This returns all rows in the left table and matched rows in the right table. If there is no match in the right table, it returns NULL in the right table. m
table_m RIGHT JOIN table_n This returns all rows in the right table and matched rows in the left table. If there is no match in the left table, it returns NULL in the left table. n
table_m FULL JOIN table_n This returns all rows in both tables and matched rows in both tables. If there is no match in the left or right table, it returns NULL instead. m + n - m ∩ n
table_m CROSS JOIN table_n This returns all row combinations in both the tables to produce CROSS JOIN table_n a Cartesian product. m*n

The following examples demonstrate the different OUTER JOINs:

SELECT
emp.name, emph.sin_number
FROM employee emp -- All rows in left table returned
LEFT JOIN employee_hr emph ON emp.name = emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 547-968-091 |
| Will | 527-948-090 |
| Shelley | NULL | -- NULL for mismatch
| Lucy | 577-928-094 |
+-----------+------------------+ SELECT
emp.name, emph.sin_number
FROM employee emp -- All rows in right table returned
RIGHT JOIN employee_hr emph
ON emp.name = emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 547-968-091 |
| Will | 527-948-090 |
| NULL | 647-968-598 | -- NULL for mismatch
| Lucy | 577-928-094 |
+-----------+------------------+
4 rows selected (34.485 seconds) SELECT
emp.name, emph.sin_number
FROM employee emp -- Rows from both side returned
FULL JOIN employee_hr emph ON emp.name = emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Lucy | 577-928-094 |
| Michael | 547-968-091 |
| Shelley | NULL | -- NULL for mismatch
| NULL | 647-968-598 | -- NULL for mismatch
| Will | 527-948-090 |
+-----------+------------------+

The CROSS JOIN statement does not have a join condition. The CROSS JOIN statement can also be written using join without condition or with the always true condition, such as 1 = 1. In this case, we can join any datasets with cross joins. However, we only consider using such joins when we have to link data without relations in nature, such as adding headers with a row count to a table. The following are three equal ways of writing CROSS JOIN.

SELECT
emp.name, emph.sin_number
FROM employee as emp
CROSS JOIN
employee_hr as emph; SELECT
emp.name, emph.sin_number
FROM employee as emp
JOIN
employee_hr as emph; SELECT
emp.name, emph.sin_number
FROM employee as emp
JOIN
employee_hr as emph
on 1=1;

Although Hive did not support unequal joins explicitly in the earlier version, there are workarounds by using CROSS JOIN and WHERE , as in this example:

SELECT
emp.name, emph.sin_number
FROM employee emp
CROSS JOIN employee_hr emph
WHERE emp.name <> emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 527-948-090 |
| Michael | 647-968-598 |
| Michael | 577-928-094 |
| Will | 547-968-091 |
| Will | 647-968-598 |
| Will | 577-928-094 |
| Shelley | 547-968-091 |
| Shelley | 527-948-090 |
| Shelley | 647-968-598 |
| Shelley | 577-928-094 |
| Lucy | 547-968-091 |
| Lucy | 527-948-090 |
| Lucy | 647-968-598 |
+-----------+------------------+

3.3 Special joins

HQL also supports some special joins that we usually do not see in relational databases, such as MapJoin and Semi-join .

MapJoin means doing the join operation only with map, without the reduce job. The MapJoin statement reads all the data from the small table to memory and broadcasts to all maps. During the map phase, the join operation is performed by comparing each row of data in the big table with small tables against the join conditions. Because there is no reduce needed, such kinds of join usually have better performance. In the newer version of Hive, Hive automatically converts join to MapJoin at runtime if possible. However, you can also manually specify the broadcast table by providing a join hint, /*+ MAPJOIN(table_name) */ . In addition, MapJoin can be used for unequal joins to improve performance since both MapJoin and WHERE are performed in the map phase. The following is an example of using a MapJoin hint with CROSS JOIN :

SELECT
/*+ MAPJOIN(employee) */ emp.name, emph.sin_number
FROM employee as emp
CROSS JOIN
employee_hr as emph
WHERE emp.name <> emph.name;

The MapJoin operation does not support the following:

  • Using MapJoin after UNION ALL , LATERAL VIEW , GROUP BY / JOIN / SORTBY / CLUSTER , and BY / DISTRIBUTE BY
  • Using MapJoin before UNION , JOIN , and another MapJoin

Bucket MapJoin is a special type of MapJoin that uses bucket columns (the column specified by CLUSTERED BY in the CREATE TABLE statement) as the join condition. Instead of fetching the whole table, as done by the regular MapJoin , bucket MapJoin only fetches the required bucket data. To enable bucket MapJoin , we need to enable some settings and make sure the bucket number is are multiple of each other. If both joined tables are sorted and bucketed with the same number of buckets, a sort-merge join can be performed instead of caching all small tables in the memory:

SET hive.optimize.bucketmapjoin = true;
SET hive.optimize.bucketmapjoin.sortedmerge = true;
SET hive.input.format = org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

In addition, the LEFT SEMI JOIN statement is also a type of MapJoin . It is the same as a subquery with IN / EXISTS after v0.13.0 of Hive. However, it is not recommended for use since it is not part of standard SQL:

SELECT a.name
FROM employee as a
LEFT SEMI JOIN
employee_id as b
ON a.name = b.name;

4. Union

When we want to combine data with the same schema together, we often use set operations. Regular set operations in the relational database are INTERSECT , MINUS , and UNION / UNION ALL . HQL only supports UNION and UNION ALL . The difference between them is that UNION ALL does not remove duplicate rows while UNION does. In addition, all unioned data must have the same name and data type, or else an implicit conversion will be done and may cause a runtime exception. If ORDER BY , SORT BY , CLUSTER BY , DISTRIBUTE BY , or LIMIT are used, they are applied to the whole result set after the union:

SELECT a.name as nm
FROM employee a
UNION ALL -- Use column alias to make the same name for union
SELECT b.name as nm
FROM employee_hr b;
+-----------+
|nm |
+-----------+
| Michael |
| Will |
| Shelley |
| Lucy |
| Michael |
| Will |
| Steven |
| Lucy |
+-----------+ SELECT a.name as nm FROM employee a
UNION -- UNION removes duplicated names and slower
SELECT b.name as nm FROM employee_hr b;
+----------+
|nm |
+----------+
| Lucy |
| Michael |
| Shelley |
| Steven |
| Will |
+----------+ -- Order by applies to the unioned data
-- When you want to order only one data set,
-- Use order in the subquery
SELECT a.name as nm FROM employee a
UNION ALL
SELECT b.name as nm FROM employee_hr b
ORDER BY nm;
+----------+
|nm |
+----------+
| Lucy |
| Lucy |
| Michael |
| Michael |
| Shelley |
| Steven |
| Will |
| Will |
+----------+

For other set operations that HQL does not support yet, such as INTERCEPT and MINUS , we can use joins or left join to implement them as follows:

-- Use join for set intercept
SELECT a.name
FROM employee a
JOIN employee_hr b
ON a.name = b.name;
+----------+
| a.name |
+----------+
| Michael |
| Will |
| Lucy |
+----------+ -- Use left join for set minus
SELECT a.name
FROM employee a
LEFT JOIN employee_hr b
ON a.name = b.name
WHERE b.name IS NULL;

Hive Essential (4):DML-project,filter,join,union的更多相关文章

  1. hive中与hbase外部表join时内存溢出(hive处理mapjoin的优化器机制)

    与hbase外部表(wizad_mdm_main)进行join出现问题: CREATE TABLE wizad_mdm_dev_lmj_edition_result as select *  from ...

  2. Hive不支持非相等的join

    由于 hive 与传统关系型数据库面对的业务场景及底层技术架构都有着很大差异,因此,传统数据库领域的一些技能放到 Hive 中可能已不再适用.关于 hive 的优化与原理.应用的文章,前面也陆陆续续的 ...

  3. sql inner join , left join, right join , union,union all 的用法和区别

    Persons 表: Id_P LastName FirstName Address City 1 Adams John Oxford Street London 2 Bush George Fift ...

  4. Hive(七):HQL DML

    HQL DML 主要涉到对Hive表中数据操作,包含有:load.INSERT.DELETE.EXPORT and IMPORT,详细资料参见:https://cwiki.apache.org/con ...

  5. 用实例展示left Join,right join,inner join,join,cross join,union 的区别

    1.向TI,T2插入数据: T1  7条 ID Field2 Field3 Field41 1 3 542 1 3 543 1 3 544 2 3 545 3 3 546 4 3 547 5 3 54 ...

  6. left join, right join , inner join, join, union的意义

    数据库在连接两张或以上的表来返回数据时,都会生成一张中间的临时表,然后再将临时表返回给用户left join,right join,inner join, join 与 on 配合用 select c ...

  7. Hive与HBase表联合使用Join的问题

    hive与hbase表结合级联查询的问题,主要hive两个表以上涉及到join操作,就会长时间卡住,查询日志也不报错,也不会出现mr的进度百分比显示,shell显示如下图 如图: 解决这个问题,需要修 ...

  8. 数据库join union 区别

    join 是两张表做交连后里面条件相同的部分记录产生一个记录集,union是产生的两个记录集(字段要一样的)并在一起,成为一个新的记录集. 1.JOIN和UNION区别  join 是两张表做交连后里 ...

  9. hive中的子查询改join操作(转)

    这些子查询在oracle和mysql等数据库中都能执行,但是在hive中却不支持,但是我们可以把这些查询语句改为join操作: -- 1.子查询 select * from A a where a.u ...

随机推荐

  1. 堆以及stl堆的使用

    概念 性质: 1.堆是一颗完全二叉树,用数组实现.    2.堆中存储数据的数据是局部有序的. 最大堆:1.任意一个结点存储的值都大于或等于其任意一个子结点中存储的值.      2.根结点存储着该树 ...

  2. TCP/IP协议标准

    OSI(7层):应用层(Application),表示层(Presentation),会话层(Session),传输层(Transport),网络层(Network),数据链路层(Data Link) ...

  3. 微信支付(APP支付)-服务端开发(二 )

    如果你已经可以微信支付成功,那么你已经成功90%,剩下的就是订单确认问题了. 接上一篇文章,今天我们来谈一谈,订单查询与确认: APP端支付成功之后,会再次向服务端发起请求,确认付款订单时候成功,同时 ...

  4. Backpack II

    Description There are n items and a backpack with size m. Given array A representing the size of eac ...

  5. Java编译器的优化

    public class Notice { public static void main(String[] args) { // 右侧20是一个int类型,但没有超过左侧数值范围,就是正确的 // ...

  6. shiro授权+注解式开发

    shiro授权和注解式开发 1.shiro授权角色.权限 2.Shiro的注解式开发 ShiroUserMapper.xml <select id="getRolesByUserId& ...

  7. (28)打鸡儿教你Vue.js

    单件商品金额计算和单选全选功能 new Vue({ el: '#app', data: { totalMoney: 0, productList: [] }, filters: { formatMon ...

  8. Java ArrayList对象集合去重

    import java.util.ArrayList; import java.util.Iterator; public class StringSampleDemo { public static ...

  9. ICEM-管肋

    原视频下载地址:https://yunpan.cn/cMgkmd7u9ZPdC  访问密码 8a73

  10. Understanding Action Filters (C#) 可以用来做权限检查

    比如需要操作某一张表league的数据,multi-tenancy的模式,每一行数据都有一个租户id的字段. 那么在api调用操作的时候,我们需要检查league的id,是否和当前用户所属的租户信息一 ...