Hive Sql的日常使用笔记

date: 2019-03-22 17:02:37

updated: 2020-04-08 16:00:00

Hive Sql的日常使用笔记

1. distinct 和 group by

distinct 只能返回去重的列

group by 理论上是需要在 select 后面配合聚合函数（sum, avg, max, min, count）来使用的

select 后面没有在聚合函数中的列都需要写在 group by 的后面，eg：

select A1, A2, A3, max(B) from test group by A1, A2, A3

select distinct(A), B from test 和 select distinct A, B from test 返回的是一样的，返回的都是 A 和 B 的组合值没有重复的结果

如果需要去重的话，select 需要的列 from A, (select min(主键) from A group by 要去重的列)B where A.主键 = B.主键

2. left join

A left join B on A.a1 = B.b and A.a2 = 'val' 保留 A 的所有结果，同时 on 后面的结果只针对这一条 left join 语句，不会影响最后的筛选，所以如果要保留所有 A.a2 = 'val' 的数据的话，应该在最后写 where A.a2 = 'val'

3. if 和 case when

一层： if(A.a = 'key', val1, val2)

两层： if(A.a = 'key1', val1, if(A.a = 'key2', val2, val3))

case A.a
when 'key1' then 'val1'
when 'key2' then 'val2'
else 'val3'
end

4. regexp_replace

替换掉 A.a 字段中的点 regexp_replace(A.a, "\\.", "")

5. cast

cast(A.a as int)

6. concat

concat(A.a,'-01-01')

7. concat_ws 行转列

concat_ws('/', collect_set(label_name))label_name group by key

concat 会自动把参数转换为字符串进行拼接，但是 concat_ws 需要提前将数值 cast 成字符串

8. regexp

where trim(A.a) regexp '^[1-9 | A-E]'

9. from_unixtime

from_unixtime(unix_timestamp(),'yyyy-MM-dd hh:mm:ss') update_time

10. 索引

索引基数：数据列所包含的不同值的数量。索引的基数相对于数据表行数较高（即列中包含很多不同的值，重复的值很少），这个时候创建的索引效果更好。因为查询优化器如果发现某个值出现在表的数据行中的百分比很高的时候，可能会忽略索引直接进行全盘扫描。

11. row_number()

row_number() over(partition by 主键 order by 字段1 desc)as r

12.多个条件

if(boolean, res1, if(boolean, res2, if(boolean, res3,……)))

13.nvl

nvl(a,b) 如果a为null 那就取b 只能判断null 不能判断空

14.decimal

cast(round((sum(an) * 100 / sum(tn)),2) as decimal(20,2))
round((sum(an) * 100 / sum(tn)),2) 可能会出现 84.6599999999 的数据，可能只是 hue 上的 bug，需要再处理一下

15.补零

if(substr(bp, 1, 1) != 'X' and bp is not null and trim(bp) != 'null' and trim(bp) != '' and trim(bp) != '*', lpad(bp, 10, '0'), bp)as temp

16.推数编码出错

Incorrect string value: '\xF0\xAF\xA0\xA5' for column 'customer_txt' at row 45

以下几种方法有可能可以解决：

修改字符集为 utf8mb4
--default-character-set gbk 在 sqoop 最后转换为 gbk 编码
出错原因是 can't parse data ' ' 之类的话，可能是因为特殊字符，需要一个个排查，可以试一下 regexp_replace(field,'\0','')

17.去重

select distinct t1.objnr
from jest t1
where t1.stat = 'I0002'
and t1.objnr not in (select distinct objnr from jest where stat = 'I0076');
==> 
select * from
(select objnr, min(if(stat = 'I0002', 0, 1))a, max(if(stat = 'I0076', 1, 0))b
from jest group by objnr)t where t.a = 0 and t.b = 1;

18.插入数据

insert into table values ('05:00','6') ,('06:00','7'),
('07:00','8'), ('08:00','9'), ('09:00','10'), ('10:00','11'), ('11:00','12'), ('12:00','13'),
('13:00','14');

19.每隔N小时取数分组求和

通过截取字符串，获取时间，除以N保留整数位，分组

20.正则表达式的匹配

select regexp_replace('[3]','[','a')	不识别
select regexp_replace('[3]','\[','a')	不识别
select regexp_replace('[3]','\\[','a')	识别
select regexp_replace('[3]',']','a')	识别
select regexp_replace('[3]','\]','a')	识别
select regexp_replace('[3]','\\]','a')	识别
select regexp_replace('[3,.]',',','a')	识别
select regexp_replace('[3,.]','.','a')	aaaaa	.会识别为通配符，即所有字符
select regexp_replace('[3,.]','\.','a')	不识别
select regexp_replace('[3,.]','\\.','a')	[3,a]	识别

21.空字符串关联

(NULL) 在 join on 的时候不会被关联出来
'' ' ' 在 join on 的时候会相互关联，只要有匹配项就会关联出来
(NULL) 只在 left jion 的时候可以出来

22.lateral view explode 列转行

select gid, assist
from origin_ennenergy_intelligent_operation.wo_baseinfo
lateral view explode
(split(split(split(regexp_replace(assist_man, '"',""), "\\[")[1],"\\]")[0], ",")) aa as assist

23. userids 包含 gid

find_in_set(gid, userids)  # userids 需要用逗号分隔
instr(userids, gid) 会让 ('2,11', '1') 也被筛选出来

24.pmod(a, b) 取模（取余）运算

pmod(7, 4) -> 3
pmod(9, -4) -> -3
# 9 mod -4 商可以取 -2 或 -3，因为 -3 < -2，所以商取 -3，-4 * -3 = 12, 9 - 12 = -3，所以最终结果是 -3
pmod(-9, 4) -> 3
# -9 mod 4 商可以取 -2 或 -3，因为 -3 < -2，所以商取 -3，4 * -3 = -12, -9 - （-12） = 3，所以最终结果是 3

25.insert 和 insert overwrite

insert overwrite 本质是先覆盖原先的表，再把新的数据插入

26.insert 和 insert into

insert table user(id, name) values(1, 'zhangsan');
insert into table user values(1, 'zhangsan');

27.max(if) 或者 max(case when) 行转列

select id,
max(if(name = 'buss', pass, '')) as buss,
max(if(name = 'record', pass, '')) as record,
max(if(name = 'stock', pass, '')) as stock,
max(if(name = 'client', pass, '')) as client
from test0610
group by id;
select id,
MAX(CASE WHEN name = 'buss' THEN pass END) AS buss,
MAX(CASE WHEN name = 'record' THEN pass END) AS record,
MAX(CASE WHEN name = 'stock' THEN pass END) AS stock,
MAX(CASE WHEN name = 'client' THEN pass END) AS client
from test0610
group by id;

28.复制表结构但不复制数据

create table Test2 like Test1;

29.两个限制条件

要让每一条数据都满足 balance < 0，gas > 0，转换成 count(if(balance >= 0, 1, 0)) as a, count(if(gas <= 0, 1, 0)) as b，然后 where 中取 a = 0 and b = 0 的数据

30.join优化

多表join，其中有一个是最大的表

select /*+ STREAMTABLE(table1) */ tabl1.a, table1.b, table2.c
from table1 join table2 on ......

hive会假定查询中最后一个表是最大的表。在对每行记录进行连接操作的时候，会尝试将其他表缓存起来，然后扫描最后那个表进行计算。通过 /*+ STREAMTABLE(table1) */ 可以显式地指出最大的表是哪一个，hive会将其作为驱动表，哪怕它不是在 join 的最后面

多表join，其中有一个是最小的表

select /*+ MAPJOIN(table1) */ tabl1.a, table1.b, table2.c
from table1 join table2 on ......

如果所有表中只有一张表是小表，那么可以在最大的表通过mapper的时候将小表完全放到内存中。Hive可以在map端执行连接过程（称为map-side JOIN），这是因为hive可以和内存中的小表进行逐一匹配，从而省略掉常规连接操作所需要的reduce过程。

31.导出mysql数据库的注释

SELECT
	COLUMN_NAME 列名,
	COLUMN_TYPE 数据类型,
	DATA_TYPE 字段类型,
	CHARACTER_MAXIMUM_LENGTH 长度,
	IS_NULLABLE 是否为空,
	COLUMN_DEFAULT 默认值,
	COLUMN_COMMENT 备注
FROM
	INFORMATION_SCHEMA.COLUMNS
	table_schema = '库名'
	AND
-- 如果不写的话，默认会查询出所有表中的数据，这样可能就分不清到底哪些字段是哪张表中的了，所以还是建议写上要导出的名名称
	table_name = '表名'

32. count()

count(*)：所有行进行统计，包括NULL行
count(1)：所有行进行统计，包括NULL行
count(column)：统计column所有非NULL行

33. sum(if...) 和 count(if...)

在计数的时候，以下两种方式是等效的：

sum(if(条件判断, 1, 0))
count(if(条件判断, 字段, NULL))

原因在于 NULL 并不参加任何计算（但是升序排序的时候 NULL 是第一个，因为ASC码最小）。

34. sum() over(...)

sum() over(partition by year) 计算结果是按照 year 分组来 sum 的总和
sum() over(partition by year order by month) 计算结果是按照 year 分组，1月算的是1月的和，2月算的是 1 - 2月的和， 3月算的是 1 - 3月的和

同理适用于其他的窗口函数

35. count(distinct a) 不支持窗口函数，即不支持后面跟着 over(partiton by 1)

36. size 函数返回后面map或者array里面的元素个数

size(collect_set(字段)) over(partiton by 1)

37. 设置变量

set myvar = "test";
select ${hiveconf:myvar}

38. coalesce() 函数

依次查看每一个表达式，返回第一个非空值，如果都是空，最后返回空值

coalesce(expression_1, expression_2, ...,expression_n)

39. count(distinct) 优化

直接使用 count(distinct) 会把数据放入到一个 reducer 下进行计算，导致运行缓慢。

比如统计本月活跃用户个数

优化一：

-- 外层SELECT求和
SELECT
    SUM(mau_part) mau
FROM
(
    -- 内层SELECT分别进行COUNT(DISTINCT)计算
    SELECT
    substr(uuid, 1, 3) uuid_part,
    COUNT(DISTINCT substr(uuid, 4)) AS mau_part
    FROM detail_sdk_session
    WHERE partition_date >= '2020-01-01' AND partition_date <= nowdate
    GROUP BY substr(uuid, 1, 3)
) t;
∵ uuid 是唯一的 ∴ 最后的结果 每个uuid_part对应的mau_part都是1，最后求和就是用户总数
这个写法的好处是，利用 group by 将数据分散到不用的 reducer 进行 count(distinct)计算，提高查询效率

这里 uuid 只截取了3位，如果截取位数过多，假设 uuid 长度为30位，内层 group by 会产生 30^n 个分组，那么外层也需要进行 30^n 次求和

优化二：

--  第三层SELECT
SELECT
    SUM(s.mau_part) mau
FROM
(
    -- 第二层SELECT
    SELECT
        tag,
        COUNT(*) mau_part
    FROM
    (
        -- 第一层SELECT
        SELECT
            uuid,
            CAST(RAND() * 100 AS BIGINT) tag  -- 为去重后的uuid打上标记，标记为：0-100之间的整数。
        FROM detail_sdk_session
        WHERE partition_date >= '2020-01-01'
        AND partition_date <= nowdate
        AND uuid IS NOT NULL
        GROUP BY uuid   -- 通过GROUP BY，保证去重
    )t
  GROUP BY tag
)s;
第一层SELECT：对uuid进行去重，并为去重后的uuid打上整数标记
第二层SELECT：按照标记进行分组，统计每个分组下uuid的个数
第三层SELECT：对所有分组进行求和

如果数据量确实很大，也可以增加分组的个数。例如：CAST(RAND() * 1000 AS BIGINT) tag