[Hive - LanguageManual] GroupBy
Group By Syntax
groupByClause: GROUP BY groupByExpression (, groupByExpression)* groupByExpression: expression groupByQuery: SELECT expression (, expression)* FROM src groupByClause? |
Simple Examples
In order to count the number of rows in a table:
SELECT COUNT(*) FROM table2; |
Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).
In order to count the number of distinct users by gender one could write the following query:
INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count (DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender; |
Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT columns. For example, the following is possible because count(DISTINCT) and sum(DISTINCT) specify the same column:
INSERT OVERWRITE TABLE pv_gender_agg SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender; |
Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).
However, the following query is not allowed. We don't allow multiple DISTINCT expressions in the same query.
INSERT OVERWRITE TABLE pv_gender_agg SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip) FROM pv_users GROUP BY pv_users.gender; |
Select statement and group by clause
When using group by clause, the select statement can only include columns included in the group by clause. Of course, you can have as many aggregation functions (e.g. count
) in the select statement as well.
Let's take a simple example
CREATE TABLE t1(a INTEGER, b INTGER); |
A group by query on the above table could look like:
SELECT a, sum(b) FROM t1 GROUP BY a; |
The above query works because the select clause contains a
(the group by key) and an aggregation function (sum(b)
).
However, the query below DOES NOT work:
SELECT a, b FROM t1 GROUP BY a; |
This is because the select clause has an additional column (b
) that is not included in the group by clause (and it's not an aggregation function either). This is because, if the table t1
looked like:
a b ------ 100 1 100 2 100 3 |
Since the grouping is only done on a
, what value of b
should Hive display for the group a=100
? One can argue that it should be the first value or the lowest value but we all agree that there are multiple possible options. Hive does away with this guessing by making it invalid SQL (HQL, to be precise) to have a column in the select clause that is not included in the group by clause.
Advanced Features 高级特性
Multi-Group-By Inserts
The output of the aggregations or simple selects can be further sent into multiple tables or even to hadoop dfs files (which can then be manipulated using hdfs utilitites). (可将输出结果插入新表中或者直接覆盖到HDFS上的文件目录)
e.g. if along with the gender breakdown, one needed to find the breakdown of unique page views by age, one could accomplish that with the following query:
FROM pv_users INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count(DISTINCT pv_users.userid) GROUP BY pv_users.gender INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum' SELECT pv_users.age, count(DISTINCT pv_users.userid) GROUP BY pv_users.age; |
Map-side Aggregation for Group By
hive.map.aggr controls how we do aggregations. The default is false. If it is set to true, Hive will do the first-level aggregation directly in the map task.
This usually provides better efficiency, but may require more memory to run successfully.
hive.map.aggr控制如何聚合,默认是false,如果设置为true, Hive将会在map端做第一级的聚合,这通常提供更好的效果,但是要求更多的内存才能运行成功。
set hive.map.aggr= true ; SELECT COUNT(*) FROM table2; |
Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).
Grouping Sets, Cubes, Rollups, and the GROUPING__ID Function
Version
Icon
Grouping sets, CUBE and ROLLUP operators, and the GROUPING__ID function were added in Hive release 0.10.0.
See Enhanced Aggregation, Cube, Grouping and Rollup for information about these aggregation operators.
Also see the JIRAs:
- HIVE-2397 Support with rollup option for group by
- HIVE-3433 Implement CUBE and ROLLUP operators in Hive
- HIVE-3471 Implement grouping sets in Hive
- HIVE-3613 Implement grouping_id function
New in Hive release 0.11.0:
- HIVE-3552 HIVE-3552 performant manner for performing cubes/rollups/grouping sets for a high number of grouping set keys
[Hive - LanguageManual] GroupBy的更多相关文章
- [Hive - LanguageManual ] Windowing and Analytics Functions (待)
LanguageManual WindowingAndAnalytics Skip to end of metadata Added by Lefty Leverenz, last edi ...
- [HIve - LanguageManual] Hive Operators and User-Defined Functions (UDFs)
Hive Operators and User-Defined Functions (UDFs) Hive Operators and User-Defined Functions (UDFs) Bu ...
- [Hive - LanguageManual] Import/Export
LanguageManual ImportExport Skip to end of metadata Added by Carl Steinbach, last edited by Le ...
- [Hive - LanguageManual] DML: Load, Insert, Update, Delete
LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ...
- [Hive - LanguageManual] Alter Table/Partition/Column
Alter Table/Partition/Column Alter Table Rename Table Alter Table Properties Alter Table Comment Add ...
- Hive LanguageManual DDL
hive语法规则LanguageManual DDL SQL DML 和 DDL 数据操作语言 (DML) 和 数据定义语言 (DDL) 一.数据库 增删改都在文档里说得也很明白,不重复造车轮 二.表 ...
- [Hive - LanguageManual ] ]SQL Standard Based Hive Authorization
Status of Hive Authorization before Hive 0.13 SQL Standards Based Hive Authorization (New in Hive 0. ...
- [Hive - LanguageManual] Hive Concurrency Model (待)
Hive Concurrency Model Hive Concurrency Model Use Cases Turn Off Concurrency Debugging Configuration ...
- [Hive - LanguageManual ] Explain (待)
EXPLAIN Syntax EXPLAIN Syntax Hive provides an EXPLAIN command that shows the execution plan for a q ...
随机推荐
- java调用phantomjs采集ajax加载生成的网页
java调用phantomjs采集ajax加载生成的网页 日前有采集需求,当我把所有的对应页面的链接都拿到手,准备开始根据链接去采集(写爬虫爬取)对应的终端页的时候,发觉用程序获取到的数据根本没有对应 ...
- 随机森林——Random Forests
[基础算法] Random Forests 2011 年 8 月 9 日 Random Forest(s),随机森林,又叫Random Trees[2][3],是一种由多棵决策树组合而成的联合预测模型 ...
- windows线程同步
一.前言 之前在项目中,由于需要使用到多线程,多线程能够提高执行的效率,同时也带来线程同步的问题,故特此总结如下. 二.windows线程同步机制 windows线程同步机制常用的有几种:Event. ...
- android线程池
线程池的基本思想还是一种对象池的思想,开辟一块内存空间,里面存放了众多(未死亡)的线程,池中线程执行调度由池管理器来处理.当有线程任务时,从池中取一个,执行完成后线程对象归池,这样可以避免反复创建线程 ...
- 宏HASH_GET_FIRST
/*******************************************************************//** Gets the first struct in a ...
- bzoj1934: [Shoi2007]Vote 善意的投票
最大流..建图方式都是玄学啊.. //Dinic是O(n2m)的. #include<cstdio> #include<cstring> #include<cctype& ...
- 【多端应用开发系列0.0.0——之总序】xy多端应用开发方案定制
[目录] 0.0.0 [多端应用开发系列之总序]服务器Json数据处理——Json数据概述 0.0.0 [因] 正在学习多客户端应用开发,挖个坑,把所用到的技术方案,用最简单直白的语言描述出来,写成一 ...
- java常量池
Java的堆是一个运行时数据区,类的(对象从中分配空间.这些对象通过new.newarray. anewarray和multianewarray等指令建立,它们不需要程序代码来显式的释放.堆是由垃圾回 ...
- 一天一个Java基础——泛型
这学期的新课——设计模式,由我仰慕已久的老师传授,可惜思维过快,第一节就被老师挑中上去敲代码,自此在心里烙下了阴影,都是Java基础欠下的债 这学期的新课——算法设计与分析,虽老师不爱与同学互动式的讲 ...
- Ch04-文字列表的设计
Ch04: 文字列表的设计 4.1 编号列表 语法: <OL> <li>编号1 <li>编号2 ...... </OL> 属性: 编号类型:type=1 ...