[HIve - LanguageManual] Sort/Distribute/Cluster/Order By
Syntax of Order By
The ORDER BY syntax in Hive QL is similar to the syntax of ORDER BY in SQL language.
colOrder: ( ASC | DESC ) orderBy: ORDER BY colName colOrder? ( ',' colName colOrder?)* query: SELECT expression ( ',' expression)* FROM src orderBy |
There are some limitations in the "order by" clause. In the strict mode (i.e., hive.mapred.mode=strict), the order by clause has to be followed by a "limit" clause. (如果hive.mapred.mode=strict,那么order by 必须和limit一起使用)The limit clause is not necessary if you set hive.mapred.mode to nonstrict. The reason is that in order to impose total order of all results, there has to be one reducer to sort the final output. If the number of rows in the output is too large, the single reducer could take a very long time to finish.(排序数据是由一个reducer完成并且最终输出,如果没有limit限制那么,那么整个操作必须相当长的时间才能完成。)
Syntax of Sort By
The SORT BY syntax is similar to the syntax of ORDER BY in SQL language.
colOrder: ( ASC | DESC ) sortBy: SORT BY colName colOrder? ( ',' colName colOrder?)* query: SELECT expression ( ',' expression)* FROM src sortBy |
Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.
Difference between Sort By and Order By
Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.
Note: It may be confusing as to the difference between SORT BY alone of a single column and CLUSTER BY. The difference is that CLUSTER BY partitions by the field and SORT BY if there are multiple reducers partitions randomly in order to distribute data (and load) uniformly across the reducers.
Basically, the data in each reducer will be sorted according to the order that the user specified. The following example shows
SELECT key, value FROM src SORT BY key ASC, value DESC |
The query had 2 reducers, and the output of each is:
0 5 0 3 3 6 9 1 |
0 4 0 3 1 1 2 5 |
Setting Types for Sort By
After a transform, variable types are generally considered to be strings, meaning that numeric data will be sorted lexicographically. To overcome this, a second SELECT statement with casts can be used before using SORT BY.
FROM (FROM (FROM src SELECT TRANSFORM(value) USING 'mapper' AS value, count) mapped SELECT cast(value as double ) AS value, cast(count as int ) AS count SORT BY value, count) sorted SELECT TRANSFORM(value, count) USING 'reducer' AS whatever |
Syntax of Cluster By and Distribute By
Cluster By and Distribute By are used mainly with the Transform/Map-Reduce Scripts. But, it is sometimes useful in SELECT statements if there is a need to partition and sort the output of a query for subsequent queries.
Cluster By is a short-cut for both Distribute By and Sort By.
Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However,Distribute By does not guarantee clustering or sorting properties on the distributed keys.
For example, we are Distributing By x on the following 5 rows to 2 reducer:
x1 x2 x4 x3 x1 |
Reducer 1 got
x1 x2 x1 |
Reducer 2 got
x4 x3 |
Note that all rows with the same key x1 is guaranteed to be distributed to the same reducer (reducer 1 in this case), but they are not guaranteed to be clustered in adjacent positions.
In contrast, if we use Cluster By x, the two reducers will further sort rows on x:
Reducer 1 got
x1 x1 x2 |
Reducer 2 got
x3 x4 |
Instead of specifying Cluster By, the user can specify Distribute By and Sort By, so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required.
SELECT col1, col2 FROM t1 CLUSTER BY col1 |
SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SORT BY col1 ASC, col2 DESC |
FROM ( FROM pv_users MAP ( pv_users.userid, pv_users.date ) USING 'map_script' AS c1, c2, c3 DISTRIBUTE BY c2 SORT BY c2, c1) map_output INSERT OVERWRITE TABLE pv_users_reduced REDUCE ( map_output.c1, map_output.c2, map_output.c3 ) USING 'reduce_script' AS date, count; |
[HIve - LanguageManual] Sort/Distribute/Cluster/Order By的更多相关文章
- hive中Sort By,Order By,Cluster By,Distribute By,Group By的区别
order by: hive中的order by 和传统sql中的order by 一样,对数据做全局排序,加上排序,会新启动一个job进行排序,会把所有数据放到同一个reduce中进行处理,不管数 ...
- [转]hive中order by,distribute by,sort by,cluster by
转至http://my.oschina.net/repine/blog/296562 order by,distribute by,sort by,cluster by 查询使用说明 1 2 3 4 ...
- [HIve - LanguageManual] Hive Operators and User-Defined Functions (UDFs)
Hive Operators and User-Defined Functions (UDFs) Hive Operators and User-Defined Functions (UDFs) Bu ...
- Hive LanguageManual DDL
hive语法规则LanguageManual DDL SQL DML 和 DDL 数据操作语言 (DML) 和 数据定义语言 (DDL) 一.数据库 增删改都在文档里说得也很明白,不重复造车轮 二.表 ...
- [HIve - LanguageManual] Joins
Hive Joins Hive Joins Join Syntax Examples MapJoin Restrictions Join Optimization Predicate Pushdown ...
- [HIve - LanguageManual] Transform [没懂]
Transform/Map-Reduce Syntax SQL Standard Based Authorization Disallows TRANSFORM TRANSFORM Examples ...
- [Hive - LanguageManual] Select base use
Select Syntax WHERE Clause ALL and DISTINCT Clauses Partition Based Queries HAVING Clause LIMIT Clau ...
- [Hive - LanguageManual] DML: Load, Insert, Update, Delete
LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ...
- hive 中的Sort By、 Order By、Cluster By、Distribute By 区别
Order by: order by 会对输入做全局排序,因此只有一个reducer(多个reducer无法保证全局有序)只有一个reducer,会导致当输入规模较大时,需要较长的计算时间.在hive ...
随机推荐
- 企业用户2T(含秒传),普通用户20G
周鸿祎一定要看的建议(要求置顶):可以解决本次云盘事件的建议!!! 2016-10-23 20:23 | 复制链接 | 淘帖 461334 本帖最后由 cqthxin 于 2016-10-23 20: ...
- alias 命令
功能说明:设置指令的别名. 语 法:alias[别名]=[指令名称] 参 数 :若不加任何参数,则列出目前所有的别名设置. 举 例 :ermao@lost-desktop:~$ alias ...
- [原创]使用命令行工具提升cocos2d-x开发效率(一)之TexturePacker篇
TexturePacker是一个常用的制作sprite sheet的工具,它提供了很多实用的功能. 一般我们制作sprite sheet都是使用他的gui版本,纯手工操作,就像下面这张图示的一样. 刚 ...
- 在Ubuntu为Android硬件抽象层(HAL)模块编写JNI方法提供Java访问硬件服务接口(老罗学习笔记4)
在上两篇文章中,我们介绍了如何为Android系统的硬件编写驱动程序,包括如何在Linux内核空间实现内核驱动程序和在用户空间实现硬件抽象层接口.实现这两者的目的是为了向更上一层提供硬件访问接口,即为 ...
- HTML 中的meta标签中的http-equiv与name属性使用介绍
meta是html语言head区的一个辅助性标签.也许你认为这些代码可有可无.其实如果你能够用好meta标签,会给你带来意想不到的效果,meta标签的作用有:搜索引擎优化(SEO),定义页面使用语言, ...
- jQuery $.post $.ajax用法
jQuery $.post $.ajax用法 jQuery.post( url, [data], [callback], [type] ) :使用POST方式来进行异步请求 参数: url (Stri ...
- [Codeforces667A]Pouring Rain(数学,几何)
题目链接:http://codeforces.com/contest/667/problem/A 题意:一个杯子里有水,一个人在喝并且同时在往里倒.问这个人能不能喝完,多久能喝完. 把相关变量都量化成 ...
- [POJ3279]Fliptile(开关问题,枚举)
题目链接:http://poj.org/problem?id=3279 题解:http://www.cnblogs.com/helenawang/p/5538547.html /* ━━━━━┒ギリギ ...
- Redis VS Memcached
1. Redis & Memecached比较 内存管理 持久化 数据类型 客户端支持 并发性能 Memcached 预分配的内存池的方式 不支持持久化 支持简单的key-value存储 ...
- UVa 11774 (置换 找规律) Doom's Day
我看大多数人的博客只说了一句:找规律得答案为(n + m) / gcd(n, m) 不过神题的题解还须神人写.. We can associate at each cell a base 3-numb ...