[HIve - LanguageManual] Transform [没懂]
Transform/Map-Reduce Syntax
Users can also plug in their own custom mappers and reducers in the data stream by using features natively supported in the Hive 2.0 language. e.g. in order to run a custom mapper script - map_script - and a custom reducer script - reduce_script - the user can issue the following command which uses the TRANSFORM clause to embed the mapper and the reducer scripts.
By default, columns will be transformed to STRING and delimited by TAB before feeding to the user script; similarly, all NULL values will be converted to the literal string \N in order to differentiate NULL values from empty strings. The standard output of the user script will be treated as TAB-separated STRINGcolumns, any cell containing only \N will be re-interpreted as a NULL, and then the resulting STRING column will be cast to the data type specified in the table declaration in the usual way. User scripts can output debug information to standard error which will be shown on the task detail page on hadoop. These defaults can be overridden with ROW FORMAT ....
In windows, use "cmd /c your_script" instead of just "your_script"
Warning
Icon
It is your responsibility to sanitize any STRING columns prior to transformation. If your STRING column contains tabs, an identity transformer will not give you back what you started with! To help with this, see REGEXP_REPLACE and replace the tabs with some other character on their way into the TRANSFORM() call.
Warning
Icon
Formally, MAP ... and REDUCE ... are syntactic transformations of SELECT TRANSFORM ( ... ). In other words, they serve as comments or notes to the reader of the query. BEWARE: Use of these keywords may be dangerous as (e.g.) typing "REDUCE" does not force a reduce phase to occur and typing "MAP" does not force a new map phase!
Please also see Sort By / Cluster By / Distribute By and Larry Ogrodnek's blog post.
clusterBy: CLUSTER BY colName ( ',' colName)* distributeBy: DISTRIBUTE BY colName ( ',' colName)* sortBy: SORT BY colName (ASC | DESC)? ( ',' colName (ASC | DESC)?)* rowFormat : ROW FORMAT (DELIMITED [FIELDS TERMINATED BY char ] [COLLECTION ITEMS TERMINATED BY char ] [MAP KEYS TERMINATED BY char ] [ESCAPED BY char ] [LINES SEPARATED BY char ] | SERDE serde_name [WITH SERDEPROPERTIES property_name=property_value, property_name=property_value, ...]) outRowFormat : rowFormat inRowFormat : rowFormat outRecordReader : RECORDREADER className query: FROM ( FROM src MAP expression ( ',' expression)* (inRowFormat)? USING 'my_map_script' ( AS colName ( ',' colName)* )? (outRowFormat)? (outRecordReader)? ( clusterBy? | distributeBy? sortBy? ) src_alias ) REDUCE expression ( ',' expression)* (inRowFormat)? USING 'my_reduce_script' ( AS colName ( ',' colName)* )? (outRowFormat)? (outRecordReader)? FROM ( FROM src SELECT TRANSFORM '(' expression ( ',' expression)* ')' (inRowFormat)? USING 'my_map_script' ( AS colName ( ',' colName)* )? (outRowFormat)? (outRecordReader)? ( clusterBy? | distributeBy? sortBy? ) src_alias ) SELECT TRANSFORM '(' expression ( ',' expression)* ')' (inRowFormat)? USING 'my_reduce_script' ( AS colName ( ',' colName)* )? (outRowFormat)? (outRecordReader)? |
SQL Standard Based Authorization Disallows TRANSFORM
The TRANSFORM clause is disallowed when SQL standard based authorization is configured in Hive 0.13.0 and later releases (HIVE-6415).
TRANSFORM Examples
Example #1:
FROM ( FROM pv_users MAP pv_users.userid, pv_users.date USING 'map_script' AS dt, uid CLUSTER BY dt) map_output INSERT OVERWRITE TABLE pv_users_reduced REDUCE map_output.dt, map_output.uid USING 'reduce_script' AS date, count; FROM ( FROM pv_users SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'map_script' AS dt, uid CLUSTER BY dt) map_output INSERT OVERWRITE TABLE pv_users_reduced SELECT TRANSFORM(map_output.dt, map_output.uid) USING 'reduce_script' AS date, count; |
Example #2
FROM ( FROM src SELECT TRANSFORM(src.key, src.value) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe' USING '/bin/cat' AS (tkey, tvalue) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe' RECORDREADER 'org.apache.hadoop.hive.ql.exec.TypedBytesRecordReader' ) tmap INSERT OVERWRITE TABLE dest1 SELECT tkey, tvalue |
Schema-less Map-reduce Scripts
If there is no AS clause after USING my_script, Hive assumes that the output of the script contains 2 parts: key which is before the first tab, and value which is the rest after the first tab. Note that this is different from specifying AS key, value because in that case, value will only contain the portion between the first tab and the second tab if there are multiple tabs.
Note that we can directly do CLUSTER BY key without specifying the output schema of the scripts.
FROM ( FROM pv_users MAP pv_users.userid, pv_users.date USING 'map_script' CLUSTER BY key) map_output INSERT OVERWRITE TABLE pv_users_reduced REDUCE map_output.key, map_output.value USING 'reduce_script' AS date, count; |
Typing the output of TRANSFORM
The output fields from a script are typed as strings by default; for example in
SELECT TRANSFORM(stuff) USING 'script' AS thing1, thing2 |
They can be immediately casted with the syntax:
SELECT TRANSFORM(stuff) USING 'script' AS (thing1 INT, thing2 INT) |
[HIve - LanguageManual] Transform [没懂]的更多相关文章
- [HIve - LanguageManual] Hive Operators and User-Defined Functions (UDFs)
Hive Operators and User-Defined Functions (UDFs) Hive Operators and User-Defined Functions (UDFs) Bu ...
- Hive的Transform功能
Hive的TRANSFORM关键字提供了在SQL中调用自写脚本的功能,适合实现Hive中没有的功能又不想写UDF的情况.例如,按日期统计每天出现的uid数,通常用如下的SQL SELECT date, ...
- HIVE的transform函数的使用
Hive的TRANSFORM关键字提供了在SQL中调用自写脚本的功能,适合实现Hive中没有的功能又不想写UDF的情况.例如,按日期统计每天出现的uid数,通常用如下的SQL SELECT date, ...
- [Hive - LanguageManual ] ]SQL Standard Based Hive Authorization
Status of Hive Authorization before Hive 0.13 SQL Standards Based Hive Authorization (New in Hive 0. ...
- [Hive - LanguageManual ] Windowing and Analytics Functions (待)
LanguageManual WindowingAndAnalytics Skip to end of metadata Added by Lefty Leverenz, last edi ...
- [HIve - LanguageManual] Sort/Distribute/Cluster/Order By
Syntax of Order By Syntax of Sort By Difference between Sort By and Order By Setting Types for Sort ...
- [Hive - LanguageManual] Import/Export
LanguageManual ImportExport Skip to end of metadata Added by Carl Steinbach, last edited by Le ...
- [Hive - LanguageManual] DML: Load, Insert, Update, Delete
LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ...
- [Hive - LanguageManual] Alter Table/Partition/Column
Alter Table/Partition/Column Alter Table Rename Table Alter Table Properties Alter Table Comment Add ...
随机推荐
- (转)SSI开发环境搭建
本文转自:http://blog.csdn.net/lifuxiangcaohui/article/details/7187869 先来点文字性的描述: MVC对于我们来说,已经不陌生了,它起源于20 ...
- 231. Power of Two
题目: Given an integer, write a function to determine if it is a power of two. 链接: http://leetcode.com ...
- Tomcat的SessionID引起的Session Fixation和Session Hijacking问题
上一篇说到<Spring MVC防御CSRF.XSS和SQL注入攻击>,今天说说SessionID带来的漏洞攻击问题.首先,什么是Session Fixation攻击和Session Hi ...
- JS模块化编程
AMD:异步模块定义,适合客户端环境,不会阻塞运行.客户端受网络影响比较大. CommonJs:适用于服务器端规范,可以同步加载,只受硬盘读写的影响.
- JSF开篇之Login案例
开发环境:Myeclipse+JDK5+MyEclipse Tomcat+jsf2.2.8 JSF看起来和STRUTS还是有些像的,刚开始还是遇到一点问题:资源包的存放路径及文件访问路径. 开发Log ...
- EINTR错误
慢系统调用(slow system call):此术语适用于那些可能永远阻塞的系统调用.永远阻塞的系统调用是指调用有可能永远无法返回,多数网络支持函数都属于这一类.如:若没有客户连接到服务器上,那么服 ...
- useradd和adduser的区别
1. 在root权限下,useradd只是创建了一个用户名,如 (useradd +用户名 ),它并没有在/home目录下创建同名文件夹,也没有创建密码,因此利用这个用户登录系统,是登录不了的,为了 ...
- Android_PendingIntent的使用
PendingIntent介绍 PendingIntent可以看作是对Intent的一个封装,但它不是立刻执行某个行为,而是满足某些条件或触发某些事件后才执行指定的行为. PendingInt ...
- HDU 1677
Nested Dolls Time Limit: 3000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others)Tota ...
- LA 3357 (递推 找规律) Pinary
n位不含前导零不含连续1的数共有fib(n)个,fib(n)为斐波那契数列. 所以可以预处理一下fib的前缀和,查找一下第n个数是k位数,然后再递归计算它是第k位数里的多少位. 举个例子,比如说要找第 ...