Transform/Map-Reduce Syntax

Users can also plug in their own custom mappers and reducers in the data stream by using features natively supported in the Hive 2.0 language. e.g. in order to run a custom mapper script - map_script - and a custom reducer script - reduce_script - the user can issue the following command which uses the TRANSFORM clause to embed the mapper and the reducer scripts.

By default, columns will be transformed to STRING and delimited by TAB before feeding to the user script; similarly, all NULL values will be converted to the literal string \N in order to differentiate NULL values from empty strings. The standard output of the user script will be treated as TAB-separated STRINGcolumns, any cell containing only \N will be re-interpreted as a NULL, and then the resulting STRING column will be cast to the data type specified in the table declaration in the usual way. User scripts can output debug information to standard error which will be shown on the task detail page on hadoop. These defaults can be overridden with ROW FORMAT ....

In windows, use "cmd /c your_script" instead of just "your_script"

Warning

Icon

It is your responsibility to sanitize any STRING columns prior to transformation. If your STRING column contains tabs, an identity transformer will not give you back what you started with! To help with this, see REGEXP_REPLACE and replace the tabs with some other character on their way into the TRANSFORM() call.

Warning

Icon

Formally, MAP ... and REDUCE ... are syntactic transformations of SELECT TRANSFORM ( ... ). In other words, they serve as comments or notes to the reader of the query. BEWARE: Use of these keywords may be dangerous as (e.g.) typing "REDUCE" does not force a reduce phase to occur and typing "MAP" does not force a new map phase!

Please also see Sort By / Cluster By / Distribute By and Larry Ogrodnek's blog post.

clusterBy: CLUSTER BY colName (',' colName)*
distributeBy: DISTRIBUTE BY colName (',' colName)*
sortBy: SORT BY colName (ASC | DESC)? (',' colName (ASC | DESC)?)*
 
rowFormat
  : ROW FORMAT
    (DELIMITED [FIELDS TERMINATED BY char]
               [COLLECTION ITEMS TERMINATED BY char]
               [MAP KEYS TERMINATED BY char]
               [ESCAPED BY char]
               [LINES SEPARATED BY char]
     |
     SERDE serde_name [WITH SERDEPROPERTIES
                            property_name=property_value,
                            property_name=property_value, ...])
 
outRowFormat : rowFormat
inRowFormat : rowFormat
outRecordReader : RECORDREADER className
 
query:
  FROM (
    FROM src
    MAP expression (',' expression)*
    (inRowFormat)?
    USING 'my_map_script'
    ( AS colName (',' colName)* )?
    (outRowFormat)? (outRecordReader)?
    ( clusterBy? | distributeBy? sortBy? ) src_alias
  )
  REDUCE expression (',' expression)*
    (inRowFormat)?
    USING 'my_reduce_script'
    ( AS colName (',' colName)* )?
    (outRowFormat)? (outRecordReader)?
 
  FROM (
    FROM src
    SELECT TRANSFORM '(' expression (',' expression)* ')'
    (inRowFormat)?
    USING 'my_map_script'
    ( AS colName (',' colName)* )?
    (outRowFormat)? (outRecordReader)?
    ( clusterBy? | distributeBy? sortBy? ) src_alias
  )
  SELECT TRANSFORM '(' expression (',' expression)* ')'
    (inRowFormat)?
    USING 'my_reduce_script'
    ( AS colName (',' colName)* )?
    (outRowFormat)? (outRecordReader)?

SQL Standard Based Authorization Disallows TRANSFORM

The TRANSFORM clause is disallowed when SQL standard based authorization is configured in Hive 0.13.0 and later releases (HIVE-6415).

TRANSFORM Examples

Example #1:

FROM (
  FROM pv_users
  MAP pv_users.userid, pv_users.date
  USING 'map_script'
  AS dt, uid
  CLUSTER BY dt) map_output
INSERT OVERWRITE TABLE pv_users_reduced
  REDUCE map_output.dt, map_output.uid
  USING 'reduce_script'
  AS date, count;
FROM (
  FROM pv_users
  SELECT TRANSFORM(pv_users.userid, pv_users.date)
  USING 'map_script'
  AS dt, uid
  CLUSTER BY dt) map_output
INSERT OVERWRITE TABLE pv_users_reduced
  SELECT TRANSFORM(map_output.dt, map_output.uid)
  USING 'reduce_script'
  AS date, count;

Example #2

FROM (
  FROM src
  SELECT TRANSFORM(src.key, src.value) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe'
  USING '/bin/cat'
  AS (tkey, tvalue) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe'
  RECORDREADER 'org.apache.hadoop.hive.ql.exec.TypedBytesRecordReader'
) tmap
INSERT OVERWRITE TABLE dest1 SELECT tkey, tvalue

Schema-less Map-reduce Scripts

If there is no AS clause after USING my_script, Hive assumes that the output of the script contains 2 parts: key which is before the first tab, and value which is the rest after the first tab. Note that this is different from specifying AS key, value because in that case, value will only contain the portion between the first tab and the second tab if there are multiple tabs.

Note that we can directly do CLUSTER BY key without specifying the output schema of the scripts.

FROM (
  FROM pv_users
  MAP pv_users.userid, pv_users.date
  USING 'map_script'
  CLUSTER BY key) map_output
INSERT OVERWRITE TABLE pv_users_reduced
  REDUCE map_output.key, map_output.value
  USING 'reduce_script'
  AS date, count;

Typing the output of TRANSFORM

The output fields from a script are typed as strings by default; for example in

SELECT TRANSFORM(stuff)
USING 'script'
AS thing1, thing2

They can be immediately casted with the syntax:

SELECT TRANSFORM(stuff)
USING 'script'
AS (thing1 INT, thing2 INT)

[HIve - LanguageManual] Transform [没懂]的更多相关文章

  1. [HIve - LanguageManual] Hive Operators and User-Defined Functions (UDFs)

    Hive Operators and User-Defined Functions (UDFs) Hive Operators and User-Defined Functions (UDFs) Bu ...

  2. Hive的Transform功能

    Hive的TRANSFORM关键字提供了在SQL中调用自写脚本的功能,适合实现Hive中没有的功能又不想写UDF的情况.例如,按日期统计每天出现的uid数,通常用如下的SQL SELECT date, ...

  3. HIVE的transform函数的使用

    Hive的TRANSFORM关键字提供了在SQL中调用自写脚本的功能,适合实现Hive中没有的功能又不想写UDF的情况.例如,按日期统计每天出现的uid数,通常用如下的SQL SELECT date, ...

  4. [Hive - LanguageManual ] ]SQL Standard Based Hive Authorization

    Status of Hive Authorization before Hive 0.13 SQL Standards Based Hive Authorization (New in Hive 0. ...

  5. [Hive - LanguageManual ] Windowing and Analytics Functions (待)

    LanguageManual WindowingAndAnalytics     Skip to end of metadata   Added by Lefty Leverenz, last edi ...

  6. [HIve - LanguageManual] Sort/Distribute/Cluster/Order By

    Syntax of Order By Syntax of Sort By Difference between Sort By and Order By Setting Types for Sort ...

  7. [Hive - LanguageManual] Import/Export

    LanguageManual ImportExport     Skip to end of metadata   Added by Carl Steinbach, last edited by Le ...

  8. [Hive - LanguageManual] DML: Load, Insert, Update, Delete

    LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ...

  9. [Hive - LanguageManual] Alter Table/Partition/Column

    Alter Table/Partition/Column Alter Table Rename Table Alter Table Properties Alter Table Comment Add ...

随机推荐

  1. (转)SSI开发环境搭建

    本文转自:http://blog.csdn.net/lifuxiangcaohui/article/details/7187869 先来点文字性的描述: MVC对于我们来说,已经不陌生了,它起源于20 ...

  2. 231. Power of Two

    题目: Given an integer, write a function to determine if it is a power of two. 链接: http://leetcode.com ...

  3. Tomcat的SessionID引起的Session Fixation和Session Hijacking问题

    上一篇说到<Spring MVC防御CSRF.XSS和SQL注入攻击>,今天说说SessionID带来的漏洞攻击问题.首先,什么是Session Fixation攻击和Session Hi ...

  4. JS模块化编程

    AMD:异步模块定义,适合客户端环境,不会阻塞运行.客户端受网络影响比较大. CommonJs:适用于服务器端规范,可以同步加载,只受硬盘读写的影响.

  5. JSF开篇之Login案例

    开发环境:Myeclipse+JDK5+MyEclipse Tomcat+jsf2.2.8 JSF看起来和STRUTS还是有些像的,刚开始还是遇到一点问题:资源包的存放路径及文件访问路径. 开发Log ...

  6. EINTR错误

    慢系统调用(slow system call):此术语适用于那些可能永远阻塞的系统调用.永远阻塞的系统调用是指调用有可能永远无法返回,多数网络支持函数都属于这一类.如:若没有客户连接到服务器上,那么服 ...

  7. useradd和adduser的区别

    1. 在root权限下,useradd只是创建了一个用户名,如 (useradd  +用户名 ),它并没有在/home目录下创建同名文件夹,也没有创建密码,因此利用这个用户登录系统,是登录不了的,为了 ...

  8. Android_PendingIntent的使用

        PendingIntent介绍 PendingIntent可以看作是对Intent的一个封装,但它不是立刻执行某个行为,而是满足某些条件或触发某些事件后才执行指定的行为. PendingInt ...

  9. HDU 1677

    Nested Dolls Time Limit: 3000/1000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Others)Tota ...

  10. LA 3357 (递推 找规律) Pinary

    n位不含前导零不含连续1的数共有fib(n)个,fib(n)为斐波那契数列. 所以可以预处理一下fib的前缀和,查找一下第n个数是k位数,然后再递归计算它是第k位数里的多少位. 举个例子,比如说要找第 ...