hive中 udf,udaf,udtf

1.hive中基本操作；

DDL，DML

2.hive中函数

User-Defined Functions : UDF(用户自定义函数，简称JDF函数)
UDF: 一进一出 upper lower substring（进来一条记录，出去还是一条记录）
UDAF：Aggregation（用户自定的聚合函数）多进一出 count max min sum ...
UDTF: Table-Generation 一进多出

3.举例

show functions显示系统支持的函数

行数举例：split(),explode()

exercise：使用hive统计单词出现次数

explode把数组转成多行的数据

[hadoop@hadoop000 data]$ vi hive-wc.txt

hello,world,welcome

hello,welcome

hive> create table hive_wc(sentence string);

OK

Time taken: 1.083 seconds

hive> load data local inpath '/home/hadoop/data/hive-wc.txt' into table hive_wc;

Loading data to table default.hive_wc

Table default.hive_wc stats: [numFiles=, totalSize=]

OK

Time taken: 1.539 seconds

hive> select * from hive_wc;

OK

hello,world,welcome

hello,welcome

Time taken: 0.536 seconds, Fetched:  row(s)

hive> select split(sentence,",") from hive_wc;

OK

["hello","world","welcome"]

["hello","welcome"]

[""]

Time taken: 0.161 seconds, Fetched:  row(s）

"hello"

"world"

"welcome"

"hello"

"welcome"

用一个SQL完成wordcount统计：

hive> select word, count() as c

    > from (select explode(split(sentence,",")) as word from hive_wc) t

    > group by word ;

Query ID = hadoop_20180613094545_920c2e72--47eb-9a9c-5e5a30ebb1ae

Total jobs =

Launching Job  out of

Number of reduce tasks not specified. Estimated from input data size:

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job = job_1528851144815_0001, Tracking URL = http://hadoop000:8088/proxy/application_1528851144815_0001/

Kill Command = /home/hadoop/app/hadoop-2.6.-cdh5.7.0/bin/hadoop job  -kill job_1528851144815_0001

Hadoop job information for Stage-: number of mappers: ; number of reducers:

-- ::, Stage- map = %,  reduce = %

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 2.42 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 4.31 sec

MapReduce Total cumulative CPU time:  seconds  msec

Ended Job = job_1528851144815_0001

MapReduce Jobs Launched:

Stage-Stage-: Map:   Reduce:    Cumulative CPU: 4.31 sec   HDFS Read:  HDFS Write:  SUCCESS

Total MapReduce CPU Time Spent:  seconds  msec

OK

hello

welcome

world

Time taken: 26.859 seconds, Fetched:  row(s)

4.json类型数据

使用到的文件： rating.json

创建一张表 rating_json，上传数据，并查看前十行数据信息：

hive> create table rating_json(json string);

OK

hive> load data local inpath '/home/hadoop/data/rating.json' into table rating_json;

Loading data to table default.rating_json

Table default.rating_json stats: [numFiles=, totalSize=]

OK

hive> select * from rating_json limit ;

OK

{"movie":"","rate":"","time":"","userid":""}

{"movie":"","rate":"","time":"","userid":""}

{"movie":"","rate":"","time":"","userid":""}

{"movie":"","rate":"","time":"","userid":""}

{"movie":"","rate":"","time":"","userid":""}

{"movie":"","rate":"","time":"","userid":""}

{"movie":"","rate":"","time":"","userid":""}

{"movie":"","rate":"","time":"","userid":""}

{"movie":"","rate":"","time":"","userid":""}

{"movie":"","rate":"","time":"","userid":""}

Time taken: 0.195 seconds, Fetched:  row(s)

对json的数据进行处理，json_tuple 是一个UDTF是 Hive0.7版本引进的：

hive> select

    > json_tuple(json,"movie","rate","time","userid") as (movie,rate,time,userid)

    > from rating_json limit ;

OK

Time taken: 0.189 seconds, Fetched:  row(s)

5.时间类型的转换：

[hadoop@hadoop000 data]$ more hive_row_number.txt

,,ruoze,M

,,jepson,M

,,wangwu,F

,,zhaoliu,F

,,tianqi,M

,,wangba,F

[hadoop@hadoop000 data]$

hive> create table hive_rownumber(id int,age int, name string, sex string)

    > row format delimited fields terminated by ',';

OK

Time taken: 0.451 seconds

hive> load data local inpath '/home/hadoop/data/hive_row_number.txt' into table hive_rownumber;

Loading data to table hive3.hive_rownumber

Table hive3.hive_rownumber stats: [numFiles=, totalSize=]

OK

Time taken: 1.381 seconds

hive> select * from hive_rownumber ;

OK

             ruoze   M

             jepson  M

             wangwu  F

             zhaoliu F

             tianqi  M

             wangba  F

Time taken: 0.455 seconds, Fetched:  row(s)

需求：查询出每种性别中年龄最大的两条数据 -- > topn：

分析：order by 是全局的排序，是做不到分组内的排序的；组内进行排序，就要用到窗口函数or分析函数

select id,age,name.sex

from

(select id,age,name,sex,

row_number() over(partition by sex order by age desc)

from hive_rownumber) t

where rank<=2;

hive> select id,age,name,sex

    > from

    > (select id,age,name,sex,

    > row_number() over(partition by sex order by age desc) as rank

    > from hive_rownumber) t

    > where rank<=;

Query ID = hadoop_20180614202525_9829dc42-3c37--8b12-89c416589ebc

Total jobs =

Launching Job  out of

Number of reduce tasks not specified. Estimated from input data size:

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job = job_1528975858636_0001, Tracking URL = http://hadoop000:/proxy/application_1528975858636_0001/

Kill Command = /home/hadoop/app/hadoop-2.6.-cdh5.7.0/bin/hadoop job  -kill job_1528975858636_0001

Hadoop job information for Stage-: number of mappers: ; number of reducers:

-- ::, Stage- map = %,  reduce = %

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 1.48 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 3.86 sec

MapReduce Total cumulative CPU time:  seconds  msec

Ended Job = job_1528975858636_0001

MapReduce Jobs Launched:

Stage-Stage-: Map:   Reduce:    Cumulative CPU: 3.86 sec   HDFS Read:  HDFS Write:  SUCCESS

Total MapReduce CPU Time Spent:  seconds  msec

OK

             wangba  F

             wangwu  F

             tianqi  M

             jepson  M

Time taken: 29.262 seconds, Fetched:  row(s)

hive中 udf,udaf,udtf的更多相关文章

hive中UDF、UDAF和UDTF使用
Hive进行UDF开发十分简单,此处所说UDF为Temporary的function,所以需要hive版本在0.4.0以上才可以. 一.背景:Hive是基于Hadoop中的MapReduce,提供HQ ...
【转】hive中UDF、UDAF和UDTF使用
原博文出自于: http://blog.csdn.net/liuj2511981/article/details/8523084 感谢! Hive进行UDF开发十分简单,此处所说UDF为Tempora ...
[转]HIVE UDF/UDAF/UDTF的Map Reduce代码框架模板
FROM : http://hugh-wangp.iteye.com/blog/1472371 自己写代码时候的利用到的模板 UDF步骤: 1.必须继承org.apache.hadoop.hive ...
Hive 自定义函数 UDF UDAF UDTF
1.UDF:用户定义(普通)函数,只对单行数值产生作用: 继承UDF类,添加方法 evaluate() /** * @function 自定义UDF统计最小值 * @author John * */ ...
【转】HIVE UDF UDAF UDTF 区别使用
原博文出自于:http://blog.csdn.net/longzilong216/article/details/23921235(暂时) 感谢! 自己写代码时候的利用到的模板 UDF步骤: 1 ...
在hive中UDF和UDAF使用说明
Hive进行UDF开发十分简单,此处所说UDF为Temporary的function,所以需要hive版本在0.4.0以上才可以. 一.背景:Hive是基于Hadoop中的MapReduce,提供HQ ...
简述UDF/UDAF/UDTF是什么，各自解决问题及应用场景
UDF User-Defined-Function 自定义函数 .一进一出: 背景系统内置函数无法解决实际的业务问题,需要开发者自己编写函数实现自身的业务实现诉求. 应用场景非常多,面临的业务不同导 ...
Hive中的UDF详解
hive作为一个sql查询引擎,自带了一些基本的函数,比如count(计数),sum(求和),有时候这些基本函数满足不了我们的需求,这时候就要写hive hdf(user defined funati ...
hive自定义UDF
udf udaf udtf 使用方式 hiverc文件 1.jar包放到安装日录下或者指定目录下 2.${HIVE_HOME}/bin目录下有个.hiverc文件,它是隐藏文件. 3.把初始化语句加载 ...

随机推荐

【Hankson 的趣味题】
可能我只适合这道题的50分但还是要争取一下的我们知道对于$gcd$和$lcm$有这样的定义 $a=\prod _{i=1}^{\pi(a)}p_i^{d_{i}}$ \(b=\prod ...
python 中if-else的多种简洁的写法
因写多了判断语句,看着短短的代码却占据来好几行,于是便搜下if-else简洁的写法,结果也是发现新大陆 4种: 第1种:__就是普通写法 a, b, c = 1, 2, 3 if a>b: c ...
iOS之3DTouch的使用---很简单，看我就够啦~~
3DTouch是苹果在iOS9之后新推出的功能,功能大致可以分成两种,一种是长按app的icon,会出现以下的界面,还有一种是在app内部的某个视图上使用,效果如下图. 详细的效果也可以参见微信.微信 ...
算法 - 给出一个字符串str,输出包含两个字符串str的最短字符串，如str为abca时，输出则为abcabca
今天碰到一个算法题觉得比较有意思,研究后自己实现了出来,代码比较简单,如发现什么问题请指正.思路和代码如下: 基本思路:从左开始取str的最大子字符串,判断子字符串是否为str的后缀,如果是则返回st ...
FreeRTOS 查询任务剩余的栈空间的方法
FreeRTOS 源码下载地址 1.官方文档提供了函数用来查询任务剩余栈空间,首先是看官方的文档解释(某位大神翻译的官方文档.) 参数解释: xTask:被查询任 ...
uiwebview与objective-c
利用oc调用js很简单, 系统直接提供了方法stringByEvaluatingJavaScriptFromString [webView stringByEvaluatingJavaScriptFr ...
oracle查询相关注意点
单表查询: .or 和 and 混合使用需求:查询业主名称包含'刘'或门牌号包含'5'的,并且地址编号为3的记录 and 的权限优先于 or 所以需要在or的两边添加() 2. 范围查询除了传统的 ...
IOS本地日志记录解决方案
我们在项目中日志记录这块也算是比较重要的,有时候用户程序出什么问题,光靠服务器的日志还不能准确的找到问题现在一般记录日志有几种方式: 1.使用第三方工具来记录日志,如腾讯的Bugly,它是只把程序的 ...
Vue.js与 ASP.NET Core 服务端渲染功能整合
http://mgyongyosi.com/2016/Vuejs-server-side-rendering-with-aspnet-core/ 原作者:Mihály Gyöngyösi 译者:oop ...
【Django笔记三】Django2.0配置mysql模型
一.环境版本信息: 操作系统:windows10 Django版本:2.0.5 Python版本:3.6.4 Mysql版本: 5.5.53 安装mysql 二.安装Mysqlclient: 1. ...

hive中 udf,udaf,udtf

hive中 udf,udaf,udtf的更多相关文章

随机推荐

热门专题