参考自大数据田地:http://lxw1234.com/archives/2015/04/190.htm

测试数据准备:

create external table test_data (
cookieid string,
createtime string, --页面访问时间
url string --被访问页面
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile location '/user/jc_rc_ftp/test_data'; select * from test_data l;
+-------------+----------------------+---------+--+
| l.cookieid | l.createtime | l.url |
+-------------+----------------------+---------+--+
| cookie1 | 2015-04-10 10:00:02 | url2 |
| cookie1 | 2015-04-10 10:00:00 | url1 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 |
| cookie1 | 2015-04-10 10:50:05 | url6 |
| cookie1 | 2015-04-10 11:00:00 | url7 |
| cookie1 | 2015-04-10 10:10:00 | url4 |
| cookie1 | 2015-04-10 10:50:01 | url5 |
| cookie2 | 2015-04-10 10:00:02 | url22 |
| cookie2 | 2015-04-10 10:00:00 | url11 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 |
| cookie2 | 2015-04-10 10:50:05 | url66 |
| cookie2 | 2015-04-10 11:00:00 | url77 |
| cookie2 | 2015-04-10 10:10:00 | url44 |
| cookie2 | 2015-04-10 10:50:01 | url55 |
+-------------+----------------------+---------+--+

LAG
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值

第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time
FROM test_data;
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookieid | createtime | url | rn | last_1_time | last_2_time |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | 1970-01-01 00:00:00 | NULL |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | 2015-04-10 10:00:00 | NULL |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 2015-04-10 10:00:02 | 2015-04-10 10:00:00 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | 2015-04-10 10:03:04 | 2015-04-10 10:00:02 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | 2015-04-10 10:10:00 | 2015-04-10 10:03:04 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | 2015-04-10 10:50:01 | 2015-04-10 10:10:00 |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | 2015-04-10 10:50:05 | 2015-04-10 10:50:01 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | 1970-01-01 00:00:00 | NULL |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | 2015-04-10 10:00:00 | NULL |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 2015-04-10 10:00:02 | 2015-04-10 10:00:00 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | 2015-04-10 10:03:04 | 2015-04-10 10:00:02 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | 2015-04-10 10:10:00 | 2015-04-10 10:03:04 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | 2015-04-10 10:50:01 | 2015-04-10 10:10:00 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | 2015-04-10 10:50:05 | 2015-04-10 10:50:01 |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+

LEAD

与LAG相反
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,
LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time
FROM test_data;
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookieid | createtime | url | rn | next_1_time | next_2_time |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | 2015-04-10 10:00:02 | 2015-04-10 10:03:04 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | 2015-04-10 10:03:04 | 2015-04-10 10:10:00 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 2015-04-10 10:10:00 | 2015-04-10 10:50:01 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | 2015-04-10 10:50:01 | 2015-04-10 10:50:05 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | 2015-04-10 10:50:05 | 2015-04-10 11:00:00 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | 2015-04-10 11:00:00 | NULL |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | 1970-01-01 00:00:00 | NULL |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | 2015-04-10 10:00:02 | 2015-04-10 10:03:04 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | 2015-04-10 10:03:04 | 2015-04-10 10:10:00 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 2015-04-10 10:10:00 | 2015-04-10 10:50:01 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | 2015-04-10 10:50:01 | 2015-04-10 10:50:05 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | 2015-04-10 10:50:05 | 2015-04-10 11:00:00 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | 2015-04-10 11:00:00 | NULL |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | 1970-01-01 00:00:00 | NULL |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+

FIRST_VALUE

取分组内排序后,截止到当前行,第一个值

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1
FROM test_data; +-----------+----------------------+---------+-----+---------+--+
| cookieid | createtime | url | rn | first1 |
+-----------+----------------------+---------+-----+---------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | url1 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | url1 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | url1 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | url1 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | url1 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | url1 |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | url1 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | url11 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | url11 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | url11 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | url11 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | url11 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | url11 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | url11 |
+-----------+----------------------+---------+-----+---------+--+

LAST_VALUE

取分组内排序后,截止到当前行,最后一个值

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1
FROM test_data;
+-----------+----------------------+---------+-----+---------+--+
| cookieid | createtime | url | rn | last1 |
+-----------+----------------------+---------+-----+---------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | url1 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | url2 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 1url3 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | url4 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | url5 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | url6 |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | url7 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | url11 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | url22 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 1url33 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | url44 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | url55 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | url66 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | url77 |
+-----------+----------------------+---------+-----+---------+--+ SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last1
FROM test_data;
+-----------+----------------------+---------+-----+---------+--+
| cookieid | createtime | url | rn | last1 |
+-----------+----------------------+---------+-----+---------+--+
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | url7 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | url6 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | url5 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | url4 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 1url3 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | url2 |
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | url1 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | url77 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | url66 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | url55 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | url44 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 1url33 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | url22 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | url11 |
+-----------+----------------------+---------+-----+---------+--+

如果不指定ORDER BY,则默认按照记录在文件中的偏移量进行排序,会出现错误的结果

SELECT cookieid,
createtime,
url,
FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2
FROM test_data;
+-----------+----------------------+---------+---------+--+
| cookieid | createtime | url | first2 |
+-----------+----------------------+---------+---------+--+
| cookie1 | 2015-04-10 10:00:02 | url2 | url2 |
| cookie1 | 2015-04-10 10:50:01 | url5 | url2 |
| cookie1 | 2015-04-10 10:10:00 | url4 | url2 |
| cookie1 | 2015-04-10 11:00:00 | url7 | url2 |
| cookie1 | 2015-04-10 10:50:05 | url6 | url2 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | url2 |
| cookie1 | 2015-04-10 10:00:00 | url1 | url2 |
| cookie2 | 2015-04-10 10:50:01 | url55 | url55 |
| cookie2 | 2015-04-10 10:10:00 | url44 | url55 |
| cookie2 | 2015-04-10 11:00:00 | url77 | url55 |
| cookie2 | 2015-04-10 10:50:05 | url66 | url55 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | url55 |
| cookie2 | 2015-04-10 10:00:00 | url11 | url55 |
| cookie2 | 2015-04-10 10:00:02 | url22 | url55 |
+-----------+----------------------+---------+---------+--+
SELECT cookieid,
createtime,
url,
LAST_VALUE(url) OVER(PARTITION BY cookieid) AS last2
FROM test_data;
+-----------+----------------------+---------+--------+--+
| cookieid | createtime | url | last2 |
+-----------+----------------------+---------+--------+--+
| cookie1 | 2015-04-10 10:00:02 | url2 | url1 |
| cookie1 | 2015-04-10 10:50:01 | url5 | url1 |
| cookie1 | 2015-04-10 10:10:00 | url4 | url1 |
| cookie1 | 2015-04-10 11:00:00 | url7 | url1 |
| cookie1 | 2015-04-10 10:50:05 | url6 | url1 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | url1 |
| cookie1 | 2015-04-10 10:00:00 | url1 | url1 |
| cookie2 | 2015-04-10 10:50:01 | url55 | url22 |
| cookie2 | 2015-04-10 10:10:00 | url44 | url22 |
| cookie2 | 2015-04-10 11:00:00 | url77 | url22 |
| cookie2 | 2015-04-10 10:50:05 | url66 | url22 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | url22 |
| cookie2 | 2015-04-10 10:00:00 | url11 | url22 |
| cookie2 | 2015-04-10 10:00:02 | url22 | url22 |
+-----------+----------------------+---------+--------+--+
14 rows selected (78.058 seconds)

如果想要取分组内排序后最后一个值,则需要变通一下:

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2
FROM test_data
ORDER BY cookieid,createtime;
+-----------+----------------------+---------+-----+---------+--------+--+
| cookieid | createtime | url | rn | last1 | last2 |
+-----------+----------------------+---------+-----+---------+--------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | url1 | url7 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | url2 | url7 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 1url3 | url7 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | url4 | url7 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | url5 | url7 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | url6 | url7 |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | url7 | url7 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | url11 | url77 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | url22 | url77 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 1url33 | url77 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | url44 | url77 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | url55 | url77 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | url66 | url77 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | url77 | url77 |
+-----------+----------------------+---------+-----+---------+--------+--+

Hive函数:LAG,LEAD,FIRST_VALUE,LAST_VALUE的更多相关文章

  1. pandas实现hive的lag和lead函数 以及 first_value和last_value函数

    lag和lead VS shift 该函数的格式如下: 第一个参数为列名, 第二个参数为往上第n行(可选,默认为1), 第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL ...

  2. Hive 窗口函数LEAD LAG FIRST_VALUE LAST_VALUE

    窗口函数(window functions)对多行进行操作,并为查询中的每一行返回一个值. OVER()子句能将窗口函数与其他分析函数(analytical functions)和报告函数(repor ...

  3. oracle listagg函数、lag函数、lead函数 实例

    Oracle大师Thomas Kyte在他的经典著作中,反复强调过一个实现需求方案选取顺序: “如果你可以使用一句SQL解决的需求,就使用一句SQL:如果不可以,就考虑PL/SQL是否可以:如果PL/ ...

  4. hive函数参考手册

    hive函数参考手册 原文见:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF 1.内置运算符1.1关系运算符 运 ...

  5. Hive函数以及自定义函数讲解(UDF)

    Hive函数介绍HQL内嵌函数只有195个函数(包括操作符,使用命令show functions查看),基本能够胜任基本的hive开发,但是当有较为复杂的需求的时候,可能需要进行定制的HQL函数开发. ...

  6. 大数据入门第十一天——hive详解(三)hive函数

    一.hive函数 1.内置运算符与内置函数 函数分类: 查看函数信息: DESC FUNCTION concat; 常用的分析函数之rank() row_number(),参考:https://www ...

  7. Hadoop生态圈-Hive函数

    Hadoop生态圈-Hive函数 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任.

  8. Hive(四)hive函数与hive shell

    一.hive函数 1.hive内置函数 (1)内容较多,见< Hive 官方文档>            https://cwiki.apache.org/confluence/displ ...

  9. Hive入门笔记---2.hive函数大全

    Hive函数大全–完整版 现在虽然有很多SQL ON Hadoop的解决方案,像Spark SQL.Impala.Presto等等,但就目前来看,在基于Hadoop的大数据分析平台.数据仓库中,Hiv ...

随机推荐

  1. canvas实现将文字变成颗粒

    <!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title> ...

  2. ASP.NET MVC编程——控制器

    每一个请求都会经过控制器处理,控制器中的每个方法被称为控制器操作,它处理具体的请求. 1操作输入参数 控制器的操作的输入参数可以是内置类型也可以是自定义类型. 2操作返回结果 结果类型 调用方法 备注 ...

  3. 笔记:Maven 生命周期与命令行详解

    Maven 拥有三套相互独立的生命周期,分别是 clean.default和site,clean 生命周期的目的是清理项目,default 生命周期的目的是构建项目,而site生命周期的目的是建立项目 ...

  4. FxCop卸载后依然生成文件夹的问题

    在 http://www.cnblogs.com/heroius/p/8270004.html 中,通过编辑csproj文件可以移除对旧版nuget独立程序的依赖. 实际上,通过编辑项目文件的方式可以 ...

  5. Shell 读取用户输入

    14.2  读取用户输入 14.2.1  变量 上一章我们谈到如何定义或取消变量,变量可被设置为当前shell的局部变量,或是环境变量.如果您的shell脚本不需要调用其他脚本,其中的变量通常设置为脚 ...

  6. 分享两个提高效率的AndroidStudio小技巧

    这次分享两个 Android Studio 的小技巧,能够有效提高效率和减少犯错,尤其是在团队协作开发中. Getter 模板修改--自动处理 null 判断 格式化代码自动整理方法位置--广度 or ...

  7. 高级软件工程2017第6次作业——团队项目:Alpha阶段综合报告

    1.版本测试报告 1.1在测试过程中总共发现了多少Bug?每个类别的Bug分别为多少个? Bug分类 Bug内容 Fixed 编辑博文时改变文字格式会刷新界面 Can't reproduced 无 N ...

  8. Beta版本敏捷冲刺每日报告——Day4

    1.情况简述 Beta阶段第四次Scrum Meeting 敏捷开发起止时间 2017.11.5 08:00 -- 2017.11.5 22:00 讨论时间地点 2017.11.5晚9:00,软工所实 ...

  9. 项目Alpha冲刺Day1

    一.会议照片 二.项目进展 1.今日安排 讨论完成项目的详细设计,并完成数据库的设计,学习powerDesigner的使用 2.问题困难 powerDesigner导出sql语句因为问题无法导入,特别 ...

  10. MySQL的小Tips

    交集和差集 MySQL中没有这两个运算,但是有并集运算,所以可以利用这个来间接实现. 差集: SELECT ID FROM ( SELECT DISTINCT A.AID AS ID FROM TAB ...