spark 累加历史 + 统计全部 + 行转列
spark 累加历史主要用到了窗口函数,而进行全部统计,则需要用到rollup函数
1 应用场景:
1、我们需要统计用户的总使用时长(累加历史)
2、前台展现页面需要对多个维度进行查询,如:产品、地区等等
3、需要展现的表格头如: 产品、2015-04、2015-05、2015-06
2 原始数据:
product_code |event_date |duration |
-------------|-----------|---------|
1438 |2016-05-13 |165 |
1438 |2016-05-14 |595 |
1438 |2016-05-15 |105 |
1629 |2016-05-13 |12340 |
1629 |2016-05-14 |13850 |
1629 |2016-05-15 |227 |
3 业务场景实现
3.1 业务场景1:累加历史:
如数据源所示:我们已经有当天用户的使用时长,我们期望在进行统计的时候,14号能累加13号的,15号能累加14、13号的,以此类推
3.1.1 spark-sql实现
//spark sql 使用窗口函数累加历史数据
sqlContext.sql(
"""
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration
from userlogs_date
""").show
+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 760|
| 1438|2016-05-15| 865|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 26190|
| 1629|2016-05-15| 26417|
+-----+----------+------------+
3.1.2 dataframe实现
//使用Column提供的over 函数,传入窗口操作
import org.apache.spark.sql.expressions._
val first_2_now_window = Window.partitionBy("pcode").orderBy("event_date")
df_userlogs_date.select(
$"pcode",
$"event_date",
sum($"duration").over(first_2_now_window).as("sum_duration")
).show +-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 760|
| 1438|2016-05-15| 865|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 26190|
| 1629|2016-05-15| 26417|
+-----+----------+------------+
3.1.3 扩展 累加一段时间范围内
实际业务中的累加逻辑远比上面复杂,比如,累加之前N天,累加前N天到后N天等等。以下我们来实现:
3.1.3.1 累加历史所有:
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration from userlogs_date
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and current row) as sum_duration from userlogs_date
Window.partitionBy("pcode").orderBy("event_date").rowsBetween(Long.MinValue,0)
Window.partitionBy("pcode").orderBy("event_date")
上边四种写法完全相等 3.1.3.2 累加N天之前,假设N=3
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 3 preceding and current row) as sum_duration from userlogs_date
Window.partitionBy("pcode").orderBy("event_date").rowsBetween(-3,0)
3.1.3.3 累加前N天,后M天: 假设N=3 M=5
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 3 preceding and 5 following ) as sum_duration from userlogs_date
Window.partitionBy("pcode").orderBy("event_date").rowsBetween(-3,5)
3.1.3.4 累加该分区内所有行
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and unbounded following ) as sum_duration from userlogs_date
Window.partitionBy("pcode").orderBy("event_date").rowsBetween(Long.MinValue,Long.MaxValue)
总结如下:
preceding:用于累加前N行(分区之内)。若是从分区第一行头开始,则为 unbounded。 N为:相对当前行向前的偏移量
following :与preceding相反,累加后N行(分区之内)。若是累加到该分区结束,则为 unbounded。N为:相对当前行向后的偏移量
current row:顾名思义,当前行,偏移量为0
说明:上边的前N,后M,以及current row均会累加该偏移量所在行
3.1.3.4 实测结果
累加历史:分区内当天及之前所有 写法1:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration from userlogs_date
+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 760|
| 1438|2016-05-15| 865|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 26190|
| 1629|2016-05-15| 26417|
+-----+----------+------------+
累加历史:分区内当天及之前所有 写法2:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and current row) as sum_duration from userlogs_date
+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 760|
| 1438|2016-05-15| 865|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 26190|
| 1629|2016-05-15| 26417|
+-----+----------+------------+
累加当日和昨天:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 1 preceding and current row) as sum_duration from userlogs_date
+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 760|
| 1438|2016-05-15| 700|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 26190|
| 1629|2016-05-15| 14077|
+-----+----------+------------+
累加当日、昨日、明日:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 1 preceding and 1 following ) as sum_duration from userlogs_date
+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 760|
| 1438|2016-05-14| 865|
| 1438|2016-05-15| 700|
| 1629|2016-05-13| 26190|
| 1629|2016-05-14| 26417|
| 1629|2016-05-15| 14077|
+-----+----------+------------+
累加分区内所有:当天和之前之后所有:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and unbounded following ) as sum_duration from userlogs_date
+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 865|
| 1438|2016-05-14| 865|
| 1438|2016-05-15| 865|
| 1629|2016-05-13| 26417|
| 1629|2016-05-14| 26417|
| 1629|2016-05-15| 26417|
+-----+----------+------------+
3.2 业务场景2:统计全部
3.2.1 spark sql实现
//spark sql 使用rollup添加all统计
sqlContext.sql(
"""
select pcode,event_date,sum(duration) as sum_duration
from userlogs_date_1
group by pcode,event_date with rollup
order by pcode,event_date
""").show() +-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| null| null| 27282|
| 1438| null| 865|
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 595|
| 1438|2016-05-15| 105|
| 1629| null| 26417|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 13850|
| 1629|2016-05-15| 227|
+-----+----------+------------+
3.2.2 dataframe函数实现
//使用dataframe提供的rollup函数,进行多维度all统计
df_userlogs_date.rollup($"pcode", $"event_date").agg(sum($"duration")).orderBy($"pcode", $"event_date") +-----+----------+-------------+
|pcode|event_date|sum(duration)|
+-----+----------+-------------+
| null| null| 27282|
| 1438| null| 865|
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 595|
| 1438|2016-05-15| 105|
| 1629| null| 26417|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 13850|
| 1629|2016-05-15| 227|
+-----+----------+-------------+
3.3 行转列 ->pivot
pivot目前还没有sql语法,先用df语法吧
val userlogs_date_all = sqlContext.sql("select dcode, pcode,event_date,sum(duration) as duration from userlogs group by dognum, pcode,event_date ")
userlogs_date_all.registerTempTable("userlogs_date_all")
val dates = userlogs_date_all.select($"event_date").map(row => row.getAs[String]("event_date")).distinct().collect().toList
userlogs_date_all.groupBy($"dcode", $"pcode").pivot("event_date", dates).sum("duration").na.fill().show +-----------------+-----+----------+----------+----------+----------+
| dcode|pcode|--|--|--|--|
+-----------------+-----+----------+----------+----------+----------+
| F2429186| | | | | |
| AI2342441| | | | | |
| A320018711| | | | | |
| H2635817| | | | | |
| D0288196| | | | | |
| Y0242218| | | | | |
| H2392574| | | | | |
| D2245588| | | | | |
| Y2514906| | | | | |
| H2540419| | | | | |
| R2231926| | | | | |
| H2684591| | | | | |
| A2548470| | | | | |
| GH000309| | | | | |
| H2293216| | | | | |
| R2170601| | | | | |
|B2365238;B2559538| | | | | |
| BQ005465| | | | | |
| AH2180324| | | | | |
| H0279306| | | | | |
+-----------------+-----+----------+----------+----------+----------+
附录
下面是这两个函数的官方api说明:
org.apache.spark.sql.scala
def rollup(col1: String, cols: String*): GroupedData
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.
This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions). // Compute the average for all numeric columns rolluped by department and group.
df.rollup("department", "group").avg() // Compute the max age and average salary, rolluped by department and gender.
df.rollup($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
def rollup(cols: Column*): GroupedData
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.
df.rollup($"department", $"group").avg() // Compute the max age and average salary, rolluped by department and gender.
df.rollup($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
org.apache.spark.sql.Column.scala
def over(window: WindowSpec): Column
Define a windowing column. val w = Window.partitionBy("name").orderBy("id")
df.select(
sum("price").over(w.rangeBetween(Long.MinValue, 2)),
avg("price").over(w.rowsBetween(0, 4))
)
spark 累加历史 + 统计全部 + 行转列的更多相关文章
- SqlServer PIVOT函数快速实现行转列,UNPIVOT实现列转行
我们在写Sql语句的时候没经常会遇到将查询结果行转列,列转行的需求,拼接sql字符串,然后使用sp_executesql执行sql字符串是比较常规的一种做法.但是这样做实现起来非常复杂,而在SqlSe ...
- SqlServer PIVOT函数快速实现行转列,UNPIVOT实现列转行(转)
我们在写Sql语句的时候没经常会遇到将查询结果行转列,列转行的需求,拼接sql字符串,然后使用sp_executesql执行sql字符串是比较常规的一种做法.但是这样做实现起来非常复杂,而在SqlSe ...
- Sql 不确定列 行转列操作
做项目时,用到了汇总统计的行转列,且 表结构: 具体存储过程脚本如下: -- =============================================-- Author: -- C ...
- MySQL,排序,统计行转列
表 -- ------------------------------ Table structure for a-- ---------------------------- DROP TABLE ...
- Mysql 列转行统计查询 、行转列统计查询
-- ---------------------------- -- Table structure for `TabName` -- ---------------------------- D ...
- Spark基于自定义聚合函数实现【列转行、行转列】
一.分析 Spark提供了非常丰富的算子,可以实现大部分的逻辑处理,例如,要实现行转列,可以用hiveContext中支持的concat_ws(',', collect_set('字段'))实现.但是 ...
- Databricks 第11篇:Spark SQL 查询(行转列、列转行、Lateral View、排序)
本文分享在Azure Databricks中如何实现行转列和列转行. 一,行转列 在分组中,把每个分组中的某一列的数据连接在一起: collect_list:把一个分组中的列合成为数组,数据不去重,格 ...
- Linux Shell 统计一(行\列)数值的总和及行、列转换
(对一列数字求和) 在日常工作当中需要对文本过滤出来的数字进行求和运算,例如想统计一个MySQL分区表现在有多大 # ls -lsh AdPlateform#P#p*.ibd |grep G 2.6 ...
- 做图表统计你需要掌握SQL Server 行转列和列转行
说在前面 做一个数据统计和分析的项目,每天面对着各种数据,经过存储过程从源表计算汇总后需要写入中间结果表以提高数据使用效率,那么此时就需要用到行转列和列转行. 1.列转行 数据经过计算加工后会直接生成 ...
随机推荐
- MacBook Pro Retina 安装WIN7 - 对抗模糊及其它
最近对虚拟机里的WIN7受够了,把整个虚拟机都删了,准备装双系统. 安装过程还是很简单的,网上教程一大堆,就是通过MAC OS X自带的BootCamp工具来管理整个安装过程,我是用外置光驱安装的,没 ...
- 分区默认segment大小变化(64k—>8M)
_partition_large_extents和_index_partition_large_extents 参考: http://www.xifenfei.com/2013/08/%E5%88%8 ...
- com.mysql.jdbc.Driver to com.mysql.cj.jdbc.Driver
com.mysql.jdbc.Driver tocom.mysql.cj.jdbc.Driver MySQL :: MySQL Connector/J 8.0 Developer Guide :: 4 ...
- python为什么需要reload(sys)后设置编码
python在安装时,默认的编码是ascii,当程序中出现非ascii编码时,python的处理常常会报这样的错UnicodeDecodeError: 'ascii' codec can't deco ...
- Scala高级语法
一.隐式 implicit分类: (1)隐式参数 (2)隐式转换类型 (3)隐式类 特点:让代码变得更加灵活 (一)隐式参数 1.ImplicitTest object ImplicitTest { ...
- 存储5——逻辑卷管理LVM
1. LVM概念 LVM是 Logical Volume Manager(逻辑卷管理)的简写,它由Heinz Mauelshagen在Linux 2.4内核上实现.LVM将一个或多个硬盘的分区在逻辑上 ...
- Portugal 2 1 minute has Pipansihuan Germany and USA tacit or kick the ball
C Luo assists last moment so that Portugal "back to life", but with just two games to allo ...
- Spark Shuffle调优原理和最佳实践
对性能消耗的原理详解 在分布式系统中,数据分布在不同的节点上,每一个节点计算一部份数据,如果不对各个节点上独立的部份进行汇聚的话,我们计算不到最终的结果.我们需要利用分布式来发挥Spark本身并行计算 ...
- 【转】Deep Learning(深度学习)学习笔记整理系列之(一)
Deep Learning(深度学习)学习笔记整理系列 zouxy09@qq.com http://blog.csdn.net/zouxy09 作者:Zouxy version 1.0 2013-0 ...
- Frame 框架的创建
Qt 创建Frame框架的例子: QFrame * frm = new QFrame(this); //创建一个框架 frm->setFrameStyle(QFrame::StyledPanel ...