org.apache.spark.sql.functions汇总

测试数据：
　　id,name,age,comment,date
　　1,lyy,28,"aaa bbb",20180102020325

 scala> var data = spark.read.format("csv").option("header",true).load("file:///E:/liyanyan/data/test.csv")

 scala> data.printSchema

 root

 |-- id: string (nullable = true)

 |-- name: string (nullable = true)

 |-- age: string (nullable = true)

 |-- comment: string (nullable = true)

 |-- date: string (nullable = true)

 scala> data.select(split(data.col("comment")," ")().alias("t"),substring(data("

 date"),0,8)).withColumn("date1",date_format(current_date(),"yyyyMMdd")).show

 +---+---------------------+--------+

 | t|substring(date, , )| date1|

 +---+---------------------+--------+

 |bbb| ||

 +---+---------------------+--------+

 scala> data.select(data("date")).withColumn("date1",from_unixtime(unix_timestamp

 (),"yyyyMMdd")).show

 +--------------+--------+

 | date| date1|

 +--------------+--------+

 |||

 +--------------+--------+

 scala> data.select(substring(data("date"),,)).withColumn("date1",date_format(c

 urrent_date(),"yyyyMMdd")).show

 +---------------------+--------+

 |substring(date, , )| date1|

 +---------------------+--------+

 | ||

 +---------------------+--------+

 scala> data.select(regexp_extract(data("comment"),"(\\w+)\\s+(\\w+)",)).show

 +-----------------------------------------+

 |regexp_extract(comment, (\w+)\s+(\w+), )|

 +-----------------------------------------+

 | bbb|

 +-----------------------------------------+

聚合函数

approx_count_distinct

count_distinct近似值

avg

平均值

collect_list

聚合指定字段的值到list

collect_set

聚合指定字段的值到set

corr

计算两列的Pearson相关系数

count

计数

countDistinct

去重计数 SQL中用法

select count(distinct class)

covar_pop

总体协方差（population covariance）

covar_samp

样本协方差（sample covariance）

first

分组第一个元素

last

分组最后一个元素

grouping

grouping_id

kurtosis

 计算峰态(kurtosis)值

skewness

 计算偏度(skewness)

max

最大值

min

最小值

mean

平均值

stddev

 即stddev_samp

stddev_samp

 样本标准偏差（sample standard deviation）

stddev_pop

总体标准偏差（population standard deviation）

sum

求和

sumDistinct

非重复值求和 SQL中用法

select sum(distinct class)

var_pop

总体方差（population variance）

var_samp

样本无偏方差（unbiased variance）

variance

即var_samp

集合函数

array_contains(column,value)

检查array类型字段是否包含指定元素

explode

 展开array或map为多行

explode_outer

同explode，但当array或map为空或null时，会展开为null。

posexplode

同explode，带位置索引。

posexplode_outer

同explode_outer，带位置索引。

from_json

解析JSON字符串为StructType or ArrayType，有多种参数形式，详见文档。

to_json

转为json字符串，支持StructType, ArrayType of StructTypes, a MapType or ArrayType of MapTypes。

get_json_object(column,path)

获取指定json路径的json对象字符串。

select get_json_object('{"a"1,"b":2}','$.a');

[JSON Path介绍](http://blog.csdn.net/koflance/article/details/63262484)

json_tuple(column,fields)

获取json中指定字段值。select json_tuple('{"a":1,"b":2}','a','b');

map_keys

返回map的键组成的array

map_values

返回map的值组成的array

size

array or map的长度

sort_array(e: Column, asc: Boolean)

将array中元素排序（自然排序），默认asc。

时间函数

add_months(startDate: Column, numMonths: Int)

指定日期添加n月

date_add(start: Column, days: Int)

指定日期之后n天 e.g. select date_add('2018-01-01',)

date_sub(start: Column, days: Int)

指定日期之前n天

datediff(end: Column, start: Column)

两日期间隔天数

current_date()

当前日期

current_timestamp()

当前时间戳，TimestampType类型

date_format(dateExpr: Column, format: String)

日期格式化

dayofmonth(e: Column)

日期在一月中的天数，支持 date/timestamp/string

dayofyear(e: Column)

日期在一年中的天数， 支持 date/timestamp/string

weekofyear(e: Column)

日期在一年中的周数， 支持 date/timestamp/string

from_unixtime(ut: Column, f: String)

时间戳转字符串格式

from_utc_timestamp(ts: Column, tz: String)

时间戳转指定时区时间戳

to_utc_timestamp(ts: Column, tz: String)

指定时区时间戳转UTF时间戳

hour(e: Column)

提取小时值

minute(e: Column)

提取分钟值

month(e: Column)

提取月份值

quarter(e: Column)

提取季度

second(e: Column)

提取秒

year(e: Column):提取年

last_day(e: Column)

指定日期的月末日期

months_between(date1: Column, date2: Column)

计算两日期差几个月

next_day(date: Column, dayOfWeek: String)

计算指定日期之后的下一个周一、二...，dayOfWeek区分大小写，只接受 "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"。

to_date(e: Column)

字段类型转为DateType

trunc(date: Column, format: String)

日期截断

unix_timestamp(s: Column, p: String)

指定格式的时间字符串转时间戳

unix_timestamp(s: Column)

同上，默认格式为 yyyy-MM-dd HH:mm:ss

unix_timestamp():当前时间戳(秒),底层实现为unix_timestamp(current_timestamp(), yyyy-MM-dd HH:mm:ss)

window(timeColumn: Column, windowDuration: String, slideDuration: String, startTime: String)

时间窗口函数，将指定时间(TimestampType)划分到窗口

数学函数

cos,sin,tan

计算角度的余弦，正弦。。。

sinh,tanh,cosh

计算双曲正弦，正切，。。

acos,asin,atan,atan2

计算余弦/正弦值对应的角度

bin

将long类型转为对应二进制数值的字符串For example, bin("") returns "".

bround

舍入，使用Decimal的HALF_EVEN模式，v>.5向上舍入，v< .5向下舍入，v0.5向最近的偶数舍入。

round(e: Column, scale: Int)

HALF_UP模式舍入到scale为小数点。v>=.5向上舍入，v< .5向下舍入,即四舍五入。

ceil

向上舍入

floor

向下舍入

cbrt

Computes the cube-root of the given value.

conv(num:Column, fromBase: Int, toBase: Int)

 转换数值（字符串）的进制

log(base: Double, a: Column):$log_{base}(a)$

log(a: Column):$log_e(a)$

log10(a: Column):$log_{}(a)$

log2(a: Column):$log_{}(a)$

log1p(a: Column):$log_{e}(a+)$

pmod(dividend: Column, divisor: Column):Returns the positive value of dividend mod divisor.

pow(l: Double, r: Column):$r^l$ 注意r是列

pow(l: Column, r: Double):$r^l$ 注意l是列

pow(l: Column, r: Column):$r^l$ 注意r,l都是列

radians(e: Column):角度转弧度

rint(e: Column):Returns the double value that is closest in value to the argument and is equal to a mathematical integer.

shiftLeft(e: Column, numBits: Int):向左位移

shiftRight(e: Column, numBits: Int):向右位移

shiftRightUnsigned(e: Column, numBits: Int):向右位移（无符号位）

signum(e: Column):返回数值正负符号

sqrt(e: Column):平方根

hex(column: Column):转十六进制

unhex(column: Column):逆转十六进制

混杂(misc)函数

crc32(e: Column):计算CRC32,返回bigint

hash(cols: Column*):计算 hash code，返回int

md5(e: Column):计算MD5摘要，返回32位，16进制字符串

sha1(e: Column):计算SHA-1摘要，返回40位，16进制字符串

sha2(e: Column, numBits: Int):计算SHA-1摘要，返回numBits位，16进制字符串。numBits支持224, , , or .

其他非聚合函数

abs(e: Column)

绝对值

array(cols: Column*)

多列合并为array，cols必须为同类型

map(cols: Column*):

将多列组织为map，输入列必须为（key,value)形式，各列的key/value分别为同一类型。

bitwiseNOT(e: Column):

Computes bitwise NOT.

broadcast[T](df: Dataset[T]): Dataset[T]:

将df变量广播，用于实现broadcast join。如left.join(broadcast(right), "joinKey")

coalesce(e: Column*):

返回第一个非空值

col(colName: String):

返回colName对应的Column

column(colName: String):

col函数的别名

expr(expr: String):

解析expr表达式，将返回值存于Column，并返回这个Column。

greatest(exprs: Column*):

返回多列中的最大值，跳过Null

least(exprs: Column*):

返回多列中的最小值，跳过Null

input_file_name():返

回当前任务的文件名 ？？

isnan(e: Column):

检查是否NaN（非数值）

isnull(e: Column):

检查是否为Null

lit(literal: Any):

将字面量(literal)创建一个Column

typedLit[T](literal: T)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[T]):

将字面量(literal)创建一个Column，literal支持 scala types e.g.: List, Seq and Map.

monotonically_increasing_id():

返回单调递增唯一ID，但不同分区的ID不连续。ID为64位整型。

nanvl(col1: Column, col2: Column):

col1为NaN则返回col2

negate(e: Column):

负数，同df.select( -df("amount") )

not(e: Column):

取反，同df.filter( !df("isActive") )

rand():

随机数[0.0, 1.0]

rand(seed: Long):

随机数[0.0, 1.0]，使用seed种子

randn():

随机数，从正态分布取

randn(seed: Long):

同上

spark_partition_id():

返回partition ID

struct(cols: Column*):

多列组合成新的struct column ？？

when(condition: Column, value: Any):

当condition为true返回value，如

people.select(when(people("gender") === "male", )

  .when(people("gender") === "female", )

  .otherwise())

如果没有otherwise且condition全部没命中，则返回null.

排序函数

asc(columnName: String):正序

asc_nulls_first(columnName: String):正序，null排最前

asc_nulls_last(columnName: String):正序，null排最后

e.g.

df.sort(asc("dept"), desc("age"))

对应有desc函数

 desc,desc_nulls_first,desc_nulls_last

字符串函数

ascii(e: Column): 计算第一个字符的ascii码

base64(e: Column): base64转码

unbase64(e: Column): base64解码

concat(exprs: Column*):连接多列字符串

concat_ws(sep: String, exprs: Column*):使用sep作为分隔符连接多列字符串

decode(value: Column, charset: String): 解码

encode(value: Column, charset: String): 转码，charset支持 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'。

format_number(x: Column, d: Int):格式化'#,###,###.##'形式的字符串

format_string(format: String, arguments: Column*): 将arguments按format格式化，格式为printf-style。

initcap(e: Column): 单词首字母大写

lower(e: Column): 转小写

upper(e: Column): 转大写

instr(str: Column, substring: String): substring在str中第一次出现的位置

length(e: Column): 字符串长度

levenshtein(l: Column, r: Column): 计算两个字符串之间的编辑距离（Levenshtein distance）

locate(substr: String, str: Column): substring在str中第一次出现的位置，位置编号从1开始，0表示未找到。

locate(substr: String, str: Column, pos: Int): 同上，但从pos位置后查找。

lpad(str: Column, len: Int, pad: String):字符串左填充。用pad字符填充str的字符串至len长度。有对应的rpad，右填充。

ltrim(e: Column):剪掉左边的空格、空白字符，对应有rtrim.

ltrim(e: Column, trimString: String):剪掉左边的指定字符,对应有rtrim.

trim(e: Column, trimString: String):剪掉左右两边的指定字符

trim(e: Column):剪掉左右两边的空格、空白字符

regexp_extract(e: Column, exp: String, groupIdx: Int): 正则提取匹配的组

regexp_replace(e: Column, pattern: Column, replacement: Column): 正则替换匹配的部分，这里参数为列。

regexp_replace(e: Column, pattern: String, replacement: String): 正则替换匹配的部分

repeat(str: Column, n: Int):将str重复n次返回

reverse(str: Column): 将str反转

soundex(e: Column): 计算桑迪克斯代码（soundex code）PS:用于按英语发音来索引姓名,发音相同但拼写不同的单词，会映射成同一个码。

split(str: Column, pattern: String): 用pattern分割str

substring(str: Column, pos: Int, len: Int): 在str上截取从pos位置开始长度为len的子字符串。

substring_index(str: Column, delim: String, count: Int):Returns the substring from string str before count occurrences of the delimiter delim. If count is positive, everything the left of the final delimiter (counting from left) is returned. If count is negative, every to the right of the final delimiter (counting from the right) is returned. substring_index performs a case-sensitive match when searching for delim.

translate(src: Column, matchingString: String, replaceString: String):把src中的matchingString全换成replaceString。

UDF函数

user-defined function.

callUDF(udfName: String, cols: Column*): 调用UDF

import org.apache.spark.sql._

val df = Seq(("id1", ), ("id2", ), ("id3", )).toDF("id", "value")

val spark = df.sparkSession

spark.udf.register("simpleUDF", (v: Int) => v * v)

df.select($"id", callUDF("simpleUDF", $"value"))

udf: 定义UDF

窗口函数

cume_dist(): cumulative distribution of values within a window partition

currentRow(): returns the special frame boundary that represents the current row in the window partition.

rank():排名，返回数据项在分组中的排名，排名相等会在名次中留下空位 ,,,。

dense_rank(): 排名，返回数据项在分组中的排名，排名相等会在名次中不会留下空位 ,,,。

row_number():行号，为每条记录返回一个数字 ,,,

percent_rank():returns the relative rank (i.e. percentile) of rows within a window partition.

lag(e: Column, offset: Int, defaultValue: Any): offset rows before the current row

lead(e: Column, offset: Int, defaultValue: Any): returns the value that is offset rows after the current row

ntile(n: Int): returns the ntile group id (from  to n inclusive) in an ordered window partition.

unboundedFollowing():returns the special frame boundary that represents the last row in the window partition.

org.apache.spark.sql.functions汇总的更多相关文章

3 分钟的高速体验 Apache Spark SQL
"War of the Hadoop SQL engines. And the winner is -?" 这是一个非常好的问题.只要.无论答案是什么.我们都值花一点时间找出 Sp ...
Caused by: java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@d7c365, see the next exception for details.
解决方法:https://stackoverflow.com/questions/37442910/spark-shell-startup-errors 异常: 18/01/29 19:04:27 W ...
spark关于join后有重复列的问题（org.apache.spark.sql.AnalysisException: Reference '*' is ambiguous）
问题 datafrme提供了强大的JOIN操作,但是在操作的时候,经常发现会碰到重复列的问题.在你不注意的时候,去用相关列做其他操作的时候,就会出现问题! 假如这两个字段同时存在,那么就会报错,如下: ...
local模式运行spark-shell时报错 java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':
先前在local模式下,什么都不做修改直接运行./spark-shell 运行什么问题都没有,然后配置过在HADOOP yarn上运行,之后再在local模式下运行出现以下错误: java.lang. ...
关于在使用sparksql写程序是报错以及解决方案：org.apache.spark.sql.AnalysisException: Duplicate column(s): "name" found, cannot save to file.
说明: spark --version : 2.2.0 我有两个json文件,分别是emp和dept: emp内容如下: {"name": "zhangsan" ...
Spark记录-org.apache.spark.sql.hive.HiveContext与org.apache.spark.sql.SQLContext包api分析
HiveContext/SQLContext val hiveContext=new HiveContext(new SparkContext(new SparkConf().setAppName(& ...
Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder': —— windows 开发环境使用spark 无法访问hdfs 问题解决
## 错误: ## 解决方案: 下载 hadoop 的可执行tar包,解压放在windows 本地,并配置环境变量. 在解压后的文件夹的bin目录下放入两个文件: winutils.exe, had ...
[bug] org.apache.spark.sql.AnalysisException: Table or view not found spark
参考 https://blog.csdn.net/weixin_44634893/article/details/89629399
Apache Spark 2.2.0 中文文档 - Spark SQL, DataFrames and Datasets Guide | ApacheCN
Spark SQL, DataFrames and Datasets Guide Overview SQL Datasets and DataFrames 开始入门起始点: SparkSession ...

随机推荐

Mongoose 对象的特殊性
一.偶遇难题在最近使用Mongoose的时候,遇到这样一个问题: 我从DB中查询出来一个对象,比如是Book,这个对象我想在返回时,给他附加一个字段,比如是字段A,正常来说,JS你只需要Book.A ...
Web打印连续的表格，自动根据行高分页
拿到这个需求,我已经蛋碎了一地,经过N天的攻克,终于是把它搞定了,只是不知道会不会在某种情况下出现BUG.表示我心虚没有敢做太多的测试.... ---------------------------- ...
Jquery Mobile通过超链接跳转后CSS样式不起作用的解决办法
Jquery Mobile中的超链接默认是采用AJAX跳转的,ajax获取到页面的内容之后,就直接替换当前页面的内容了,它只是单纯的获取页面的HTML代码,并不会再去下载引用的CSS代码和JS代码,因 ...
etcd部署说明
etcd是一个K/V分布式存储,每个节点都保存完成的一份数据.有点类似redis.但是etcd不是数据库. 1.先说废话.之所以会用etcd,并不是实际项目需要,而是前面自己写的上传的DBCacheS ...
Swift_TableView(delegate,dataSource,prefetchDataSource 详解)
Swift_TableView(delegate,dataSource,prefetchDataSource 详解) GitHub import UIKit let identifier = &quo ...
确认框,confirm工具封装
用bootstrap封装了个确认框工具效果如下代码如下: /** * 以模态窗做确认框的函数,title为标题栏内容,body为消息体,yesFun为点击确认按钮后执行的函数,执行后会执行关闭并删 ...
CF1066B Heaters（贪心）
题意描述: Vova先生的家可以看作一个n×1的矩形,寒冷的冬天来了,Vova先生想让他的家里变得暖和起来.现在我们给你Vova先生家的平面图,其中111表示这个地方是加热炉,0表示这个地方什么也没有 ...
通过xshell在linux上安装mysql5.7（终极版）
通过xshell在linux上安装mysql5.7(终极版) 0)通过xshell连接到远程服务器 1)彻底删除原来安装的mysql 首先查看:rpm -qa|grep -i mysql 删除操作(一 ...
mysql学习记录，CASE WHEN THEN ELSE END用法
记mysql,case when then else end用法用法1:搜索函数 SELECT r.order_no, r.golds, r.pay_tool, , ) ) END AS price ...
python字符编码转换说明及深浅copy介绍
编码说明: 常用编码介绍: ascii 数字,字母特殊字符. 字节:8位表示一个字节. 字符:是你看到的内容的最小组成单位. abc : a 一个字符. 中国:中一个字符. a : 0000 10 ...

org.apache.spark.sql.functions汇总

org.apache.spark.sql.functions汇总的更多相关文章

随机推荐

热门专题