collect_set去除重复元素:collect_list不去除重复元素select gender, concat_ws(',', collect_set(children)), concat_ws(',', collect_list(children)) from Affairs group by gender // 创建视图 data.createOrReplaceTempView("Affairs") val df3= spark.sql("…
select gender, age, row_number() over(partition by gender order by age) as rowNumber, rank() over(partition by gender order by age) as ranks, dense_rank() over(partition by gender order by age) as denseRank, percent_rank…
val df6 = spark.sql("select gender,children,max(age),avg(age),count(age) from Affairs group by Cube(gender,children) order by 1,2") df6.show +------+--------+--------+--------+----------+ |gender|children|max(age)|avg(age)|count(age)| +------+--…
val df4=spark.sql("SELECT mean(age),variance(age),stddev(age),corr(age,yearsmarried),skewness(age),kurtosis(age) FROM Affairs") df4.show +--------+------------------+------------------+-----------------------+-----------------+------------------…
// 创建视图 data.createOrReplaceTempView("Affairs") val df1 = spark.sql("SELECT * FROM Affairs WHERE age BETWEEN 20 AND 25") df1: org.apache.spark.sql.DataFrame = [affairs: double, gender: string ... 7 more fields] // 子查询 val df2 = spark.s…
目录 Part I. Gentle Overview of Big Data and Spark Overview 1.基本架构 2.基本概念 3.例子(可跳过) Spark工具箱 1.Datasets: Type-Safe Structured APIs 2.Structured Streaming 3.Machine Learning and Advanced Analytics 4.Lower-Level APIs Part II. Structured APIs-DataFrames,…
1. 同列多行数据组合成一个字段cell的方法, top N 问题的hive方案 如下: hive 列转行 to json与to array list set等复杂结构,hive topN的提取的窗口统计方法 select ll, collect_list(n) , -- 将topN 转换成 List or Json with the help of collect_set(xx) collect_list(xx) collect_list(nn), collect_list(ll), coll…
Hive行列转换 1.行转列 (根据主键,进行多行合并一列) 使用函数:concat_ws(‘,’,collect_set(column)) collect_list 不去重 collect_set 去重 column 的数据类型要求是 string 1.1.构建测试数据 vi row_to_col.txt a b 1 a b 2 a b 3 c d 4 c d 5 c d 6 1.2.建表 create table tmp_jia…