Spark：实现行转列

示例JAVA代码：

import static org.apache.spark.sql.functions.col;

import static org.apache.spark.sql.functions.split;

import static org.apache.spark.sql.functions.explode;

import java.util.ArrayList;

import java.util.List;

import org.apache.spark.sql.Dataset;

import org.apache.spark.sql.Row;

import org.apache.spark.sql.SparkSession;

public class TestSparkSqlSplit {

    public static void main(String[] args){

        SparkSession sparkSession =SparkSession.builder().appName("test").master("local[*]").getOrCreate();

        List<MyEntity> items=new ArrayList<MyEntity>();

        MyEntity myEntity=new MyEntity();

        myEntity.setId("scene_id1,scene_name1;scene_id2,scene_name2|id1");

        myEntity.setName("name");

        myEntity.setFields("other");

        items.add(myEntity);

        sparkSession.createDataFrame(items, MyEntity.class).createOrReplaceTempView("test");

        Dataset<Row> rows=sparkSession.sql("select * from test");

        rows = rows.withColumn("id", explode(split(split(col("id"), "\\|").getItem(), ";")));

        rows=rows.withColumn("id1",split(rows.col("id"),",").getItem())

                .withColumn("name1",split(rows.col("id"),",").getItem());

        rows=rows.withColumn("id",rows.col("id1"))

                .withColumn("name",rows.col("name1"));

        rows=rows.drop("id1","name1");

        rows.show();

        sparkSession.stop();

    }

}

MyEntity.java

import java.io.Serializable;

public class MyEntity implements Serializable{

    private String id;

    private String name;

    private String fields;

    public String getId() {

        return id;

    }

    public void setId(String id) {

        this.id = id;

    }

    public String getName() {

        return name;

    }

    public void setName(String name) {

        this.name = name;

    }

    public String getFields() {

        return fields;

    }

    public void setFields(String fields) {

        this.fields = fields;

    }

}

打印结果：

// :: INFO codegen.CodeGenerator: Code generated in 36.359731 ms

+------+---------+-----------+

|fields|       id|       name|

+------+---------+-----------+

| other|scene_id1|scene_name1|

| other|scene_id2|scene_name2|

+------+---------+-----------+

Scala实现：

[dx@CDH- ~]$ spark-shell2

-bash: spark-shell2: command not found

[boco@CDH- ~]$ spark2-shell

Setting default log level to "WARN".

...

Spark context available as 'sc' (master = yarn, app id = application_1552012317155_0189).

Spark session available as 'spark'.

Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 2.2..cloudera1

      /_/

Using Scala version 2.11. (Java HotSpot(TM) -Bit Server VM, Java 1.8.0_171)

Type in expressions to have them evaluated.

Type :help for more information.

scala>

scala> val df = Seq(

     |   (, "scene_id1,scene_name1;scene_id2,scene_name2",""),

     |   (, "scene_id1,scene_name1;scene_id2,scene_name2;scene_id3,scene_name3",""),

     |   (, "scene_id4,scene_name4;scene_id2,scene_name2",""),

     |   (, "scene_id6,scene_name6;scene_id5,scene_name5","")

     | ).toDF("id", "int_id","name");

df: org.apache.spark.sql.DataFrame = [id: int, int_id: string ...  more field]

scala> df.show;

+---+--------------------+----+

| id|              int_id|name|

+---+--------------------+----+

|  |scene_id1,scene_n...|    |

|  |scene_id1,scene_n...|    |

|  |scene_id4,scene_n...|    |

|  |scene_id6,scene_n...|    |

+---+--------------------+----+

scala> df.withColumn("int_id", explode(split(col("int_id"), ";")));

res1: org.apache.spark.sql.DataFrame = [id: int, int_id: string ...  more field]

scala> res1.show();

+---+--------------------+----+

| id|              int_id|name|

+---+--------------------+----+

|  |scene_id1,scene_n...|    |

|  |scene_id2,scene_n...|    |

|  |scene_id1,scene_n...|    |

|  |scene_id2,scene_n...|    |

|  |scene_id3,scene_n...|    |

|  |scene_id4,scene_n...|    |

|  |scene_id2,scene_n...|    |

|  |scene_id6,scene_n...|    |

|  |scene_id5,scene_n...|    |

+---+--------------------+----+

scala> res1.withColumn("int_id", split(col("int_id"), ",")()).withColumn("name", split(col("int_id"), ",")());

res5: org.apache.spark.sql.DataFrame = [id: int, int_id: string ...  more field]

scala> res5.show

+---+---------+----+

| id|   int_id|name|

+---+---------+----+

|  |scene_id1|null|

|  |scene_id2|null|

|  |scene_id1|null|

|  |scene_id2|null|

|  |scene_id3|null|

|  |scene_id4|null|

|  |scene_id2|null|

|  |scene_id6|null|

|  |scene_id5|null|

+---+---------+----+

scala> res1.withColumn("name", split(col("int_id"), ",")()).withColumn("int_id", split(col("int_id"), ",")());

res7: org.apache.spark.sql.DataFrame = [id: int, int_id: string ...  more field]

scala> res7.show

+---+---------+-----------+

| id|   int_id|       name|

+---+---------+-----------+

|  |scene_id1|scene_name1|

|  |scene_id2|scene_name2|

|  |scene_id1|scene_name1|

|  |scene_id2|scene_name2|

|  |scene_id3|scene_name3|

|  |scene_id4|scene_name4|

|  |scene_id2|scene_name2|

|  |scene_id6|scene_name6|

|  |scene_id5|scene_name5|

+---+---------+-----------+

scala>

int_id(string类型)为null,会自动转化为空字符串，如果filter中写过滤条件col("int_id").notEqual(null),将会过滤掉所有数据：

// MARK:如果int_id(string类型)为null,会自动转化为空字符串，如果filter中写过滤条件col("int_id").notEqual(null),将会过滤掉所有数据。

scala> val df = Seq(

     |             (1, null,""),

     |             (2, "-1",""),

     |             (3, "scene_id4,scene_name4;scene_id2,scene_name2",""),

     |             (4, "scene_id6,scene_name6;scene_id5,scene_name5","")

     |           ).toDF("id", "int_id","name");

df: org.apache.spark.sql.DataFrame = [id: int, int_id: string ... 1 more field]

scala> df.filter(col("int_id").notEqual(null).and(col("int_id").notEqual("-1")));

res5: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, int_id: string ... 1 more field]

scala> res5.show;

+---+------+----+

| id|int_id|name|

+---+------+----+

+---+------+----+

scala> df.filter(col("int_id").notEqual("").and(col("int_id").notEqual("-1")));

res7: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, int_id: string ... 1 more field]

scala> res7.show;

+---+--------------------+----+

| id|              int_id|name|

+---+--------------------+----+

|  3|scene_id4,scene_n...|    |

|  4|scene_id6,scene_n...|    |

+---+--------------------+----+

int_id如果不包含列传行的条件，数据不会丢失:

scala> 

scala> val df = Seq(

     | (, null,""),

     | (, "-1",""),

     | (, "scene_id4,scene_name4;scene_id2,scene_name2",""),

     | (, "scene_id6,scene_name6;scene_id5,scene_name5","")

     | ).toDF("id", "int_id","name");

df: org.apache.spark.sql.DataFrame = [id: int, int_id: string ...  more field]

scala> 

scala> df.withColumn("name", split(col("int_id"), ",")()).withColumn("int_id", split(col("int_id"), ",")());

res0: org.apache.spark.sql.DataFrame = [id: int, int_id: string ...  more field]

scala> res0.show;

+---+---------+--------------------+

| id|   int_id|                name|

+---+---------+--------------------+

|  |     null|                null|

|  |       -|                null|

|  |scene_id4|scene_name4;scene...|

|  |scene_id6|scene_name6;scene...|

+---+---------+--------------------+

scala>

Spark：实现行转列的更多相关文章

spark 累加历史 + 统计全部 + 行转列
spark 累加历史主要用到了窗口函数,而进行全部统计,则需要用到rollup函数 1 应用场景: 1.我们需要统计用户的总使用时长(累加历史) 2.前台展现页面需要对多个维度进行查询,如:产品.地 ...
Spark基于自定义聚合函数实现【列转行、行转列】
一.分析 Spark提供了非常丰富的算子,可以实现大部分的逻辑处理,例如,要实现行转列,可以用hiveContext中支持的concat_ws(',', collect_set('字段'))实现.但是 ...
Databricks 第11篇：Spark SQL 查询（行转列、列转行、Lateral View、排序）
本文分享在Azure Databricks中如何实现行转列和列转行. 一,行转列在分组中,把每个分组中的某一列的数据连接在一起: collect_list:把一个分组中的列合成为数组,数据不去重,格 ...
SQL Server 动态行转列（参数化表名、分组列、行转列字段、字段值）
一.本文所涉及的内容(Contents) 本文所涉及的内容(Contents) 背景(Contexts) 实现代码(SQL Codes) 方法一:使用拼接SQL,静态列字段: 方法二:使用拼接SQL, ...
T-SQL 实现行转列
问题: 我正在寻找一种有效的方式将行转换为SQL服务器中的列例如,通过下表如何构建出预期结果表. Id Value ColumnName 1 John FirstName 2 2 ...
Oracle行转列、列转行的Sql语句总结
多行转字符串这个比较简单,用||或concat函数可以实现 SQL Code 12 select concat(id,username) str from app_userselect i ...
sql的行转列(PIVOT)与列转行(UNPIVOT)
在做数据统计的时候,行转列,列转行是经常碰到的问题.case when方式太麻烦了,而且可扩展性不强,可以使用 PIVOT,UNPIVOT比较快速实现行转列,列转行,而且可扩展性强一.行转列 1.测 ...
做图表统计你需要掌握SQL Server 行转列和列转行
说在前面做一个数据统计和分析的项目,每天面对着各种数据,经过存储过程从源表计算汇总后需要写入中间结果表以提高数据使用效率,那么此时就需要用到行转列和列转行. 1.列转行数据经过计算加工后会直接生成 ...
SQL SERVER特殊行转列案列一则
今天有个同事找我,他说他有个需求,需要进行行转列,但是又跟一般的行转列有些区别,具体需求如下所说,需要将表1的数据转换为表2的显示格式. 我想了一下,给出了一个解决方法,具体如下所示(先给出测试数据) ...
SQL Server中使用PIVOT行转列
使用PIVOT行转列 1.建表及插入数据 USE [AdventureDB] GO /****** Object: Table [dbo].[Score] Script Date: 11/25/201 ...

随机推荐

LeetCode（27）：移除元素
Easy! 题目描述: 给定一个数组 nums 和一个值 val,你需要原地移除所有数值等于 val 的元素,返回移除后数组的新长度. 不要使用额外的数组空间,你必须在原地修改输入数组并在使用 O(1 ...
python 全栈开发，Day86(上传文件,上传头像,CBV,python读写Excel,虚拟环境virtualenv)
一.上传文件上传一个图片使用input type="file",来上传一个文件.注意:form表单必须添加属性enctype="multipart/form-data ...
步步为营-30-AES加密与解密
using System; using System.Collections.Generic; using System.ComponentModel; using System.Data; usin ...
小丸工具箱FAQ
下载地址:https://maruko.appinn.me/index.html 本文章是把一些使用小丸工具箱中常见的操作失误或出错的问题集中写出并提出解决方法,以便大家寻找解决并避免重复提问. 文章 ...
day15--JavaScript
上节作业回顾 <style></style>代表的是CSS样式 <script></script>代表的是JavaScript样式 1. ...
Python3.X 安装Scrapy
安装Scrapy有两种方法: 1.pip install Scrapy 这种方式按道理来说是最简洁最快速的,但是有的时候安装不成功,只能更换一种方式,下载源文件安装的方式,详见下面一步. 2.下载文件 ...
python json数据处理
1. python 转 json import json data={ "name":"haha", "age" : 1,"lis ...
php字符串截取
保留字符串前面的 substr($str,start[,$length]); start 为负数则从后面开始截取 leng为负数则返回的字符串将从 $str 结尾处向前数第 start 个字符开始 ...
windows通过Visual Studio Code中配置GO开发环境(转)
一.GO语言安装详情查看:GO语言下载.安装.配置二.GoLang插件介绍对于Visual Studio Code开发工具,有一款优秀的GoLang插件,它的主页为:https://github ...
007.基于Docker的Etcd分布式部署
一环境准备 1.1 基础环境 ntp配置:略 #建议配置ntp服务,保证时间一致性 etcd版本:v3.3.9 防火墙及SELinux:关闭防火墙和SELinux 名称地址主机名备注 etcd ...

Spark：实现行转列

示例JAVA代码：

Scala实现：

int_id(string类型)为null,会自动转化为空字符串，如果filter中写过滤条件col("int_id").notEqual(null),将会过滤掉所有数据：

int_id如果不包含列传行的条件，数据不会丢失:

Spark：实现行转列的更多相关文章

随机推荐

热门专题