转化:

RDD、DataFrame、Dataset三者有许多共性,有各自适用的场景常常需要在三者之间转换

DataFrame/Dataset转RDD:

这个转换很简单

val rdd1=testDF.rdd
val rdd2=testDS.rdd RDD转DataFrame: import spark.implicits._
val testDF = rdd.map {line=>
(line._1,line._2)
}.toDF("col1","col2") 一般用元组把一行的数据写在一起,然后在toDF中指定字段名 RDD转Dataset:

import spark.implicits._
case class Coltest(col1:String,col2:Int)extends Serializable //定义字段名和类型
val testDS = rdd.map {line=>
Coltest(line._1,line._2)
}.toDS 可以注意到,定义每一行的类型(case class)时,已经给出了字段名和类型,后面只要往case class里面添加值即可 Dataset转DataFrame: 这个也很简单,因为只是把case class封装成Row import spark.implicits._
val testDF = testDS.toDF DataFrame转Dataset: import spark.implicits._
case class Coltest(col1:String,col2:Int)extends Serializable //定义字段名和类型
val testDS = testDF.as[Coltest] 这种方法就是在给出每一列的类型后,使用as方法,转成Dataset,这在数据类型是DataFrame又需要针对各个字段处理时极为方便
特别注意: 在使用一些特殊的操作时,一定要加上 import spark.implicits._ 不然toDF、toDS无法使用

package dataframe


import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}


//
// Explore interoperability between DataFrame and Dataset. Note that Dataset
// is covered in much greater detail in the 'dataset' directory.
//
object DatasetConversion {


case class Cust(id: Integer, name: String, sales: Double, discount: Double, state: String)


case class StateSales(state: String, sales: Double)


def main(args: Array[String]) {
val spark =
SparkSession.builder()
.appName("DataFrame-DatasetConversion")
.master("local[4]")
.getOrCreate()


import spark.implicits._


// create a sequence of case class objects
// (we defined the case class above)
val custs = Seq(
Cust(1, "Widget Co", 120000.00, 0.00, "AZ"),
Cust(2, "Acme Widgets", 410500.00, 500.00, "CA"),
Cust(3, "Widgetry", 410500.00, 200.00, "CA"),
Cust(4, "Widgets R Us", 410500.00, 0.0, "CA"),
Cust(5, "Ye Olde Widgete", 500.00, 0.0, "MA")
)


// Create the DataFrame without passing through an RDD
val customerDF : DataFrame = spark.createDataFrame(custs)
//
// println("*** DataFrame schema")
//
// customerDF.printSchema()
//
// println("*** DataFrame contents")
//
// customerDF.show()

// +---+---------------+--------+--------+-----+
//| id| name| sales|discount|state|
//+---+---------------+--------+--------+-----+
//| 1| Widget Co|120000.0| 0.0| AZ|
//| 2| Acme Widgets|410500.0| 500.0| CA|
//| 3| Widgetry|410500.0| 200.0| CA|
//| 4| Widgets R Us|410500.0| 0.0| CA|
//| 5|Ye Olde Widgete| 500.0| 0.0| MA|
//+---+---------------+--------+--------+-----+


//
// println("*** Select and filter the DataFrame")
//
val smallerDF =
customerDF.select("sales", "state").filter($"state".equalTo("CA"))
//
// smallerDF.show()

//
// +--------+-----+
//| sales|state|
//+--------+-----+
//|410500.0| CA|
//|410500.0| CA|
//|410500.0| CA|
//+--------+-----+

///////////////////////////////////////////////////////////////////////////////////

// Convert it to a Dataset by specifying the type of the rows -- use a case
// class because we have one and it's most convenient to work with. Notice
// you have to choose a case class that matches the remaining columns.
// BUT also notice that the columns keep their order from the DataFrame --
// later you'll see a Dataset[StateSales] of the same type where the
// columns have the opposite order, because of the way it was created.


val customerDS : Dataset[StateSales] = smallerDF.as[StateSales]
//
// println("*** Dataset schema")
//
// customerDS.printSchema()
//
// println("*** Dataset contents")
//
// customerDS.show()


// Select and other operations can be performed directly on a Dataset too,
// but be careful to read the documentation for Dataset -- there are
// "typed transformations", which produce a Dataset, and
// "untyped transformations", which produce a DataFrame. In particular,
// you need to project using a TypedColumn to gate a Dataset.


// val verySmallDS : Dataset[Double] = customerDS.select($"sales".as[Double])
//
// println("*** Dataset after projecting one column")
//
// verySmallDS.show()

//
//+--------+
//| sales|
//+--------+
//|410500.0|
//|410500.0|
//|410500.0|
//+--------+


// If you select multiple columns on a Dataset you end up with a Dataset
// of tuple type, but the columns keep their names.
val tupleDS : Dataset[(String, Double)] =
customerDS.select($"state".as[String], $"sales".as[Double])
//
// println("*** Dataset after projecting two columns -- tuple version")
//
// tupleDS.show()

//
//+-----+--------+
//|state| sales|
//+-----+--------+
//| CA|410500.0|
//| CA|410500.0|
//| CA|410500.0|
//+-----+--------+


// You can also cast back to a Dataset of a case class. Notice this time
// the columns have the opposite order than the last Dataset[StateSales]
// val betterDS: Dataset[StateSales] = tupleDS.as[StateSales]
//
// println("*** Dataset after projecting two columns -- case class version")
//
// betterDS.show()

//
//+-----+--------+
//|state| sales|
//+-----+--------+
//| CA|410500.0|
//| CA|410500.0|
//| CA|410500.0|
//+-----+--------+


// Converting back to a DataFrame without making other changes is really easy
// val backToDataFrame : DataFrame = tupleDS.toDF()
//
// println("*** This time as a DataFrame")
//
// backToDataFrame.show()
//

//+-----+--------+
//|state| sales|
//+-----+--------+
//| CA|410500.0|
//| CA|410500.0|
//| CA|410500.0|
//+-----+--------+


//
// // While converting back to a DataFrame you can rename the columns
val renamedDataFrame : DataFrame = tupleDS.toDF("MyState", "MySales")


println("*** Again as a DataFrame but with renamed columns")


renamedDataFrame.show()


// +-------+--------+
//|MyState| MySales|
//+-------+--------+
//| CA|410500.0|
//| CA|410500.0|
//| CA|410500.0|
//+-------+--------+


}
}

 

RDD、DataFrame、Dataset三者三者之间转换的更多相关文章

  1. APACHE SPARK 2.0 API IMPROVEMENTS: RDD, DATAFRAME, DATASET AND SQL

    What’s New, What’s Changed and How to get Started. Are you ready for Apache Spark 2.0? If you are ju ...

  2. spark的数据结构 RDD——DataFrame——DataSet区别

    转载自:http://blog.csdn.net/wo334499/article/details/51689549 RDD 优点: 编译时类型安全 编译时就能检查出类型错误 面向对象的编程风格 直接 ...

  3. sparkSQL中RDD——DataFrame——DataSet的区别

    spark中RDD.DataFrame.DataSet都是spark的数据集合抽象,RDD针对的是一个个对象,但是DF与DS中针对的是一个个Row RDD 优点: 编译时类型安全 编译时就能检查出类型 ...

  4. RDD, DataFrame or Dataset

    总结: 1.RDD是一个Java对象的集合.RDD的优点是更面向对象,代码更容易理解.但在需要在集群中传输数据时需要为每个对象保留数据及结构信息,这会导致数据的冗余,同时这会导致大量的GC. 2.Da ...

  5. spark rdd df dataset

    RDD.DataFrame.DataSet的区别和联系 共性: 1)都是spark中得弹性分布式数据集,轻量级 2)都是惰性机制,延迟计算 3)根据内存情况,自动缓存,加快计算速度 4)都有parti ...

  6. byte[] 、Bitmap与Drawbale 三者直接的转换

    经常遇到这种类似头疼的问题 byte[] .Bitmap与Drawbale 三者直接的转换 1.byte[] ->Bitmap Bitmap Bitmap = BitmapFactory.dec ...

  7. Spark入门之DataFrame/DataSet

    目录 Part I. Gentle Overview of Big Data and Spark Overview 1.基本架构 2.基本概念 3.例子(可跳过) Spark工具箱 1.Dataset ...

  8. C#中对象,字符串,dataTable、DataReader、DataSet,对象集合转换成Json字符串方法。

    C#中对象,字符串,dataTable.DataReader.DataSet,对象集合转换成Json字符串方法. public class ConvertJson { #region 私有方法 /// ...

  9. 关于不同进制数之间转换的数学推导【Written By KillerLegend】

    关于不同进制数之间转换的数学推导 涉及范围:正整数范围内二进制(Binary),八进制(Octonary),十进制(Decimal),十六进制(hexadecimal)之间的转换 数的进制有多种,比如 ...

随机推荐

  1. 于dm-0 dm-1

    dm是device mapper的意思,dm-0, dm-1的实体可以通过下面几个命令看出,lvm会把每个lv连接到一个/dev/dm-x的设备档,这个设备档并不是一个真正的磁盘,所以不会有分区表存在 ...

  2. EF-CodeFirst-数据库初始化

    数据库初始化 之前看到Code-First会自动根据域模型创建数据库,下图展示了一个数据库初始化工作流程,该工作流程基于从DbContext派生的上下文类的基础构造函数中传递的参数 如上图所示,上下文 ...

  3. LeetCode 762 Prime Number of Set Bits in Binary Representation 解题报告

    题目要求 Given two integers L and R, find the count of numbers in the range [L, R] (inclusive) having a ...

  4. 洛谷P2329 栅栏 [SCOI2005] 搜索

    正解:搜索 解题报告: 先放下传送门! 首先说下爆搜趴,就直接枚每个需求是否被满足以及如果满足切哪个板子,随便加个最优性剪枝,似乎是有80pts 然后思考优化 首先显然尽量满足需求比较小的,显然如果能 ...

  5. 使用Apache CXF根据wsdl文件生成代码

    1.去官网下载,我用的是apache-cxf-2.5.10.zip 2.解压 3.通过命令行进入Apache CXF的bin目录,如我的目录是D:\BIS\axis2\apache-cxf-2.7.1 ...

  6. EXPLAIN执行计划中要重点关注哪些要素(叶金荣)

    原文:http://mp.weixin.qq.com/s/CDKN_nPcIjzA_U5-xwAE5w 导读 EXPLAIN的结果中,有哪些关键信息值得注意呢? MySQL的EXPLAIN当然和ORA ...

  7. 异常 java.net.ConnectException: Connection refused: no further information

    java.net.ConnectException: Connection refused: no further information at sun.nio.ch.SocketChannelImp ...

  8. iOS webview注入JS

    - (void)webViewDidFinishLoad:(UIWebView *)webView { NSString *js = @"function imgAutoFit() { \ ...

  9. insert select带来的问题

    当使用 insert...select...进行记录的插入时,如果select的表是innodb类型的,不论insert的表是什么类型的表,都会对select的表的纪录进行锁定.对于那些从oracle ...

  10. Error, some other host already uses address 192.168.0.202错误解决方法

    Error, some other host already uses address 192.168.0.202错误解决方法 今天配置虚拟机网卡的时候遇到错误:Error, some other h ...