Spark笔记-DataSet,DataFrame
DataSet:面向对象的,从JVM进行构建,或从其它格式进行转化
DataFrame:面向SQL查询,从多种数据源进行构建,或从其它格式进行转化
RDD DataSet DataFrame互转
1.RDD -> Dataset
val ds = rdd.toDS() 2.RDD -> DataFrame
val df = spark.read.json(rdd) 3.Dataset -> RDD
val rdd = ds.rdd 4.Dataset -> DataFrame
val df = ds.toDF() 5.DataFrame -> RDD
val rdd = df.toJSON.rdd 6.DataFrame -> Dataset
val ds = df.toJSON
DataFrameTest1.scala
package com.spark.dataframe
import org.apache.spark.{SparkConf, SparkContext}
class DataFrameTest1 {
}
object DataFrameTest1{
def main(args : Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin");
val logFile = "e://temp.txt"
val conf = new SparkConf().setAppName("test").setMaster("local[4]")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile,2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs , Line with b : $numBs")
sc.stop()
}
}
DataFrameTest2.scala
package com.spark.dataframe
import org.apache.spark.sql.SparkSession
class DataFrameTest2 {
}
object DataFrameTest2{
def main(args : Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.master("local[4]")
.getOrCreate()
val df = spark.read.json("E:\\spark\\datatemp\\people.json")
df.show()
// This import is needed to use the $-notation
import spark.implicits._
df.printSchema()
df.select("name").show()
df.filter("age>21").show()
df.select($"name",$"age"+1).show()
df.groupBy("age").count().show()
}
}
DataFrameTest3.scala
package com.spark.dataframe
import org.apache.spark.sql.SparkSession
class DataFrameTest3 {
}
object DataFrameTest3{
def main(args : Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.master("local[4]")
.getOrCreate()
val df = spark.read.json("E:\\spark\\datatemp\\people.json")
// 将DataFrame注册为sql temporary view
df.createOrReplaceTempView("people")
val sqlDF = spark.sql("select * from people")
sqlDF.show()
//spark.sql("select * from global_temp.people").show()
}
}
DataSetTest1.scala
package com.spark.dataframe
import org.apache.spark.sql.SparkSession
class DataSetTest1 {
}
case class Person(name: String, age: Long)
object DataSetTest1 {
def main(args : Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.master("local[4]")
.getOrCreate()
// This import is needed to use the $-notation
import spark.implicits._
val caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS.show()
val ds = spark.read.json("E:\\spark\\datatemp\\people.json").as[Person]
ds.show()
}
}
RDDToDataFrame.scala
package com.spark.dataframe
import org.apache.spark.sql.{Row, SparkSession}
class RDDToDataFrame {
}
//介绍两种将RDD转换为DataFrame的方式
object RDDToDataFrame{
def main(args : Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Rdd to DataFrame")
.master("local[4]")
.getOrCreate()
// This import is needed to use the $-notation
import spark.implicits._
// 数据读取类可以提前定义,Person
val peopleDF =spark.sparkContext
.textFile("E:\\spark\\datatemp\\people.txt")
.map(_.split(","))
.map(attribute => Person(attribute(0),attribute(1).trim.toInt))
.toDF()
peopleDF.createOrReplaceTempView("people")
val teenagerDF = spark.sql("select name, age from people where age between 13 and 19")
teenagerDF.map(teenager=> "name:"+teenager(0)).show()
teenagerDF.map(teenager => "Name: "+teenager.getAs[String]("name")).show()
// No pre-defined encoders for Dataset[Map[K,V]], define explicitly
//隐式参数,后面需要Encoder类型的参数时时候则自动调用
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String,Any]]
// Primitive types and case classes can be also defined as
// implicit val stringIntMapEncoder: Encoder[Map[String, Any]] = ExpressionEncoder()
// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagerDF.map(teenager => teenager.getValuesMap[Any](List("name","age"))).collect().foreach(println(_))
// Array(Map("name" -> "Justin", "age" -> 19))
//////////////////////////////////////////
//case classes 不能提前定义
/*
* When case classes cannot be defined ahead of time
* (for example, the structure of records is encoded in a string,
* or a text dataset will be parsed and fields will be projected differently for different users),
* a DataFrame can be created programmatically with three steps.
* 1. Create an RDD of Rows from the original RDD;
* 2. Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
* 3. Apply the schema to the RDD of Rows via createDataFrame method provided by SparkSession.
* */
import org.apache.spark.sql.types._
//1. 创建RDD
val peopleRDD = spark.sparkContext.textFile("e:\\spark\\datatemp\\people.txt")
//2.1 创建和RDD相匹配的schema
val schemaString = "name age"
val fields = schemaString.split(" ")
.map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
//2.2. 将RDD进行格式化
val rowRDD = peopleRDD
.map(_.split(","))
.map(attributes => Row(attributes(0),attributes(1).trim))
//3. 将RDD转换为DF
val peopleDF2 = spark.createDataFrame(rowRDD, schema)
peopleDF2.createOrReplaceTempView("people")
val results = spark.sql("select name from people")
results.show()
}
}
GenericLoadAndSave.scala
package com.spark.dataframe
import org.apache.spark.sql.{SaveMode, SparkSession}
class GenericLoadAndSave {
} object GenericLoadAndSave{
def main(args: Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Rdd to DataFrame")
.master("local[4]")
.getOrCreate() // This import is needed to use the $-notation
import spark.implicits._ //保存为parquet格式的数据
val userDF = spark.read.json("e:\\spark\\datatemp\\people.json")
//userDF.select("name","age").write.save("e:\\spark\\datasave\\nameAndAge.parquet")
//数据保存时的模式设置为append
userDF.select("name","age").write.mode(SaveMode.Overwrite).save("e:\\spark\\datasave\\nameAndAge.parquet") //数据源的格式可以指定为 (json, parquet, jdbc, orc, libsvm, csv, text)
val peopleDF = spark.read.format("json").load("e:\\spark\\datatemp\\people.json")
//peopleDF.select("name","age").write.format("json").save("e:\\spark\\datasave\\peopleNameAndAge.json")
//数据保存时的模式设置为overwrite
peopleDF.select("name","age").write.mode(SaveMode.Overwrite).format("json").save("e:\\spark\\datasave\\peopleNameAndAge.json") //从parquet格式的数据源中读取数据构建DataFrame
val peopleDF2 = spark.read.format("parquet").load("E:\\spark\\datasave\\nameAndAge.parquet\\")
//+"part-00000-*.snappy.parquet") //这行加上便于精准定位。事实上parquet可以根据文件路径自行发现和推断分区信息
System.out.println("------------------")
peopleDF2.select("name","age").show() //userDF.select("name","age").write.saveAsTable("e:\\spark\\datasave\\peopleSaveAsTable") //代码有错误,原因暂时未知
//val sqlDF = spark.sql("SELECT * FROM parquet.'E:\\spark\\datasave\\nameAndAge.parquet\\part-00000-c8740fc5-cba8-4ebe-a7a8-9cec3da7dfa2.snappy.parquet'")
//sqlDF.show()
}
}
ReadFromParquet.scala
package com.spark.dataframe
import org.apache.spark.sql.{SaveMode, SparkSession} class ReadFromParquet {
} object ReadFromParquet{
def main(args: Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Rdd to DataFrame")
.master("local[4]")
.getOrCreate() // This import is needed to use the $-notation
import spark.implicits._
//从parquet格式的数据源中读取数据构建DataFrame
val peopleDF2 = spark.read.format("parquet").load("E:\\spark\\datasave\\people") /*
* 目录结构为:
* people
* |- country=china
* |-data.parquet
* |- country=us
* |-data.parquet
*
* data.parquet内包含people的name和age。加上文件路径中的country信息,最终得到的表结构为:
* +-------+----+-------+
* | name| age|country|
* +-------+----+-------+
* */
peopleDF2.show()
}
}
SchemaMerge.scala
package com.spark.dataframe
import org.apache.spark.sql.{SaveMode, SparkSession} class SchemaMerge {
} object SchemaMerge{
def main(args: Array[String]) {
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Rdd to DataFrame")
.master("local[4]")
.getOrCreate() // This import is needed to use the $-notation
import spark.implicits._ val squaresDF = spark.sparkContext.makeRDD(1 to 5)
.map(i=>(i,i*i))
.toDF("value","square") squaresDF.write.mode(SaveMode.Overwrite).parquet("E:\\spark\\datasave\\schemamerge\\test_table\\key=1") val cubesDF = spark.sparkContext.makeRDD(1 to 5)
.map(i => (i,i*i*i))
.toDF("value","cube")
cubesDF.write.mode(SaveMode.Overwrite).parquet("E:\\spark\\datasave\\schemamerge\\test_table\\key=2") val mergedDF = spark.read.option("mergeSchema","true")
.parquet("E:\\spark\\datasave\\schemamerge\\test_table\\") mergedDF.printSchema()
mergedDF.show()
}
}
结果:


Spark笔记-DataSet,DataFrame的更多相关文章
- Spark提高篇——RDD/DataSet/DataFrame(一)
该部分分为两篇,分别介绍RDD与Dataset/DataFrame: 一.RDD 二.DataSet/DataFrame 先来看下官网对RDD.DataSet.DataFrame的解释: 1.RDD ...
- Spark提高篇——RDD/DataSet/DataFrame(二)
该部分分为两篇,分别介绍RDD与Dataset/DataFrame: 一.RDD 二.DataSet/DataFrame 该篇主要介绍DataSet与DataFrame. 一.生成DataFrame ...
- spark算子之DataFrame和DataSet
前言 传统的RDD相对于mapreduce和storm提供了丰富强大的算子.在spark慢慢步入DataFrame到DataSet的今天,在算子的类型基本不变的情况下,这两个数据集提供了更为强大的的功 ...
- spark结构化数据处理:Spark SQL、DataFrame和Dataset
本文讲解Spark的结构化数据处理,主要包括:Spark SQL.DataFrame.Dataset以及Spark SQL服务等相关内容.本文主要讲解Spark 1.6.x的结构化数据处理相关东东,但 ...
- spark RDD、DataFrame、DataSet之间的相互转化
这三个数据集看似经常用,但是真正归纳总结的时候,很容易说不出来 三个之间的关系与区别参考我的另一篇blog http://www.cnblogs.com/xjh713/p/7309507.html ...
- Spark Dataset DataFrame空值null,NaN判断和处理
Spark Dataset DataFrame空值null,NaN判断和处理 import org.apache.spark.sql.SparkSession import org.apache.sp ...
- Spark Dataset DataFrame 操作
Spark Dataset DataFrame 操作 相关博文参考 sparksql中dataframe的用法 一.Spark2 Dataset DataFrame空值null,NaN判断和处理 1. ...
- Spark SQL、DataFrame和Dataset——转载
转载自: Spark SQL.DataFrame和Datase
- RDD/Dataset/DataFrame互转
1.RDD -> Dataset val ds = rdd.toDS() 2.RDD -> DataFrame val df = spark.read.json(rdd) 3.Datase ...
随机推荐
- Spider-one
1. 爬虫是如何采集网页数据的: 网页的三大特征: -1. 每个网页都有自己的 URL(统一资源定位符)地址来进行网络定位. -2. 每个网页都使用 HTML(超文本标记语言)来描述页面信息. -3. ...
- HDU 6138 Fleet of the Eternal Throne(后缀自动机)
题意 题目链接 Sol 真是狗血,被疯狂卡常的原因竟是 我们考虑暴力枚举每个串的前缀,看他能在\(x, y\)的后缀自动机中走多少步,对两者取个min即可 复杂度\(O(T 10^5 M)\)(好假啊 ...
- LVS主从部署配置和使用
LVS是Linux Virtual Server的简写,意即Linux虚拟服务器,是一个虚拟的服务器集群系统.本项目在1998年5月由章文嵩博士成立,是中国国内最早出现的自由软件项目之一. LVS是L ...
- 微软 WPC 2014 合作伙伴keynote
本周一,2014 微软WPC (Worldwide Partner Conference) 合作者伙伴大会在美国华盛顿开幕,微软除了介绍了Azure.云端化的Office 365和Windows Ph ...
- loadrunner 脚本优化-关联设置
脚本优化-关联设置 by:授客 QQ:1033553122 关联的原理 关联也属于一钟特殊的参数化.一般参数化的参数来源于一个文件.一个定义的table.通过sql写的一个结果集等,但关联所获得的参数 ...
- 小程序 青少儿书画 利用engineercms作为服务端
因为很多妈咪们喜欢发布自己宝宝的作品,享受哪些美好时刻,记录亲子创作过程. 为了方便妈咪们展示亲子创作,比如宝宝们画作,涂鸦,书法,作文,其他才艺,特利用engineercms作为服务端,重新设计了一 ...
- echart参数设置——曲线图
{ title: { text: '请求返回码分布', subtext: '实时数据' }, tooltip: { trigger: 'axis', position: function (point ...
- [20180423]flashback tablespace与snapshot standby.txt
[20180423]flashback tablespace与snapshot standby.txt --//缺省建立表空间是打开flashback on,如果某个表空间flashback off, ...
- python第四十三天--第三模块考核
面向对象: 概念:类,实例化,对象,实例 属性: 公有属性:在类中定义 成员属性:在方法中定义 私有属性:在方法中使用 __属性 定义 限制外部访问 方法: 普通方法 类方法: @classmeth ...
- ORM查询之基于对象的正向查询与反向查询
一.为什么有正向查询和反向查询? 举例有两张表,一张表叫书籍表,一张表叫出版社表,他们关系是一对多的关系,书籍是多,出版社是一,因为一本书应该只有一个出版社对应,而出版社可以有多本书对应. 那么在实际 ...