DataSet:面向对象的,从JVM进行构建,或从其它格式进行转化

DataFrame:面向SQL查询,从多种数据源进行构建,或从其它格式进行转化

RDD DataSet DataFrame互转

1.RDD -> Dataset
val ds = rdd.toDS() 2.RDD -> DataFrame
val df = spark.read.json(rdd) 3.Dataset -> RDD
val rdd = ds.rdd 4.Dataset -> DataFrame
val df = ds.toDF() 5.DataFrame -> RDD
val rdd = df.toJSON.rdd 6.DataFrame -> Dataset
val ds = df.toJSON

DataFrameTest1.scala

package com.spark.dataframe

import org.apache.spark.{SparkConf, SparkContext}

class DataFrameTest1 {
} object DataFrameTest1{ def main(args : Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin");
val logFile = "e://temp.txt"
val conf = new SparkConf().setAppName("test").setMaster("local[4]")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile,2).cache() val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count() println(s"Lines with a: $numAs , Line with b : $numBs") sc.stop()
}
}

DataFrameTest2.scala

package com.spark.dataframe

import org.apache.spark.sql.SparkSession

class DataFrameTest2 {
} object DataFrameTest2{ def main(args : Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.master("local[4]")
.getOrCreate() val df = spark.read.json("E:\\spark\\datatemp\\people.json")
df.show() // This import is needed to use the $-notation
import spark.implicits._
df.printSchema()
df.select("name").show()
df.filter("age>21").show()
df.select($"name",$"age"+1).show() df.groupBy("age").count().show() }
}

DataFrameTest3.scala

package com.spark.dataframe

import org.apache.spark.sql.SparkSession

class DataFrameTest3 {
} object DataFrameTest3{ def main(args : Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.master("local[4]")
.getOrCreate() val df = spark.read.json("E:\\spark\\datatemp\\people.json")
// 将DataFrame注册为sql temporary view
df.createOrReplaceTempView("people") val sqlDF = spark.sql("select * from people")
sqlDF.show()
//spark.sql("select * from global_temp.people").show() }
}

DataSetTest1.scala

package com.spark.dataframe

import org.apache.spark.sql.SparkSession

class DataSetTest1 {
} case class Person(name: String, age: Long) object DataSetTest1 {
def main(args : Array[String]): Unit ={ System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.master("local[4]")
.getOrCreate() // This import is needed to use the $-notation
import spark.implicits._ val caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS.show() val ds = spark.read.json("E:\\spark\\datatemp\\people.json").as[Person]
ds.show() }
}

RDDToDataFrame.scala

package com.spark.dataframe

import org.apache.spark.sql.{Row, SparkSession}

class RDDToDataFrame {
} //介绍两种将RDD转换为DataFrame的方式
object RDDToDataFrame{
def main(args : Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Rdd to DataFrame")
.master("local[4]")
.getOrCreate() // This import is needed to use the $-notation
import spark.implicits._ // 数据读取类可以提前定义,Person
val peopleDF =spark.sparkContext
.textFile("E:\\spark\\datatemp\\people.txt")
.map(_.split(","))
.map(attribute => Person(attribute(0),attribute(1).trim.toInt))
.toDF() peopleDF.createOrReplaceTempView("people") val teenagerDF = spark.sql("select name, age from people where age between 13 and 19")
teenagerDF.map(teenager=> "name:"+teenager(0)).show()
teenagerDF.map(teenager => "Name: "+teenager.getAs[String]("name")).show() // No pre-defined encoders for Dataset[Map[K,V]], define explicitly //隐式参数,后面需要Encoder类型的参数时时候则自动调用
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String,Any]]
// Primitive types and case classes can be also defined as
// implicit val stringIntMapEncoder: Encoder[Map[String, Any]] = ExpressionEncoder() // row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagerDF.map(teenager => teenager.getValuesMap[Any](List("name","age"))).collect().foreach(println(_))
// Array(Map("name" -> "Justin", "age" -> 19)) //////////////////////////////////////////
//case classes 不能提前定义
/*
* When case classes cannot be defined ahead of time
* (for example, the structure of records is encoded in a string,
* or a text dataset will be parsed and fields will be projected differently for different users),
* a DataFrame can be created programmatically with three steps.
* 1. Create an RDD of Rows from the original RDD;
* 2. Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
* 3. Apply the schema to the RDD of Rows via createDataFrame method provided by SparkSession.
* */
import org.apache.spark.sql.types._ //1. 创建RDD
val peopleRDD = spark.sparkContext.textFile("e:\\spark\\datatemp\\people.txt")
//2.1 创建和RDD相匹配的schema
val schemaString = "name age"
val fields = schemaString.split(" ")
.map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields) //2.2. 将RDD进行格式化
val rowRDD = peopleRDD
.map(_.split(","))
.map(attributes => Row(attributes(0),attributes(1).trim)) //3. 将RDD转换为DF
val peopleDF2 = spark.createDataFrame(rowRDD, schema)
peopleDF2.createOrReplaceTempView("people")
val results = spark.sql("select name from people") results.show() }
}

GenericLoadAndSave.scala

package com.spark.dataframe
import org.apache.spark.sql.{SaveMode, SparkSession}
class GenericLoadAndSave {
} object GenericLoadAndSave{
def main(args: Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Rdd to DataFrame")
.master("local[4]")
.getOrCreate() // This import is needed to use the $-notation
import spark.implicits._ //保存为parquet格式的数据
val userDF = spark.read.json("e:\\spark\\datatemp\\people.json")
//userDF.select("name","age").write.save("e:\\spark\\datasave\\nameAndAge.parquet")
//数据保存时的模式设置为append
userDF.select("name","age").write.mode(SaveMode.Overwrite).save("e:\\spark\\datasave\\nameAndAge.parquet") //数据源的格式可以指定为 (json, parquet, jdbc, orc, libsvm, csv, text)
val peopleDF = spark.read.format("json").load("e:\\spark\\datatemp\\people.json")
//peopleDF.select("name","age").write.format("json").save("e:\\spark\\datasave\\peopleNameAndAge.json")
//数据保存时的模式设置为overwrite
peopleDF.select("name","age").write.mode(SaveMode.Overwrite).format("json").save("e:\\spark\\datasave\\peopleNameAndAge.json") //从parquet格式的数据源中读取数据构建DataFrame
val peopleDF2 = spark.read.format("parquet").load("E:\\spark\\datasave\\nameAndAge.parquet\\")
//+"part-00000-*.snappy.parquet") //这行加上便于精准定位。事实上parquet可以根据文件路径自行发现和推断分区信息
System.out.println("------------------")
peopleDF2.select("name","age").show() //userDF.select("name","age").write.saveAsTable("e:\\spark\\datasave\\peopleSaveAsTable") //代码有错误,原因暂时未知
//val sqlDF = spark.sql("SELECT * FROM parquet.'E:\\spark\\datasave\\nameAndAge.parquet\\part-00000-c8740fc5-cba8-4ebe-a7a8-9cec3da7dfa2.snappy.parquet'")
//sqlDF.show()
}
}

ReadFromParquet.scala

package com.spark.dataframe
import org.apache.spark.sql.{SaveMode, SparkSession} class ReadFromParquet {
} object ReadFromParquet{
def main(args: Array[String]): Unit ={
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Rdd to DataFrame")
.master("local[4]")
.getOrCreate() // This import is needed to use the $-notation
import spark.implicits._
//从parquet格式的数据源中读取数据构建DataFrame
val peopleDF2 = spark.read.format("parquet").load("E:\\spark\\datasave\\people") /*
* 目录结构为:
* people
* |- country=china
* |-data.parquet
* |- country=us
* |-data.parquet
*
* data.parquet内包含people的name和age。加上文件路径中的country信息,最终得到的表结构为:
* +-------+----+-------+
* | name| age|country|
* +-------+----+-------+
* */
peopleDF2.show()
}
}

SchemaMerge.scala

package com.spark.dataframe
import org.apache.spark.sql.{SaveMode, SparkSession} class SchemaMerge {
} object SchemaMerge{
def main(args: Array[String]) {
System.setProperty("hadoop.home.dir", "E:\\spark\\hadoophome\\hadoop-common-2.2.0-bin")
val spark = SparkSession
.builder()
.appName("Rdd to DataFrame")
.master("local[4]")
.getOrCreate() // This import is needed to use the $-notation
import spark.implicits._ val squaresDF = spark.sparkContext.makeRDD(1 to 5)
.map(i=>(i,i*i))
.toDF("value","square") squaresDF.write.mode(SaveMode.Overwrite).parquet("E:\\spark\\datasave\\schemamerge\\test_table\\key=1") val cubesDF = spark.sparkContext.makeRDD(1 to 5)
.map(i => (i,i*i*i))
.toDF("value","cube")
cubesDF.write.mode(SaveMode.Overwrite).parquet("E:\\spark\\datasave\\schemamerge\\test_table\\key=2") val mergedDF = spark.read.option("mergeSchema","true")
.parquet("E:\\spark\\datasave\\schemamerge\\test_table\\") mergedDF.printSchema()
mergedDF.show()
}
}

结果:

Spark笔记-DataSet,DataFrame的更多相关文章

  1. Spark提高篇——RDD/DataSet/DataFrame(一)

    该部分分为两篇,分别介绍RDD与Dataset/DataFrame: 一.RDD 二.DataSet/DataFrame 先来看下官网对RDD.DataSet.DataFrame的解释: 1.RDD ...

  2. Spark提高篇——RDD/DataSet/DataFrame(二)

    该部分分为两篇,分别介绍RDD与Dataset/DataFrame: 一.RDD 二.DataSet/DataFrame 该篇主要介绍DataSet与DataFrame. 一.生成DataFrame ...

  3. spark算子之DataFrame和DataSet

    前言 传统的RDD相对于mapreduce和storm提供了丰富强大的算子.在spark慢慢步入DataFrame到DataSet的今天,在算子的类型基本不变的情况下,这两个数据集提供了更为强大的的功 ...

  4. spark结构化数据处理:Spark SQL、DataFrame和Dataset

    本文讲解Spark的结构化数据处理,主要包括:Spark SQL.DataFrame.Dataset以及Spark SQL服务等相关内容.本文主要讲解Spark 1.6.x的结构化数据处理相关东东,但 ...

  5. spark RDD、DataFrame、DataSet之间的相互转化

    这三个数据集看似经常用,但是真正归纳总结的时候,很容易说不出来 三个之间的关系与区别参考我的另一篇blog  http://www.cnblogs.com/xjh713/p/7309507.html ...

  6. Spark Dataset DataFrame空值null,NaN判断和处理

    Spark Dataset DataFrame空值null,NaN判断和处理 import org.apache.spark.sql.SparkSession import org.apache.sp ...

  7. Spark Dataset DataFrame 操作

    Spark Dataset DataFrame 操作 相关博文参考 sparksql中dataframe的用法 一.Spark2 Dataset DataFrame空值null,NaN判断和处理 1. ...

  8. Spark SQL、DataFrame和Dataset——转载

    转载自:  Spark SQL.DataFrame和Datase

  9. RDD/Dataset/DataFrame互转

    1.RDD -> Dataset val ds = rdd.toDS() 2.RDD -> DataFrame val df = spark.read.json(rdd) 3.Datase ...

随机推荐

  1. Spider-one

    1. 爬虫是如何采集网页数据的: 网页的三大特征: -1. 每个网页都有自己的 URL(统一资源定位符)地址来进行网络定位. -2. 每个网页都使用 HTML(超文本标记语言)来描述页面信息. -3. ...

  2. HDU 6138 Fleet of the Eternal Throne(后缀自动机)

    题意 题目链接 Sol 真是狗血,被疯狂卡常的原因竟是 我们考虑暴力枚举每个串的前缀,看他能在\(x, y\)的后缀自动机中走多少步,对两者取个min即可 复杂度\(O(T 10^5 M)\)(好假啊 ...

  3. LVS主从部署配置和使用

    LVS是Linux Virtual Server的简写,意即Linux虚拟服务器,是一个虚拟的服务器集群系统.本项目在1998年5月由章文嵩博士成立,是中国国内最早出现的自由软件项目之一. LVS是L ...

  4. 微软 WPC 2014 合作伙伴keynote

    本周一,2014 微软WPC (Worldwide Partner Conference) 合作者伙伴大会在美国华盛顿开幕,微软除了介绍了Azure.云端化的Office 365和Windows Ph ...

  5. loadrunner 脚本优化-关联设置

    脚本优化-关联设置 by:授客 QQ:1033553122 关联的原理 关联也属于一钟特殊的参数化.一般参数化的参数来源于一个文件.一个定义的table.通过sql写的一个结果集等,但关联所获得的参数 ...

  6. 小程序 青少儿书画 利用engineercms作为服务端

    因为很多妈咪们喜欢发布自己宝宝的作品,享受哪些美好时刻,记录亲子创作过程. 为了方便妈咪们展示亲子创作,比如宝宝们画作,涂鸦,书法,作文,其他才艺,特利用engineercms作为服务端,重新设计了一 ...

  7. echart参数设置——曲线图

    { title: { text: '请求返回码分布', subtext: '实时数据' }, tooltip: { trigger: 'axis', position: function (point ...

  8. [20180423]flashback tablespace与snapshot standby.txt

    [20180423]flashback tablespace与snapshot standby.txt --//缺省建立表空间是打开flashback on,如果某个表空间flashback off, ...

  9. python第四十三天--第三模块考核

    面向对象: 概念:类,实例化,对象,实例 属性: 公有属性:在类中定义 成员属性:在方法中定义 私有属性:在方法中使用 __属性  定义 限制外部访问 方法: 普通方法 类方法: @classmeth ...

  10. ORM查询之基于对象的正向查询与反向查询

    一.为什么有正向查询和反向查询? 举例有两张表,一张表叫书籍表,一张表叫出版社表,他们关系是一对多的关系,书籍是多,出版社是一,因为一本书应该只有一个出版社对应,而出版社可以有多本书对应. 那么在实际 ...