Spark 学习笔记：（三）Spark SQL

参考：https://spark.apache.org/docs/latest/sql-programming-guide.html#overview

　　http://www.csdn.net/article/2015-04-03/2824407

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.

1)在Spark中，DataFrame是一种以RDD为基础的分布式数据集，类似于传统数据库中的二维表格。DataFrame与RDD的主要区别在于，前者带有schema元信息，即DataFrame所表示的二维表数据集的每一列都带有名称和类型。这使得Spark SQL得以洞察更多的结构信息，从而对藏于DataFrame背后的数据源以及作用于DataFrame之上的变换进行了针对性的优化，最终达到大幅提升运行时效率的目标。反观RDD，由于无从得知所存数据元素的具体内部结构，Spark Core只能在stage层面进行简单、通用的流水线优化。

2)A DataFrame can be operated on as normal RDDs and can also be registered as a temporary table. Registering a DataFrame as a table allows you to run SQL queries over its data.

3)The sql function on a SQLContext enables applications to run SQL queries programmatically and returns the result as a DataFrame.

val df = sqlContext.sql("SELECT * FROM table")   //sql接口

创建DataFrames：

With a SQLContext, applications can create DataFrames from an existing RDD, from a Hive table, or from data sources.

val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

Two different methods for converting existing RDDs into DataFrames.
- The first method uses reflection to infer the schema of an RDD that contains specific types of objects.

The case class defines the schema of the table=>The names of the arguments to the case class are read using reflection and become the names of the columns=>This RDD can be implicitly converted to a DataFrame and then be registered as a table=>Tables can be used in subsequent SQL statements.

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
 
// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface.
case class Person(name: String, age: Int)
 
// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
people.registerTempTable("people")
 
// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
 
// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

- Aprogrammatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct DataFrames when the columns and their types are not known until runtime.

From data sources：

val df = sqlContext.load("people.json", "json")
df.select("name", "age").save("namesAndAges.parquet", "parquet")

val df = sqlContext.jsonFile("examples/src/main/resources/people.json")

Spark 学习笔记：（三）Spark SQL的更多相关文章

spark学习笔记总结-spark入门资料精化
Spark学习笔记 Spark简介 spark 可以很容易和yarn结合,直接调用HDFS.Hbase上面的数据,和hadoop结合.配置很容易. spark发展迅猛,框架比hadoop更加灵活实用. ...
Spark学习笔记-三种属性配置详细说明【转】
相关资料:Spark属性配置 http://www.cnblogs.com/chengxin1982/p/4023111.html 本文出处:转载自过往记忆(http://www.iteblog.c ...
Spark学习笔记-使用Spark History Server
在运行Spark应用程序的时候,driver会提供一个webUI给出应用程序的运行信息,但是该webUI随着应用程序的完成而关闭端口,也就是说,Spark应用程序运行完后,将无法查看应用程序的历史记 ...
Spark 学习笔记之 Spark history Server 搭建
在hdfs上建立文件夹/directory hadoop fs -mkdir /directory 进入conf目录 spark-env.sh 增加以下配置 export SPARK_HISTORY ...
Spark学习笔记之-Spark远程调试
Spark远程调试本例子介绍简单介绍spark一种远程调试方法,使用的IDE是IntelliJ IDEA. 1.了解jvm一些参数属性 -X ...
Spark学习笔记之SparkRDD
Spark学习笔记之SparkRDD 一. 基本概念 RDD(resilient distributed datasets)弹性分布式数据集. 来自于两方面 ① 内存集合和外部存储系统 ② ...
Spark学习笔记0——简单了解和技术架构
目录 Spark学习笔记0--简单了解和技术架构什么是Spark 技术架构和软件栈 Spark Core Spark SQL Spark Streaming MLlib GraphX 集群管理器受 ...
Oracle学习笔记三 SQL命令
SQL简介 SQL 支持下列类别的命令: 1.数据定义语言(DDL) 2.数据操纵语言(DML) 3.事务控制语言(TCL) 4.数据控制语言(DCL)
Spark学习笔记2（spark所需环境配置
Spark学习笔记2 配置spark所需环境 1.首先先把本地的maven的压缩包解压到本地文件夹中,安装好本地的maven客户端程序,版本没有什么要求不需要最新版的maven客户端. 解压完成之后 ...
Spark学习笔记3（IDEA编写scala代码并打包上传集群运行）
Spark学习笔记3 IDEA编写scala代码并打包上传集群运行我们在IDEA上的maven项目已经搭建完成了,现在可以写一个简单的spark代码并且打成jar包上传至集群,来检验一下我们的sp ...

随机推荐

Leetcode 407.接雨水
接雨水给定一个 m x n 的矩阵,其中的值均为正整数,代表二维高度图每个单元的高度,请计算图中形状最多能接多少体积的雨水. 说明: m 和 n 都是小于110的整数.每一个单位的高度都大于0 且小 ...
[automator学习篇]android 接口-- bluetooth
http://www.android-doc.com/guide/topics/connectivity/bluetooth.html 本地下载的sdk 目录: D:\AndroidSdk\s ...
Windows Server AppFabric
文章:Windows Server AppFabric简介介绍了AppFabric强大的功能.
公钥密码之RSA密码算法大素数判定：Miller-Rabin判定法！
公钥密码之RSA密码算法大素数判定:Miller-Rabin判定法! 先存档再说,以后实验报告还得打印上交. Miller-Rabin大素数判定对于学算法的人来讲不是什么难事,主要了解其原理. 先来灌 ...
iOS学习笔记16-数据库SQLite
一.数据库在项目开发中,通常都需要对数据进行离线缓存的处理,如新闻数据的离线缓存等.离线缓存一般都是把数据保存到项目的沙盒中.有以下几种方式: 归档:NSKeyedArchiver 偏好设置:NSU ...
wsgi 简介
原文博客地址 http://blog.csdn.net/on_1y/article/details/18803563
VMware虚拟机 NAT模式配置静态ip
前言:Ubuntu 16.04 VMware虚拟机 NAT模式配置静态ip,这个问题困扰我好长时间,桥接的静态ip我会了,然而用NAT 的方式配置集群会更好.(NAT 方式客户机之间的通讯不经过路由 ...
洛谷P2365 任务安排 [解法二斜率优化]
解法一:http://www.cnblogs.com/SilverNebula/p/5926253.html 解法二:斜率优化在解法一中有这样的方程:dp[i]=min(dp[i],dp[j]+(s ...
Yii 之cookie的使用
public function actionIndex(){ //设置cookie(注意这里用的是响应组件) $cookies = \YII::$app->response->cookie ...
Maven单元测试
// SKU码:系列前3位+6位年月日+3位序号(自动生产,取数据库中当天最大的,没有就赋值位001) // 订单编号:BRD+6位年月日+5位序号 // // 退单号:BRT+6位年月日+3位序号 ...

Spark 学习笔记：（三）Spark SQL

Spark 学习笔记：（三）Spark SQL的更多相关文章

随机推荐

热门专题