spark SQL初步认识

spark SQL是spark的一个模块,主要用于进行结构化数据的处理。它提供的最核心的编程抽象就是DataFrame。

DataFrame:它可以根据很多源进行构建,包括:结构化的数据文件,hive中的表,外部的关系型数据库,以及RDD

创建DataFrame

数据文件students.json

{"id":1, "name":"leo", "age":18}
{"id":2, "name":"jack", "age":19}
{"id":3, "name":"marry", "age":17}

spark-shell里创建DataFrame

//将文件上传到hdfs目录下
hadoop@master:~/wujiadong$ hadoop fs -put students.json /student/2016113012/spark
//启动spark shell
hadoop@slave01:~$ spark-shell
//导入SQLContext
scala> import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext
//声明一个SQLContext的对象,以便对数据进行操作
scala> val sql = new SQLContext(sc)
warning: there was one deprecation warning; re-run with -deprecation for details
sql: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@27acd9a7
//读取数据
scala> val students = sql.read.json("hdfs://master:9000/student/2016113012/spark/students.json")
students: org.apache.spark.sql.DataFrame = [age: bigint, id: bigint ... 1 more field]
//显示数据
scala> students.show
+---+---+-----+
|age| id| name|
+---+---+-----+
| 18| 1| leo|
| 19| 2| jack|
| 17| 3|marry|
+---+---+-----+

DataFrame常用操作

scala> students.show
+---+---+-----+
|age| id| name|
+---+---+-----+
| 18| 1| leo|
| 19| 2| jack|
| 17| 3|marry|
+---+---+-----+ scala> students.printSchema
root
|-- age: long (nullable = true)
|-- id: long (nullable = true)
|-- name: string (nullable = true) scala> students.select("name").show
+-----+
| name|
+-----+
| leo|
| jack|
|marry|
+-----+ scala> students.select(students("name"),students("age")+1).show
+-----+---------+
| name|(age + 1)|
+-----+---------+
| leo| 19|
| jack| 20|
|marry| 18|
+-----+---------+ scala> students.filter(students("age")>18).show
+---+---+----+
|age| id|name|
+---+---+----+
| 19| 2|jack|
+---+---+----+ scala> students.groupBy("age").count().show
+---+-----+
|age|count|
+---+-----+
| 19| 1|
| 17| 1|
| 18| 1|
+---+-----+

两种方式将RDD转换成DataFrame

1)基于反射方式

package wujiadong_sparkSQL

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext} /**
* Created by Administrator on 2017/3/5.
*/
object RDDDataFrameReflection {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("rdddatafromareflection")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val fileRDD = sc.textFile("hdfs://master:9000/student/2016113012/data/students.txt")
val lineRDD = fileRDD.map(line => line.split(","))
//将RDD和case class关联
val studentsRDD = lineRDD.map(x => Students(x(0).toInt,x(1),x(2).toInt))
//在scala中使用反射方式,进行rdd到dataframe的转换,需要手动导入一个隐式转换
import sqlContext.implicits._
val studentsDF = studentsRDD.toDF()
//注册表
studentsDF.registerTempTable("t_students")
val df = sqlContext.sql("select * from t_students")
df.rdd.foreach(row => println(row(0)+","+row(1)+","+row(2)))
df.rdd.saveAsTextFile("hdfs://master:9000/student/2016113012/data/out") } }
//放到外面
case class Students(id:Int,name:String,age:Int)

运行结果

hadoop@master:~/wujiadong$ spark-submit --class wujiadong_sparkSQL.RDDDataFrameReflection  --executor-memory 500m --total-executor-cores 2 /home/hadoop/wujiadong/wujiadong.spark.jar
17/03/05 22:46:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/03/05 22:46:48 INFO Slf4jLogger: Slf4jLogger started
17/03/05 22:46:48 INFO Remoting: Starting remoting
17/03/05 22:46:49 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.131:34921]
17/03/05 22:46:49 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
17/03/05 22:46:51 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
17/03/05 22:47:00 INFO FileInputFormat: Total input paths to process : 1
17/03/05 22:47:07 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/03/05 22:47:07 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/03/05 22:47:07 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/03/05 22:47:07 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/03/05 22:47:07 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
1,leo,17
2,marry,17
3,jack,18
4,tom,19
17/03/05 22:47:10 INFO FileOutputCommitter: Saved output of task 'attempt_201703052247_0001_m_000000_1' to hdfs://master:9000/student/2016113012/data/out/_temporary/0/task_201703052247_0001_m_000000

2)编程接口方式

package wujiadong_sparkSQL

import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.{SparkConf, SparkContext} /**
* Created by Administrator on 2017/3/5.
*/
object RDDDataFrameBianchen {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("RDDDataFrameBianchen")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
//指定地址创建rdd
val studentsRDD = sc.textFile("hdfs://master:9000/student/2016113012/data/students.txt").map(_.split(","))
//将rdd映射到rowRDD
val RowRDD = studentsRDD.map(x => Row(x(0).toInt,x(1),x(2).toInt))
//以编程方式动态构造元素据
val schema = StructType(
List(
StructField("id",IntegerType,true),
StructField("name",StringType,true),
StructField("age",IntegerType,true)
)
)
//将schema信息映射到rowRDD
val studentsDF = sqlContext.createDataFrame(RowRDD,schema)
//注册表
studentsDF.registerTempTable("t_students")
val df = sqlContext.sql("select * from t_students order by age")
df.rdd.collect().foreach(row => println(row))
} }

运行结果

hadoop@master:~/wujiadong$ spark-submit --class wujiadong_sparkSQL.RDDDataFrameBianchen --executor-memory 500m --total-executor-cores 2 /home/hadoop/wujiadong/wujiadong.spark.jar
17/03/06 11:07:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/03/06 11:07:27 INFO Slf4jLogger: Slf4jLogger started
17/03/06 11:07:27 INFO Remoting: Starting remoting
17/03/06 11:07:28 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.131:49756]
17/03/06 11:07:32 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
17/03/06 11:07:38 INFO FileInputFormat: Total input paths to process : 1
17/03/06 11:07:44 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/03/06 11:07:44 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/03/06 11:07:44 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/03/06 11:07:44 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/03/06 11:07:44 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
[1,leo,17]
[2,marry,17]
[3,jack,18]
[4,tom,19]
17/03/06 11:07:47 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
17/03/06 11:07:47 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
17/03/06 11:07:47 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.

DataFrame与RDD

1)在spark中,DataFrame是一种以RDD为基础的分布式数据集,类似于传统数据库中的二维表格

2)DataFrame与RDD的主要区别就是,前者带有schema元信息,即DataFrame所表示的二维表数据集的每一列都带有名称和类型

参考资料

http://9269309.blog.51cto.com/9259309/1851673

参考资料

http://blog.csdn.net/ronaldo4511/article/details/53406069

参考资料

http://spark.apache.org/docs/latest/sql-programming-guide.html#overview

spark SQL学习(认识spark SQL)的更多相关文章

  1. spark SQL学习(spark连接 mysql)

    spark连接mysql(打jar包方式) package wujiadong_sparkSQL import java.util.Properties import org.apache.spark ...

  2. spark SQL学习(spark连接hive)

    spark 读取hive中的数据 scala> import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql. ...

  3. SQL学习笔记之SQL查询练习题1

    (网络搜集) 0x00 表名和字段 –1.学生表 Student(s_id,s_name,s_birth,s_sex) –学生编号,学生姓名, 出生年月,学生性别 –2.课程表 Course(c_id ...

  4. SQL学习之SqlMap SQL注入

    sqlmap也是渗透中常用的一个注入工具,其实在注入工具方面,一个sqlmap就足够用了,只要你用的熟,秒杀各种工具,只是一个便捷性问题,sql注入另一方面就是手工党了,这个就另当别论了. 今天把我一 ...

  5. SQL学习笔记之SQL中INNER、LEFT、RIGHT JOIN的区别和用法详解

    0x00 建表准备 相信很多人在刚开始使用数据库的INNER JOIN.LEFT JOIN和RIGHT JOIN时,都不太能明确区分和正确使用这三种JOIN操作,本文通过一个简单的例子通俗易懂的讲解这 ...

  6. SQL学习笔记----更改SQL默认的端口号

    1.SQLServer配置管理器----SQLServer网络配置----MSSQLSERVER的协议---TCP/IP(已启用)---IP地址 清空素有的IP,在IPALL下更改默认的端口: 2. ...

  7. 大数据技术之_19_Spark学习_03_Spark SQL 应用解析 + Spark SQL 概述、解析 、数据源、实战 + 执行 Spark SQL 查询 + JDBC/ODBC 服务器

    第1章 Spark SQL 概述1.1 什么是 Spark SQL1.2 RDD vs DataFrames vs DataSet1.2.1 RDD1.2.2 DataFrame1.2.3 DataS ...

  8. Spark学习之Spark SQL

    一.简介 Spark SQL 提供了以下三大功能. (1) Spark SQL 可以从各种结构化数据源(例如 JSON.Hive.Parquet 等)中读取数据. (2) Spark SQL 不仅支持 ...

  9. Spark学习之Spark SQL(8)

    Spark学习之Spark SQL(8) 1. Spark用来操作结构化和半结构化数据的接口--Spark SQL. 2. Spark SQL的三大功能 2.1 Spark SQL可以从各种结构化数据 ...

随机推荐

  1. 如何将计算机加入域 分类: AD域 Windows服务 2015-06-10 11:04 63人阅读 评论(0) 收藏

    在上一篇博客中我已经实现了windows server 2008 R2域中的DC部署,那么如何将计算机加入到我们部署的域环境中呢? (初级教程,step by step,不足之处欢迎批评指正!) 将计 ...

  2. TIME_WAIT Accumulation and Port Exhaustion

    客户端实现连接的唯一性 HTTP The Definitive Guide 4.2.7 TIME_WAIT Accumulation and Port Exhaustion TIME_WAIT por ...

  3. [报错]编译报错:clang: error: linker command failed with exit code 1及duplicate symbol xxxx in错误解决方法之一

    今天添加了一个新类(包括m,h,xib文件),还没有调用,—编译遇到如下错误,根据错误提示, duplicate symbol param1 in: /Users/xxxx/Library/Devel ...

  4. python的@classmethod和@staticmethod

    本文是对StackOverflow上的一篇高赞回答的不完全翻译,原文链接:meaning-of-classmethod-and-staticmethod-for-beginner Python面向对象 ...

  5. C#中字符数组,字节数组和string之间的转化

    转自:http://blog.csdn.net/wangxiaoqin00007/article/details/17675419 NDC(NetworkDiskClient)的界面和后台程序之间用S ...

  6. Spring Data 分页和排序 PagingAndSortingRepository的使用(九)

    继承PagingAndSortingRepository 我们可以看到,BlogRepository定义了这样一个方法:Page<Blog> findByDeletedFalse(Page ...

  7. HDFS基本工作机制

  8. qemu网络虚拟化之数据流向分析三

    2016-09-27 前篇文章通过分析源代码,大致描述了各个数据结构之间的关系是如何建立的,那么今天就从数据包的角度,分析下数据包是如何在这些数据结构中间流转的! 这部分内容需要结合前面两篇文章来看, ...

  9. centos7手动编译安装Libvirt常见问题

    由于功能需要,体验了手动编译安装Libvrt,还是碰到了不少问题,这里总结如下仅限于centos7: 1.configure: error: You must install the pciacces ...

  10. docker安装入门

    docker安装入门 https://blog.csdn.net/earbao/article/details/49683175