1.准备数据employee.txt

,Gong Shaocheng,
,Li Dachao,
,Qiu Xin,
,Cheng Jiangzhong,
,Wo Binggang,

将数据放入hdfs

[root@jfp3- spark-studio]# hdfs dfs -put employee.txt /user/spark_studio

2.启动spark shell

[root@jfp3- spark-1.0.-bin-hadoop2]# ./bin/spark-shell --master spark://192.168.0.71:7077

3.编写脚本

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._ case class Employee(employeeId: Int, name: String, departmentId: Int) // Create an RDD of Employee objects and register it as a table.
val employees = sc.textFile("hdfs://jfp3-1:8020/user/spark_studio/employee.txt").map(_.split(",")).map(p => Employee(p(), p(), p().trim.toInt))
employees.registerAsTable("employee") // SQL statements can be run by using the sql methods provided by sqlContext.
val fsis = sql("SELECT name FROM employee WHERE departmentId = 1") // The results of SQL queries are SchemaRDDs and support all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
fsis.map(t => "Name: " + t()).collect().foreach(println)

4.运行

scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@ scala> import sqlContext._
import sqlContext._ scala> case class Employee(employeeId: String, name: String, departmentId: Int)
defined class Employee scala> val employees = sc.textFile("hdfs://jfp3-1:8020/user/spark_studio/employee.txt").map(_.split(",")).map(p => Employee(p(), p(), p().trim.toInt))
// :: INFO MemoryStore: ensureFreeSpace() called with curMem=, maxMem=
// :: INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 135.5 KB, free 294.8 MB)
employees: org.apache.spark.rdd.RDD[Employee] = MappedRDD[] at map at <console>: scala> employees.registerAsTable("employee") scala> val fsis = sql("SELECT name FROM employee WHERE departmentId = 1")
// :: INFO Analyzer: Max iterations () reached for batch MultiInstanceRelations
// :: INFO Analyzer: Max iterations () reached for batch CaseInsensitiveAttributeReferences
// :: INFO SQLContext$$anon$: Max iterations () reached for batch Add exchange
// :: INFO SQLContext$$anon$: Max iterations () reached for batch Prepare Expressions
fsis: org.apache.spark.sql.SchemaRDD =
SchemaRDD[] at RDD at SchemaRDD.scala:
== Query Plan ==
Project [name#:]
Filter (departmentId#: = )
ExistingRdd [employeeId#,name#,departmentId#], MapPartitionsRDD[] at mapPartitions at basicOperators.scala: scala> fsis.map(t => "Name: " + t()).collect().foreach(println)
// :: INFO FileInputFormat: Total input paths to process :
// :: INFO SparkContext: Starting job: collect at <console>:
// :: INFO DAGScheduler: Got job (collect at <console>:) with output partitions (allowLocal=false)
// :: INFO DAGScheduler: Final stage: Stage (collect at <console>:)
// :: INFO DAGScheduler: Parents of final stage: List()
// :: INFO DAGScheduler: Missing parents: List()
// :: INFO DAGScheduler: Submitting Stage (MappedRDD[] at map at <console>:), which has no missing parents
// :: INFO DAGScheduler: Submitting missing tasks from Stage (MappedRDD[] at map at <console>:)
// :: INFO TaskSchedulerImpl: Adding task set 0.0 with tasks
// :: INFO TaskSetManager: Starting task 0.0: as TID on executor : jfp3- (NODE_LOCAL)
// :: INFO TaskSetManager: Serialized task 0.0: as bytes in ms
// :: INFO TaskSetManager: Starting task 0.0: as TID on executor : jfp3- (NODE_LOCAL)
// :: INFO TaskSetManager: Serialized task 0.0: as bytes in ms
// :: INFO TaskSetManager: Finished TID in ms on jfp3- (progress: /)
// :: INFO TaskSetManager: Finished TID in ms on jfp3- (progress: /)
// :: INFO DAGScheduler: Completed ResultTask(, )
// :: INFO DAGScheduler: Completed ResultTask(, )
// :: INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
// :: INFO DAGScheduler: Stage (collect at <console>:) finished in 1.284 s
// :: INFO SparkContext: Job finished: collect at <console>:, took 1.386154401 s
Name: Gong Shaocheng
Name: Li Dachao
Name: Qiu Xin

5.将数据存为parquet格式,并运行sql

scala> val parquetFile = sqlContext.parquetFile("hdfs://jfp3-1:8020/user/spark_studio/employee.parquet")
// :: INFO Analyzer: Max iterations () reached for batch MultiInstanceRelations
// :: INFO Analyzer: Max iterations () reached for batch CaseInsensitiveAttributeReferences
// :: INFO SQLContext$$anon$: Max iterations () reached for batch Add exchange
// :: INFO SQLContext$$anon$: Max iterations () reached for batch Prepare Expressions
parquetFile: org.apache.spark.sql.SchemaRDD =
SchemaRDD[] at RDD at SchemaRDD.scala:
== Query Plan ==
ParquetTableScan [employeeId#,name#,departmentId#], (ParquetRelation hdfs://jfp3-1:8020/user/spark_studio/employee.parquet), None scala> parquetFile.registerAsTable("parquetFile") scala> val telcos = sql("SELECT name FROM parquetFile WHERE departmentId = 3")
// :: INFO Analyzer: Max iterations () reached for batch MultiInstanceRelations
// :: INFO Analyzer: Max iterations () reached for batch CaseInsensitiveAttributeReferences
// :: INFO SQLContext$$anon$: Max iterations () reached for batch Add exchange
// :: INFO SQLContext$$anon$: Max iterations () reached for batch Prepare Expressions
// :: INFO MemoryStore: ensureFreeSpace() called with curMem=, maxMem=
// :: INFO MemoryStore: Block broadcast_1 stored as values to memory (estimated size 176.3 KB, free 294.6 MB)
telcos: org.apache.spark.sql.SchemaRDD =
SchemaRDD[] at RDD at SchemaRDD.scala:
== Query Plan ==
Project [name#:]
Filter (departmentId#: = )
ParquetTableScan [name#,departmentId#], (ParquetRelation hdfs://jfp3-1:8020/user/spark_studio/employee.parquet), None scala> telcos.collect().foreach(println)
// :: INFO FileInputFormat: Total input paths to process :
// :: INFO ParquetInputFormat: Total input paths to process :
// :: INFO ParquetFileReader: reading summary file: hdfs://jfp3-1:8020/user/spark_studio/employee.parquet/_metadata
// :: INFO deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
// :: INFO deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
// :: INFO SparkContext: Starting job: collect at <console>:
// :: INFO DAGScheduler: Got job (collect at <console>:) with output partitions (allowLocal=false)
// :: INFO DAGScheduler: Final stage: Stage (collect at <console>:)
// :: INFO DAGScheduler: Parents of final stage: List()
// :: INFO DAGScheduler: Missing parents: List()
// :: INFO DAGScheduler: Submitting Stage (SchemaRDD[] at RDD at SchemaRDD.scala:
== Query Plan ==
Project [name#:]
Filter (departmentId#: = )
ParquetTableScan [name#,departmentId#], (ParquetRelation hdfs://jfp3-1:8020/user/spark_studio/employee.parquet), None), which has no missing parents
// :: INFO DAGScheduler: Submitting missing tasks from Stage (SchemaRDD[] at RDD at SchemaRDD.scala:
== Query Plan ==
Project [name#:]
Filter (departmentId#: = )
ParquetTableScan [name#,departmentId#], (ParquetRelation hdfs://jfp3-1:8020/user/spark_studio/employee.parquet), None)
// :: INFO TaskSchedulerImpl: Adding task set 2.0 with tasks
// :: INFO TaskSetManager: Starting task 2.0: as TID on executor : jfp3- (NODE_LOCAL)
// :: INFO TaskSetManager: Serialized task 2.0: as bytes in ms
// :: INFO TaskSetManager: Starting task 2.0: as TID on executor : jfp3- (NODE_LOCAL)
// :: INFO TaskSetManager: Serialized task 2.0: as bytes in ms
// :: INFO DAGScheduler: Completed ResultTask(, )
// :: INFO TaskSetManager: Finished TID in ms on jfp3- (progress: /)
// :: INFO DAGScheduler: Completed ResultTask(, )
// :: INFO TaskSetManager: Finished TID in ms on jfp3- (progress: /)
// :: INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
// :: INFO DAGScheduler: Stage (collect at <console>:) finished in 2.177 s
// :: INFO SparkContext: Job finished: collect at <console>:, took 2.210887848 s
[Wo Binggang]

6. DSL syntax支持

scala> all.collect().foreach(println)
// :: INFO SparkContext: Starting job: collect at <console>:
// :: INFO DAGScheduler: Got job (collect at <console>:) with output partitions (allowLocal=false)
// :: INFO DAGScheduler: Final stage: Stage (collect at <console>:)
// :: INFO DAGScheduler: Parents of final stage: List()
// :: INFO DAGScheduler: Missing parents: List()
// :: INFO DAGScheduler: Submitting Stage (SchemaRDD[] at RDD at SchemaRDD.scala:
== Query Plan ==
Project [name#:]
Filter (departmentId#: >= )
ExistingRdd [employeeId#,name#,departmentId#], MapPartitionsRDD[] at mapPartitions at basicOperators.scala:), which has no missing parents
// :: INFO DAGScheduler: Submitting missing tasks from Stage (SchemaRDD[] at RDD at SchemaRDD.scala:
== Query Plan ==
Project [name#:]
Filter (departmentId#: >= )
ExistingRdd [employeeId#,name#,departmentId#], MapPartitionsRDD[] at mapPartitions at basicOperators.scala:)
// :: INFO TaskSchedulerImpl: Adding task set 6.0 with tasks
// :: INFO TaskSetManager: Starting task 6.0: as TID on executor : jfp3- (NODE_LOCAL)
// :: INFO TaskSetManager: Serialized task 6.0: as bytes in ms
// :: INFO TaskSetManager: Starting task 6.0: as TID on executor : jfp3- (NODE_LOCAL)
// :: INFO TaskSetManager: Serialized task 6.0: as bytes in ms
// :: INFO TaskSetManager: Finished TID in ms on jfp3- (progress: /)
// :: INFO DAGScheduler: Completed ResultTask(, )
// :: INFO DAGScheduler: Completed ResultTask(, )
// :: INFO TaskSetManager: Finished TID in ms on jfp3- (progress: /)
// :: INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool
// :: INFO DAGScheduler: Stage (collect at <console>:) finished in 0.039 s
// :: INFO SparkContext: Job finished: collect at <console>:, took 0.052556716 s
[Gong Shaocheng]
[Li Dachao]
[Qiu Xin]
[Cheng Jiangzhong]
[Wo Binggang]

SparkSQL之旅的更多相关文章

  1. sparkSQL实战详解

    摘要   如果要想真正的掌握sparkSQL编程,首先要对sparkSQL的整体框架以及sparkSQL到底能帮助我们解决什么问题有一个整体的认识,然后就是对各个层级关系有一个清晰的认识后,才能真正的 ...

  2. hadoop学习之旅1

    大数据介绍 大数据本质也是数据,但是又有了新的特征,包括数据来源广.数据格式多样化(结构化数据.非结构化数据.Excel文件.文本文件等).数据量大(最少也是TB级别的.甚至可能是PB级别).数据增长 ...

  3. 一条Sql的Spark之旅

    背景 ​ SQL作为一门标准的.通用的.简单的DSL,在大数据分析中有着越来越重要的地位;Spark在批处理引擎领域当前也是处于绝对的地位,而Spark2.0中的SparkSQL也支持ANSI-SQL ...

  4. Linq之旅:Linq入门详解(Linq to Objects)

    示例代码下载:Linq之旅:Linq入门详解(Linq to Objects) 本博文详细介绍 .NET 3.5 中引入的重要功能:Language Integrated Query(LINQ,语言集 ...

  5. WCF学习之旅—第三个示例之四(三十)

           上接WCF学习之旅—第三个示例之一(二十七)               WCF学习之旅—第三个示例之二(二十八)              WCF学习之旅—第三个示例之三(二十九)   ...

  6. 【C#代码实战】群蚁算法理论与实践全攻略——旅行商等路径优化问题的新方法

    若干年前读研的时候,学院有一个教授,专门做群蚁算法的,很厉害,偶尔了解了一点点.感觉也是生物智能的一个体现,和遗传算法.神经网络有异曲同工之妙.只不过当时没有实际需求学习,所以没去研究.最近有一个这样 ...

  7. Hadoop学习之旅二:HDFS

    本文基于Hadoop1.X 概述 分布式文件系统主要用来解决如下几个问题: 读写大文件 加速运算 对于某些体积巨大的文件,比如其大小超过了计算机文件系统所能存放的最大限制或者是其大小甚至超过了计算机整 ...

  8. .NET跨平台之旅:在生产环境中上线第一个运行于Linux上的ASP.NET Core站点

    2016年7月10日,我们在生产环境中上线了第一个运行于Linux上的ASP.NET Core站点,这是一个简单的提供后端服务的ASP.NET Core Web API站点. 项目是在Windows上 ...

  9. 【Knockout.js 学习体验之旅】(3)模板绑定

    本文是[Knockout.js 学习体验之旅]系列文章的第3篇,所有demo均基于目前knockout.js的最新版本(3.4.0).小茄才识有限,文中若有不当之处,还望大家指出. 目录: [Knoc ...

随机推荐

  1. 记录一下我使用的vim的配置文件

    还不是很完美: "au BufReadPost * if line("'\"") > 0|if line("'\"") &l ...

  2. 十六、Java基础---------集合框架之Set

    写在前面的话,这篇文章在昨天就写好了,今天打开的时候一不小心将第二天的文章粘贴到了这篇文章,很不幸的是除了标题之外依然面目全非,今天带着沉痛的心情再来写这篇文章! 上篇文章介绍了Collection体 ...

  3. 数据库连接driverClass和jdbcUrl大全

    一.MySQL: driverClass:com.mysql.jdbc.Driver                         org.gjt.mm.mysql.Driver jdbcUrl:j ...

  4. 构建高性能的ASP.NET应用程序

    看见大标题的时候,也许各位看官会自然而然的联想到如何在设计阶段考虑系统性能问题,如何编写高性能的程序代码.关于这一点,大家可以在MSDN和相关网站上找到非常多的介绍,不过大多是防患于未难,提供的是在设 ...

  5. sublime设置备份

    Settings-user { "font_face": "Consolas", "font_size": 13, "line_p ...

  6. windows进程详解

    1:系统必要进程system process    进程文件: [system process] or [system process]进程名称: Windows内存处理系统进程描述: Windows ...

  7. python 学习笔记十 rabbitmq(进阶篇)

    RabbitMQ MQ全称为Message Queue, 消息队列(MQ)是一种应用程序对应用程序的通信方法.应用程序通过读写出入队列的消息(针对应用程序的数据)来通信,而无需专用连接来链接它们.消 ...

  8. Windows server 2012 AD DS 搭建步骤

    服务器版本:Windows server 2012 1.  配置网络,由于本机会搭建DNS服务器,因此首选DNS服务器设置为127.0.0.1 2.  打开服务器管理器 3.  点击添加角色和功能,下 ...

  9. Bootstrap_导航

    一.标签形tab导航 标签形导航,也称为选项卡导航. 标签形导航是通过“.nav-tabs”样式来实现.在制作标签形导航时需要在原导航“.nav”上追加此类名. <ul class=" ...

  10. Kanzi Studio中的概念

    Kanzi Studio是Kanzi的UI编辑器,功能非常强大.在使用Kanzi Stadio之前,首先要先熟悉编辑器中的概念. Kanzi Studio中主要分project窗格,property窗 ...