Spark 2.0
Apache Spark 2.0: Faster, Easier, and Smarter
http://blog.madhukaraphatak.com/categories/spark-two/
https://amplab.cs.berkeley.edu/technical-preview-of-apache-spark-2-0-easier-faster-and-smarter/
Dataset - New Abstraction of Spark
For long, RDD was the standard abstraction of Spark.
But from Spark 2.0, Dataset will become the new abstraction layer for spark. Though RDD API will be available, it will become low level API, used mostly for runtime and library development. All user land code will be written against the Dataset abstraction and it’s subset Dataframe API.
2.0中,最关键的是在RDD这个low level抽象层上,又加了一组DataSet的high level的抽象层,让用户可以跟方便的开发
From Definition, ” A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each dataset also has an untyped view called a DataFrame, which is a Dataset of Row. “
which sounds similar to RDD definition
” RDD represents an immutable,partitioned collection of elements that can be operated on in parallel “
Dataset is a superset of Dataframe API which is released in Spark 1.3.
Dataframe是一种特殊化的Dataset,Dataframe = Dataset[row]
SparkSession - New entry point of Spark
In earlier versions of spark, spark context was entry point for Spark. As RDD was main API, it was created and manipulated using context API’s. For every other API,we needed to use different contexts.For streaming, we needed StreamingContext, for SQL sqlContext and for hive HiveContext. But as DataSet and Dataframe API’s are becoming new standard API’s we need an entry point build for them. So in Spark 2.0, we have a new entry point for DataSet and Dataframe API’s called as Spark Session.
因为有了新的抽象层,所以需要加新的入口,SparkSession
就像RDD对应于SparkContext
更强大的SQL支持
On the SQL side, we have significantly expanded the SQL capabilities of Spark, with the introduction of a new ANSI SQL parser and support for subqueries.
Spark 2.0 can run all the 99 TPC-DS queries, which require many of the SQL:2003 features.
Tungsten 2.0
Spark 2.0 ships with the second generation Tungsten engine.
This engine builds upon ideas from modern compilers and MPP databases and applies them to data processing.
The main idea is to emit optimized bytecode at runtime that collapses the entire query into a single function, eliminating virtual function calls and leveraging CPU registers for intermediate data. We call this technique “whole-stage code generation.”
在内存管理和基于CPU的性能优化上,faster
Structured Streaming
Spark 2.0’s Structured Streaming APIs is a novel way to approach streaming.
It stems from the realization that the simplest way to compute answers on streams of data is to not having to reason about the fact that it is a stream.
意思就是说,不要意识到流,流就是个无限的DataFrames
这个是典型的spark的思路,batch是一切的根本,和Flink截然相反
Spark 2.0的更多相关文章
- Spark 1.0.0 横空出世 Spark on Yarn 部署(Hadoop 2.4)
就在昨天,北京时间5月30日20点多.Spark 1.0.0最终公布了:Spark 1.0.0 released 依据官网描写叙述,Spark 1.0.0支持SQL编写:Spark SQL Progr ...
- APACHE SPARK 2.0 API IMPROVEMENTS: RDD, DATAFRAME, DATASET AND SQL
What’s New, What’s Changed and How to get Started. Are you ready for Apache Spark 2.0? If you are ju ...
- Apache Spark 3.0 将内置支持 GPU 调度
如今大数据和机器学习已经有了很大的结合,在机器学习里面,因为计算迭代的时间可能会很长,开发人员一般会选择使用 GPU.FPGA 或 TPU 来加速计算.在 Apache Hadoop 3.1 版本里面 ...
- spark 2.0.0集群安装与hive on spark配置
1. 环境准备: JDK1.8 hive 2.3.4 hadoop 2.7.3 hbase 1.3.3 scala 2.11.12 mysql5.7 2. 下载spark2.0.0 cd /home/ ...
- Spark 2.0 PCA主成份分析
PCA在Spark2.0中用法比较简单,只需要设置: .setInputCol(“features”)//保证输入是特征值向量 .setOutputCol(“pcaFeatures”)//输出 .se ...
- Apache Spark 2.0三种API的传说:RDD、DataFrame和Dataset
Apache Spark吸引广大社区开发者的一个重要原因是:Apache Spark提供极其简单.易用的APIs,支持跨多种语言(比如:Scala.Java.Python和R)来操作大数据. 本文主要 ...
- Spark 2.0 DataFrame map操作中Unable to find encoder for type stored in a Dataset.问题的分析与解决
转载:http://blog.csdn.net/sparkexpert/article/details/52871000 随着新版本的spark已经逐渐稳定,最近拟将原有框架升级到spark 2.0. ...
- Spark 2.0.0 SPARK-SQL returns NPE Error
com.esotericsoftware.kryo.KryoException: java.lang.NullPointerExceptionSerialization trace:underlyin ...
- Apache Spark 3.0 预览版正式发布,多项重大功能发布
2019年11月08日 数砖的 Xingbo Jiang 大佬给社区发了一封邮件,宣布 Apache Spark 3.0 预览版正式发布,这个版本主要是为了对即将发布的 Apache Spark 3. ...
随机推荐
- hdu 2795 线段树(纵向)
注意h的范围和n的范围,纵向建立线段树 题意:h*w的木板,放进一些1*L的物品,求每次放空间能容纳且最上边的位子思路:每次找到最大值的位子,然后减去L线段树功能:query:区间求最大值的位子(直接 ...
- 电赛总结(三)——DA芯片总结
一.AD7890 1.特性参数 (1)高速12位DA,转换速度5.9us (2)具有8个通道. (3)串行通信 2.芯片管脚图 3.管脚功能 管脚名称 功能 AGND 模拟地 SMODE 控制端,&q ...
- P and V
上次,我们已经说过死锁的形成原因以及防止方法了,都知道,之所以会发生死锁现象,原因之一是进程执行所申请的资源得不到满足,而陷入无限期的循环等待现象,而在这里我们说的进程其实是并发进程,也就是一组,至少 ...
- Java配置环境变量、方法和原因
首先,你应该已经安装了 java 的 JDK 了,笔者安装的是:jdk-7u7-windows-x64 接下来主要讲怎么配置 java 的环境变量,也是为了以后哪天自己忘记了做个备份 1.进入“计算机 ...
- spark streaming中使用checkpoint
从官方的Programming Guides中看到的 我理解streaming中的checkpoint有两种,一种指的是metadata的checkpoint,用于恢复你的streaming:一种是r ...
- LinQ的一些基本语句
LINQ(Language Integrated Query)即语言集成查询.它是一种语言特性和API,使得你可以使用统一的方式编写各种查询,查询的对象包括xml.对象集合.SqlServer数据库等 ...
- C#图片处理示例(裁剪,缩放,清晰度,水印)
C#图片处理示例(裁剪,缩放,清晰度,水印) 吴剑 2011-02-20 原创文章,转载必需注明出处:http://www.cnblogs.com/wu-jian/ 前言 需求源自项目中的一些应用,比 ...
- 在VMware Workstation上安装Kali Linux
在VMware Workstation上安装Kali Linux VMware Workstation是一款功能强大的桌面虚拟计算机软件.该软件允许用户在单一的桌面上同时运行不同的操作系统,并且可以进 ...
- Java集合的线程安全用法
线程安全的集合包含2个问题 1.多线程并发修改一 个 集合 怎么办? 2.如果迭代的过程中 集合 被修改了怎么办? a.一个线程在迭代,另一个线程在修改 b.在同一个线程内用同一个迭代器对象进行迭代. ...
- TYVJ P1098 任务安排 Label:倒推dp 不懂
描述 N个任务排成一个序列在一台机器上等待完成(顺序不得改变),这N个任务被分成若干批,每批包含相邻的若干任务.从时刻0开始,这些任务被分批加工,第i个任务单独完成所需的时间是Ti.在每批任务开始前, ...