总结: 
1、RDD是一个Java对象的集合。RDD的优点是更面向对象,代码更容易理解。但在需要在集群中传输数据时需要为每个对象保留数据及结构信息,这会导致数据的冗余,同时这会导致大量的GC。 
2、DataFrame是在1.3引入的,它包含数据与schema2部分信息,其中数据就是真正的数据,而不是一个java对象。它不容易理解,同时对java支持不好,还有一个缺点是非强类型,这会导致部分错误在运行时才会发现。优点是数据不需要加载到一个java对象,减少GC,大大优化了数据在集群间传播与本地序列化的效率。 
3、DataSet在1.6引入了预览版,在2.0才真正稳定。它试图整合RDD/DataFrame的优点。在2.0里对DataSet的定位是:(1)DataFrame只是一个type alias,真正实现都是DataSet。(2)对于Python和R这些非类型安全的语言,DataFrame仍是主要编程接口。

  • Unifying DataFrames and Datasets in Scala/Java: Starting in Spark 2.0, DataFrame is just a type alias for Dataset of Row. Both the typed methods (e.g. map, filter, groupByKey) and the untyped methods (e.g. select, groupBy) are available on the Dataset class. Also, this new combined Dataset interface is the abstraction used for Structured Streaming. Since compile-time type-safety in Python and R is not a language feature, the concept of Dataset does not apply to these languages’ APIs. Instead, DataFrame remains the primary programing abstraction, which is analogous to the single-node data frame notion in these languages. Get a peek from a Dataset API notebook.

  • DataFrame-based Machine Learning API emerges as the primary ML API: With Spark 2.0, the spark.ml package, with its “pipeline” APIs, will emerge as the primary machine learning API. While the original spark.mllib package is preserved, future development will focus on the DataFrame-based API.

There Are Now 3 Apache Spark APIs. Here’s How to Choose the Right One

See Apache Spark 2.0 API Improvements: RDD, DataFrame, DataSet and SQL here.

Apache Spark is evolving at a rapid pace, including changes and additions to core APIs. One of the most disruptive areas of change is around the representation of data sets. Spark 1.0 used the RDD API but in the past twelve months, two new alternative and incompatible APIs have been introduced. Spark 1.3 introduced the radically different DataFrame API and the recently released Spark 1.6 release introduces a preview of the new Dataset API.

Many existing Spark developers will be wondering whether to jump from RDDs directly to the Dataset API, or whether to first move to the DataFrame API. Newcomers to Spark will have to choose which API to start learning with.

This article provides an overview of each of these APIs, and outlines the strengths and weaknesses of each one. A companion github repository provides working examples that are a good starting point for experimentation with the approaches outlined in this article.

The RDD (Resilient Distributed Dataset) API has been in Spark since the 1.0 release. This interface and its Java equivalent, JavaRDD, will be familiar to any developers who have worked through the standard Spark tutorials. From a developer’s perspective, an RDD is simply a set of Java or Scala objects representing data.

The RDD API provides many transformation methods, such as map()filter(), and reduce() for performing computations on the data. Each of these methods results in a new RDD representing the transformed data. However, these methods are just defining the operations to be performed and the transformations are not performed until an action method is called. Examples of action methods are collect() and saveAsObjectFile().

Example of RDD transformations and actions

Scala:

rdd.filter(_.age > )              // transformation
.map(_.last) // transformation
.saveAsObjectFile("under21.bin") // action

Java:

rdd.filter(p -> p.getAge() < )     // transformation
.map(p -> p.getLast()) // transformation
.saveAsObjectFile("under21.bin"); // action

The main advantage of RDDs is that they are simple and well understood because they deal with concrete classes, providing a familiar object-oriented programming style with compile-time type-safety. For example, given an RDD containing instances of Person we can filter by age by referencing the age attribute of each Person object:

Example: Filter by attribute with RDD

Scala:

rdd.filter(_.age > )

Java:

rdd.filter(person -> person.getAge() > )

Because the code is referring to data attributes by name, it is not possible for the compiler to catch any errors. If attribute names are incorrect then the error will only detected at runtime, when the query plan is created.

Another downside with the DataFrame API is that it is very scala-centric and while it does support Java, the support is limited. For example, when creating a DataFrame from an existing RDD of Java objects, Spark’s Catalyst optimizer cannot infer the schema and assumes that any objects in the DataFrame implement the scala.Product interface. Scala case classes work out the box because they implement this interface.

Dataset API

The Dataset API, released as an API preview in Spark 1.6, aims to provide the best of both worlds; the familiar object-oriented programming style and compile-time type-safety of the RDD API but with the performance benefits of the Catalyst query optimizer. Datasets also use the same efficient off-heap storage mechanism as the DataFrame API.

When it comes to serializing data, the Dataset API has the concept of encoders which translate between JVM representations (objects) and Spark’s internal binary format. Spark has built-in encoders which are very advanced in that they generate byte code to interact with off-heap data and provide on-demand access to individual attributes without having to de-serialize an entire object. Spark does not yet provide an API for implementing custom encoders, but that is planned for a future release.

Additionally, the Dataset API is designed to work equally well with both Java and Scala. When working with Java objects, it is important that they are fully bean-compliant. In writing the examples to accompany this article, we ran into errors when trying to create a Dataset in Java from a list of Java objects that were not fully bean-compliant.

Example: Creating Dataset from a list of objects

Scala

val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val sampleData: Seq[ScalaPerson] = ScalaData.sampleData()
val dataset = sqlContext.createDataset(sampleData)

Java

JavaSparkContext sc = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new SQLContext(sc);
List data = JavaData.sampleData();
Dataset dataset = sqlContext.createDataset(data, Encoders.bean(JavaPerson.class));

Transformations with the Dataset API look very much like the RDD API and deal with the Person class rather than an abstraction of a row.

Example: Filter by attribute with Dataset

Scala

dataset.filter(_.age < );

Java

dataset.filter(person -> person.getAge() < );

Despite the similarity with RDD code, this code is building a query plan, rather than dealing with individual objects, and if age is the only attribute accessed, then the rest of the the object’s data will not be read from off-heap storage.

Conclusions

If you are developing primarily in Java then it is worth considering a move to Scala before adopting the DataFrame or Dataset APIs. Although there is an effort to support Java, Spark is written in Scala and the code often makes assumptions that make it hard (but not impossible) to deal with Java objects.

If you are developing in Scala and need your code to go into production with Spark 1.6.0 then the DataFrame API is clearly the most stable option available and currently offers the best performance.

However, the Dataset API preview looks very promising and provides a more natural way to code. Given the rapid evolution of Spark it is likely that this API will mature very quickly through 2016 and become the de-facto API for developing new applications.

See Apache Spark 2.0 API Improvements: RDD, DataFrame, DataSet and SQL here.

RDD, DataFrame or Dataset的更多相关文章

  1. Spark RDD、DataFrame和DataSet的区别

    版权声明:本文为博主原创文章,未经博主允许不得转载.   目录(?)[+]   转载请标明出处:小帆的帆的专栏 RDD 优点: 编译时类型安全 编译时就能检查出类型错误 面向对象的编程风格 直接通过类 ...

  2. spark结构化数据处理:Spark SQL、DataFrame和Dataset

    本文讲解Spark的结构化数据处理,主要包括:Spark SQL.DataFrame.Dataset以及Spark SQL服务等相关内容.本文主要讲解Spark 1.6.x的结构化数据处理相关东东,但 ...

  3. RDD、DataFrame和DataSet的区别

    原文链接:http://www.jianshu.com/p/c0181667daa0 RDD.DataFrame和DataSet是容易产生混淆的概念,必须对其相互之间对比,才可以知道其中异同. RDD ...

  4. 谈谈RDD、DataFrame、Dataset的区别和各自的优势

    在spark中,RDD.DataFrame.Dataset是最常用的数据类型,本博文给出笔者在使用的过程中体会到的区别和各自的优势 共性: 1.RDD.DataFrame.Dataset全都是spar ...

  5. APACHE SPARK 2.0 API IMPROVEMENTS: RDD, DATAFRAME, DATASET AND SQL

    What’s New, What’s Changed and How to get Started. Are you ready for Apache Spark 2.0? If you are ju ...

  6. RDD、DataFrame、Dataset三者三者之间转换

    转化: RDD.DataFrame.Dataset三者有许多共性,有各自适用的场景常常需要在三者之间转换 DataFrame/Dataset转RDD: 这个转换很简单 val rdd1=testDF. ...

  7. RDD、DataFrame、Dataset

    RDD是Spark建立之初的核心API.RDD是不可变分布式弹性数据集,在Spark集群中可跨节点分区,并提供分布式low-level API来操作RDD,包括transformation和actio ...

  8. RDD、DataFrame和DataSet

    简述 RDD.DataFrame和DataSet是容易产生混淆的概念,必须对其相互之间对比,才可以知道其中异同:DataFrame多了数据的结构信息,即schema.RDD是分布式的 Java对象的集 ...

  9. spark RDD、DataFrame、DataSet之间的相互转化

    这三个数据集看似经常用,但是真正归纳总结的时候,很容易说不出来 三个之间的关系与区别参考我的另一篇blog  http://www.cnblogs.com/xjh713/p/7309507.html ...

随机推荐

  1. 【java】 java 设计模式概述

    一.设计模式的分类 总体来说设计模式分为三大类: 创建型模式,共五种:工厂方法模式.抽象工厂模式.单例模式.建造者模式.原型模式. 结构型模式,共七种:适配器模式.装饰器模式.代理模式.外观模式.桥接 ...

  2. Python图像处理库PIL的ImageSequence模块介绍

    ImageSequence模块包括了一个wrapper类,它能够让用户迭代訪问图形序列中每一帧图像. 一.ImageSequence模块的函数 1.  Iterator 定义:ImageSequenc ...

  3. Java精选笔记_Java编程基础

    Java的基本语法 Java代码的基本格式 修饰符 class 类名 {   程序代码 } 一个Java源文件只定义一个类,不同的类使用不同的源文件定义:将每个源文件中单独定义的类都定义成public ...

  4. Mac普通用户修改了/etc/sudoers文件的解决办法

    1.开启 Root 账户 打开“系统偏好设置”,进入“用户与群组”面板,记得把面板左下角的小锁打开,然后选择面板里的“登录选项”.在面板右边你会看到“网络账户服务 器”,点击它旁边的“加入…”按钮,再 ...

  5. 关于直播学习笔记-003-nginx-rtmp、srs、vlc、obs

    服务器 1.nginx-rtmp:https://github.com/illuspas/nginx-rtmp-win32 2.srs:https://github.com/illuspas/srs- ...

  6. mySQL数据库三:命令行附录

    一:where 在上一篇,粗略的介绍了where,但是where后面可以跟其他的条件,现在我们来一一说明 1.between:在某两个值之间 我建立一个名为person的表,里面有id,name,ag ...

  7. zookeeper集群的安装和配置

    Zookeeper的目的是封装好复杂易出错的关键服务,将简单易用的接口和性能高效.功能稳定的系统提供给用户.Zookeeper有两种运行模式,单机模式(Standalone)和集群模式(Distrib ...

  8. CRUX下实现进程隐藏(2)

    前面我们介绍了如何修改/proc目录读取函数的方法实现进程隐藏.这篇博文将介绍另一种方法—— 劫持系统调用实现进程隐藏. 其基本原理是:加载一个内核模块(LKM),通过劫持系统调用sys_getden ...

  9. python epoll实现异步socket

    一.同步和异步: 在程序执行中,同步运行意味着等待调用的函数.线程.子进程等的返回结果后继续处理:异步指不等待当下的返回结果,直接运行主进程下面的程序,等到有返回结果时,通知主进程处理.有点高效. 二 ...

  10. 百度移动开发平台在用angularJS