关于RDD, 详细可以参考Spark的论文, 下面看下源码
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
Represents an immutable, partitioned collection of elements that can be operated on in parallel.

* Internally, each RDD is characterized by five main properties:
*  - A list of partitions
*  - A function for computing each split
*  - A list of dependencies on other RDDs
*  - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
*  - Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

RDD分为一下几类,

basic(org.apache.spark.rdd.RDD): This class contains the basic operations available on all RDDs, such as `map`, `filter`, and `persist`.

org.apache.spark.rdd.PairRDDFunctions: contains operations available only on RDDs of key-value pairs, such as `groupByKey` and `join`

org.apache.spark.rdd.DoubleRDDFunctions: contains operations available only on RDDs of Doubles

org.apache.spark.rdd.SequenceFileRDDFunctions: contains operations available on RDDs that can be saved as SequenceFiles

 

RDD首先是泛型类, T表示存放数据的类型, 在处理数据是都是基于Iterator[T]
以SparkContext和依赖关系Seq deps为初始化参数
从RDD提供的这些接口大致就可以知道, 什么是RDD
1. RDD是一块数据, 可能比较大的数据, 所以不能保证可以放在一个机器的memory中, 所以需要分成partitions, 分布在集群的机器的memory
所以自然需要getPartitions, partitioner如果分区, getPreferredLocations分区如何考虑locality

Partition的定义很简单, 只有id, 不包含data

  1. trait Partition extends Serializable {
  2. /**
  3. * Get the split's index within its parent RDD
  4. */
  5. def index: Int
  6. // A better default implementation of HashCode
  7. override def hashCode(): Int = index
  8. }

2. RDD之间是有关联的, 一个RDD可以通过compute逻辑把父RDD的数据转化成当前RDD的数据, 所以RDD之间有因果关系

并且通过getDependencies, 可以取到所有的dependencies

3. RDD是可以被persisit的, 常用的是cache, 即StorageLevel.MEMORY_ONLY

4. RDD是可以被checkpoint的, 以提高failover的效率, 当有很长的RDD链时, 单纯的依赖replay会比较低效

5. RDD.iterator可以产生用于迭代真正数据的Iterator[T]

6. 在RDD上可以做各种transforms和actions

  1. abstract class RDD[T: ClassManifest](
  2. @transient private var sc: SparkContext, //@transient, 不需要序列化
  3. @transient private var deps: Seq[Dependency[_]]
  4. ) extends Serializable with Logging {
  1. /**辅助构造函数, 专门用于初始化1对1依赖关系的RDD,这种还是很多的, filter, map...
  2.  
  3. Construct an RDD with just a one-to-one dependency on one parent */
  4. def this(@transient oneParent: RDD[_]) =
  5. this(oneParent.context , List(new OneToOneDependency(oneParent))) // 不同于一般的RDD, 这种情况因为只有一个parent, 所以直接传入parent RDD对象即可

  1. // =======================================================================
  2. // Methods that should be implemented by subclasses of RDD
  3. // =======================================================================
  4. /** Implemented by subclasses to compute a given partition. */
  5. def compute(split: Partition, context: TaskContext): Iterator[T]
  6.  
  7. /**
  8. * Implemented by subclasses to return the set of partitions in this RDD. This method will only
  9. * be called once, so it is safe to implement a time-consuming computation in it.
  10. */
  11. protected def getPartitions: Array[Partition]
  12.  
  13. /**
  14. * Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
  15. * be called once, so it is safe to implement a time-consuming computation in it.
  16. */
  17. protected def getDependencies: Seq[Dependency[_]] = deps
  18.  
  19. /** Optionally overridden by subclasses to specify placement preferences. */
  20. protected def getPreferredLocations(split: Partition): Seq[String] = Nil
  21.  
  22. /** Optionally overridden by subclasses to specify how they are partitioned. */
  23. val partitioner: Option[Partitioner] = None
  24.  
  25. // =======================================================================
  26. // Methods and fields available on all RDDs
  27. // =======================================================================
  28.  
  29. /** The SparkContext that created this RDD. */
  30. def sparkContext: SparkContext = sc
  31.  
  32. /** A unique ID for this RDD (within its SparkContext). */
  33. val id: Int = sc.newRddId()
  34.  
  35. /** A friendly name for this RDD */
  36. var name: String = null
  37.  
  38. /**
  39. * Set this RDD's storage level to persist its values across operations after the first time
  40. * it is computed. This can only be used to assign a new storage level if the RDD does not
  41. * have a storage level set yet..
  42. */
  43. def persist(newLevel: StorageLevel): RDD[T] = {
  44. // TODO: Handle changes of StorageLevel
  45. if (storageLevel != StorageLevel.NONE && newLevel != storageLevel) {
  46. throw new UnsupportedOperationException(
  47. "Cannot change storage level of an RDD after it was already assigned a level")
  48. }
  49. storageLevel = newLevel
  50. // Register the RDD with the SparkContext
  51. sc.persistentRdds(id) = this
  52. this
  53. }
  54.  
  55. /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
  56. def persist(): RDD[T] = persist(StorageLevel.MEMORY_ONLY)
  57.  
  58. /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
  59. def cache(): RDD[T] = persist()
  1. /** Get the RDD's current storage level, or StorageLevel.NONE if none is set. */
  2. def getStorageLevel = storageLevel
  3.  
  4. // Our dependencies and partitions will be gotten by calling subclass's methods below, and will
  5. // be overwritten when we're checkpointed
  6. private var dependencies_ : Seq[Dependency[_]] = null
  7. @transient private var partitions_ : Array[Partition] = null
  8.  
  9. /** An Option holding our checkpoint RDD, if we are checkpointed
    * checkpoint就是把RDD存到磁盘文件中, 以提高failover的效率, 虽然也可以选择replay
    * 并且在RDD的实现中, 如果存在checkpointRDD, 则可以直接从中读到RDD数据, 而不需要compute */
  10. private def checkpointRDD: Option[RDD[T]] = checkpointData.flatMap(_.checkpointRDD)

  1. /**
  2. * Internal method to this RDD; will read from cache if applicable, or otherwise compute it.
  3. * This should ''not'' be called by users directly, but is available for implementors of custom
  4. * subclasses of RDD.
  5. */
  1. /** 这是RDD访问数据的核心, 在RDD中的Partition中只包含id而没有真正数据
    * 那么如果获取RDD的数据? 参考storage模块
    * 在cacheManager.getOrCompute中, 会将RDD和Partition id对应到相应的block, 并从中读出数据*/
  2. final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
  3. if (storageLevel != StorageLevel.NONE) {//StorageLevel不为None,说明这个RDD persist过, 可以直接读出来
  4. SparkEnv.get.cacheManager.getOrCompute(this, split, context, storageLevel)
  5. } else {
  6. computeOrReadCheckpoint(split, context) //如果没有persisit过, 只有从新计算出, 或从checkpoint中读出
  7. }
  8. }

  1. // Transformations (return a new RDD)
  2. //...... 各种transformations的接口,map, union...
  1. /**
  2. * Return a new RDD by applying a function to all elements of this RDD.
  3. */
  4. def map[U: ClassManifest](f: T => U): RDD[U] = new MappedRDD(this, sc.clean(f))

  1. // Actions (launch a job to return a value to the user program)
  2. //......各种actions的接口,count, collect...
  1. /**
  2. * Return the number of elements in the RDD.
  3. */
  4. def count(): Long = {// 只有在action中才会真正调用runJob, 所以transform都是lazy的
  5. sc.runJob(this, (iter: Iterator[T]) => {
  6. var result = 0L
  7. while (iter.hasNext) {
  8. result += 1L
  9. iter.next()
  10. }
  11. result
  12. }).sum
  13. }

  1. // =======================================================================
  2. // Other internal methods and fields
  3. // =======================================================================
  1. /** Returns the first parent RDD
    返回第一个parent RDD*/
  2. protected[spark] def firstParent[U: ClassManifest] = {
  3. dependencies.head.rdd.asInstanceOf[RDD[U]]
  4. }

  1. //................
  2. }

 

这里先只讨论一些basic的RDD, pairRDD会单独讨论

FilteredRDD

One-to-one Dependency, FilteredRDD

使用FilteredRDD, 将当前RDD作为第一个参数, f函数作为第二个参数, 返回值是filter过后的RDD

  1. /**
  2. * Return a new RDD containing only the elements that satisfy a predicate.
  3. */
  4. def filter(f: T => Boolean): RDD[T] = new FilteredRDD(this, sc.clean(f))

在compute中, 对parent RDD的Iterator[T]进行filter操作

  1. private[spark] class FilteredRDD[T: ClassManifest]( //filter是典型的one-to-one dependency, 使用辅助构造函数
  2. prev: RDD[T], //parent RDD
  3. f: T => Boolean) //f,过滤函数
  4. extends RDD[T](prev) {
  5. //firstParent会从deps中取出第一个RDD对象, 就是传入的prev RDD, 在One-to-one Dependency中,parent和child的partition信息相同
  6. override def getPartitions: Array[Partition] = firstParent[T].partitions
  7.  
  8. override val partitioner = prev.partitioner // Since filter cannot change a partition's keys
  9.  
  10. override def compute(split: Partition, context: TaskContext) =
  11. firstParent[T].iterator(split, context).filter(f) //compute就是真正产生RDD的逻辑
  12. }

 

UnionRDD

Range Dependency, 仍然是narrow的

先看看如果使用union的, 第二个参数是, 两个RDD的array, 返回值就是把这两个RDD union后产生的新的RDD

  1. /**
  2. * Return the union of this RDD and another one. Any identical elements will appear multiple
  3. * times (use `.distinct()` to eliminate them).
  4. */
  5. def union(other: RDD[T]): RDD[T] = new UnionRDD(sc, Array(this, other))

 

先定义UnionPartition, Union操作的特点是, 只是把多个RDD的partition合并到一个RDD中, 而partition本身没有变化, 所以可以直接重用parent partition

3个参数

idx, partition id, 在当前UnionRDD中的序号

rdd, parent RDD

splitIndex, parent partition的id

  1. private[spark] class UnionPartition[T: ClassManifest](idx: Int, rdd: RDD[T], splitIndex: Int)
  2. extends Partition {
  3.  
  4. var split: Partition = rdd.partitions(splitIndex)//从parent RDD中取出相应的partition, 重用
  5.  
  6. def iterator(context: TaskContext) = rdd.iterator(split, context)//Iterator也可以重用
  7.  
  8. def preferredLocations() = rdd.preferredLocations(split)
  9.  
  10. override val index: Int = idx//partition id是新的, 因为多个合并后, 序号肯定会发生变化
  11. }

定义UnionRDD

  1. class UnionRDD[T: ClassManifest](
  2. sc: SparkContext,
  3. @transient var rdds: Seq[RDD[T]]) //parent RDD Seq
  4. extends RDD[T](sc, Nil) { // Nil since we implement getDependencies
  5.  
  6. override def getPartitions: Array[Partition] = {
  7. val array = new Array[Partition](rdds.map(_.partitions.size).sum) //UnionRDD的partition数,是所有parent RDD中的partition数目的和
  8. var pos = 0
  9. for (rdd <- rdds; split <- rdd.partitions) {
  10. array(pos) = new UnionPartition(pos, rdd, split.index) //创建所有的UnionPartition
  11. pos += 1
  12. }
  13. array
  14. }
  15.  
  16. override def getDependencies: Seq[Dependency[_]] = {
  17. val deps = new ArrayBuffer[Dependency[_]]
  18. var pos = 0
  19. for (rdd <- rdds) {
  20. deps += new RangeDependency(rdd, 0, pos, rdd.partitions.size)//创建RangeDependency
  21. pos += rdd.partitions.size)//由于是RangeDependency, 所以pos的递增是加上整个区间size
  22. }
  23. deps
  24. }
  25.  
  26. override def compute(s: Partition, context: TaskContext): Iterator[T] =
  27. s.asInstanceOf[UnionPartition[T]].iterator(context)//Union的compute非常简单,什么都不需要做
  28.  
  29. override def getPreferredLocations(s: Partition): Seq[String] =
  30. s.asInstanceOf[UnionPartition[T]].preferredLocations()
  31. }

Spark 源码分析 -- RDD的更多相关文章

  1. Spark源码分析 – 汇总索引

    http://jerryshao.me/categories.html#architecture-ref http://blog.csdn.net/pelick/article/details/172 ...

  2. Spark源码分析 – Dependency

    Dependency 依赖, 用于表示RDD之间的因果关系, 一个dependency表示一个parent rdd, 所以在RDD中使用Seq[Dependency[_]]来表示所有的依赖关系 Dep ...

  3. Spark源码分析:多种部署方式之间的区别与联系(转)

    原文链接:Spark源码分析:多种部署方式之间的区别与联系(1) 从官方的文档我们可以知道,Spark的部署方式有很多种:local.Standalone.Mesos.YARN.....不同部署方式的 ...

  4. Spark 源码分析 -- task实际执行过程

    Spark源码分析 – SparkContext 中的例子, 只分析到sc.runJob 那么最终是怎么执行的? 通过DAGScheduler切分成Stage, 封装成taskset, 提交给Task ...

  5. Spark源码分析 – Shuffle

    参考详细探究Spark的shuffle实现, 写的很清楚, 当前设计的来龙去脉 Hadoop Hadoop的思路是, 在mapper端每次当memory buffer中的数据快满的时候, 先将memo ...

  6. Spark源码分析 – BlockManager

    参考, Spark源码分析之-Storage模块 对于storage, 为何Spark需要storage模块?为了cache RDD Spark的特点就是可以将RDD cache在memory或dis ...

  7. Spark源码分析 – DAGScheduler

    DAGScheduler的架构其实非常简单, 1. eventQueue, 所有需要DAGScheduler处理的事情都需要往eventQueue中发送event 2. eventLoop Threa ...

  8. Spark源码分析 – SparkContext

    Spark源码分析之-scheduler模块 这位写的非常好, 让我对Spark的源码分析, 变的轻松了许多 这里自己再梳理一遍 先看一个简单的spark操作, val sc = new SparkC ...

  9. Spark源码分析之八:Task运行(二)

    在<Spark源码分析之七:Task运行(一)>一文中,我们详细叙述了Task运行的整体流程,最终Task被传输到Executor上,启动一个对应的TaskRunner线程,并且在线程池中 ...

随机推荐

  1. vue中config/index.js:配置的详细理解

    当我们需要和后台分离部署的时候,必须配置config/index.js: 用vue-cli 自动构建的目录里面  (环境变量及其基本变量的配置) var path = require('path') ...

  2. Linux(Ubuntu/Debian/CentOS/RedHat)下交叉编译boost库

    我用的软件版本如下(其他版本编译方法与此完全相同): Boost Ver: 1.55.0Compiler : GNU gcc 4.6 for ARM 1. 确保ARM编译成功安装,并配置好环境变量.2 ...

  3. filebeat+kafka失败

    filebeat端配置 #----------------------------- Kafka output -------------------------------- output.kafk ...

  4. SAP ECC6安装系列二:安装前的准备工作

    原作者博客 http://www.cnblogs.com/Michael_z/ ======================================== 安装 Java  1,安装 Java, ...

  5. JSF中获得HTTP SESSION和Request

    转载自:http://blog.sina.com.cn/s/blog_872758480100waew.html 为了保持向后兼容,我们有时可能会需要访问session对象.在JSF中可以通过如下方式 ...

  6. headfirst设计模式swift版01

    headfirst设计模式这本书真好,准备用一个月学完.书里讲得很清楚了. 设计原则: 1.找出应用中可能需要变化之处,把它们独立出来,不要和那些不需要变化的代码混在一起. 2.针对接口编程,而不是针 ...

  7. nodejs具体解释

    文件夹 javascript与node.js     javascript与你     因为javascript真正意义上有两种,甚至能够说是三种形态(从最早的作为DHTML进行增强的小工具,到像jQ ...

  8. Lua中的常用语句结构以及函数

     1.Lua中的常用语句结构介绍 --if 语句结构,如下实例: gTable = {} ] ] then ]) == gTable[] then ]) else print("unkown ...

  9. Requests blocked by CORS policy in spring boot and angular

    在本地启动Spring Boot后端和Angular前端调试时遇到跨域访问的问题导致前端请求失败. 错误描述 Access to XMLHttpRequest at 'http://localhost ...

  10. Laravel 5.1 中创建自定义 Artisan 控制台命令实例教程

    1.入门 Laravel通过Artisan提供了强大的控制台命令来处理非浏览器业务逻辑.要查看Laravel中所有的Artisan命令,可以通过在项目根目录运行: php artisan list 对 ...