spark 笔记 6: RDD

了解RDD之前，必读UCB的论文，个人认为这是最好的资料，没有之一。

http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
* partitioned collection of elements that can be operated on in parallel. This class contains the

* basic operations available on all RDDs, such as `map`, `filter`, and `persist`. In addition,* Internally, each

* [[org.apache.spark.rdd.PairRDDFunctions]] contains operations available only on RDDs of key-value
* pairs, such as `groupByKey` and `join`;
* [[org.apache.spark.rdd.DoubleRDDFunctions]] contains operations available only on RDDs of
* Doubles; and
* [[org.apache.spark.rdd.SequenceFileRDDFunctions]] contains operations available on RDDs that
* can be saved as SequenceFiles.
* These operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)]
* through implicit conversions when you `import org.apache.spark.SparkContext._`.

*RDD is characterized by five main properties:
*
*  - A list of partitions
*  - A function for computing each split
*  - A list of dependencies on other RDDs
*  - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
*  - Optionally, a list of preferred locations to compute each split on (e.g. block locations for
*    an HDFS file)

* All of the scheduling and execution in Spark is done based on these methods, allowing each RDD
* to implement its own way of computing itself. Indeed, users can implement custom RDDs (e.g. for
* reading data from a new storage system) by overriding these functions. Please refer to the
* [[http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Spark paper]] for more details
* on RDD internals.
*/
abstract class RDD[T: ClassTag](
    @transient private var sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging {


RDD是spark中最基础的数据表达形式，它的compute方法用来产生partition。由子类实现。
/**
 * :: DeveloperApi ::
 * Implemented by subclasses to compute a given partition.
 */
@DeveloperApi
def compute(split: Partition, context: TaskContext): Iterator[T]


RDD的persist是一个主要的功能，它负责将RDD以某个存储级别保留给后续的计算流程使用，是的迭代计算高效。
/**
 * Set this RDD's storage level to persist its values across operations after the first time
 * it is computed. This can only be used to assign a new storage level if the RDD does not
 * have a storage level set yet..
 */
def persist(newLevel: StorageLevel): this.type = {
  // TODO: Handle changes of StorageLevel
  if (storageLevel != StorageLevel.NONE && newLevel != storageLevel) {
    throw new UnsupportedOperationException(
      "Cannot change storage level of an RDD after it was already assigned a level")
  }
  sc.persistRDD(this)
  // Register the RDD with the ContextCleaner for automatic GC-based cleanup
  sc.cleaner.foreach(_.registerRDDForCleanup(this))
  storageLevel = newLevel
  this
}

RDD可以设置本地化优先策略，这是在使用Hadoop做存储时提高性能的主要手段。
/**
 * Get the preferred locations of a partition (as hostnames), taking into account whether the
 * RDD is checkpointed.
 */
final def preferredLocations(split: Partition): Seq[String] = {
  checkpointRDD.map(_.getPreferredLocations(split)).getOrElse {
    getPreferredLocations(split)
  }
}

RDD可以转化为其他的RDD，map/flatMap/filter是三个最常用的转化方式
// Transformations (return a new RDD)


/**
 * Return a new RDD by applying a function to all elements of this RDD.
 */
def map[U: ClassTag](f: T => U): RDD[U] = new MappedRDD(this, sc.clean(f))

/**
 *  Return a new RDD by first applying a function to all elements of this
 *  RDD, and then flattening the results.
 */
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] =
  new FlatMappedRDD(this, sc.clean(f))

/**
 * Return a new RDD containing only the elements that satisfy a predicate.
 */
def filter(f: T => Boolean): RDD[T] = new FilteredRDD(this, sc.clean(f))


注意，大部分时候RDD是推迟计算的，也就是在做transformation时，其实只是记录“如何做”，而真正的转化，是等到“Actions”来出发的。这样做的优势是使得串行化成为可能，这是spark性能高于hadoop的主要原因之一。
// Actions (launch a job to return a value to the user program)



/**
 * Applies a function f to all elements of this RDD.
 */
def foreach(f: T => Unit) {
  val cleanF = sc.clean(f)
  sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
}

/**
 * Return an array that contains all of the elements in this RDD.
 */
def collect(): Array[T] = {
  val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
  Array.concat(results: _*)
}
/**
 * Reduces the elements of this RDD using the specified commutative and
 * associative binary operator.
 */
def reduce(f: (T, T) => T): T = {
  val cleanF = sc.clean(f)
  val reducePartition: Iterator[T] => Option[T] = iter => {
    if (iter.hasNext) {
      Some(iter.reduceLeft(cleanF))
    } else {
      None
    }
  }
  var jobResult: Option[T] = None
  val mergeResult = (index: Int, taskResult: Option[T]) => {
    if (taskResult.isDefined) {
      jobResult = jobResult match {
        case Some(value) => Some(f(value, taskResult.get))
        case None => taskResult
      }
    }
  }
  sc.runJob(this, reducePartition, mergeResult)
  // Get the final result out of our Option, or throw an exception if the RDD was empty
  jobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))
}


/**
 * Return the number of elements in the RDD.
 */
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

/**
 * Returns the top K (largest) elements from this RDD as defined by the specified
 * implicit Ordering[T]. This does the opposite of [[takeOrdered]]. For example:
 * {{{
 *   sc.parallelize(Seq(10, 4, 2, 12, 3)).top(1)
 *   // returns Array(12)
 *
 *   sc.parallelize(Seq(2, 3, 4, 5, 6)).top(2)
 *   // returns Array(6, 5)
 * }}}
 *
 * @param num the number of top elements to return
 * @param ord the implicit ordering for T
 * @return an array of top elements
 */
def top(num: Int)(implicit ord: Ordering[T]): Array[T] = takeOrdered(num)(ord.reverse)


RDD的checkpoint功能意义也很重大，因为它会将RDD存到可靠存储设备，所以在这个RDD之前的历史记录就可以不用记录了（因为这个RDD已经是可靠的，不需要更老的历史了）。对于RDD以来很长的应用，选择合适的checkpiont显得格外重要。
/**
 * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
 * directory set with SparkContext.setCheckpointDir() and all references to its parent
 * RDDs will be removed. This function must be called before any job has been
 * executed on this RDD. It is strongly recommended that this RDD is persisted in
 * memory, otherwise saving it on a file will require recomputation.
 */
def checkpoint() {
  if (context.checkpointDir.isEmpty) {
    throw new Exception("Checkpoint directory has not been set in the SparkContext")
  } else if (checkpointData.isEmpty) {
    checkpointData = Some(new RDDCheckpointData(this))
    checkpointData.get.markForCheckpoint()
  }
}


这个调试函数会打印绝大部分的RDD的状态和信息。
/** A description of this RDD and its recursive dependencies for debugging. */
def toDebugString: String = {

RDD的转换示意图：

来自为知笔记(Wiz)

spark 笔记 6: RDD的更多相关文章

Spark笔记：RDD基本操作（下）
上一篇里我提到可以把RDD当作一个数组,这样我们在学习spark的API时候很多问题就能很好理解了.上篇文章里的API也都是基于RDD是数组的数据模型而进行操作的. Spark是一个计算框架,是对ma ...
Spark笔记：RDD基本操作（上）
本文主要是讲解spark里RDD的基础操作.RDD是spark特有的数据模型,谈到RDD就会提到什么弹性分布式数据集,什么有向无环图,本文暂时不去展开这些高深概念,在阅读本文时候,大家可以就把RDD当 ...
Spark学习笔记3——RDD（下）
目录 Spark学习笔记3--RDD(下) 向Spark传递函数通过匿名内部类通过具名类传递通过带参数的 Java 函数类传递通过 lambda 表达式传递(仅限于 Java 8 及以上) 常 ...
Spark学习笔记2——RDD（上）
目录 Spark学习笔记2--RDD(上) RDD是什么? 例子创建 RDD 并行化方式读取外部数据集方式 RDD 操作转化操作行动操作惰性求值 Spark学习笔记2--RDD(上) 笔记摘 ...
Spark学习笔记之RDD中的Transformation和Action函数
总算可以开始写第一篇技术博客了,就从学习Spark开始吧.之前阅读了很多关于Spark的文章,对Spark的工作机制及编程模型有了一定了解,下面把Spark中对RDD的常用操作函数做一下总结,以pys ...
spark 中的RDD编程 -以下基于Java api
1.RDD介绍: RDD,弹性分布式数据集,即分布式的元素集合.在spark中,对所有数据的操作不外乎是创建RDD.转化已有的RDD以及调用RDD操作进行求值.在这一切的背后,Spark会自动 ...
spark笔记环境配置
spark笔记 spark简介 saprk 有六个核心组件: SparkCore.SparkSQL.SparkStreaming.StructedStreaming.MLlib,Graphx Spar ...
spark 笔记 2： Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf ucb关于spark的论文,对spark中核心组件RDD最原始.本质的理解, ...
Spark计算模型-RDD介绍
在Spark集群背后,有一个非常重要的分布式数据架构,即弹性分布式数据集(Resilient Distributed DataSet,RDD),它是逻辑集中的实体,在集群中的多台集群上进行数据分区.通 ...

随机推荐

27 Python 装饰器
一. 我们先写一个玩游戏的步骤 # def play(): # print("双击LOL") # print("选择狂战士") # print("进草 ...
dedecms织梦副栏目名称和链接调用
https://blog.csdn.net/qq_41805834/article/details/79552859
为Qtcreator 编译的程序添加管理员权限
(1)创建资源文件 myapp.rc 1 24 uac.manifest (2)创建文件uac.manifest <?xml version="1.0" encoding=& ...
Linux驱动开发之LED驱动
首先讲下字符设备控制技术 : 大部分驱动程序除了需要提供读写设备的能力外,还需要具备控制设备的能力.比如: 改变波特率. 在用户空间,使用ioctl系统调用来控制设备,原型如下:int ioctl(i ...
10 Zabbix4.4.1系统告警“Zabbix server is not running”
点击返回:自学Zabbix之路点击返回:自学Zabbix4.0之路点击返回:自学zabbix集锦 Zabbix4.4.1系统告警“Zabbix server is not running” 第一步 ...
Kruskal重构树+LCA || BZOJ 3732: Network
题面:https://www.lydsy.com/JudgeOnline/problem.php?id=3732 题解:Kruskal重构树板子代码: #include<cstdio> ...
MonkeyRunner的简介与综合实践
官方介绍: Monkeyrunner工具提供了一个API,用于编写可从Android代码外部控制Android设备或模拟器的程序.使用monkeyrunner,您可以编写一个Python程序来安装An ...
第三章指令-- 30 指令-使用钩子函数的第二个binding参数拿到传递的值
<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8&quo ...
js 字符串格式化为时间格式
首先介绍一下我遇到的坑,找了几个关于字符串转时间的,他们都可以就我用的时候不行. 我的原因,我的字符串是MYSQL拿出来的不是标准的时间格式,是不会转成功的. 解决思路:先将字符串转为标准时间格式的字 ...
noi.ac NA537 【Graph】
本来以为过了...然后FST了... 吐槽:nmdGraph为什么不连通... 这题想法其实非常\(na\ddot{\imath}ve\),就是对于一个连通块先钦点一个点为根,颜色是\(1\),考虑到 ...

spark 笔记 6: RDD

spark 笔记 6: RDD的更多相关文章

随机推荐

热门专题