A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous sequence of RDDs (of the same type) representing a continuous stream of data.
Dstream本质就是离散化的stream,将stream离散化成一组RDD的list,所以基本的操作仍然是以RDD为基础
下面看到DStream的基本定义,对于普通的RDD而言,时间对于DStream是更为重要的因素
将stream切分成RDD的interval时间,stream开始的时间,DStream需要保留的RDD的时间,每个RDD所对于的时间key……

DStream抽象定义

  1. /**
  2. * A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous
  3. * sequence of RDDs (of the same type) representing a continuous stream of data (see
  4. * org.apache.spark.rdd.RDD in the Spark core documentation for more details on RDDs).
  5. * DStreams can either be created from live data (such as, data from TCP sockets, Kafka, Flume,
  6. * etc.) using a [[org.apache.spark.streaming.StreamingContext]] or it can be generated by
  7. * transforming existing DStreams using operations such as `map`,
  8. * `window` and `reduceByKeyAndWindow`. While a Spark Streaming program is running, each DStream
  9. * periodically generates a RDD, either from live data or by transforming the RDD generated by a
  10. * parent DStream.
  11. *
  12. * This class contains the basic operations available on all DStreams, such as `map`, `filter` and
  13. * `window`. In addition, [[org.apache.spark.streaming.dstream.PairDStreamFunctions]] contains
  14. * operations available only on DStreams of key-value pairs, such as `groupByKeyAndWindow` and
  15. * `join`. These operations are automatically available on any DStream of pairs
  16. * (e.g., DStream[(Int, Int)] through implicit conversions when
  17. * `org.apache.spark.streaming.StreamingContext._` is imported.
  18. *
  19. * DStreams internally is characterized by a few basic properties:
  20. * - A list of other DStreams that the DStream depends on
  21. * - A time interval at which the DStream generates an RDD
  22. * - A function that is used to generate an RDD after each time interval
  23. */
  24.  
  25. abstract class DStream[T: ClassTag] (
  26. @transient private[streaming] var ssc: StreamingContext
  27. ) extends Serializable with Logging {
  28.  
  29. // =======================================================================
  30. // Methods that should be implemented by subclasses of DStream
  31. // =======================================================================
  32.  
  33. /** Time interval after which the DStream generates a RDD */
  34. def slideDuration: Duration // 将stream切分成RDD的interval
  35.  
  36. /** List of parent DStreams on which this DStream depends on */
  37. def dependencies: List[DStream[_]] // 和RDD一样,DStream之间也存在dependency关系
  38.  
  39. /** Method that generates a RDD for the given time */
  40. def compute (validTime: Time): Option[RDD[T]] // RDD的生成逻辑
  41.  
  42. // =======================================================================
  43. // Methods and fields available on all DStreams
  44. // =======================================================================
  45.  
  46. // RDDs generated, marked as private[streaming] so that testsuites can access it
  47. @transient
  48. private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]] () // 最为核心的结构,可以看到DStream就是以time为key的RDD的hashmap
  49.  
  50. // Time zero for the DStream
  51. private[streaming] var zeroTime: Time = null // Stream开始的时间
  52.  
  53. // Duration for which the DStream will remember each RDD created
  54. private[streaming] var rememberDuration: Duration = null // Stream是无限的,而在DStream不可能保留所有的RDD,所以设置DStream需要remember的duration
  55.  
  56. // Storage level of the RDDs in the stream
  57. private[streaming] var storageLevel: StorageLevel = StorageLevel.NONE
  58.  
  59. // Checkpoint details
  60. private[streaming] val mustCheckpoint = false
  61. private[streaming] var checkpointDuration: Duration = null
  62. private[streaming] val checkpointData = new DStreamCheckpointData(this)
  63.  
  64. // Reference to whole DStream graph
  65. private[streaming] var graph: DStreamGraph = null // DStreamGraph
  66.  
  67. // Duration for which the DStream requires its parent DStream to remember each RDD created
  68. private[streaming] def parentRememberDuration = rememberDuration
  69.  
  70. /** Return the StreamingContext associated with this DStream */
  71. def context = ssc
  72.  
  73. /** Persist the RDDs of this DStream with the given storage level */
  74. def persist(level: StorageLevel): DStream[T] = {
  75. this.storageLevel = level
  76. this
  77. }
  78.  
  79. /** Persist RDDs of this DStream with the default storage level (MEMORY_ONLY_SER) */
  80. def persist(): DStream[T] = persist(StorageLevel.MEMORY_ONLY_SER)
  81.  
  82. /** Persist RDDs of this DStream with the default storage level (MEMORY_ONLY_SER) */
  83. def cache(): DStream[T] = persist()
  84.  
  85. /**
  86. * Enable periodic checkpointing of RDDs of this DStream
  87. * @param interval Time interval after which generated RDD will be checkpointed
  88. */
  89. def checkpoint(interval: Duration): DStream[T] = {
  90. persist()
  91. checkpointDuration = interval
  92. this
  93. }
  94. }

getOrCompute
注意的是,这里是产生RDD对象,而不是真正的进行计算,只有在runjob时才会做真正的计算
Spark RDD本身是不包含具体数据的,只是定义了workflow(依赖关系),处理逻辑

  1. /**
  2. * Retrieve a precomputed RDD of this DStream, or computes the RDD. This is an internal
  3. * method that should not be called directly.
  4. */
  5. private[streaming] def getOrCompute(time: Time): Option[RDD[T]] = {
  6. // If this DStream was not initialized (i.e., zeroTime not set), then do it
  7. // If RDD was already generated, then retrieve it from HashMap
  8. generatedRDDs.get(time) match {
  9.  
  10. // If an RDD was already generated and is being reused, then
  11. // probably all RDDs in this DStream will be reused and hence should be cached
  12. case Some(oldRDD) => Some(oldRDD)
  13.  
  14. // if RDD was not generated, and if the time is valid
  15. // (based on sliding time of this DStream), then generate the RDD
  16. case None => { // 需要compute
  17. if (isTimeValid(time)) { // invalid的定义,(time <= zeroTime || ! (time - zeroTime).isMultipleOf(slideDuration)
  18. compute(time) match { // 使用compute生成RDD对象
  19. case Some(newRDD) =>
  20. if (storageLevel != StorageLevel.NONE) {
  21. newRDD.persist(storageLevel) // 设置persist level
  22. }
  23. if (checkpointDuration != null &&
  24. (time - zeroTime).isMultipleOf(checkpointDuration)) {
  25. newRDD.checkpoint() // 设置checkpoint
  26. }
  27. generatedRDDs.put(time, newRDD) // 将产生的RDD对象放入generatedRDDs
  28. Some(newRDD)
  29. case None =>
  30. None
  31. }
  32. } else {
  33. None
  34. }
  35. }
  36. }
  37. }


generateJob
对于用getOrCompute产生的RDD对象,需要封装成job
而Job的关键,jobFunc,其实就是想Spark集群提交一个job
这里只是使用了emptyFunc,具体的output逻辑是需要被具体的outputDStream改写的

  1. /**
  2. * Generate a SparkStreaming job for the given time. This is an internal method that
  3. * should not be called directly. This default implementation creates a job
  4. * that materializes the corresponding RDD. Subclasses of DStream may override this
  5. * to generate their own jobs.
  6. */
  7. private[streaming] def generateJob(time: Time): Option[Job] = {
  8. getOrCompute(time) match {
  9. case Some(rdd) => {
  10. val jobFunc = () => {
  11. val emptyFunc = { (iterator: Iterator[T]) => {} }
  12. context.sparkContext.runJob(rdd, emptyFunc)
  13. }
  14. Some(new Job(time, jobFunc))
  15. }
  16. case None => None
  17. }
  18. }

clearMetadata
清除过时的RDD对象,其中还会做unpersist,以及调用dependencies的clearMetadata

  1. /**
  2. * Clear metadata that are older than `rememberDuration` of this DStream.
  3. * This is an internal method that should not be called directly. This default
  4. * implementation clears the old generated RDDs. Subclasses of DStream may override
  5. * this to clear their own metadata along with the generated RDDs.
  6. */
  7. private[streaming] def clearMetadata(time: Time) {
  8. val oldRDDs = generatedRDDs.filter(_._1 <= (time - rememberDuration))
  9. generatedRDDs --= oldRDDs.keys
  10. if (ssc.conf.getBoolean("spark.streaming.unpersist", false)) {
  11. oldRDDs.values.foreach(_.unpersist(false))
  12. }
  13. dependencies.foreach(_.clearMetadata(time))
  14. }

具体DStream的定义

FilteredDStream

  1. package org.apache.spark.streaming.dstream
  2.  
  3. private[streaming]
  4. class FilteredDStream[T: ClassTag](
  5. parent: DStream[T],
  6. filterFunc: T => Boolean
  7. ) extends DStream[T](parent.ssc) {
  8.  
  9. override def dependencies = List(parent)
  10.  
  11. override def slideDuration: Duration = parent.slideDuration
  12.  
  13. override def compute(validTime: Time): Option[RDD[T]] = {
  14. parent.getOrCompute(validTime).map(_.filter(filterFunc))
  15. }
  16. }

WindowedDStream

  1. private[streaming]
  2. class WindowedDStream[T: ClassTag](
  3. parent: DStream[T],
  4. _windowDuration: Duration,
  5. _slideDuration: Duration)
  6. extends DStream[T](parent.ssc) {
  7.  
  8. // Persist parent level by default, as those RDDs are going to be obviously reused.
  9. parent.persist(StorageLevel.MEMORY_ONLY_SER) //默认将parentRDD设置persist,因为parent RDD会在window slide中被反复读到
  10.  
  11. def windowDuration: Duration = _windowDuration // Windows大小
  12.  
  13. override def dependencies = List(parent)
  14.  
  15. override def slideDuration: Duration = _slideDuration // Windows滑动
  16.  
  17. override def parentRememberDuration: Duration = rememberDuration + windowDuration // 保证RememberDuratioin一定大于windowDuration
  18.  
  19. override def persist(level: StorageLevel): DStream[T] = {
  20. // Do not let this windowed DStream be persisted as windowed (union-ed) RDDs share underlying
  21. // RDDs and persisting the windowed RDDs would store numerous copies of the underlying data.
  22. // Instead control the persistence of the parent DStream.
  23. // 不要直接persist windowed RDDS,而是去persist parent RDD,原因是各个windows RDDs之间有大量的重复数据,直接persist浪费空间
  24. parent.persist(level)
  25. this
  26. }
  27.  
  28. override def compute(validTime: Time): Option[RDD[T]] = {
  29. val currentWindow = new Interval(validTime - windowDuration + parent.slideDuration, validTime) //计算窗口inteval
  30. val rddsInWindow = parent.slice(currentWindow)
  31. val windowRDD = if (rddsInWindow.flatMap(_.partitioner).distinct.length == 1) {
  32. new PartitionerAwareUnionRDD(ssc.sc, rddsInWindow)
  33. } else {
  34. new UnionRDD(ssc.sc,rddsInWindow) //本质就是把parent DStream窗口内的RDD做union
  35. }
  36. Some(windowRDD)
  37. }
  38. }

ShuffledDStream

  1. private[streaming]
  2. class ShuffledDStream[K: ClassTag, V: ClassTag, C: ClassTag](
  3. parent: DStream[(K,V)],
  4. createCombiner: V => C,
  5. mergeValue: (C, V) => C,
  6. mergeCombiner: (C, C) => C,
  7. partitioner: Partitioner,
  8. mapSideCombine: Boolean = true
  9. ) extends DStream[(K,C)] (parent.ssc) {
  10.  
  11. override def dependencies = List(parent)
  12.  
  13. override def slideDuration: Duration = parent.slideDuration
  14.  
  15. override def compute(validTime: Time): Option[RDD[(K,C)]] = {
  16. parent.getOrCompute(validTime) match {
  17. case Some(rdd) => Some(rdd.combineByKey[C](
  18. createCombiner, mergeValue, mergeCombiner, partitioner, mapSideCombine))
  19. case None => None
  20. }
  21. }
  22. }

PairDStreamFunctions
以groupByKey为例,和普通Spark里面没啥区别,依赖是基于combineByKey实现
比较有特点是提供groupByKeyAndWindow,其实就是先使用WindowedDStream将windows中的RDD union,然后再使用combineByKey

  1. /**
  2. * Extra functions available on DStream of (key, value) pairs through an implicit conversion.
  3. * Import `org.apache.spark.streaming.StreamingContext._` at the top of your program to use
  4. * these functions.
  5. */
  6. class PairDStreamFunctions[K: ClassTag, V: ClassTag](self: DStream[(K,V)])
  7. extends Serializable {
  8.  
  9. private[streaming] def ssc = self.ssc
  10.  
  11. private[streaming] def defaultPartitioner(numPartitions: Int = self.ssc.sc.defaultParallelism)
  12. = {new HashPartitioner(numPartitions)}
  13.  
  14. /**
  15. * Return a new DStream by applying `groupByKey` on each RDD. The supplied
  16. * [[org.apache.spark.Partitioner]] is used to control the partitioning of each RDD.
  17. */
  18. def groupByKey(partitioner: Partitioner): DStream[(K, Seq[V])] = {
  19. val createCombiner = (v: V) => ArrayBuffer[V](v)
  20. val mergeValue = (c: ArrayBuffer[V], v: V) => (c += v)
  21. val mergeCombiner = (c1: ArrayBuffer[V], c2: ArrayBuffer[V]) => (c1 ++ c2)
  22. combineByKey(createCombiner, mergeValue, mergeCombiner, partitioner)
  23. .asInstanceOf[DStream[(K, Seq[V])]]
  24. }
  25.  
  26. /**
  27. * Combine elements of each key in DStream's RDDs using custom functions. This is similar to the
  28. * combineByKey for RDDs. Please refer to combineByKey in
  29. * org.apache.spark.rdd.PairRDDFunctions in the Spark core documentation for more information.
  30. */
  31. def combineByKey[C: ClassTag](
  32. createCombiner: V => C,
  33. mergeValue: (C, V) => C,
  34. mergeCombiner: (C, C) => C,
  35. partitioner: Partitioner,
  36. mapSideCombine: Boolean = true): DStream[(K, C)] = {
  37. new ShuffledDStream[K, V, C](self, createCombiner, mergeValue, mergeCombiner, partitioner,
  38. mapSideCombine)
  39. }
  40. }

groupByKeyAndWindow

  1. /**
  2. * Create a new DStream by applying `groupByKey` over a sliding window on `this` DStream.
  3. * Similar to `DStream.groupByKey()`, but applies it over a sliding window.
  4. * @param windowDuration width of the window; must be a multiple of this DStream's
  5. * batching interval
  6. * @param slideDuration sliding interval of the window (i.e., the interval after which
  7. * the new DStream will generate RDDs); must be a multiple of this
  8. * DStream's batching interval
  9. * @param partitioner partitioner for controlling the partitioning of each RDD in the new
  10. * DStream.
  11. */
  12. def groupByKeyAndWindow(
  13. windowDuration: Duration,
  14. slideDuration: Duration,
  15. partitioner: Partitioner
  16. ): DStream[(K, Seq[V])] = {
  17. val createCombiner = (v: Seq[V]) => new ArrayBuffer[V] ++= v
  18. val mergeValue = (buf: ArrayBuffer[V], v: Seq[V]) => buf ++= v
  19. val mergeCombiner = (buf1: ArrayBuffer[V], buf2: ArrayBuffer[V]) => buf1 ++= buf2
  20. self.groupByKey(partitioner)
  21. .window(windowDuration, slideDuration) // DStream.window会将当前的dstream封装成WindowedDStream,见下面的代码
  22. .combineByKey[ArrayBuffer[V]](createCombiner, mergeValue, mergeCombiner, partitioner)
  23. .asInstanceOf[DStream[(K, Seq[V])]]
  24. }
  1. /**
  2. * Return a new DStream in which each RDD contains all the elements in seen in a
  3. * sliding window of time over this DStream.
  4. * @param windowDuration width of the window; must be a multiple of this DStream's
  5. * batching interval
  6. * @param slideDuration sliding interval of the window (i.e., the interval after which
  7. * the new DStream will generate RDDs); must be a multiple of this
  8. * DStream's batching interval
  9. */
  10. def window(windowDuration: Duration, slideDuration: Duration): DStream[T] = {
  11. new WindowedDStream(this, windowDuration, slideDuration)
  12. }

updateStateByKey

  1. /**
  2. * Return a new "state" DStream where the state for each key is updated by applying
  3. * the given function on the previous state of the key and the new values of each key.
  4. * [[org.apache.spark.Partitioner]] is used to control the partitioning of each RDD.
  5. * @param updateFunc State update function. If `this` function returns None, then
  6. * corresponding state key-value pair will be eliminated. Note, that
  7. * this function may generate a different a tuple with a different key
  8. * than the input key. It is up to the developer to decide whether to
  9. * remember the partitioner despite the key being changed.
  10. * @param partitioner Partitioner for controlling the partitioning of each RDD in the new
  11. * DStream
  12. * @param rememberPartitioner Whether to remember the paritioner object in the generated RDDs.
  13. * @tparam S State type
  14. */
  15. def updateStateByKey[S: ClassTag](
  16. updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
  17. partitioner: Partitioner,
  18. rememberPartitioner: Boolean
  19. ): DStream[(K, S)] = {
  20. new StateDStream(self, ssc.sc.clean(updateFunc), partitioner, rememberPartitioner)
  21. }

StateDStream
普通的DStream,都是直接从ParentRDD通过compute来得到当前的RDD
而StateDStream的特别之处,除了ParentRDD,还需要参考PreviousRDD,这个只存在在stream场景下,只有这个场景下,RDD之间才存在时间关系
PreviousRDD = getOrCompute(validTime - slideDuration),即在DStream的generatedRDDs上前一个时间interval上的RDD
处理函数,val finalFunc = (iterator: Iterator[(K, (Seq[V], Seq[S]))]) => { },需要3个参数,key,ParentRDD上的value,PreviousRDD上的value
处理函数需要考虑,当ParentRDD或PreviousRDD为空的情况

注意StateDStream,默认需要做persist和checkpoint

  1. private[streaming]
  2. class StateDStream[K: ClassTag, V: ClassTag, S: ClassTag](
  3. parent: DStream[(K, V)],
  4. updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
  5. partitioner: Partitioner,
  6. preservePartitioning: Boolean
  7. ) extends DStream[(K, S)](parent.ssc) {
  8.  
  9. super.persist(StorageLevel.MEMORY_ONLY_SER) // RDD persist默认设为memory,因为后面的RDD需要用到
  10.  
  11. override def dependencies = List(parent)
  12.  
  13. override def slideDuration: Duration = parent.slideDuration
  14.  
  15. override val mustCheckpoint = true // 默认需要checkpoint,需要保持状态
  16.  
  17. override def compute(validTime: Time): Option[RDD[(K, S)]] = {
  18.  
  19. // Try to get the previous state RDD
  20. getOrCompute(validTime - slideDuration) match {
  21.  
  22. case Some(prevStateRDD) => { // If previous state RDD exists
  23.  
  24. // Try to get the parent RDD
  25. parent.getOrCompute(validTime) match { // 既有PreviousRDD,又有ParentRDD的case
  26. case Some(parentRDD) => { // If parent RDD exists, then compute as usual
  27.  
  28. // Define the function for the mapPartition operation on cogrouped RDD;
  29. // first map the cogrouped tuple to tuples of required type,
  30. // and then apply the update function
  31. val updateFuncLocal = updateFunc
  32. val finalFunc = (iterator: Iterator[(K, (Seq[V], Seq[S]))]) => {
  33. val i = iterator.map(t => {
  34. (t._1, t._2._1, t._2._2.headOption)
  35. })
  36. updateFuncLocal(i)
  37. }
  38. val cogroupedRDD = parentRDD.cogroup(prevStateRDD, partitioner) //`(k, a) cogroup (k, b)` produces k -> Seq(ArrayBuffer as, ArrayBuffer bs)
  39. val stateRDD = cogroupedRDD.mapPartitions(finalFunc, preservePartitioning)
  40. Some(stateRDD)
  41. }
  42. case None => { // If parent RDD does not exist,ParentRDD不存在
  43.  
  44. // Re-apply the update function to the old state RDD
  45. val updateFuncLocal = updateFunc
  46. val finalFunc = (iterator: Iterator[(K, S)]) => {
  47. val i = iterator.map(t => (t._1, Seq[V](), Option(t._2))) // 直接把ParentRDD置空,Seq[V]()
  48. updateFuncLocal(i)
  49. }
  50. val stateRDD = prevStateRDD.mapPartitions(finalFunc, preservePartitioning)
  51. Some(stateRDD)
  52. }
  53. }
  54. }
  55.  
  56. case None => { // If previous session RDD does not exist (first input data)
  57.  
  58. // Try to get the parent RDD
  59. parent.getOrCompute(validTime) match {
  60. case Some(parentRDD) => { // If parent RDD exists, then compute as usual,PreviousRDD为空的case,说明是第一个state RDD
  61.  
  62. // Define the function for the mapPartition operation on grouped RDD;
  63. // first map the grouped tuple to tuples of required type,
  64. // and then apply the update function
  65. val updateFuncLocal = updateFunc
  66. val finalFunc = (iterator: Iterator[(K, Seq[V])]) => {
  67. updateFuncLocal(iterator.map(tuple => (tuple._1, tuple._2, None))) // 把PreviousRDD置为None
  68. }
  69.  
  70. val groupedRDD = parentRDD.groupByKey(partitioner)
  71. val sessionRDD = groupedRDD.mapPartitions(finalFunc, preservePartitioning)
  72. //logDebug("Generating state RDD for time " + validTime + " (first)")
  73. Some(sessionRDD)
  74. }
  75. case None => { // If parent RDD does not exist, then nothing to do!,previous和parent都没有,当然啥也做不了
  76. //logDebug("Not generating state RDD (no previous state, no parent)")
  77. None
  78. }
  79. }
  80. }
  81. }
  82. }
  83. }

TransformedDStream
首先这是个比较通用的operation,可以通过自定义的transformFunc,将一组parentRDDs计算出当前的RDD
需要注意的是,这些parentRDDs必须在同一个streamContext下,并且有相同的slideDuration
在DStream接口中,可以提供transform和transformWith两种,参考下面源码

  1. private[streaming]
  2. class TransformedDStream[U: ClassTag] (
  3. parents: Seq[DStream[_]],
  4. transformFunc: (Seq[RDD[_]], Time) => RDD[U]
  5. ) extends DStream[U](parents.head.ssc) {
  6.  
  7. require(parents.length > 0, "List of DStreams to transform is empty")
  8. require(parents.map(_.ssc).distinct.size == 1, "Some of the DStreams have different contexts")
  9. require(parents.map(_.slideDuration).distinct.size == 1,
  10. "Some of the DStreams have different slide durations")
  11.  
  12. override def dependencies = parents.toList
  13.  
  14. override def slideDuration: Duration = parents.head.slideDuration
  15.  
  16. override def compute(validTime: Time): Option[RDD[U]] = {
  17. val parentRDDs = parents.map(_.getOrCompute(validTime).orNull).toSeq
  18. Some(transformFunc(parentRDDs, validTime))
  19. }
  20. }
  1. /**
  2. * Return a new DStream in which each RDD is generated by applying a function
  3. * on each RDD of 'this' DStream.
  4. */
  5. def transform[U: ClassTag](transformFunc: (RDD[T], Time) => RDD[U]): DStream[U] = {
  6. val cleanedF = context.sparkContext.clean(transformFunc)
  7. val realTransformFunc = (rdds: Seq[RDD[_]], time: Time) => {
  8. assert(rdds.length == 1)
  9. cleanedF(rdds.head.asInstanceOf[RDD[T]], time)
  10. }
  11. new TransformedDStream[U](Seq(this), realTransformFunc) // this,单个RDD
  12. }
  13. /**
  14. * Return a new DStream in which each RDD is generated by applying a function
  15. * on each RDD of 'this' DStream and 'other' DStream.
  16. */
  17. def transformWith[U: ClassTag, V: ClassTag](
  18. other: DStream[U], transformFunc: (RDD[T], RDD[U], Time) => RDD[V]
  19. ): DStream[V] = {
  20. val cleanedF = ssc.sparkContext.clean(transformFunc)
  21. val realTransformFunc = (rdds: Seq[RDD[_]], time: Time) => {
  22. assert(rdds.length == 2)
  23. val rdd1 = rdds(0).asInstanceOf[RDD[T]]
  24. val rdd2 = rdds(1).asInstanceOf[RDD[U]]
  25. cleanedF(rdd1, rdd2, time)
  26. }
  27. new TransformedDStream[V](Seq(this, other), realTransformFunc) // this and other,多个RDDs
  28. }

Spark Streaming源码分析 – DStream的更多相关文章

  1. 10.Spark Streaming源码分析:Receiver数据接收全过程详解

    原创文章,转载请注明:转载自 听风居士博客(http://www.cnblogs.com/zhouyf/)   在上一篇中介绍了Receiver的整体架构和设计原理,本篇内容主要介绍Receiver在 ...

  2. Spark Streaming源码分析 – Checkpoint

    PersistenceStreaming没有做特别的事情,DStream最终还是以其中的每个RDD作为job进行调度的,所以persistence就以RDD为单位按照原先Spark的方式去做就可以了, ...

  3. Spark Streaming源码分析 – JobScheduler

    先给出一个job从被generate到被执行的整个过程在JobGenerator中,需要定时的发起GenerateJobs事件,而每个job其实就是针对DStream中的一个RDD,发起一个Spark ...

  4. Spark Streaming源码分析 – InputDStream

    对于NetworkInputDStream而言,其实不是真正的流方式,将数据读出来后不是直接去处理,而是先写到blocks中,后面的RDD再从blocks中读取数据继续处理这就是一个将stream离散 ...

  5. Spark Streaming源码解读之JobScheduler内幕实现和深度思考

    本期内容 : JobScheduler内幕实现 JobScheduler深度思考 JobScheduler 是整个Spark Streaming调度的核心,需要设置多线程,一条用于接收数据不断的循环, ...

  6. 第十一篇:Spark SQL 源码分析之 External DataSource外部数据源

    上周Spark1.2刚发布,周末在家没事,把这个特性给了解一下,顺便分析下源码,看一看这个特性是如何设计及实现的. /** Spark SQL源码分析系列文章*/ (Ps: External Data ...

  7. 第十篇:Spark SQL 源码分析之 In-Memory Columnar Storage源码分析之 query

    /** Spark SQL源码分析系列文章*/ 前面讲到了Spark SQL In-Memory Columnar Storage的存储结构是基于列存储的. 那么基于以上存储结构,我们查询cache在 ...

  8. 第九篇:Spark SQL 源码分析之 In-Memory Columnar Storage源码分析之 cache table

    /** Spark SQL源码分析系列文章*/ Spark SQL 可以将数据缓存到内存中,我们可以见到的通过调用cache table tableName即可将一张表缓存到内存中,来极大的提高查询效 ...

  9. 第七篇:Spark SQL 源码分析之Physical Plan 到 RDD的具体实现

    /** Spark SQL源码分析系列文章*/ 接上一篇文章Spark SQL Catalyst源码分析之Physical Plan,本文将介绍Physical Plan的toRDD的具体实现细节: ...

随机推荐

  1. 漫谈Linux下的音频问题(转)

    转自 http://www.kunli.info/2009/03/24/linux-sound-issue/ 现今的互联网,比较Linux和Windows的战争贴基本都成月经贴了.一群群激进的用户不断 ...

  2. 李洪强iOS开发之Block和协议

    李洪强iOS开发之Block和协议 OC语言BLOCK和协议 一.BOLCK (一)简介 BLOCK是什么?苹果推荐的类型,效率高,在运行中保存代码.用来封装和保存代码,有点像函数,BLOCK可以在任 ...

  3. CodeForces 558D

     Guess Your Way Out! II Time Limit:2000MS     Memory Limit:262144KB     64bit IO Format:%I64d & ...

  4. cocos2dx3.2升级Android5的坑

    虽然已经转到服务端,但是对客户端的事情,偶尔还看看.公司的游戏用的是cocos2dx 3.2的版本, 然而在Android 5 上却无法运行. 先是没有触摸事件. 在stackoverflow上看到, ...

  5. exec系列函数(execl,execlp,execle,execv,execvp)使用

    本节目标: exec替换进程映像 exec关联函数组(execl.execlp.execle.execv.execvp) 一,exec替换进程映像 在进程的创建上Unix采用了一个独特的方法,它将进程 ...

  6. ssdfd

    http://www.phpweb.net/ http://wenku.baidu.com/view/6044c67c27284b73f242506b.htmlhttp://www.jb51.net/ ...

  7. js 拼接字符串 穿参数 带有单引号

    var html="<a href=\"#\"  onclick=Unlock(\""+flid+"\",1)>弹出& ...

  8. 树莓派安装centos 7系统

    1,格式化 https://www.sdcard.org/downloads/formatter_4/eula_windows/ 2,烧录,Win32DiskImager https://source ...

  9. 逻辑斯特回归(logistic regression)与最大熵模型(maximum entropy model)

  10. git fork同步是什么意思?

    这篇文章主要介绍了git fork同步是什么意思?fork到了哪里?有什么用?怎样用?跟clone有什么差别?本文就一一解释这些问题,须要的朋友能够參考下 官方文档:http://help.githu ...