DStream-04 Window函数的原理和源码

DStream 中 window 函数有两种，一种是普通 WindowedDStream，另外一种是针对 window聚合优化的 ReducedWindowedDStream。

Demo

object SocketWordCountDstreamReduceByWindow {

  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf()

      .setAppName("SocketWordCountDstream")

      .setMaster("local[3]")

    val sparkStreamContext = new StreamingContext(sparkConf,Seconds(5))

    val sparkContext = sparkStreamContext.sparkContext

    sparkContext.setLogLevel("WARN")

    val dstream = sparkStreamContext.socketTextStream("localhost",9090)

    val v = dstream.flatMap(_.split(" "))

      .map((_,1))

        .reduceByKeyAndWindow((p:Int,c:Int)=>{

          p + c

        },Durations.seconds(15),Durations.seconds(5))

    v.foreachRDD(...)

    sparkStreamContext.start()

    sparkStreamContext.awaitTermination()

  }

}

源码

DStream

前提知识

在每个DStream 中会把每个batch 产生的 Rdd 放入Map中，也就是放到内存中。

// 保存RDD的 Map

@transient

private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]]()

 private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = {

    // If RDD was already generated, then retrieve it from HashMap,

    // or else compute the RDD

	// 如果map 中就直接拿着用，没有就创建

    generatedRDDs.get(time).orElse {

      if (isTimeValid(time)) {

        val rddOption = createRDDWithLocalProperties(time, displayInnerRDDOps = false) {

          SparkHadoopWriterUtils.disableOutputSpecValidation.withValue(true) {

            compute(time)

          }

        }

        rddOption.foreach { case newRDD =>

          // Register the generated RDD for caching and checkpointing

          if (storageLevel != StorageLevel.NONE) {

            newRDD.persist(storageLevel)

            logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel")

          }

          if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {

            newRDD.checkpoint()

            logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")

          }

          // 放入Map中

          generatedRDDs.put(time, newRDD)

        }

        rddOption

      } else {

        None

      }

    }

  }

同时也会 batch 完成的时候去清理这个Map。

private[streaming] def clearMetadata(time: Time) {

  //根据当前 batch time - rememberDuration

  //time = 100030, rememberDuration = 10 , generatedRDDs = {100020->rdd}

  val oldRDDs = generatedRDDs.filter(_._1 <= (time - rememberDuration))

  generatedRDDs --= oldRDDs.keys

  if (unpersistData) {

    logDebug(s"Unpersisting old RDDs: ${oldRDDs.values.map(_.id).mkString(", ")}")

    oldRDDs.values.foreach { rdd =>

      rdd.unpersist(false)

      // Explicitly remove blocks of BlockRDD

      rdd match {

        case b: BlockRDD[_] =>

          logInfo(s"Removing blocks of RDD $b of time $time")

          b.removeBlocks()

        case _ =>

      }

    }

  }

  dependencies.foreach(_.clearMetadata(time))

}

这个清理的过程是从后往前的，先清理子DStream 然后是父DStream

// 默认 rememberDuration = slideDuration  也就是 batchInterval 。

private[streaming] var rememberDuration: Duration = null

//所以这边 父继承子

private[streaming] def parentRememberDuration = rememberDuration

DStream 初始化时

private[streaming] def initialize(time: Time) {

  var minRememberDuration = slideDuration

  // checkpointDuration 一般的DStream都是空的

  if (checkpointDuration != null && minRememberDuration <= checkpointDuration) {

    minRememberDuration = checkpointDuration * 2

  }

  if (rememberDuration == null || rememberDuration < minRememberDuration) {

    rememberDuration = minRememberDuration

  }

  // Initialize the dependencies

  dependencies.foreach(_.initialize(zeroTime))

}

// 抽象方法

// slideDuration 官方的注释是：方法是 DStream生成RDD的时间间隔。

// 默认情况下 slideDuration  就是 batchInterval 。针对 WindowedDStream  就是滑动的时间  但也是批次的时间

def slideDuration: Duration

这边举个例子 interval = 1s window = 4s slide = 2s。JobGenerator 会每个隔 interval = 1s 发送 GenerateJobs 事件，然后会触发最后一个DStream 的 getOrCompute，然后依次 compute 会优先调 parent.getOrCompute 依次递归到第一个DStream。但是 getOrCompute 会判断这个 batch 的 Time - zeroTime（zeroTime 是 Stream 开始的时间，第一个batch 就是 zeroTime + batchInterval）是不是 slideDuration 的倍数。如果是才会调 compute 否则就会返回 None。

private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = {

  // If RDD was already generated, then retrieve it from HashMap,

  // or else compute the RDD

  generatedRDDs.get(time).orElse {

    // Compute the RDD if time is valid (e.g. correct time in a sliding window)

    // of RDD generation, else generate nothing.

    if (isTimeValid(time)) {

		....

        rdd

      }

      rddOption

    } else {

      None

    }

  }

}

校验时间。 zeroTime 就是第一个Batch的时间

private[streaming] def isTimeValid(time: Time): Boolean = {

  if (!isInitialized) {

    throw new SparkException (this + " has not been initialized")

  } else if (time <= zeroTime || ! (time - zeroTime).isMultipleOf(slideDuration)) {

    logInfo(s"Time $time is invalid as zeroTime is $zeroTime" +

      s" , slideDuration is $slideDuration and difference is ${time - zeroTime}")

    false

  } else {

    logDebug(s"Time $time is valid")

    true

  }

}

例如常见 MappedDStream 是父 DStream 的 slideDuration

override def slideDuration: Duration = parent.slideDuration

第一个DStream 的 slideDuration = batchDuration

override def slideDuration: Duration = {

  if (ssc == null) throw new Exception("ssc is null")

  if (ssc.graph.batchDuration == null) throw new Exception("batchDuration is null")

  ssc.graph.batchDuration

}

InputDStream

InputDStream（输入流就是数据源）是 DirectKafkaInputDStream 、FileInputDStream..... 的父类。输入流都是继承这个方法。

override def slideDuration: Duration = {

  if (ssc == null) throw new Exception("ssc is null")

  if (ssc.graph.batchDuration == null) throw new Exception("batchDuration is null")

  // 就是 new StreamContext  传入流的间隔时间

  ssc.graph.batchDuration

}

默认情况下就是 slideDuration = batchInterval 批次间隔时间

也就是 rememberDuration = batchInterval ， parentRememberDuration = batchInterval 。默认只会保留上次一个batch的RDD。

进入正题。

源码的入口就是 reduceByKeyAndWindow

PairDStreamFunctions

实际就是 dstream.reduceByKey().window().reduceByKey，点开window()

def reduceByKeyAndWindow(

    reduceFunc: (V, V) => V,

    windowDuration: Duration,

    slideDuration: Duration,

    partitioner: Partitioner

  ): DStream[(K, V)] = ssc.withScope {

  self.reduceByKey(reduceFunc, partitioner)

      .window(windowDuration, slideDuration)

      .reduceByKey(reduceFunc, partitioner)

}

DStream

其实只要是DStream 都有 window 函数

def window(windowDuration: Duration, slideDuration: Duration): DStream[T] = ssc.withScope {

  new WindowedDStream(this, windowDuration, slideDuration)

}

WindowedDStream

跟到 WindowedDStream 类

class WindowedDStream[T: ClassTag](

    parent: DStream[T],

    _windowDuration: Duration,

    _slideDuration: Duration)

  extends DStream[T](parent.ssc) {

  // Persist parent level by default, as those RDDs are going to be obviously reused.

  // 默认吧 parent dsteam 进行持久化，因为 parent.dsteam中的rdd 将会被吃

   parent.persist(StorageLevel.MEMORY_ONLY_SER)

  // 窗口时间

  def windowDuration: Duration = _windowDuration

  override def dependencies: List[DStream[_]] = List(parent)

  // 这边的 slideDuration 就不是parent.slideDuration 而是我们定传入 window方法的滑动间隔。

  override def slideDuration: Duration = _slideDuration

  // parentRememberDuration 是 slideDuration + windowDuration

  override def parentRememberDuration: Duration = rememberDuration + windowDuration

  override def compute(validTime: Time): Option[RDD[T]] = {

    val currentWindow = new Interval(validTime - windowDuration + parent.slideDuration, validTime)

    // 获取一个范围内的RDD

    val rddsInWindow = parent.slice(currentWindow)

    // 最后union 范围内的RDD

    Some(ssc.sc.union(rddsInWindow))

  }

}

DStream

def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = ssc.withScope {

  if (!isInitialized) {

    throw new SparkException(this + " has not been initialized")

  }

  // windowStream.parent 就是普通的DStream slideDuration = batchInterval

  // zeroTime 是 Stream 开始的时间，第一个batch 就是 zeroTime + batchInterval

  截至时间到第一个batch的时间  是不是  batchInterval 的倍数。主要是方便计算

  val alignedToTime = if ((toTime - zeroTime).isMultipleOf(slideDuration)) {

    toTime

  } else {

    logWarning(s"toTime ($toTime) is not a multiple of slideDuration ($slideDuration)")

    toTime.floor(slideDuration, zeroTime)

  }

  // 同理 开始时间到第一个batch的时间  是不是  batchInterval 的倍数。主要是方便计算

  val alignedFromTime = if ((fromTime - zeroTime).isMultipleOf(slideDuration)) {

    fromTime

  } else {

    logWarning(s"fromTime ($fromTime) is not a multiple of slideDuration ($slideDuration)")

    fromTime.floor(slideDuration, zeroTime)

  }

  logInfo(s"Slicing from $fromTime to $toTime" +

    s" (aligned to $alignedFromTime and $alignedToTime)")

  alignedFromTime.to(alignedToTime, slideDuration).flatMap { time =>

    //最后将这一个时间段的 根据 slideDuration 来切分，然后得到之前 batch的Time

    // 然后 到DStrem getOrCompute 中，从内存中重新取回来。

    if (time >= zeroTime) getOrCompute(time) else None

  }

}

例子

这边还是举个例子 interval = 1s window = 4s slice = 2s。zeroTime = 1582800004.000

isTimeValid = （Time - zeroTime % slide 是否为整数）

1s后，Time = 1582800005 ，先到到 WindowedDStream isTimeValid = false None

2s后，Time = 1582800006 ，先到到 WindowedDStream isTimeValid = true :

1、val currentWindow = new Interval(validTime - windowDuration + parent.slideDuration, validTime)

第一个参数 1582800006 - 4 + 1 = 1582800003

第二个参数 1582800006

2、parent.slice(1582800003,1582800006 ) 也就是取 1582800003、1582800004、1582800005、1582800006 四个时间的RDD。1582800003、1582800004 是无效的时间直接是None ，然后调用

getOrCompute（1582800005）和 getOrCompute（1582800006 ），这样就取到 1582800005 的 RDD，虽然1582800005 时刻的返回的是None 到了 1582800006 就会把1582800005 取到，依次类推就好了。

Demo 2

大家想，如果一个Window 的时间比较长，并且 reduceBykey().window().reduceBykey 涉及的计算比较慢。每次都需要重新计算 4个batch的RDD ，很浪费（前提是不是去重计算）。

假设 batchInterval = 1s window = 4s sliace = 1s

这边的数字代表时间戳

第一次计算 1、2、3、4 的RDD

第二次计算 2、3、4、5 的RDD

第三次计算 3、4、 5、6、的RDD

规律就是上一次计算的结果可以被下一次所副复用，减少计算。怎么服用呢

第一个 1、2、3、4 减 1的RDD 加 5的RDD 就可以了。

object SocketWordCountDstreamReduceByWindowOptimization {

  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf()

      .setAppName("SocketWordCountDstream")

      .setMaster("local[3]")

    val sparkStreamContext = new StreamingContext(sparkConf,Seconds(1))

    val sparkContext = sparkStreamContext.sparkContext

    sparkContext.setLogLevel("WARN")

    val dstream = sparkStreamContext.socketTextStream("localhost",9090)

    val v = dstream.flatMap(_.split(" "))

      .map((_,1))

        .reduceByKeyAndWindow((p:Int,c:Int)=>{

          p + c

        },(c:Int,p:Int)=>{

          c - p

        },Durations.seconds(4),Durations.seconds(2))

    v.foreachRDD(rdd => {

      ....

    })

    sparkStreamContext.start()

    sparkStreamContext.awaitTermination()

  }

}

PairDStreamFunctions

需要传入两个关键的函数，第一个就是减去 slide/batchinterval 个 RDD 的函数，第二个就是加上 slide/batchinterval 个 RDD 的函数。

def reduceByKeyAndWindow(

    reduceFunc: (V, V) => V,

    invReduceFunc: (V, V) => V,

    windowDuration: Duration,

    slideDuration: Duration,

    partitioner: Partitioner,

    filterFunc: ((K, V)) => Boolean

  ): DStream[(K, V)] = ssc.withScope {

  val cleanedReduceFunc = ssc.sc.clean(reduceFunc)

  val cleanedInvReduceFunc = ssc.sc.clean(invReduceFunc)

  val cleanedFilterFunc = if (filterFunc != null) Some(ssc.sc.clean(filterFunc)) else None

  new ReducedWindowedDStream[K, V](

    self, cleanedReduceFunc, cleanedInvReduceFunc, cleanedFilterFunc,

    windowDuration, slideDuration, partitioner

  )

}

ReducedWindowedDStream

几个关键都是和window 一样

def windowDuration: Duration = _windowDuration

override def dependencies: List[DStream[_]] = List(reducedStream)

override def slideDuration: Duration = _slideDuration

// 这个就是 需要 checkpoint

override val mustCheckpoint = true

override def parentRememberDuration: Duration = rememberDuration + windowDuration

怎么获取上一个batch的RDD，可以之在map 中，根据 time - slideDuration 就是上一个批次的时间。

例如 batchInterval = 1s window = 4s slide = 2s

当前时间 100006

上一个widow 的区间 [ 100001，100002，100003，100004 ]

第一个：需要减去的RDD 区间 [ 100001，100002 ] （区间的表示，下同）

第二个：需要加上的RDD 区间 [ 100005，100006 ]

期望： [100003，100006 ]

// 这个就是 parent　DStream，这边只是做了一个转换。

private val reducedStream = parent.reduceByKey(reduceFunc, partitioner)

override def compute(validTime: Time): Option[RDD[(K, V)]] = {

  val reduceF = reduceFunc

  val invReduceF = invReduceFunc

  val currentTime = validTime

  // 这边计算后就是 [100006 - 4 + 1 ,100006] => [100003 ,100006]

  val currentWindow = new Interval(currentTime - windowDuration + parent.slideDuration,

    currentTime)

  // 这边计算后就是 [100003-2,100006-2] => [100001,100004]  上一个Window的区间

  val previousWindow = currentWindow - slideDuration

  //  _____________________________

  // |  previous window   _________|___________________

  // |___________________|       current window        |  --------------> Time

  //                     |_____________________________|

  //

  // |________ _________|          |________ _________|

  //          |                             |

  //          V                             V

  //       old RDDs                     new RDDs

  //

  // Get the RDDs of the reduced values in "old time steps"

  // 这边计算后就是 [100001,100004-2] => [100001,100002] 和我们之前预算的一致

  val oldRDDs =

    reducedStream.slice(previousWindow.beginTime, currentWindow.beginTime - parent.slideDuration)

  logDebug("# old RDDs = " + oldRDDs.size)

  // Get the RDDs of the reduced values in "new time steps"

  //这边计算后就是 [100004+1,100006] 注意 parent.slideDuration = batchInterval

  val newRDDs =

    reducedStream.slice(previousWindow.endTime + parent.slideDuration, currentWindow.endTime)

  logDebug("# new RDDs = " + newRDDs.size)

  // Get the RDD of the reduced value of the previous window

  // 获取上一个window 的RDD

  val previousWindowRDD =

    getOrCompute(previousWindow.endTime).getOrElse(ssc.sc.makeRDD(Seq[(K, V)]()))

  // Make the list of RDDs that needs to cogrouped together for reducing their reduced values

  val allRDDs = new ArrayBuffer[RDD[(K, V)]]() += previousWindowRDD ++= oldRDDs ++= newRDDs

  // Cogroup the reduced RDDs and merge the reduced valuesRDD

  // 将 三个RDD cogroupe也就是 合并

  val cogroupedRDD = new CoGroupedRDD[K](allRDDs.toSeq.asInstanceOf[Seq[RDD[(K, _)]]],

    partitioner)

  // val mergeValuesFunc = mergeValues(oldRDDs.size, newRDDs.size) _

  val numOldValues = oldRDDs.size

  val numNewValues = newRDDs.size

  val mergeValues = (arrayOfValues: Array[Iterable[V]]) => {

    if (arrayOfValues.length != 1 + numOldValues + numNewValues) {

      throw new Exception("Unexpected number of sequences of reduced values")

    }

    // Getting reduced values "old time steps" that will be removed from current window

   // 拿到 需要减去oldrdd 的 值

   val oldValues = (1 to numOldValues).map(i => arrayOfValues(i)).filter(!_.isEmpty).map(_.head)

    // Getting reduced values "new time steps"

     // 拿到 需要加上 oldrdd 的 值

    val newValues =

      (1 to numNewValues).map(i => arrayOfValues(numOldValues + i)).filter(!_.isEmpty).map(_.head)

    // 判断上一个Window的Rdd值是不是

    if (arrayOfValues(0).isEmpty) {

      // If previous window's reduce value does not exist, then at least new values should exist

      if (newValues.isEmpty) {

        throw new Exception("Neither previous window has value for key, nor new values found. " +

          "Are you sure your key class hashes consistently?")

      }

      // Reduce the new values

      // 如果上一个window 是空的，直接只计算新值

      newValues.reduce(reduceF) // return

    } else {

      // Get the previous window's reduced value

      var tempValue = arrayOfValues(0).head

      // If old values exists, then inverse reduce then from previous value

      if (!oldValues.isEmpty) {

         // 减去 oldRDDs，其实应该说对 上一个window 的值和 oldValues 处理

        tempValue = invReduceF(tempValue, oldValues.reduce(reduceF))

      }

      // If new values exists, then reduce them with previous value

      if (!newValues.isEmpty) {

        // 加上 newRdd的值，其实应该说对 上一个window 的值和 newValues 处理

        tempValue = reduceF(tempValue, newValues.reduce(reduceF))

      }

      tempValue // return

    }

  }

  // 调用上面的函数 拿到当前的值

  val mergedValuesRDD = cogroupedRDD.asInstanceOf[RDD[(K, Array[Iterable[V]])]]

    .mapValues(mergeValues)

  if (filterFunc.isDefined) {

    Some(mergedValuesRDD.filter(filterFunc.get))

  } else {

    Some(mergedValuesRDD)

  }

}