14：Spark Streaming源码解读之State管理之updateStateByKey和mapWithState解密

首先简单解释一下什么是state(状态)管理？我们以wordcount为例。每个batchInterval会计算当前batch的单词计数，那如果需要计算从流开始到目前为止的单词出现的次数，该如计算呢？SparkStreaming提供了两种方法：updateStateByKey和mapWithState 。mapWithState 是1.6版本新增功能，目前属于实验阶段。mapWithState具官方说性能较updateStateByKey提升10倍。那么我们来看看他们到底是如何实现的。

一、updateStateByKey 解析

1.1 updateStateByKey 的使用实例

首先看一个updateStateByKey函数使用的例子：

object UpdateStateByKeyDemo {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("UpdateStateByKeyDemo")
    val ssc = new StreamingContext(conf,Seconds(20))
    //要使用updateStateByKey方法，必须设置Checkpoint。
    ssc.checkpoint("/checkpoint/")
    val socketLines = ssc.socketTextStream("localhost",9999)
 
    socketLines.flatMap(_.split(",")).map(word=>(word,1))
      .updateStateByKey( (currValues:Seq[Int],preValue:Option[Int]) =>{
           val currValue = currValues.sum //将目前值相加
           Some(currValue + preValue.getOrElse(0)) //目前值的和加上历史值
    }).print()
 
    ssc.start()
    ssc.awaitTermination()
    ssc.stop()
 
  }
}

代码很简单，关键地方写了详细的注释。

1.2 updateStateByKey 方法源码分析

我们知道map返回的是MappedDStream，而MappedDStream并没有updateStateByKey方法，并且它的父类DStream中也没有该方法。但是DStream的伴生对象中有一个隐式转换函数

  implicit def toPairDStreamFunctions[K, V](stream: DStream[(K, V)])
      (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null):
    PairDStreamFunctions[K, V] = {
    new PairDStreamFunctions[K, V](stream)
  }

PairDStreamFunction 中updateStateByKey的源码如下：

  def updateStateByKey[S: ClassTag](
      updateFunc: (Seq[V], Option[S]) => Option[S]
    ): DStream[(K, S)] = ssc.withScope {
    updateStateByKey(updateFunc, defaultPartitioner())
  }

其中updateFunc就要传入的参数，他是一个函数，Seq[V]表示当前key对应的所有值，Option[S] 是当前key的历史状态，返回的是新的状态。

最终会调用下面的方法：

  def updateStateByKey[S: ClassTag](
      updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
      partitioner: Partitioner,
      rememberPartitioner: Boolean
    ): DStream[(K, S)] = ssc.withScope {
     new StateDStream(self, ssc.sc.clean(updateFunc), partitioner, rememberPartitioner, None)
  }

在这里面new出了一个StateDStream对象。在其compute方法中，会先获取上一个batch计算出的RDD（包含了至程序开始到上一个batch单词的累计计数），然后在获取本次batch中StateDStream的父类计算出的RDD（本次batch的单词计数）分别是prevStateRDD和parentRDD，然后在调用 computeUsingPreviousRDD 方法：

  private [this] def computeUsingPreviousRDD (
    parentRDD: RDD[(K, V)], prevStateRDD: RDD[(K, S)]) = {
    // Define the function for the mapPartition operation on cogrouped RDD;
    // first map the cogrouped tuple to tuples of required type,
    // and then apply the update function
    val updateFuncLocal = updateFunc
    val finalFunc = (iterator: Iterator[(K, (Iterable[V], Iterable[S]))]) => {
      val i = iterator.map { t =>
        val itr = t._2._2.iterator
        val headOption = if (itr.hasNext) Some(itr.next()) else None
        (t._1, t._2._1.toSeq, headOption)
      }
      updateFuncLocal(i)
    }
    val cogroupedRDD = parentRDD.cogroup(prevStateRDD, partitioner)
    val stateRDD = cogroupedRDD.mapPartitions(finalFunc, preservePartitioning)
    Some(stateRDD)
  }

两个RDD进行cogroup然后应用updateStateByKey传入的函数。cogroup的性能是比较低下的。

二、mapWithState方法解析

2.1 mapWithState方法使用实例：

object StatefulNetworkWordCount {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println("Usage: StatefulNetworkWordCount <hostname> <port>")
      System.exit(1)
    }

    StreamingExamples.setStreamingLogLevels()

    val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount")
    // Create the context with a 1 second batch size
    val ssc = new StreamingContext(sparkConf, Seconds(1))
    ssc.checkpoint(".")

    // Initial state RDD for mapWithState operation
    val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1)))

    // Create a ReceiverInputDStream on target ip:port and count the
    // words in input stream of \n delimited test (eg. generated by 'nc')
    val lines = ssc.socketTextStream(args(0), args(1).toInt)
    val words = lines.flatMap(_.split(" "))
    val wordDstream = words.map(x => (x, 1))

    // Update the cumulative count using mapWithState
    // This will give a DStream made of state (which is the cumulative count of the words)
    val mappingFunc = (word: String, one: Option[Int], state: State[Int]) => {
      val sum = one.getOrElse(0) + state.getOption.getOrElse(0)
      val output = (word, sum)
      state.update(sum)
      output
    }

    val stateDstream = wordDstream.mapWithState(
      StateSpec.function(mappingFunc).initialState(initialRDD))
    stateDstream.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

mapWithState接收的参数是一个StateSpec对象。在StateSpec中封装了状态管理的函数

mapWithState函数中创建了MapWithStateDStreamImpl对象

  def mapWithState[StateType: ClassTag, MappedType: ClassTag](
      spec: StateSpec[K, V, StateType, MappedType]
    ): MapWithStateDStream[K, V, StateType, MappedType] = {
    new MapWithStateDStreamImpl[K, V, StateType, MappedType](
      self,
      spec.asInstanceOf[StateSpecImpl[K, V, StateType, MappedType]]
    )
  }

MapWithStateDStreamImpl 中创建了一个InternalMapWithStateDStream类型对象internalStream，在MapWithStateDStreamImpl的compute方法中调用了internalStream的getOrCompute方法。

/** Internal implementation of the [[MapWithStateDStream]] */
private[streaming] class MapWithStateDStreamImpl[
    KeyType: ClassTag, ValueType: ClassTag, StateType: ClassTag, MappedType: ClassTag](
    dataStream: DStream[(KeyType, ValueType)],
    spec: StateSpecImpl[KeyType, ValueType, StateType, MappedType])
  extends MapWithStateDStream[KeyType, ValueType, StateType, MappedType](dataStream.context) {

  private val internalStream =
    new InternalMapWithStateDStream[KeyType, ValueType, StateType, MappedType](dataStream, spec)

  override def slideDuration: Duration = internalStream.slideDuration

  override def dependencies: List[DStream[_]] = List(internalStream)

  override def compute(validTime: Time): Option[RDD[MappedType]] = {
    internalStream.getOrCompute(validTime).map { _.flatMap[MappedType] { _.mappedData } }
  }

InternalMapWithStateDStream中没有getOrCompute方法，这里调用的是其父类 DStream 的getOrCpmpute方法，该方法中最终会调用InternalMapWithStateDStream的Compute方法：

  /** Method that generates a RDD for the given time */
  override def compute(validTime: Time): Option[RDD[MapWithStateRDDRecord[K, S, E]]] = {
    // Get the previous state or create a new empty state RDD
    val prevStateRDD = getOrCompute(validTime - slideDuration) match {
      case Some(rdd) =>
        if (rdd.partitioner != Some(partitioner)) {
          // If the RDD is not partitioned the right way, let us repartition it using the
          // partition index as the key. This is to ensure that state RDD is always partitioned
          // before creating another state RDD using it
          MapWithStateRDD.createFromRDD[K, V, S, E](
            rdd.flatMap { _.stateMap.getAll() }, partitioner, validTime)
        } else {
          rdd
        }
      case None =>
        MapWithStateRDD.createFromPairRDD[K, V, S, E](
          spec.getInitialStateRDD().getOrElse(new EmptyRDD[(K, S)](ssc.sparkContext)),
          partitioner,
          validTime
        )
    }


    // Compute the new state RDD with previous state RDD and partitioned data RDD
    // Even if there is no data RDD, use an empty one to create a new state RDD
    val dataRDD = parent.getOrCompute(validTime).getOrElse {
      context.sparkContext.emptyRDD[(K, V)]
    }
    val partitionedDataRDD = dataRDD.partitionBy(partitioner)
    val timeoutThresholdTime = spec.getTimeoutInterval().map { interval =>
      (validTime - interval).milliseconds
    }
    Some(new MapWithStateRDD(
      prevStateRDD, partitionedDataRDD, mappingFunction, validTime, timeoutThresholdTime))
  }

根据给定的时间生成一个MapWithStateRDD，首先获取了先前状态的RDD：preStateRDD和当前时间的RDD:dataRDD，然后对dataRDD基于先前状态RDD的分区器进行重新分区获取partitionedDataRDD。最后将preStateRDD，partitionedDataRDD和用户定义的函数mappingFunction传给新生成的MapWithStateRDD对象返回。

下面看一下MapWithStateRDD的compute方法：

  override def compute(
      partition: Partition, context: TaskContext): Iterator[MapWithStateRDDRecord[K, S, E]] = {

    val stateRDDPartition = partition.asInstanceOf[MapWithStateRDDPartition]
    val prevStateRDDIterator = prevStateRDD.iterator(
      stateRDDPartition.previousSessionRDDPartition, context)
    val dataIterator = partitionedDataRDD.iterator(
      stateRDDPartition.partitionedDataRDDPartition, context)
//prevRecord 代表一个分区的数据
    val prevRecord = if (prevStateRDDIterator.hasNext) Some(prevStateRDDIterator.next()) else None
    val newRecord = MapWithStateRDDRecord.updateRecordWithData(
      prevRecord,
      dataIterator,
      mappingFunction,
      batchTime,
      timeoutThresholdTime,
      removeTimedoutData = doFullScan // remove timedout data only when full scan is enabled
    )
    Iterator(newRecord)
  }

MapWithStateRDDRecord 对应MapWithStateRDD 的一个分区：

private[streaming] case class MapWithStateRDDRecord[K, S, E](
    var stateMap: StateMap[K, S], var mappedData: Seq[E])

其中stateMap存储了key的状态，mappedData存储了mapping function函数的返回值

看一下MapWithStateRDDRecord的updateRecordWithData方法

 def updateRecordWithData[K: ClassTag, V: ClassTag, S: ClassTag, E: ClassTag](
    prevRecord: Option[MapWithStateRDDRecord[K, S, E]],
    dataIterator: Iterator[(K, V)],
    mappingFunction: (Time, K, Option[V], State[S]) => Option[E],
    batchTime: Time,
    timeoutThresholdTime: Option[Long],
    removeTimedoutData: Boolean
  ): MapWithStateRDDRecord[K, S, E] = {


    // 创建一个新的 state map 从过去的Recoord中复制 (如果存在) 否则创建一下空的StateMap对象
    val newStateMap = prevRecord.map { _.stateMap.copy() }. getOrElse { new EmptyStateMap[K, S]() }

    val mappedData = new ArrayBuffer[E]
//状态
    val wrappedState = new StateImpl[S]()

    // Call the mapping function on each record in the data iterator, and accordingly
    // update the states touched, and collect the data returned by the mapping function
    dataIterator.foreach { case (key, value) =>
    //获取key对应的状态
      wrappedState.wrap(newStateMap.get(key))
//调用mappingFunction获取返回值
      val returned = mappingFunction(batchTime, key, Some(value), wrappedState)
    //维护newStateMap的值
      if (wrappedState.isRemoved) {
        newStateMap.remove(key)
      } else if (wrappedState.isUpdated
          || (wrappedState.exists && timeoutThresholdTime.isDefined)) {
        newStateMap.put(key, wrappedState.get(), batchTime.milliseconds)
      }
      mappedData ++= returned
    }

    // Get the timed out state records, call the mapping function on each and collect the
    // data returned
    if (removeTimedoutData && timeoutThresholdTime.isDefined) {
      newStateMap.getByTime(timeoutThresholdTime.get).foreach { case (key, state, _) =>
        wrappedState.wrapTimingOutState(state)
        val returned = mappingFunction(batchTime, key, None, wrappedState)
        mappedData ++= returned
        newStateMap.remove(key)
      }
    }

    MapWithStateRDDRecord(newStateMap, mappedData)
  }

最终返回MapWithStateRDDRecord对象交个MapWithStateRDD的compute函数，MapWithStateRDD的compute函数将其封装成Iterator返回。

来自为知笔记(Wiz)

14：Spark Streaming源码解读之State管理之updateStateByKey和mapWithState解密的更多相关文章

Spark Streaming源码解读之State管理之UpdataStateByKey和MapWithState解密
本期内容 : UpdateStateByKey解密 MapWithState解密 Spark Streaming是实现State状态管理因素: 01. Spark Streaming是按照整个Bach ...
Spark Streaming源码解读之JobScheduler内幕实现和深度思考
本期内容 : JobScheduler内幕实现 JobScheduler深度思考 JobScheduler 是整个Spark Streaming调度的核心,需要设置多线程,一条用于接收数据不断的循环, ...
Spark Streaming源码解读之流数据不断接收和全生命周期彻底研究和思考
本节的主要内容: 一.数据接受架构和设计模式二.接受数据的源码解读 Spark Streaming不断持续的接收数据,具有Receiver的Spark 应用程序的考虑. Receiver和Drive ...
15、Spark Streaming源码解读之No Receivers彻底思考
在前几期文章里讲了带Receiver的Spark Streaming 应用的相关源码解读,但是现在开发Spark Streaming的应用越来越多的采用No Receivers(Direct Appr ...
Spark Streaming源码解读之流数据不断接收全生命周期彻底研究和思考
本期内容 : 数据接收架构设计模式数据接收源码彻底研究一.Spark Streaming数据接收设计模式 Spark Streaming接收数据也相似MVC架构: 1. Mode相当于Rece ...
Spark Streaming源码解读之Receiver生成全生命周期彻底研究和思考
本期内容 : Receiver启动的方式设想 Receiver启动源码彻底分析多个输入源输入启动,Receiver启动失败,只要我们的集群存在就希望Receiver启动成功,运行过程中基于每个Tea ...
Spark Streaming源码解读之生成全生命周期彻底研究与思考
本期内容 : DStream与RDD关系彻底研究 Streaming中RDD的生成彻底研究问题的提出 : 1. RDD是怎么生成的,依靠什么生成 2.执行时是否与Spark Core上的RDD执行有 ...
Spark Streaming源码解读之Job动态生成和深度思考
本期内容 : Spark Streaming Job生成深度思考 Spark Streaming Job生成源码解析 Spark Core中的Job就是一个运行的作业,就是具体做的某一件事,这里的JO ...
16.Spark Streaming源码解读之数据清理机制解析
原创文章,转载请注明:转载自听风居士博客(http://www.cnblogs.com/zhouyf/) 本期内容: 一.Spark Streaming 数据清理总览二.Spark Streami ...

随机推荐

前端PHP入门-029-文件操作-掌握级别
人生只有三天,活在昨天的人迷惑:活在明天的人等待:活在今天的人最踏实.你永远无法预测意外和明天哪个来得更早,所以,我们能做的,就是尽最大的努力过好今天.请记住:今天永远是昨天死去的人所期待的明天. f ...
Oracle中查询当前数据库中的所有表空间和对应的数据文件语句命令
Oracle中查询当前数据库中的所有表空间和对应的数据文件语句命令 ------------------------------------------------------------------ ...
Tool1—安装配置Windows Live Writer
详细步骤请看:http://home.cnblogs.com/group/topic/8550.html . Windows Live Writer手工配置步骤(在博客园配置时输入用户名与密码会自动完 ...
数据库-Core Data
技术博客http://www.cnblogs.com/ChenYilong/ 新浪微博http://weibo.com/luohanchenyilong 数据库-Core Data 技术博客htt ...
css3背景色过渡
<!DOCTYPE html><html lang="zh-cmn-Hans"><head><meta charset="utf ...
Django之ModelForm（一）
要说ModelForm,那就先说Form吧! 先给出一个Form示例: models.py from django.db import models class UserType(models.Mod ...
Linux下的lds链接脚本详解【转】
转自:http://www.cnblogs.com/li-hao/p/4107964.html 转载自:http://linux.chinaunix.net/techdoc/beginner/2009 ...
用VIM查看编辑二进制文件
用VIM查看编辑二进制文件 vim可以很方便地编辑二进制文件,个人认为它比emacs的二进制编辑方式更好用.vim中二进制文件的编辑是先通过外部程序xxd来把文件dump成其二进制的文本形式,然后就可 ...
64_p9
python2-termcolor-1.1.0-11.fc26.noarch.rpm 12-Feb-2017 14:05 13610 python2-terminado-0.6-2.fc26.noar ...
在ubuntu中安装puppeteer
https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md 早些时候puppeteer刚出来,在vps上 ...

14：Spark Streaming源码解读之State管理之updateStateByKey和mapWithState解密

14：Spark Streaming源码解读之State管理之updateStateByKey和mapWithState解密的更多相关文章

随机推荐

热门专题