1. updateStateByKey 解释:

    以DStream中的数据进行按key做reduce操作,然后对各个批次的数据进行累加

    在有新的数据信息进入或更新时。能够让用户保持想要的不论什么状。使用这个功能须要完毕两步:

    1) 定义状态:能够是随意数据类型

    2) 定义状态更新函数:用一个函数指定怎样使用先前的状态。从输入流中的新值更新状态。

    对于有状态操作,要不断的把当前和历史的时间切片的RDD累加计算,随着时间的流失,计算的数据规模会变得越来越大。

  2. updateStateByKey源代码:

    /**

    • Return a new “state” DStream where the state for each key is updated by applying
    • the given function on the previous state of the key and the new values of the key.
    • org.apache.spark.Partitioner is used to control the partitioning of each RDD.
    • @param updateFunc State update function. If this function returns None, then
    • corresponding state key-value pair will be eliminated.
    • @param partitioner Partitioner for controlling the partitioning of each RDD in the new
    • DStream.
    • @param initialRDD initial state value of each key.
    • @tparam S State type

      */

      def updateStateByKey[S: ClassTag](

      updateFunc: (Seq[V], Option[S]) => Option[S],

      partitioner: Partitioner,

      initialRDD: RDD[(K, S)]

      ): DStream[(K, S)] = {

      val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {

      iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))

      }

      updateStateByKey(newUpdateFunc, partitioner, true, initialRDD)

      }
  3. 代码实现

    • StatefulNetworkWordCount

      object StatefulNetworkWordCount {
      def main(args: Array[String]) {
      if (args.length < 2) {
      System.err.println("Usage: StatefulNetworkWordCount <hostname> <port>")
      System.exit(1)
      } Logger.getLogger("org.apache.spark").setLevel(Level.WARN) val updateFunc = (values: Seq[Int], state: Option[Int]) => {
      val currentCount = values.sum val previousCount = state.getOrElse(0) Some(currentCount + previousCount)
      } val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
      iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))
      } val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount").setMaster("local")
      // Create the context with a 1 second batch size
      val ssc = new StreamingContext(sparkConf, Seconds(1))
      ssc.checkpoint(".") // Initial RDD input to updateStateByKey
      val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1))) // Create a ReceiverInputDStream on target ip:port and count the
      // words in input stream of \n delimited test (eg. generated by 'nc')
      val lines = ssc.socketTextStream(args(0), args(1).toInt)
      val words = lines.flatMap(_.split(" "))
      val wordDstream = words.map(x => (x, 1)) // Update the cumulative count using updateStateByKey
      // This will give a Dstream made of state (which is the cumulative count of the words)
      val stateDstream = wordDstream.updateStateByKey[Int](newUpdateFunc,
      new HashPartitioner (ssc.sparkContext.defaultParallelism), true, initialRDD)
      stateDstream.print()
      ssc.start()
      ssc.awaitTermination()
      }
      }
    • NetworkWordCount

import org.apache.spark.SparkConf
import org.apache.spark.HashPartitioner
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._ object NetworkWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: NetworkWordCount <hostname> <port>")
System.exit(1)
} val sparkConf = new SparkConf().setAppName("NetworkWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(10))
//使用updateStateByKey前须要设置checkpoint
ssc.checkpoint("hdfs://master:8020/spark/checkpoint") val addFunc = (currValues: Seq[Int], prevValueState: Option[Int]) => {
//通过Spark内部的reduceByKey按key规约。然后这里传入某key当前批次的Seq/List,再计算当前批次的总和
val currentCount = currValues.sum
// 已累加的值
val previousCount = prevValueState.getOrElse(0)
// 返回累加后的结果。是一个Option[Int]类型
Some(currentCount + previousCount)
} val lines = ssc.socketTextStream(args(0), args(1).toInt)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1)) //val currWordCounts = pairs.reduceByKey(_ + _)
//currWordCounts.print() val totalWordCounts = pairs.updateStateByKey[Int](addFunc)
totalWordCounts.print() ssc.start()
ssc.awaitTermination()
}
}
  • WebPagePopularityValueCalculator
package com.spark.streaming

import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Duration, Seconds, StreamingContext} /**
* ━━━━━━神兽出没━━━━━━
*    ┏┓   ┏┓
*   ┏┛┻━━━┛┻┓
*   ┃       ┃
*   ┃   ━   ┃
*   ┃ ┳┛ ┗┳ ┃
*   ┃       ┃
*   ┃   ┻   ┃
*   ┃       ┃
*   ┗━┓   ┏━┛
*     ┃   ┃神兽保佑, 永无BUG!
*      ┃   ┃Code is far away from bug with the animal protecting
*     ┃   ┗━━━┓
*     ┃       ┣┓
*     ┃       ┏┛
*     ┗┓┓┏━┳┓┏┛
*      ┃┫┫ ┃┫┫
*      ┗┻┛ ┗┻┛
* ━━━━━━感觉萌萌哒━━━━━━
* Module Desc:
* User: wangyue
* DateTime: 15-11-9上午10:50
*/
object WebPagePopularityValueCalculator { private val checkpointDir = "popularity-data-checkpoint"
private val msgConsumerGroup = "user-behavior-topic-message-consumer-group" def main(args: Array[String]) { if (args.length < 2) {
println("Usage:WebPagePopularityValueCalculator zkserver1:2181, zkserver2: 2181, zkserver3: 2181 consumeMsgDataTimeInterval (secs) ")
System.exit(1)
} val Array(zkServers, processingInterval) = args
val conf = new SparkConf().setAppName("Web Page Popularity Value Calculator") val ssc = new StreamingContext(conf, Seconds(processingInterval.toInt))
//using updateStateByKey asks for enabling checkpoint
ssc.checkpoint(checkpointDir) val kafkaStream = KafkaUtils.createStream(
//Spark streaming context
ssc,
//zookeeper quorum. e.g zkserver1:2181,zkserver2:2181,...
zkServers,
//kafka message consumer group ID
msgConsumerGroup,
//Map of (topic_name -> numPartitions) to consume. Each partition is consumed in its own thread
Map("user-behavior-topic" -> 3))
val msgDataRDD = kafkaStream.map(_._2) //for debug use only
//println("Coming data in this interval...")
//msgDataRDD.print()
// e.g page37|5|1.5119122|-1
val popularityData = msgDataRDD.map { msgLine => {
val dataArr: Array[String] = msgLine.split("\\|")
val pageID = dataArr(0)
//calculate the popularity value
val popValue: Double = dataArr(1).toFloat * 0.8 + dataArr(2).toFloat * 0.8 + dataArr(3).toFloat * 1
(pageID, popValue)
}
} //sum the previous popularity value and current value
//定义一个匿名函数去把网页热度上一次的计算结果值和新计算的值相加,得到最新的热度值。 val updatePopularityValue = (iterator: Iterator[(String, Seq[Double], Option[Double])]) => {
iterator.flatMap(t => {
val newValue: Double = t._2.sum
val stateValue: Double = t._3.getOrElse(0);
Some(newValue + stateValue)
}.map(sumedValue => (t._1, sumedValue)))
} val initialRDD = ssc.sparkContext.parallelize(List(("page1", 0.00))) //调用 updateStateByKey 原语并传入上面定义的匿名函数更新网页热度值。
val stateDStream = popularityData.updateStateByKey[Double](updatePopularityValue,
new HashPartitioner(ssc.sparkContext.defaultParallelism), true, initialRDD) //set the checkpoint interval to avoid too frequently data checkpoint which may
//may significantly reduce operation throughput
stateDStream.checkpoint(Duration(8 * processingInterval.toInt * 1000)) //after calculation, we need to sort the result and only show the top 10 hot pages
//最后得到最新结果后,须要对结果进行排序。最后打印热度值最高的 10 个网页。 stateDStream.foreachRDD { rdd => {
val sortedData = rdd.map { case (k, v) => (v, k) }.sortByKey(false)
val topKData = sortedData.take(10).map { case (v, k) => (k, v) }
topKData.foreach(x => {
println(x)
})
}
} ssc.start()
ssc.awaitTermination()
}
}

參考文章:

http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/

https://github.com/apache/spark/blob/branch-1.3/streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala

http://stackoverflow.com/questions/28998408/spark-streaming-example-calls-updatestatebykey-with-additional-parameters

http://stackoverflow.com/questions/27535668/spark-streaming-groupbykey-and-updatestatebykey-implementation

尊重原创,未经同意不得转载:

http://blog.csdn.net/stark_summer/article/details/47666337

spark streaming updateStateByKey 使用方法的更多相关文章

  1. Spark Streaming updateStateByKey案例实战和内幕源码解密

    本节课程主要分二个部分: 一.Spark Streaming updateStateByKey案例实战二.Spark Streaming updateStateByKey源码解密 第一部分: upda ...

  2. spark streaming updateStateByKey 用法

    object NetworkWordCount { def main(args: Array[String]) { ) { System.err.println("Usage: Networ ...

  3. Spark Streaming updateStateByKey和mapWithState源码解密

    本篇从二个方面进行源码分析: 一.updateStateByKey解密 二.mapWithState解密 通过对Spark研究角度来研究jvm.分布式.图计算.架构设计.软件工程思想,可以学到很多东西 ...

  4. 55、Spark Streaming:updateStateByKey以及基于缓存的实时wordcount程序

    一.updateStateByKey 1.概述 SparkStreaming 7*24 小时不间断的运行,有时需要管理一些状态,比如wordCount,每个batch的数据不是独立的而是需要累加的,这 ...

  5. Spark Streaming状态管理函数updateStateByKey和mapWithState

    Spark Streaming状态管理函数updateStateByKey和mapWithState 一.状态管理函数 二.mapWithState 2.1关于mapWithState 2.2mapW ...

  6. spark streaming - kafka updateStateByKey 统计用户消费金额

    场景 餐厅老板想要统计每个用户来他的店里总共消费了多少金额,我们可以使用updateStateByKey来实现 从kafka接收用户消费json数据,统计每分钟用户的消费情况,并且统计所有时间所有用户 ...

  7. Spark Streaming中空batches处理的两种方法(转)

    原文链接:Spark Streaming中空batches处理的两种方法 Spark Streaming是近实时(near real time)的小批处理系统.对给定的时间间隔(interval),S ...

  8. Spark之 Spark Streaming整合kafka(并演示reduceByKeyAndWindow、updateStateByKey算子使用)

    Kafka0.8版本基于receiver接受器去接受kafka topic中的数据(并演示reduceByKeyAndWindow的使用) 依赖 <dependency> <grou ...

  9. kafka broker Leader -1引起spark Streaming不能消费的故障解决方法

    一.问题描述:Kafka生产集群中有一台机器cdh-003由于物理故障原因挂掉了,并且系统起不来了,使得线上的spark Streaming实时任务不能正常消费,重启实时任务都不行.查看kafka t ...

随机推荐

  1. Django基于JWT实现微信小程序的登录和鉴权

    什么是JWT? JWT,全称Json Web Token,用于作为JSON对象在各方之间安全地传输信息.该信息可以被验证和信任,因为它是数字签名的. 与Session的区别 一.Session是在服务 ...

  2. jQuery封装的选项卡方法

    ********************************************************2018/3/15更新********************************* ...

  3. Less文件的建立

    1.打开DreamWeaver 2.选择 新建 Less       文件名为.less,保存于Less文件夹中 3.声明文档头:@charset "utf-8"; 4.将Less ...

  4. Windows下环境变量显示、设置或删除操作详情

    显示.设置或删除 cmd.exe 环境变量. SET [variable=[string]] variable 指定环境变量名. string 指定要指派给变量的一系列字符串. 要显示当前环境变量,键 ...

  5. [文章转载]-我的Java后端书架-江南白衣

    我的Java后端书架 (2016年暮春3.0版) 04月 24, 2016 | Filed under 技术 书架主要针对Java后端开发. 3.0版把一些后来买的.看的书添补进来,又或删掉或降级一些 ...

  6. Django 更新字段

    Django在1.7以后的版本提供数据迁移命令,用来在修改模型中的字段,更新到数据库 1. python manager.py makemigrations 命令用来创建迁移文件版本的 2. pyth ...

  7. Android控件的继承关系

    1.View,ViewGroup >View: }1.所有高级UI组件都继承View类而实现的 }2.一个View在屏幕上占据一块矩形区域 }3. 负责渲染 }4.负责处理发生的事件 }5.设置 ...

  8. Python 之web动态服务器

    webServer.py代码如下: import socket import sys from multiprocessing import Process class WSGIServer(obje ...

  9. 内核调试-perf introduction

    perf概念 perf_event Perf_events是目前在Linux上使用广泛的profiling/tracing工具,除了本身是内核(kernel)的组成部分以外,还提供了用户空间(user ...

  10. 腾讯云,搭建LNMP环境

    LNMP代表的就是:Linux系统下Nginx+MySQL+PHP这种网站服务器架构. Linux是一类Unix计算机操作系统的统称,是目前最流行的免费操作系统.代表版本有:debian.centos ...