spark streaming updateStateByKey 使用方法

updateStateByKey 解释:

以DStream中的数据进行按key做reduce操作，然后对各个批次的数据进行累加

在有新的数据信息进入或更新时。能够让用户保持想要的不论什么状。使用这个功能须要完毕两步：

1) 定义状态：能够是随意数据类型

2) 定义状态更新函数：用一个函数指定怎样使用先前的状态。从输入流中的新值更新状态。

对于有状态操作，要不断的把当前和历史的时间切片的RDD累加计算，随着时间的流失，计算的数据规模会变得越来越大。
updateStateByKey源代码:

/**
- Return a new “state” DStream where the state for each key is updated by applying
- the given function on the previous state of the key and the new values of the key.
- org.apache.spark.Partitioner is used to control the partitioning of each RDD.
- @param updateFunc State update function. If this function returns None, then
- corresponding state key-value pair will be eliminated.
- @param partitioner Partitioner for controlling the partitioning of each RDD in the new
- DStream.
- @param initialRDD initial state value of each key.
- @tparam S State type
  
  */
  
  def updateStateByKey[S: ClassTag](
  
  updateFunc: (Seq[V], Option[S]) => Option[S],
  
  partitioner: Partitioner,
  
  initialRDD: RDD[(K, S)]
  
  ): DStream[(K, S)] = {
  
  val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {
  
  iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))
  
  }
  
  updateStateByKey(newUpdateFunc, partitioner, true, initialRDD)
  
  }

代码实现

StatefulNetworkWordCount

object StatefulNetworkWordCount {

def main(args: Array[String]) {

if (args.length < 2) {

  System.err.println("Usage: StatefulNetworkWordCount <hostname> <port>")

  System.exit(1)

}

Logger.getLogger("org.apache.spark").setLevel(Level.WARN)

val updateFunc = (values: Seq[Int], state: Option[Int]) => {

  val currentCount = values.sum

  val previousCount = state.getOrElse(0)

  Some(currentCount + previousCount)

}

val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {

  iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))

}

val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount").setMaster("local")

// Create the context with a 1 second batch size

val ssc = new StreamingContext(sparkConf, Seconds(1))

ssc.checkpoint(".")

// Initial RDD input to updateStateByKey

val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1)))

// Create a ReceiverInputDStream on target ip:port and count the

// words in input stream of \n delimited test (eg. generated by 'nc')

val lines = ssc.socketTextStream(args(0), args(1).toInt)

val words = lines.flatMap(_.split(" "))

val wordDstream = words.map(x => (x, 1))

// Update the cumulative count using updateStateByKey

// This will give a Dstream made of state (which is the cumulative count of the words)

val stateDstream = wordDstream.updateStateByKey[Int](newUpdateFunc,

  new HashPartitioner (ssc.sparkContext.defaultParallelism), true, initialRDD)

stateDstream.print()

ssc.start()

ssc.awaitTermination()

}

}

NetworkWordCount

import org.apache.spark.SparkConf

import org.apache.spark.HashPartitioner

import org.apache.spark.streaming.{Seconds, StreamingContext}

import org.apache.spark.streaming.StreamingContext._

object NetworkWordCount {

  def main(args: Array[String]) {

    if (args.length < 2) {

      System.err.println("Usage: NetworkWordCount <hostname> <port>")

      System.exit(1)

    }

    val sparkConf = new SparkConf().setAppName("NetworkWordCount")

    val ssc = new StreamingContext(sparkConf, Seconds(10))

    //使用updateStateByKey前须要设置checkpoint

    ssc.checkpoint("hdfs://master:8020/spark/checkpoint")

    val addFunc = (currValues: Seq[Int], prevValueState: Option[Int]) => {

      //通过Spark内部的reduceByKey按key规约。然后这里传入某key当前批次的Seq/List,再计算当前批次的总和

      val currentCount = currValues.sum

      // 已累加的值

      val previousCount = prevValueState.getOrElse(0)

      // 返回累加后的结果。是一个Option[Int]类型

      Some(currentCount + previousCount)

    }

    val lines = ssc.socketTextStream(args(0), args(1).toInt)

    val words = lines.flatMap(_.split(" "))

    val pairs = words.map(word => (word, 1))

    //val currWordCounts = pairs.reduceByKey(_ + _)

    //currWordCounts.print()

    val totalWordCounts = pairs.updateStateByKey[Int](addFunc)

    totalWordCounts.print()

    ssc.start()

    ssc.awaitTermination()

  }

}

WebPagePopularityValueCalculator

package com.spark.streaming

import org.apache.spark.{HashPartitioner, SparkConf}

import org.apache.spark.streaming.kafka.KafkaUtils

import org.apache.spark.streaming.{Duration, Seconds, StreamingContext}

/**

 * ━━━━━━神兽出没━━━━━━

 * 　　　┏┓　　　┏┓

 * 　　┏┛┻━━━┛┻┓

 * 　　┃　　　　　　　┃

 * 　　┃　　　━　　　┃

 * 　　┃　┳┛　┗┳　┃

 * 　　┃　　　　　　　┃

 * 　　┃　　　┻　　　┃

 * 　　┃　　　　　　　┃

 * 　　┗━┓　　　┏━┛

 * 　　　　┃　　　┃神兽保佑, 永无BUG!

 * 　　　　 ┃　　　┃Code is far away from bug with the animal protecting

 * 　　　　┃　　　┗━━━┓

 * 　　　　┃　　　　　　　┣┓

 * 　　　　┃　　　　　　　┏┛

 * 　　　　┗┓┓┏━┳┓┏┛

 * 　　　　　┃┫┫　┃┫┫

 * 　　　　　┗┻┛　┗┻┛

 * ━━━━━━感觉萌萌哒━━━━━━

 * Module Desc:

 * User: wangyue

 * DateTime: 15-11-9上午10:50

 */

object WebPagePopularityValueCalculator {

  private val checkpointDir = "popularity-data-checkpoint"

  private val msgConsumerGroup = "user-behavior-topic-message-consumer-group"

  def main(args: Array[String]) {

    if (args.length < 2) {

      println("Usage:WebPagePopularityValueCalculator zkserver1:2181, zkserver2: 2181, zkserver3: 2181 consumeMsgDataTimeInterval (secs) ")

      System.exit(1)

    }

    val Array(zkServers, processingInterval) = args

    val conf = new SparkConf().setAppName("Web Page Popularity Value Calculator")

    val ssc = new StreamingContext(conf, Seconds(processingInterval.toInt))

    //using updateStateByKey asks for enabling checkpoint

    ssc.checkpoint(checkpointDir)

    val kafkaStream = KafkaUtils.createStream(

      //Spark streaming context

      ssc,

      //zookeeper quorum. e.g zkserver1:2181,zkserver2:2181,...

      zkServers,

      //kafka message consumer group ID

      msgConsumerGroup,

      //Map of (topic_name -> numPartitions) to consume. Each partition is consumed in its own thread

      Map("user-behavior-topic" -> 3))

    val msgDataRDD = kafkaStream.map(_._2)

    //for debug use only

    //println("Coming data in this interval...")

    //msgDataRDD.print()

    // e.g page37|5|1.5119122|-1

    val popularityData = msgDataRDD.map { msgLine => {

      val dataArr: Array[String] = msgLine.split("\\|")

      val pageID = dataArr(0)

      //calculate the popularity value

      val popValue: Double = dataArr(1).toFloat * 0.8 + dataArr(2).toFloat * 0.8 + dataArr(3).toFloat * 1

      (pageID, popValue)

    }

    }

    //sum the previous popularity value and current value

    //定义一个匿名函数去把网页热度上一次的计算结果值和新计算的值相加，得到最新的热度值。

val updatePopularityValue = (iterator: Iterator[(String, Seq[Double], Option[Double])]) => {

      iterator.flatMap(t => {

        val newValue: Double = t._2.sum

        val stateValue: Double = t._3.getOrElse(0);

        Some(newValue + stateValue)

      }.map(sumedValue => (t._1, sumedValue)))

    }

    val initialRDD = ssc.sparkContext.parallelize(List(("page1", 0.00)))

    //调用 updateStateByKey 原语并传入上面定义的匿名函数更新网页热度值。

    val stateDStream = popularityData.updateStateByKey[Double](updatePopularityValue,

      new HashPartitioner(ssc.sparkContext.defaultParallelism), true, initialRDD)

    //set the checkpoint interval to avoid too frequently data checkpoint which may

    //may significantly reduce operation throughput

    stateDStream.checkpoint(Duration(8 * processingInterval.toInt * 1000))

    //after calculation, we need to sort the result and only show the top 10 hot pages

    //最后得到最新结果后，须要对结果进行排序。最后打印热度值最高的 10 个网页。

stateDStream.foreachRDD { rdd => {

      val sortedData = rdd.map { case (k, v) => (v, k) }.sortByKey(false)

      val topKData = sortedData.take(10).map { case (v, k) => (k, v) }

      topKData.foreach(x => {

        println(x)

      })

    }

    }

    ssc.start()

    ssc.awaitTermination()

  }

}

參考文章:

http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/

https://github.com/apache/spark/blob/branch-1.3/streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala

http://stackoverflow.com/questions/28998408/spark-streaming-example-calls-updatestatebykey-with-additional-parameters

http://stackoverflow.com/questions/27535668/spark-streaming-groupbykey-and-updatestatebykey-implementation

尊重原创，未经同意不得转载：

http://blog.csdn.net/stark_summer/article/details/47666337

spark streaming updateStateByKey 使用方法的更多相关文章

Spark Streaming updateStateByKey案例实战和内幕源码解密
本节课程主要分二个部分: 一.Spark Streaming updateStateByKey案例实战二.Spark Streaming updateStateByKey源码解密第一部分: upda ...
spark streaming updateStateByKey 用法
object NetworkWordCount { def main(args: Array[String]) { ) { System.err.println("Usage: Networ ...
Spark Streaming updateStateByKey和mapWithState源码解密
本篇从二个方面进行源码分析: 一.updateStateByKey解密二.mapWithState解密通过对Spark研究角度来研究jvm.分布式.图计算.架构设计.软件工程思想,可以学到很多东西 ...
55、Spark Streaming:updateStateByKey以及基于缓存的实时wordcount程序
一.updateStateByKey 1.概述 SparkStreaming 7*24 小时不间断的运行,有时需要管理一些状态,比如wordCount,每个batch的数据不是独立的而是需要累加的,这 ...
Spark Streaming状态管理函数updateStateByKey和mapWithState
Spark Streaming状态管理函数updateStateByKey和mapWithState 一.状态管理函数二.mapWithState 2.1关于mapWithState 2.2mapW ...
spark streaming - kafka updateStateByKey 统计用户消费金额
场景餐厅老板想要统计每个用户来他的店里总共消费了多少金额,我们可以使用updateStateByKey来实现从kafka接收用户消费json数据,统计每分钟用户的消费情况,并且统计所有时间所有用户 ...
Spark Streaming中空batches处理的两种方法（转）
原文链接:Spark Streaming中空batches处理的两种方法 Spark Streaming是近实时(near real time)的小批处理系统.对给定的时间间隔(interval),S ...
Spark之 Spark Streaming整合kafka（并演示reduceByKeyAndWindow、updateStateByKey算子使用）
Kafka0.8版本基于receiver接受器去接受kafka topic中的数据(并演示reduceByKeyAndWindow的使用) 依赖 <dependency> <grou ...
kafka broker Leader -1引起spark Streaming不能消费的故障解决方法
一.问题描述:Kafka生产集群中有一台机器cdh-003由于物理故障原因挂掉了,并且系统起不来了,使得线上的spark Streaming实时任务不能正常消费,重启实时任务都不行.查看kafka t ...

随机推荐

OI——不后悔的两年
NOI2014,悲惨的考跪,99+170+130 399 Cu滚粗.最终签到了复旦的一本,还算是有点结果吧.(其实我一开始就想读复旦我会说?)回首这两年,就像一场梦一样,从一无所知的小白到进入省队再到 ...
利用python去除红章
近期接的一个需求需要去除图片的红章,用到了PIL库. from PIL import Image,ImageEnhanceimport os#f="5-12 - 0001.tif" ...
System.Data.SqlClient.SqlException: 在向服务器发送请求时发生传输级错误。 (provider: TCP 提供程序, error: 0 - 远程主机强迫关闭了一个现有的连接。) .
今天使用sql server 2008 R2管理器,进行SQL查询时,频率非常高的报错: System.Data.SqlClient.SqlException: 在向服务器发送请求时发生传输级错误. ...
V3的普通窗体控件的使用
1.文本: 属性: 标签对齐方式: Left Right Center 值对齐方式: Left Right Center 事件: 值改变事件值加载事件单击标题事件键盘按下事件获得焦点事 ...
JavaScript 封装插件学习笔记（一）
此篇只是笔记,在借鉴.参考.模仿的过程,可能不完整,请多指教! 定义插件名称要注意命名冲突,防止全局污染. 1.第一种Javascript对象命名:(Javascript语言是“先解析,后运行”,解析 ...
html5——3D案例（立方体）
立方体:父盒子规定了3d呈现属性,立方体做旋转运动移动顺序:1.每个盒子都先移动100px,然后再做相应的旋转 2.只有这样立方体的几何中心点与父盒子的几何中心点是一样的 <!DOCTYPE ...
CSS——行高
浏览器默认文字大小:16px 行高:是基线与基线之间的距离行高=文字高度+上下边距一行文字行高和父元素高度一致的时候,垂直居中显示. <!DOCTYPE html> <html& ...
linux共享库的版本控制
前几天看到一篇介绍linux共享库版本控制及使用的文章,觉得不错,这里就与大家分享一下. 1. Linux约定经常看到Linux中,共享库的名字后面跟了一串数字,比如:libperl.so.5.18 ...
flask web开发日记
from flask import Flask,make_response,redirect,abort app = Flask(__name__) @app.route('/index1') def ...
【转载】HTTP 请求头与请求体
原文地址: https://segmentfault.com/a/1190000006689767 HTTP Request HTTP 的请求报文分为三个部分请求行.请求头和请求体,格式如图:一个典 ...

spark streaming updateStateByKey 使用方法

spark streaming updateStateByKey 使用方法的更多相关文章

随机推荐

热门专题