flatMapGroupsWithState的出现解决了什么问题:

flatMapGroupsWithState的出现在spark structured streaming原因(从spark.2.2.0开始才开始支持):

1)可以实现agg函数;

2)就目前最新spark2.3.2版本来说在spark structured streming中依然不支持对dataset多次agg操作

,而flatMapGroupsWithState可以替代agg的作用,同时它允许在sink为append模式下在agg之前使用。

注意:尽管允许agg之前使用,但前提是:输出(sink)方式Append方式。

flatMapGroupsWithState的使用示例(从网上找到):

参考:《https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KeyValueGroupedDataset-flatMapGroupsWithState.html》

说明:以下示例代码实现了“select deviceId,count(0) as count from tbName group by deviceId.”。

1)spark2.3.0版本下定义一个Signal实体类:

  1. scala> spark.version
  2. res0: String = 2.3.0-SNAPSHOT
  3.  
  4. import java.sql.Timestamp
  5. type DeviceId = Int
  6. case class Signal(timestamp: java.sql.Timestamp, value: Long, deviceId: DeviceId)

2)使用Rate source方式生成一些测试数据(随机实时流方式),并查看执行计划:

  1. // input stream
  2. import org.apache.spark.sql.functions._
  3. val signals = spark.
  4. readStream.
  5. format("rate").
  6. option("rowsPerSecond", 1).
  7. load.
  8. withColumn("value", $"value" % 10). // <-- randomize the values (just for fun)
  9. withColumn("deviceId", rint(rand() * 10) cast "int"). // <-- 10 devices randomly assigned to values
  10. as[Signal] // <-- convert to our type (from "unpleasant" Row)
  11. scala> signals.explain
  12. == Physical Plan ==
  13. *Project [timestamp#0, (value#1L % 10) AS value#5L, cast(ROUND((rand(4440296395341152993) * 10.0)) as int) AS deviceId#9]
  14. +- StreamingRelation rate, [timestamp#0, value#1L]

3)对Rate source流对象进行groupBy,使用flatMapGroupsWithState实现agg

  1. // stream processing using flatMapGroupsWithState operator
  2. val device: Signal => DeviceId = { case Signal(_, _, deviceId) => deviceId }
  3. val signalsByDevice = signals.groupByKey(device)
  4.  
  5. import org.apache.spark.sql.streaming.GroupState
  6. type Key = Int
  7. type Count = Long
  8. type State = Map[Key, Count]
  9. case class EventsCounted(deviceId: DeviceId, count: Long)
  10. def countValuesPerKey(deviceId: Int, signalsPerDevice: Iterator[Signal], state: GroupState[State]): Iterator[EventsCounted] = {
  11. val values = signalsPerDevice.toList
  12. println(s"Device: $deviceId")
  13. println(s"Signals (${values.size}):")
  14. values.zipWithIndex.foreach { case (v, idx) => println(s"$idx. $v") }
  15. println(s"State: $state")
  16.  
  17. // update the state with the count of elements for the key
  18. val initialState: State = Map(deviceId -> 0)
  19. val oldState = state.getOption.getOrElse(initialState)
  20. // the name to highlight that the state is for the key only
  21. val newValue = oldState(deviceId) + values.size
  22. val newState = Map(deviceId -> newValue)
  23. state.update(newState)
  24.  
  25. // you must not return as it's already consumed
  26. // that leads to a very subtle error where no elements are in an iterator
  27. // iterators are one-pass data structures
  28. Iterator(EventsCounted(deviceId, newValue))
  29. }
  30. import org.apache.spark.sql.streaming.{GroupStateTimeout, OutputMode}
  31.  
  32. val signalCounter = signalsByDevice.flatMapGroupsWithState(
  33. outputMode = OutputMode.Append,
  34. timeoutConf = GroupStateTimeout.NoTimeout)(func = countValuesPerKey)

4)使用Console Sink方式打印agg结果:

  1. import org.apache.spark.sql.streaming.{OutputMode, Trigger}
  2. import scala.concurrent.duration._
  3. val sq = signalCounter.
  4. writeStream.
  5. format("console").
  6. option("truncate", false).
  7. trigger(Trigger.ProcessingTime(10.seconds)).
  8. outputMode(OutputMode.Append).
  9. start

5)console print

  1. ...
  2. -------------------------------------------
  3. Batch:
  4. -------------------------------------------
  5. +--------+-----+
  6. |deviceId|count|
  7. +--------+-----+
  8. +--------+-----+
  9. ...
  10. // :: INFO StreamExecution: Streaming query made progress: {
  11. "id" : "a43822a6-500b-4f02-9133-53e9d39eedbf",
  12. "runId" : "79cb037e-0f28-4faf-a03e-2572b4301afe",
  13. "name" : null,
  14. "timestamp" : "2017-08-21T06:57:26.719Z",
  15. "batchId" : ,
  16. "numInputRows" : ,
  17. "processedRowsPerSecond" : 0.0,
  18. "durationMs" : {
  19. "addBatch" : ,
  20. "getBatch" : ,
  21. "getOffset" : ,
  22. "queryPlanning" : ,
  23. "triggerExecution" : ,
  24. "walCommit" :
  25. },
  26. "stateOperators" : [ {
  27. "numRowsTotal" : ,
  28. "numRowsUpdated" : ,
  29. "memoryUsedBytes" :
  30. } ],
  31. "sources" : [ {
  32. "description" : "RateSource[rowsPerSecond=1, rampUpTimeSeconds=0, numPartitions=8]",
  33. "startOffset" : null,
  34. "endOffset" : ,
  35. "numInputRows" : ,
  36. "processedRowsPerSecond" : 0.0
  37. } ],
  38. "sink" : {
  39. "description" : "ConsoleSink[numRows=20, truncate=false]"
  40. }
  41. }
  42. // :: DEBUG StreamExecution: batch committed
  43. ...
  44. -------------------------------------------
  45. Batch:
  46. -------------------------------------------
  47. Device:
  48. Signals ():
  49. . Signal(-- ::27.682,,)
  50. State: GroupState(<undefined>)
  51. Device:
  52. Signals ():
  53. . Signal(-- ::26.682,,)
  54. State: GroupState(<undefined>)
  55. Device:
  56. Signals ():
  57. . Signal(-- ::28.682,,)
  58. State: GroupState(<undefined>)
  59. +--------+-----+
  60. |deviceId|count|
  61. +--------+-----+
  62. | | |
  63. | | |
  64. | | |
  65. +--------+-----+
  66. ...
  67. // :: INFO StreamExecution: Streaming query made progress: {
  68. "id" : "a43822a6-500b-4f02-9133-53e9d39eedbf",
  69. "runId" : "79cb037e-0f28-4faf-a03e-2572b4301afe",
  70. "name" : null,
  71. "timestamp" : "2017-08-21T06:57:30.004Z",
  72. "batchId" : ,
  73. "numInputRows" : ,
  74. "inputRowsPerSecond" : 0.91324200913242,
  75. "processedRowsPerSecond" : 2.2388059701492535,
  76. "durationMs" : {
  77. "addBatch" : ,
  78. "getBatch" : ,
  79. "getOffset" : ,
  80. "queryPlanning" : ,
  81. "triggerExecution" : ,
  82. "walCommit" :
  83. },
  84. "stateOperators" : [ {
  85. "numRowsTotal" : ,
  86. "numRowsUpdated" : ,
  87. "memoryUsedBytes" :
  88. } ],
  89. "sources" : [ {
  90. "description" : "RateSource[rowsPerSecond=1, rampUpTimeSeconds=0, numPartitions=8]",
  91. "startOffset" : ,
  92. "endOffset" : ,
  93. "numInputRows" : ,
  94. "inputRowsPerSecond" : 0.91324200913242,
  95. "processedRowsPerSecond" : 2.2388059701492535
  96. } ],
  97. "sink" : {
  98. "description" : "ConsoleSink[numRows=20, truncate=false]"
  99. }
  100. }
  101. // :: DEBUG StreamExecution: batch committed
  102. ...
  103. -------------------------------------------
  104. Batch:
  105. -------------------------------------------
  106. Device:
  107. Signals ():
  108. . Signal(-- ::36.682,,)
  109. State: GroupState(<undefined>)
  110. Device:
  111. Signals ():
  112. . Signal(-- ::32.682,,)
  113. . Signal(-- ::35.682,,)
  114. State: GroupState(Map( -> ))
  115. Device:
  116. Signals ():
  117. . Signal(-- ::34.682,,)
  118. State: GroupState(<undefined>)
  119. Device:
  120. Signals ():
  121. . Signal(-- ::29.682,,)
  122. State: GroupState(<undefined>)
  123. Device:
  124. Signals ():
  125. . Signal(-- ::31.682,,)
  126. . Signal(-- ::33.682,,)
  127. State: GroupState(Map( -> ))
  128. Device:
  129. Signals ():
  130. . Signal(-- ::30.682,,)
  131. . Signal(-- ::37.682,,)
  132. State: GroupState(Map( -> ))
  133. Device:
  134. Signals ():
  135. . Signal(-- ::38.682,,)
  136. State: GroupState(<undefined>)
  137. +--------+-----+
  138. |deviceId|count|
  139. +--------+-----+
  140. | | |
  141. | | |
  142. | | |
  143. | | |
  144. | | |
  145. | | |
  146. | | |
  147. +--------+-----+
  148. ...
  149. // :: INFO StreamExecution: Streaming query made progress: {
  150. "id" : "a43822a6-500b-4f02-9133-53e9d39eedbf",
  151. "runId" : "79cb037e-0f28-4faf-a03e-2572b4301afe",
  152. "name" : null,
  153. "timestamp" : "2017-08-21T06:57:40.005Z",
  154. "batchId" : ,
  155. "numInputRows" : ,
  156. "inputRowsPerSecond" : 0.9999000099990002,
  157. "processedRowsPerSecond" : 9.242144177449168,
  158. "durationMs" : {
  159. "addBatch" : ,
  160. "getBatch" : ,
  161. "getOffset" : ,
  162. "queryPlanning" : ,
  163. "triggerExecution" : ,
  164. "walCommit" :
  165. },
  166. "stateOperators" : [ {
  167. "numRowsTotal" : ,
  168. "numRowsUpdated" : ,
  169. "memoryUsedBytes" :
  170. } ],
  171. "sources" : [ {
  172. "description" : "RateSource[rowsPerSecond=1, rampUpTimeSeconds=0, numPartitions=8]",
  173. "startOffset" : ,
  174. "endOffset" : ,
  175. "numInputRows" : ,
  176. "inputRowsPerSecond" : 0.9999000099990002,
  177. "processedRowsPerSecond" : 9.242144177449168
  178. } ],
  179. "sink" : {
  180. "description" : "ConsoleSink[numRows=20, truncate=false]"
  181. }
  182. }
  183. // :: DEBUG StreamExecution: batch committed
  184.  
  185. // In the end...
  186. sq.stop
  187.  
  188. // Use stateOperators to access the stats
  189. scala> println(sq.lastProgress.stateOperators().prettyJson)
  190. {
  191. "numRowsTotal" : ,
  192. "numRowsUpdated" : ,
  193. "memoryUsedBytes" :
  194. }

Kafka:ZK+Kafka+Spark Streaming集群环境搭建(三十):使用flatMapGroupsWithState替换agg的更多相关文章

  1. Kafka:ZK+Kafka+Spark Streaming集群环境搭建(十二)VMW安装四台CentOS,并实现本机与它们能交互,虚拟机内部实现可以上网。

    Centos7出现异常:Failed to start LSB: Bring up/down networking. 按照<Kafka:ZK+Kafka+Spark Streaming集群环境搭 ...

  2. Kafka:ZK+Kafka+Spark Streaming集群环境搭建(十)安装hadoop2.9.0搭建HA

    如何搭建配置centos虚拟机请参考<Kafka:ZK+Kafka+Spark Streaming集群环境搭建(一)VMW安装四台CentOS,并实现本机与它们能交互,虚拟机内部实现可以上网.& ...

  3. Kafka:ZK+Kafka+Spark Streaming集群环境搭建(十九)ES6.2.2 安装Ik中文分词器

    注: elasticsearch 版本6.2.2 1)集群模式,则每个节点都需要安装ik分词,安装插件完毕后需要重启服务,创建mapping前如果有机器未安装分词,则可能该索引可能为RED,需要删除后 ...

  4. Kafka:ZK+Kafka+Spark Streaming集群环境搭建(十五)Spark编写UDF、UDAF、Agg函数

    Spark Sql提供了丰富的内置函数让开发者来使用,但实际开发业务场景可能很复杂,内置函数不能够满足业务需求,因此spark sql提供了可扩展的内置函数. UDF:是普通函数,输入一个或多个参数, ...

  5. Kafka:ZK+Kafka+Spark Streaming集群环境搭建(十六)Structured Streaming中ForeachSink的用法

    Structured Streaming默认支持的sink类型有File sink,Foreach sink,Console sink,Memory sink. ForeachWriter实现: 以写 ...

  6. Kafka:ZK+Kafka+Spark Streaming集群环境搭建(十四)定义一个avro schema使用comsumer发送avro字符流,producer接受avro字符流并解析

    参考<在Kafka中使用Avro编码消息:Consumer篇>.<在Kafka中使用Avro编码消息:Producter篇> 在了解如何avro发送到kafka,再从kafka ...

  7. Kafka:ZK+Kafka+Spark Streaming集群环境搭建(十八)ES6.2.2 增删改查基本操作

    #文档元数据 一个文档不仅仅包含它的数据 ,也包含 元数据 —— 有关 文档的信息. 三个必须的元数据元素如下:## _index    文档在哪存放 ## _type    文档表示的对象类别 ## ...

  8. Kafka:ZK+Kafka+Spark Streaming集群环境搭建(十三)kafka+spark streaming打包好的程序提交时提示虚拟内存不足(Container is running beyond virtual memory limits. Current usage: 119.5 MB of 1 GB physical memory used; 2.2 GB of 2.1 G)

    异常问题:Container is running beyond virtual memory limits. Current usage: 119.5 MB of 1 GB physical mem ...

  9. Kafka:ZK+Kafka+Spark Streaming集群环境搭建(九)安装kafka_2.11-1.1.0

    如何搭建配置centos虚拟机请参考<Kafka:ZK+Kafka+Spark Streaming集群环境搭建(一)VMW安装四台CentOS,并实现本机与它们能交互,虚拟机内部实现可以上网.& ...

  10. Kafka:ZK+Kafka+Spark Streaming集群环境搭建(八)安装zookeeper-3.4.12

    如何搭建配置centos虚拟机请参考<Kafka:ZK+Kafka+Spark Streaming集群环境搭建(一)VMW安装四台CentOS,并实现本机与它们能交互,虚拟机内部实现可以上网.& ...

随机推荐

  1. LeetCode(1):两数之和

    写在前面:基本全部参考大神“Grandyang”的博客,附上网址:http://www.cnblogs.com/grandyang/p/4130379.html 写在这里,是为了做笔记,同时加深理解, ...

  2. java远程工具类

    package com.zdz.httpclient; import java.io.BufferedReader; import java.io.IOException; import java.i ...

  3. windows10 更新后要输入2次密码才能进入系统

    解决办法: 设置---账户---登录选项---隐私---更新或重启后,使用我的登录信息自动完成设备设置.(关闭)

  4. 如何将自己的Image镜像Push到Docker Hub

    首先需要一个docker官方账号 这里我添加了一个AspNetCore程序 通过创建了一个镜像(前面提过使用Dockerfile处理了) docker build -t dockertest . 首先 ...

  5. DFS基础题

    hdu 1241 油田  裸DFS 题意:@代表油田 8个方向上还有@就相连 相当于求图中连通子图的个数Sample Input1 1 // n m*3 5*@*@***@***@*@*1 8@@** ...

  6. Hibernate api 之常见的类(配置类,会话工厂类,会话类)

    1:Configuration :配置管理类对象 1.1:config.configure(): 加载主配置文件的方法(hibernate.cfg.xml) ,默认加载src/hibernate.cf ...

  7. UOJ Round #1 题解

    题解: 质量不错的一套题目啊..(题解也很不错啊) t1: 首先暴力显然有20分,把ai相同的缩在一起就有40分了 然后会发现由于原来的式子有个%很不方便处理 so计数题嘛 考虑一下容斥 最终步数=初 ...

  8. java:给你一个数组和两个索引,交换下标为这两个索引的数字

    给你一个数组和两个索引,交换下标为这两个索引的数字 import java.util.Arrays; public class Solution { public static void main(S ...

  9. Codeforces Round #441(Div.2) F - High Cry

    F - High Cry 题目大意:给你n个数,让你找区间里面所有数或 起来大于区间里面最大数的区间个数. 思路:反向思维,找出不符合的区间然后用总数减去.我们找出每个数掌控的最左端 和最右端,一个数 ...

  10. hdu 1251:统计难题[【trie树】||【map】

    <题目链接> 统计难题                        Time Limit: 4000/2000 MS (Java/Others)    Memory Limit: 131 ...