Structured Streaming Programming Guide
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
http://www.slideshare.net/databricks/a-deep-dive-into-structured-streaming
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.
You can express your streaming computation the same way you would express a batch computation on static data.
The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive. You can use the Dataset/DataFrame API in Scala, Java or Python to express streaming aggregations, event-time windows, stream-to-batch joins, etc. The computation is executed on the same optimized Spark SQL engine.
Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs.
In short, Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.
你可以像在静态数据源上一样,使用DataFrame接口去执行SQL,这些SQL会跑在和batch相同的optimized Spark SQL engine上
并且可以保证exactly-once fault-tolerance,通过checkpointing and Write Ahead Logs

只是将DStream抽象,换成DataFrame,即table
这样就可以进行结构化的操作,
并且基本和处理batch数据一样,

可以看到差别不大
整个过程是这样的,

可以看到,这里的output模式是complete,因为有聚合,所以每次输出需要,输出until now的统计数据
输出的mode,分为,
The “Output” is defined as what gets written out to the external storage. The output can be defined in different modes
Complete Mode - The entire updated Result Table will be written to the external storage. It is up to the storage connector to decide how to handle writing of the entire table.
Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only on the queries where existing rows in the Result Table are not expected to change.
Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (not available yet in Spark 2.0). Note that this is different from the Complete Mode in that this mode does not output the rows that are not changed.
complete mode上面的例子已经给出
append mode,就是每次只输出增量,这个对于没有聚合的场景就是合适的
Window Operations on Event Time

spark认为自己对于Event time是天然支持的,只需要把它作为dataframe里面的一个列,然后做groupby即可以
然后对于late data,因为是增量输出的,所以也是可以handle的
Fault Tolerance Semantics
Delivering end-to-end exactly-once semantics was one of key goals behind the design of Structured Streaming.
To achieve that, we have designed the Structured Streaming sources, the sinks and the execution engine to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing. Every streaming source is assumed to have offsets (similar to Kafka offsets, or Kinesis sequence numbers) to track the read position in the stream. The engine uses checkpointing and write ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling reprocessing. Together, using replayable sources and idempotant sinks, Structured Streaming can ensure end-to-end exactly-once semantics under any failure.
首先依赖source是可以依据offset replay,而sink是幂等的,这样只需要通过Write Ahead Logs记录offset,checkpoint记录state,就可以做到exactly once,因为本质是batch
Structured Streaming Programming Guide的更多相关文章
- Structured Streaming Programming Guide结构化流编程指南
目录 Overview Quick Example Programming Model Basic Concepts Handling Event-time and Late Data Fault T ...
- Spark Streaming Programming Guide
参考,http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html Overview SparkStre ...
- spark第六篇:Spark Streaming Programming Guide
预览 Spark Streaming是Spark核心API的扩展,支持高扩展,高吞吐量,实时数据流的容错流处理.数据可以从Kafka,Flume或TCP socket等许多来源获取,并且可以使用复杂的 ...
- Spark Structured streaming框架(1)之基本使用
Spark Struntured Streaming是Spark 2.1.0版本后新增加的流计算引擎,本博将通过几篇博文详细介绍这个框架.这篇是介绍Spark Structured Streamin ...
- Spark Structured Streaming框架(2)之数据输入源详解
Spark Structured Streaming目前的2.1.0版本只支持输入源:File.kafka和socket. 1. Socket Socket方式是最简单的数据输入源,如Quick ex ...
- Spark Structured Streaming框架(5)之进程管理
Structured Streaming提供一些API来管理Streaming对象.用户可以通过这些API来手动管理已经启动的Streaming,保证在系统中的Streaming有序执行. 1. St ...
- Spark Structured Streaming框架(4)之窗口管理详解
1. 结构 1.1 概述 Structured Streaming组件滑动窗口功能由三个参数决定其功能:窗口时间.滑动步长和触发时间. 窗口时间:是指确定数据操作的长度: 滑动步长:是指窗口每次向前移 ...
- Spark Structured Streaming框架(3)之数据输出源详解
Spark Structured streaming API支持的输出源有:Console.Memory.File和Foreach.其中Console在前两篇博文中已有详述,而Memory使用非常简单 ...
- Spark Structured Streaming框架(2)之数据输入源详解
Spark Structured Streaming目前的2.1.0版本只支持输入源:File.kafka和socket. 1. Socket Socket方式是最简单的数据输入源,如Quick ex ...
随机推荐
- 关于java程序打包为EXE的若干问题
这几天在一个即时通讯系统的打包上,吃尽了苦头,到现在才算解决,现在对遇到的问题进行分析总结. 1.一开始是在export "Runnable JAR file"的时候,弹出了这样的 ...
- Eigen相关介绍
最近在用Matlab处理图像,现在要做的是将其用C++语言进行翻译,由于要进行大量的矩阵计算,就研究了一下可以进行矩阵计算的开源库,详细的介绍可以参照http://my.oschina.net/cvn ...
- 【JUnit 报错】java.lang.IncompatibleClassChangeError
使用Junit 测试spring时候报错: java.lang.IncompatibleClassChangeError: class org.springframework.core.LocalVa ...
- 【Log4j2 配置详解】log4j2的资源文件具体怎么配置
可以先附上一个log4j2的资源文件详细内容,对照着看 ### set log levels ### log4j.rootLogger = INFO , C , D , E ### console # ...
- 图文详解YUV420, yuv格式2
YUV格式有两大类:planar和packed. 对于planar的YUV格式,先连续存储所有像素点的Y,紧接着存储所有像素点的U,随后是所有像素点的V. 对于packed的YUV格式,每个像素点的Y ...
- json学习系列(6)JSONObject和JSONArray是JDK的集合部分延伸
我一直觉得JSONObject和JSONArray是JDK集合部分的延伸,它们与JDK的List和Map一脉相承.通过研究JSONObject和JSONArray的结构,我们顺便也复习一下JDK的内容 ...
- 简单几何(四边形形状) UVA 11800 Determine the Shape
题目传送门 题意:给了四个点,判断能构成什么图形,有优先规则 分析:正方形和矩形按照点积为0和长度判断,菱形和平行四边形按向量相等和长度判断,梯形按照叉积为0判平行.因为四个点是任意给出的,首先要进行 ...
- POJ3254 Corn Fields(状压DP)
题目给个n×m的地图,1可以放玉米0不可以,现在要放玉米,玉米上下左右不能相邻,问放法有几种. 当前一行的决策只会影响下一行,所以状压DP之: dp[i][S]表示前i行放完且第i行放玉米的列的集合是 ...
- LightOJ1033 Generating Palindromes(区间DP/LCS)
题目要计算一个字符串最少添加几个字符使其成为回文串. 一年多前,我LCS这道经典DP例题看得还一知半解时遇到一样的问题,http://acm.fafu.edu.cn/problem.php?id=10 ...
- 【BZOJ】1202: [HNOI2005]狡猾的商人(并查集+前缀和)
http://www.lydsy.com/JudgeOnline/problem.php?id=1202 用并查集+前缀和. 前缀和从后向前维护和,并查集从前往后合并 对于询问l, r 如果l-1和r ...