https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

 

这篇文章,首先要说清的一个问题是,给‘Streaming’正名

What is streaming?

The crux of the problem is that many things that ought to be described bywhat they are (e.g., unbounded data processing, approximate results, etc.), have come to be described colloquially by how they historically have been accomplished (i.e., via streaming execution engines).

当前我们对Streaming的定义是不准确的,导致我们对Streaming会有些误解
比如,认为Streaming就意味着Low-latency, approximate,lack of precision

这个问题的症结在于,我们把一样东西的本质是什么和这样东西被完成到什么程度给混淆了

所以这里作者给出streaming的定义,

I prefer to isolate the term streaming to a very specific meaning: a type of data processing engine that is designed with infinite data sets in mind. Nothing more.

而对于常常出现的和streaming相关的词,也加以区别定义

Unbounded data: A type of ever-growing, essentially infinite data set.
这个词用于描述数据集本身的特性,而Streaming用于描述processing engine

Unbounded data processing: An ongoing mode of data processing, applied to the aforementioned type of unbounded data.
which is at best misleading:repeated runs of batch engines have been used to process unbounded data since batch systems were first conceived
batch engine也可以用于repeated的处理Unbounded data
同样Streaming engine也可以用于处理Bounded data
所以这个词并不等同于Streaming

Low-latency, approximate, and/or speculative results:

作者认为只是,batch engine在设计时没有考虑要针对low-latency的场景,batch也可以做到low-latency,也可以得出approximate或speculative结果
反之,streaming也可以balance low-latency来达到准确的结果

So,

From here on out, any time I use the term “streaming,” you can safely assume I mean an execution engine designed for unbounded data sets, and nothing more.

 

What streaming can do?

近期流计算的兴起于Twitter’s Nathan Marz (creator of Storm)的Storm,当然也带给Streaming以low-latency, inaccurate/speculative results这样的标签

为了提供eventually correct results,Marz提出Lambda Architecture. 这种架构虽然看上去很简单,但是给出一种balance一致性和可用性的思路;

当然问题也很明显,你需要维护streaming和batch两个pipeline,这个代价是很大的。

作者表示对于这种架构 a bit unsavory。

Unsurprisingly, I was a huge fan of Jay KrepsQuestioning the Lambda Architecture post when it came out.

所以下位出场的是linkedin的Jay Krep,他提出的是基于Kafka的Kappa Architecture,

该架构也很简单,但给出将两个pipeline合并成一个pipeline的思路;更关键的这个方案用well-designed streaming system替代了batch pipeline,这个对于作者是有很大启发的

作者对这个架构的评价,I’m not convinced that notion itself requires a name, but I fully support the idea in principle.

 

Quite honestly, I’d take things a step further.
I would argue that well-designed streaming systems actually provide a strict superset of batch functionality.

作者进步提出,Streaming是Batch的超集,即这个时代已经不需要batch了,该退休了

Steaming要击败Batch,只需要做到两件事,

Correctness — This gets you parity with batch.

只要做到这点,就至少可以等同于batch

At the core, correctness boils down to consistent storage.
Streaming systems need a method for checkpointing persistent state over time (something Kreps has talked about in his Why local state is a fundamental primitive in stream processing post), and it must be well-designed enough to remain consistent in light of machine failures.

 

If you’re curious to learn more about what it takes to get strong consistency in a streaming system, I recommend you check out theMillWheel and Spark Streaming papers.

 

Tools for reasoning about time — This gets you beyond batch.

做到这点就可以超越batch

Good tools for reasoning about time are essential for dealing with unbounded, unordered data of varying event-time skew.

这是作者的重点,讨论如何处理unbounded, unordered data

因为在现实中,我们往往需要安装event-time来处理数据,而不是按照process-time

 

In the context of unbounded data, disorder and variable skew induce a completeness problem for event time windows:
lacking a predictable mapping between processing time and event time, how can you determine when you’ve observed all the data for a given event time X? For many real-world data sources, you simply can’t. The vast majority of data processing systems in use today rely on some notion of completeness, which puts them at a severe disadvantage when applied to unbounded data sets.

这个问题会在102中详细的描述,其实就是dataflow论文里面的内容

 

Data processing patterns

最终,作者描述下当前的数据处理的patterns

Bounded data

 

Unbounded data — batch

Fixed windows

 

Sessions

这个和上面fixed windows的区别,人为的划分fixed windows会切断sessions,如图中红色

 

Unbounded data — streaming

现实中,unbounded data往往有两个特点,

  • Highly unordered with respect to event times, meaning you need some sort of time-based shuffle in your pipeline if you want to analyze the data in the context in which they occurred.
  • Of varying event time skew, meaning you can’t just assume you’ll always see most of the data for a given event time X within some constant epsilon of time Y.

对于这样的数据,处理的方式有如下几类,

Time-agnostic

Time-agnostic processing is used in cases where time is essentially irrelevant — i.e., all relevant logic is data driven.

这个最简单,时间无关的应用,所以stateless的情况,比如map或filter都属于这个case

这种场景没啥好说的,任何Streaming平台都可以很好的处理

 

Approximation algorithms

The second major category of approaches is approximation algorithms, such as approximate Top-N, streaming K-means, etc.

 

Windowing by processing time

There are a few nice properties of processing time windowing:

  • It’s simple
  • Judging window completeness is straightforward.
  • If you’re wanting to infer information about the source as it is observed, processing time windowing is exactly what you want.

 

Windowing by event time

Event time windowing is what you use when you need to observe a data source in finite chunks that reflect the times at which those events actually happened.

It’s the gold standard of windowing. Sadly, most data processing systems in use today lack native support for it.

这种方式是作者所采用的,他认为是gold standard of windowing,而当前的system往往都是native不支持的,原因是比较困难,这也是作者的主要贡献,102中会详细描述

 

Of course, powerful semantics rarely come for free, and event time windows are no exception. Event time windows have two notable drawbacks due to the fact that windows must often live longer (in processing time) than the actual length of the window itself:

Buffering: Due to extended window lifetimes, more buffering of data is required.

Completeness: Given that we often have no good way of knowing when we’ve seen all the data for a given window, how do we know when the results for the window are ready to materialize? In truth, we simply don’t.

The world beyond batch: Streaming 101的更多相关文章

  1. Streaming 101

    开宗明义!本文根据Google Beam大神Tyler Akidau的系列文章<The world beyond batch: Streaming 101>(批处理之外的流式世界)整理而成 ...

  2. 《Streaming Systems》第一章: Streaming 101

    数据的价值在其产生之后,将随着时间的流逝逐渐降低.因此,为了获得最大化的数据价值,尽可能实时.快速地处理新产生的数据就显得尤为重要.实时数据处理将在越来越多的场景中体现出更大的价值所在 -- 实时即未 ...

  3. Flink-v1.12官方网站翻译-P017-Execution Mode (Batch/Streaming)

    执行模式(批处理/流处理) DataStream API 支持不同的运行时执行模式,您可以根据用例的要求和作业的特点从中选择.DataStream API 有一种 "经典 "的执行 ...

  4. All the Apache Streaming Projects: An Exploratory Guide

    The speed at which data is generated, consumed, processed, and analyzed is increasing at an unbeliev ...

  5. Streaming-大数据的未来

    分享一篇关于实时流式计算的经典文章,这篇文章名为Streaming 101: The world beyond batch 那么流计算如何超越批处理呢? 从这几个方面说明:实时流计算系统,数据处理模式 ...

  6. 实时计算大数据处理的基石-Google Dataflow

    ​ 此文选自Google大神Tyler Akidau的另一篇文章:Streaming 102: The world beyond batch ​ 欢迎回来!如果您错过了我以前的帖子,Streaming ...

  7. Mongodb Manual阅读笔记:CH2 Mongodb CRUD 操作

    2 Mongodb CRUD 操作 Mongodb Manual阅读笔记:CH2 Mongodb CRUD 操作Mongodb Manual阅读笔记:CH3 数据模型(Data Models)Mong ...

  8. Flink Program Guide (3) -- Event Time (DataStream API编程指导 -- For Java)

    Event Time 本文翻译自DataStream API Docs v1.2的Event Time ------------------------------------------------ ...

  9. 初探Apache Beam

    文章作者:luxianghao 文章来源:http://www.cnblogs.com/luxianghao/p/9010748.html  转载请注明,谢谢合作. 免责声明:文章内容仅代表个人观点, ...

随机推荐

  1. 用VMware9 安装 mac 10.8和10.9搜集的资料

    VMware9虚拟机安装MAC OS X Mountain Lion 10.8.2详细图文教程 http://diybbs.zol.com.cn/1/34037_699.html vmware too ...

  2. Harris角点

    1. 不同类型的角点 在现实世界中,角点对应于物体的拐角,道路的十字路口.丁字路口等.从图像分析的角度来定义角点可以有以下两种定义: 角点可以是两个边缘的角点: 角点是邻域内具有两个主方向的特征点: ...

  3. codeforces Round #263(div2) D. Appleman and Tree 树形dp

    题意: 给出一棵树,每个节点都被标记了黑或白色,要求把这棵树的其中k条变切换,划分成k+1棵子树,每颗子树必须有1个黑色节点,求有多少种划分方法. 题解: 树形dp dp[x][0]表示是以x为根的树 ...

  4. 【Tyvj1038】忠诚 线段树

    题目描述 老管家是一个聪明能干的人.他为财主工作了整整10年,财主为了让自已账目更加清楚.要求管家每天记k次账,由于管家聪明能干,因而管家总是让财主十分满意.但是由于一些人的挑拨,财主还是对管家产生了 ...

  5. 【HTML5】Web Workers

    什么是 Web Worker? 当在 HTML 页面中执行脚本时,页面的状态是不可响应的,直到脚本已完成. web worker 是运行在后台的 JavaScript,独立于其他脚本,不会影响页面的性 ...

  6. poj 1115 Lifting the Stone 计算多边形的中心

    Lifting the Stone Time Limit:1000MS     Memory Limit:32768KB     64bit IO Format:%I64d & %I64u S ...

  7. eclispe报错PermGen space

    最近使用eclipse做开发,使用的服务器是tomcat,但在启动时报了Caused by: java.lang.OutOfMemoryError: PermGen space的异常. 这个错误很常见 ...

  8. http://jingyan.baidu.com/article/86112f13582848273797879b.html

    http://jingyan.baidu.com/article/86112f13582848273797879b.html

  9. WPF之TextBox

    1. TextBox实现文字垂直居中 TextBox纵向长度比较长但文字字体比较小的时候,在输入时就会发现文字不是垂直居中的. 而使用中我们发现,TextBox虽然可以设置文字的水平对齐方式,但却没有 ...

  10. 【wikioi】1012 最大公约数和最小公倍数问题

    题目链接 算法:辗转相除(欧几里得) gcd(a, b)是a和b最小公倍数, lcm(a, b)是a和b的最大公倍数 gcd(a, b) == gcd(b, a%b) 时间复杂度: O(lgb) 具体 ...