13 Stream Processing Patterns for building Streaming and Realtime Applications
原文:https://iwringer.wordpress.com/2015/08/03/patterns-for-streaming-realtime-analytics/
Introduction
More and more use cases, we want to react to data faster, rather than storing them in a disk and periodically processing and acting on the data. This is done using Realtime analytics.
Realtime analytics, or what people call Realtime Analytics, have two flavors.
- Realtime Interactive/Ad-hoc Analytics (users issue ad-hoc dynamic queries and the system responds interactively). Examples of such tools are Druid, SAP Hana, VoltDB, MemSQL, and Apache Drill.
- Realtime Streaming Analytics / Stream Processing ( users issue static queries once and they do not change, and the system process data as they come in without storing). This is supported by Stream Processors and among examples of such tools are WSO2 Stream Processors and Apache Flink. ( see What is Stream Processing? for more details)
Realtime Interactive Analytics allows users to explore a large data set by issuing ad-hoc queries. Queries should respond within 10 seconds, which is considered the upper bound for acceptable human interaction. In contrast, this tutorial focuses on Stream Processing, which is processing data as they come in without storing them and react to those data very fast, often within few milliseconds. Such technologies are not new. History goes back to Active Databases (2000+), Stream processing (e.g. Aurora (2003) , Borealis (2005+) and later Apache Storm), Distributed Streaming Operators(2005), and Complex Event processing. Between 2015-2018, most of these technologies have converged under the theme Steam Processing (see CEP vs. Stream Processing) for more information.
when thinking about Realtime analytics, many think only about counting usecases. Counting usecases are only the tip of the iceberg of real-life realtime usecases. Since the input data arrives as a data stream, a time dimension always presents in the data. This time dimension allows us to implement and perform many powerful usecases. For an example, Streaming SQL supported by many Stream Processors provides operators like windows, joins, and temporal event sequence detection.
Stream processing technologies like Apache Samza and Apache Storm has received much attention under the theme large scale streaming analytics. However, these tools force every programmer to design and implement realtime analytics processing from first principals.
For an example, if users need a time window, they need to implement it from first principals. This is like every programmer implementing his own list data structure.
Since 2016, a new idea called Streaming SQL has emerged. We call a language that enables users to write SQL like queries to query streaming data as a “Streaming SQL” language. Almost all Stream Processors now support Streaming SQL.
However, writing Streaming Applications requires very different thinking patterns from writing code with a language like Java. A better understanding of common patterns in Stream Processing will let us understand the domain better and build tools that handle those scenarios. This tutorial describes 13 common relatime streaming analytics patterns and how to implement them. In the discussion, we will draw heavily from real life usecases done under Stream Processing and other technologies.
Realtime Streaming Analytics Patterns
Before looking at the patterns, let’s first agree on the terminology. Stream Processing accepts input as a set of streams where each stream consists of many events ordered in time. Each event has many attributes, but all events in the same stream have the same set of attributes or schema.
Pattern 1: Preprocessing
Preprocessing is often done as a projection from one data stream to the other or through filtering. Potential operations include
- Filtering and removing some events
- Reshaping a stream by removing, renaming, or adding new attributes to a stream
- Splitting and combining attributes in a stream
- Transforming attributes
For example, from a twitter data stream, we might choose to extract the fields: author, timestamp, location, and then filter them based on the location of the author.
Pattern 2: Alerts and Thresholds
This pattern detects a condition and generates alerts based on a condition. (e.g. Alarm on high temperature). These alerts can be based on a simple value or more complex conditions such as rate of increase etc.
For an example, in TFL (Transport for London) Demo video based on transit data from London, we trigger a speed alert when the bus has exceed a given speed limit.
Pattern 3: Simple Counting and Counting with Windows
This pattern includes aggregate functions like Min, Max, Percentiles etc, and they can be counted without storing any data. (e.g. counting the number of failed transactions).
However, counts are often used with a time window attached to it.( e.g. failure count last hour). There are many types of windows: sliding windows vs. batch (tumbling) windows and time vs. length windows. There are four main variations.
- Time, Sliding window: keeps each event for the given time window, produce an output whenever a new event has added or removed.
- Time, Batch window: also called tumbling windows, they only produce output at the end of the time window
- Length, Sliding : same as the time, sliding window, but keeps a window of n events instead of selecting them by time.
- Length, Batch window: same as the time, batch window, but keeps a window of n events instead of selecting them by time
There are special windows like decaying windows and unique windows. Please refer to Stream Processing 101: From SQL to Streaming SQL in 10 Minutes for more details.
Pattern 4: Joining Event Streams
Main idea behind this pattern is to match up multiple data streams and create a new event steam. For an example, lets assume we play a football game with both the players and the ball having sensors that emits events with current location and acceleration. We can use “joins” to detect when a player have kicked the ball. To that end, we can join the ball location stream and the player stream on the condition that they are close to each other by one meter and the ball’s acceleration has increased by more than 55m/s^2.
Among other usecases are combining data from two sensors, and detecting the proximity of two vehicles. Please refer to Stream Processing 101: From SQL to Streaming SQL in 10 Minutes for more details.
Pattern 5: Data Correlation, Missing Events, and Erroneous Data
This pattern and the pattern four a has lot in common where here too we match up multiple stream. In addition, we also correlate the data within the same stream. This is because different data sensors can send events in different rates, and many usecases require this fundamental operator.
Following are some possible scenarios.
- Matching up two data streams that send events in different speeds
- Detecting a missing event in a data stream ( e.g. detect a customer request that has not been responded within in 1 hour of its reception. )
- Detecting erroneous data (e.g. Detect failed sensors using a set of sensors that monitor overlapping regions and using those redundant data to find erroneous sensors and removing their data from further processing)
Pattern 6: Interacting with Databases
Often we need to combine the realtime data against the historical data stored in a disk. Following are few examples.
- When a transaction happened, lookup the age using the customer ID from customer database to be used for Fraud detection (enrichment)
- Checking a transaction against blacklists and whitelists in the database
- Receive an input from the user (e.g. Daily discount amount may be updated in the database, and then the query will pick it automatically without human intervention.)
Pattern 7: Detecting Temporal Event Sequence Patterns
Using regular expressions with strings, we detect a pattern of characters from a sequence of characters. Similarly, given a sequence of events, we can write a regular expression to detect a temporal sequence of events arranged on time where each event or condition about the event is parallel to a character in a string in the above example.
A frequently cited example, although bit simplistic, is that a thief, having stolen a credit card, would try a smaller transaction to make sure it works and then do a large transaction. Here the small transaction followed by a large transaction is a temporal sequence of events arranged on time, and can be detected using a regular expression written on top of an event sequence.
Such temporal sequence patterns are very powerful. For example, the follow video shows a real time analytics done using the data collected from a real football game. This was the dataset taken from DEBS 2013 Grand Challenge.
In the video, we used patterns on event sequence to detect the ball possession, the time period a specific player controlled the ball. A player possessed the ball from the time he hits the ball, until someone else hits the ball. This condition can be written as a regular expression: a hit by me, followed by any number of hits by me, followed by a hit by someone else. (We already discussed how to detect the hits on the ball in Pattern 4: Joins).
Please refer to Stream Processing 101: From SQL to Streaming SQL in 10 Minutes for more details.
Pattern 8: Tracking
The eighth pattern tracks something over space and time and detects given conditions. Following are few examples
- Tracking a fleet of vehicles, making sure that they adhere to speed limits, routes, and geo-fences.
- Tracking wildlife, making sure they are alive (they will not move if they are dead) and making sure they will not go out of the reservation.
- Tracking airline luggages and making sure they are not been sent to wrong destinations
- Tracking a logistic network and figure out bottlenecks and unexpected conditions.
For example, TFL Demo we discussed under pattern 2 shows an application that tracks and monitors London buses using the open data feeds exposed by TFL(Transport for London).
Pattern 9: Detecting Trends
We often encounter time series data. Detecting patterns from time series data and bringing them into operator attention are common use cases.
Following are some of the examples of tends.
- Rise, Fall
- Turn (switch from rise to a fall)
- Outliers
- Complex trends like triple bottom etc.
These trends are useful in a wide variety of use cases such as
- Stock markets and Algorithmic trading
- Enforcing SLA (Service Level Agreement), Auto Scaling, and Load Balancing
- Predictive maintenance ( e.g. guessing the Hard Disk will fill within next week)
Pattern 10: Running the same Query in Batch and Realtime Pipelines
This pattern runs the same query in both Relatime and batch pipeline. It is often used to fill the gap left in the data due to batch processing. For example, if batch processing takes 15 minutes, results would lack the data for last 15 minutes.
Idea of this pattern, which is sometimes called “Lambda Architecture” is to use realtime analytics to fill the gap. Jay Kreps’s article “Questioning the Lambda Architecture” discusses this pattern in detail.
Pattern 11: Detecting and switching to Detailed Analysis
Main idea of the pattern is to detect a condition that suggests some anomaly, and further analyze it using historical data. This pattern is used with the use cases where we cannot analyze all the data with full detail. Instead, we analyze anomalous cases in full detail. Following are few examples.
- Use basic rules to detect Fraud (e.g. large transaction), then pull out all transactions done against that credit card for a larger time period (e.g. 3 months data) from a batch pipeline and run a detailed analysis
- While monitoring weather, detect conditions like high temperature or low pressure in a given region, and then start a high resolution localized forecast on that region.
- Detect good customers, for example through the expenditure of more than $1000 within a month, and then run a detailed model to decide the potential of offering a deal.
Pattern 12: Using a Model
Idea is to train a model (often a Machine Learning model), and then use it with the Realtime pipeline to make decisions. For example, you can build a model using R, export it as PMML (Predictive Model Markup Language) and use it within your realtime pipeline.
Among examples are Fraud Detections, Segmentation, Predict next value, Predict Churn. Also see InfoQ article, Machine Learning Techniques for Predictive Maintenance, for a detailed example of this pattern.
Pattern 13: Online Control
There are many use cases where we need to control something online. The classical use cases are autopilot, self-driving, and robotics. These would involve problems like current situation awareness, predicting next value(s), and deciding on corrective actions.
You can implement most of these use cases with a Stream Processor that supports a Streaming SQL language. Please refer to Stream Processing 101: From SQL to Streaming SQL in 10 Minutes for a detailed discussion on Streaming SQL. You can try out above patterns with WSO2 Stream Processor,which is freely available under Apache Licence 2. You can also find other Stream Processors from What are the best stream processing solutions out there?
This post is initially based on a tutorial in DEBS 2015 (9th ACM International Conference on Distributed Event-Based Systems), describing a set of realtime analytics patterns. We have later edited the content to capture the trends such as Streaming SQL.
You can find details about pattern implementations from the following slide deck, and source code from https://github.com/suhothayan/DEBS-2015-Realtime-Analytics-Patterns. Although Streaming SQL syntax closely follow the most recent release of WSO2 SP there are minor changes in the syntax. Please refer to WSO2 SP user guide for most recent syntax.
- Stream Processing 101: From SQL to Streaming SQL in 10 Minutes, https://wso2.com/library/articles/2018/02/stream-processing-101-from-sql-to-streaming-sql-in-ten-minutes/
- Machine Learning Techniques for Predictive Maintenance, https://www.infoq.com/articles/machine-learning-techniques-predictive-maintenance
13 Stream Processing Patterns for building Streaming and Realtime Applications的更多相关文章
- Stream processing with Apache Flink and Minio
转自:https://blog.minio.io/stream-processing-with-apache-flink-and-minio-10da85590787 Modern technolog ...
- Stream Processing 101: From SQL to Streaming SQL in 10 Minutes
转自:https://wso2.com/library/articles/2018/02/stream-processing-101-from-sql-to-streaming-sql-in-ten- ...
- Storm(2) - Log Stream Processing
Introduction This chapter will present an implementation recipe for an enterprise log storage and a ...
- 腾讯大数据平台Oceanus: A one-stop platform for real time stream processing powered by Apache Flink
January 25, 2019Use Cases, Apache Flink The Big Data Team at Tencent In recent years, the increa ...
- Apache Samza - Reliable Stream Processing atop Apache Kafka and Hadoop YARN
http://engineering.linkedin.com/data-streams/apache-samza-linkedins-real-time-stream-processing-fram ...
- 1.1 Introduction中 Kafka for Stream Processing官网剖析(博主推荐)
不多说,直接上干货! 一切来源于官网 http://kafka.apache.org/documentation/ Kafka for Stream Processing kafka的流处理 It i ...
- Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf Discretized Streams: A Fault-Tol ...
- Akka(23): Stream:自定义流构件功能-Custom defined stream processing stages
从总体上看:akka-stream是由数据源头Source,流通节点Flow和数据流终点Sink三个框架性的流构件(stream components)组成的.这其中:Source和Sink是stre ...
- 1.2 Use Cases中 Stream Processing官网剖析(博主推荐)
不多说,直接上干货! 一切来源于官网 http://kafka.apache.org/documentation/ Stream Processing 流处理 Many users of Kafka ...
随机推荐
- kotlin for android----------MVP模式下(OKHttp和 Retrofit+RxJava)网络请求的两种实现方式
今天要说的干货是:以Kotlin,在MVP模式下(OKHttp和 Retrofit+RxJava)网络请求两种实现方式的一个小案例,希望对大家有所帮助,效果图: Retrofit是Square公司开发 ...
- Python实现CSV数据的读取--两种方法实现
方法一: 方法二:
- cat 命令|more命令|less命令
cat主要有三大功能:1.一次显示整个文件:cat [-n] filename2.从键盘创建一个文件:cat > filename 3.将几个文件合并为一个文件:cat file1 file2 ...
- grub2 windows版安装
一.BIOS方式,grub2安装 查看磁盘情况 E:\grub-2.02-for-windows>wmic diskdrive list brief Caption DeviceID Model ...
- BOM-event事件
添加事件监听 <button id="btnShoot">shoot</button><br> <button id="btnA ...
- JQuery iframe
子页面获取父页面的元素 function colisetapTJ() { var tapid = $('div:contains("添加档案报送"):last', window.p ...
- SpringAnnotation注解之@Resource
@Resource:同样也是注入,默认是按byName,byName找不到的话按byType 1 2 3 4 @Resource public void setUserDao(UserDao user ...
- Jenkins分享
2016-02-26 小马哥 程序员之路 PPT下载地址:http://pan.baidu.com/s/1i4pw6oP Jenkins 是一个开源软件项目,旨在提供一个开放易用的软件平台,使 ...
- Linux:expand命令详解
expand 用于将文件的制表符[TAB]转换为空格,将结果显示到标准输出设备 语法 expand(选项)(file) 选项 -t<数字>:指定制表符所代表的空白字符的个数,而不使用默认的 ...
- [置顶]
JVM层对jar包字节码加密
github https://github.com/sea-boat/ByteCodeEncrypt 需求 拿到的需求是要对某特定的jar包实现加密保护,jar包需要提供给外部使用,但核心逻辑部分需要 ...