13 Stream Processing Patterns for building Streaming and Realtime Applications
原文:https://iwringer.wordpress.com/2015/08/03/patterns-for-streaming-realtime-analytics/
Introduction
More and more use cases, we want to react to data faster, rather than storing them in a disk and periodically processing and acting on the data. This is done using Realtime analytics.
Realtime analytics, or what people call Realtime Analytics, have two flavors.
- Realtime Interactive/Ad-hoc Analytics (users issue ad-hoc dynamic queries and the system responds interactively). Examples of such tools are Druid, SAP Hana, VoltDB, MemSQL, and Apache Drill.
- Realtime Streaming Analytics / Stream Processing ( users issue static queries once and they do not change, and the system process data as they come in without storing). This is supported by Stream Processors and among examples of such tools are WSO2 Stream Processors and Apache Flink. ( see What is Stream Processing? for more details)
Realtime Interactive Analytics allows users to explore a large data set by issuing ad-hoc queries. Queries should respond within 10 seconds, which is considered the upper bound for acceptable human interaction. In contrast, this tutorial focuses on Stream Processing, which is processing data as they come in without storing them and react to those data very fast, often within few milliseconds. Such technologies are not new. History goes back to Active Databases (2000+), Stream processing (e.g. Aurora (2003) , Borealis (2005+) and later Apache Storm), Distributed Streaming Operators(2005), and Complex Event processing. Between 2015-2018, most of these technologies have converged under the theme Steam Processing (see CEP vs. Stream Processing) for more information.
when thinking about Realtime analytics, many think only about counting usecases. Counting usecases are only the tip of the iceberg of real-life realtime usecases. Since the input data arrives as a data stream, a time dimension always presents in the data. This time dimension allows us to implement and perform many powerful usecases. For an example, Streaming SQL supported by many Stream Processors provides operators like windows, joins, and temporal event sequence detection.
Stream processing technologies like Apache Samza and Apache Storm has received much attention under the theme large scale streaming analytics. However, these tools force every programmer to design and implement realtime analytics processing from first principals.
For an example, if users need a time window, they need to implement it from first principals. This is like every programmer implementing his own list data structure.
Since 2016, a new idea called Streaming SQL has emerged. We call a language that enables users to write SQL like queries to query streaming data as a “Streaming SQL” language. Almost all Stream Processors now support Streaming SQL.
However, writing Streaming Applications requires very different thinking patterns from writing code with a language like Java. A better understanding of common patterns in Stream Processing will let us understand the domain better and build tools that handle those scenarios. This tutorial describes 13 common relatime streaming analytics patterns and how to implement them. In the discussion, we will draw heavily from real life usecases done under Stream Processing and other technologies.
Realtime Streaming Analytics Patterns
Before looking at the patterns, let’s first agree on the terminology. Stream Processing accepts input as a set of streams where each stream consists of many events ordered in time. Each event has many attributes, but all events in the same stream have the same set of attributes or schema.
Pattern 1: Preprocessing
Preprocessing is often done as a projection from one data stream to the other or through filtering. Potential operations include
- Filtering and removing some events
- Reshaping a stream by removing, renaming, or adding new attributes to a stream
- Splitting and combining attributes in a stream
- Transforming attributes
For example, from a twitter data stream, we might choose to extract the fields: author, timestamp, location, and then filter them based on the location of the author.
Pattern 2: Alerts and Thresholds
This pattern detects a condition and generates alerts based on a condition. (e.g. Alarm on high temperature). These alerts can be based on a simple value or more complex conditions such as rate of increase etc.
For an example, in TFL (Transport for London) Demo video based on transit data from London, we trigger a speed alert when the bus has exceed a given speed limit.
Pattern 3: Simple Counting and Counting with Windows
This pattern includes aggregate functions like Min, Max, Percentiles etc, and they can be counted without storing any data. (e.g. counting the number of failed transactions).
However, counts are often used with a time window attached to it.( e.g. failure count last hour). There are many types of windows: sliding windows vs. batch (tumbling) windows and time vs. length windows. There are four main variations.
- Time, Sliding window: keeps each event for the given time window, produce an output whenever a new event has added or removed.
- Time, Batch window: also called tumbling windows, they only produce output at the end of the time window
- Length, Sliding : same as the time, sliding window, but keeps a window of n events instead of selecting them by time.
- Length, Batch window: same as the time, batch window, but keeps a window of n events instead of selecting them by time
There are special windows like decaying windows and unique windows. Please refer to Stream Processing 101: From SQL to Streaming SQL in 10 Minutes for more details.
Pattern 4: Joining Event Streams
Main idea behind this pattern is to match up multiple data streams and create a new event steam. For an example, lets assume we play a football game with both the players and the ball having sensors that emits events with current location and acceleration. We can use “joins” to detect when a player have kicked the ball. To that end, we can join the ball location stream and the player stream on the condition that they are close to each other by one meter and the ball’s acceleration has increased by more than 55m/s^2.
Among other usecases are combining data from two sensors, and detecting the proximity of two vehicles. Please refer to Stream Processing 101: From SQL to Streaming SQL in 10 Minutes for more details.
Pattern 5: Data Correlation, Missing Events, and Erroneous Data
This pattern and the pattern four a has lot in common where here too we match up multiple stream. In addition, we also correlate the data within the same stream. This is because different data sensors can send events in different rates, and many usecases require this fundamental operator.
Following are some possible scenarios.
- Matching up two data streams that send events in different speeds
- Detecting a missing event in a data stream ( e.g. detect a customer request that has not been responded within in 1 hour of its reception. )
- Detecting erroneous data (e.g. Detect failed sensors using a set of sensors that monitor overlapping regions and using those redundant data to find erroneous sensors and removing their data from further processing)
Pattern 6: Interacting with Databases
Often we need to combine the realtime data against the historical data stored in a disk. Following are few examples.
- When a transaction happened, lookup the age using the customer ID from customer database to be used for Fraud detection (enrichment)
- Checking a transaction against blacklists and whitelists in the database
- Receive an input from the user (e.g. Daily discount amount may be updated in the database, and then the query will pick it automatically without human intervention.)
Pattern 7: Detecting Temporal Event Sequence Patterns
Using regular expressions with strings, we detect a pattern of characters from a sequence of characters. Similarly, given a sequence of events, we can write a regular expression to detect a temporal sequence of events arranged on time where each event or condition about the event is parallel to a character in a string in the above example.
A frequently cited example, although bit simplistic, is that a thief, having stolen a credit card, would try a smaller transaction to make sure it works and then do a large transaction. Here the small transaction followed by a large transaction is a temporal sequence of events arranged on time, and can be detected using a regular expression written on top of an event sequence.
Such temporal sequence patterns are very powerful. For example, the follow video shows a real time analytics done using the data collected from a real football game. This was the dataset taken from DEBS 2013 Grand Challenge.
In the video, we used patterns on event sequence to detect the ball possession, the time period a specific player controlled the ball. A player possessed the ball from the time he hits the ball, until someone else hits the ball. This condition can be written as a regular expression: a hit by me, followed by any number of hits by me, followed by a hit by someone else. (We already discussed how to detect the hits on the ball in Pattern 4: Joins).
Please refer to Stream Processing 101: From SQL to Streaming SQL in 10 Minutes for more details.
Pattern 8: Tracking
The eighth pattern tracks something over space and time and detects given conditions. Following are few examples
- Tracking a fleet of vehicles, making sure that they adhere to speed limits, routes, and geo-fences.
- Tracking wildlife, making sure they are alive (they will not move if they are dead) and making sure they will not go out of the reservation.
- Tracking airline luggages and making sure they are not been sent to wrong destinations
- Tracking a logistic network and figure out bottlenecks and unexpected conditions.
For example, TFL Demo we discussed under pattern 2 shows an application that tracks and monitors London buses using the open data feeds exposed by TFL(Transport for London).
Pattern 9: Detecting Trends
We often encounter time series data. Detecting patterns from time series data and bringing them into operator attention are common use cases.
Following are some of the examples of tends.
- Rise, Fall
- Turn (switch from rise to a fall)
- Outliers
- Complex trends like triple bottom etc.
These trends are useful in a wide variety of use cases such as
- Stock markets and Algorithmic trading
- Enforcing SLA (Service Level Agreement), Auto Scaling, and Load Balancing
- Predictive maintenance ( e.g. guessing the Hard Disk will fill within next week)
Pattern 10: Running the same Query in Batch and Realtime Pipelines
This pattern runs the same query in both Relatime and batch pipeline. It is often used to fill the gap left in the data due to batch processing. For example, if batch processing takes 15 minutes, results would lack the data for last 15 minutes.
Idea of this pattern, which is sometimes called “Lambda Architecture” is to use realtime analytics to fill the gap. Jay Kreps’s article “Questioning the Lambda Architecture” discusses this pattern in detail.
Pattern 11: Detecting and switching to Detailed Analysis
Main idea of the pattern is to detect a condition that suggests some anomaly, and further analyze it using historical data. This pattern is used with the use cases where we cannot analyze all the data with full detail. Instead, we analyze anomalous cases in full detail. Following are few examples.
- Use basic rules to detect Fraud (e.g. large transaction), then pull out all transactions done against that credit card for a larger time period (e.g. 3 months data) from a batch pipeline and run a detailed analysis
- While monitoring weather, detect conditions like high temperature or low pressure in a given region, and then start a high resolution localized forecast on that region.
- Detect good customers, for example through the expenditure of more than $1000 within a month, and then run a detailed model to decide the potential of offering a deal.
Pattern 12: Using a Model
Idea is to train a model (often a Machine Learning model), and then use it with the Realtime pipeline to make decisions. For example, you can build a model using R, export it as PMML (Predictive Model Markup Language) and use it within your realtime pipeline.
Among examples are Fraud Detections, Segmentation, Predict next value, Predict Churn. Also see InfoQ article, Machine Learning Techniques for Predictive Maintenance, for a detailed example of this pattern.
Pattern 13: Online Control
There are many use cases where we need to control something online. The classical use cases are autopilot, self-driving, and robotics. These would involve problems like current situation awareness, predicting next value(s), and deciding on corrective actions.
You can implement most of these use cases with a Stream Processor that supports a Streaming SQL language. Please refer to Stream Processing 101: From SQL to Streaming SQL in 10 Minutes for a detailed discussion on Streaming SQL. You can try out above patterns with WSO2 Stream Processor,which is freely available under Apache Licence 2. You can also find other Stream Processors from What are the best stream processing solutions out there?
This post is initially based on a tutorial in DEBS 2015 (9th ACM International Conference on Distributed Event-Based Systems), describing a set of realtime analytics patterns. We have later edited the content to capture the trends such as Streaming SQL.
You can find details about pattern implementations from the following slide deck, and source code from https://github.com/suhothayan/DEBS-2015-Realtime-Analytics-Patterns. Although Streaming SQL syntax closely follow the most recent release of WSO2 SP there are minor changes in the syntax. Please refer to WSO2 SP user guide for most recent syntax.
- Stream Processing 101: From SQL to Streaming SQL in 10 Minutes, https://wso2.com/library/articles/2018/02/stream-processing-101-from-sql-to-streaming-sql-in-ten-minutes/
- Machine Learning Techniques for Predictive Maintenance, https://www.infoq.com/articles/machine-learning-techniques-predictive-maintenance
13 Stream Processing Patterns for building Streaming and Realtime Applications的更多相关文章
- Stream processing with Apache Flink and Minio
转自:https://blog.minio.io/stream-processing-with-apache-flink-and-minio-10da85590787 Modern technolog ...
- Stream Processing 101: From SQL to Streaming SQL in 10 Minutes
转自:https://wso2.com/library/articles/2018/02/stream-processing-101-from-sql-to-streaming-sql-in-ten- ...
- Storm(2) - Log Stream Processing
Introduction This chapter will present an implementation recipe for an enterprise log storage and a ...
- 腾讯大数据平台Oceanus: A one-stop platform for real time stream processing powered by Apache Flink
January 25, 2019Use Cases, Apache Flink The Big Data Team at Tencent In recent years, the increa ...
- Apache Samza - Reliable Stream Processing atop Apache Kafka and Hadoop YARN
http://engineering.linkedin.com/data-streams/apache-samza-linkedins-real-time-stream-processing-fram ...
- 1.1 Introduction中 Kafka for Stream Processing官网剖析(博主推荐)
不多说,直接上干货! 一切来源于官网 http://kafka.apache.org/documentation/ Kafka for Stream Processing kafka的流处理 It i ...
- Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf Discretized Streams: A Fault-Tol ...
- Akka(23): Stream:自定义流构件功能-Custom defined stream processing stages
从总体上看:akka-stream是由数据源头Source,流通节点Flow和数据流终点Sink三个框架性的流构件(stream components)组成的.这其中:Source和Sink是stre ...
- 1.2 Use Cases中 Stream Processing官网剖析(博主推荐)
不多说,直接上干货! 一切来源于官网 http://kafka.apache.org/documentation/ Stream Processing 流处理 Many users of Kafka ...
随机推荐
- Selenium入门练习(一)
自主学习---上海野生动物园之登录.订票.退票 Create了一个TestNG可以查看执行结果: package FristTestNG; import java.sql.Driver; import ...
- slim(4621✨)
用于代码瘦身. 老鸟建议:不要混写js 和 html,如果避免不了,当前文件可以改为erb格式,混用slim和erb不是什么问题. git: https://github.com/slim-temp ...
- hdu3544找规律
如果x>1&&y>1,可以简化到其中一个为1的情况,这是等价的,当其中一个为1(假设为x),另一个一定能执行y-1次, 这是一个贪心问题,把所有的执行次数加起来比较就能得到 ...
- 搞懂分布式技术10:LVS实现负载均衡的原理与实践
搞懂分布式技术10:LVS实现负载均衡的原理与实践 浅析负载均衡及LVS实现 原创: fireflyc 写程序的康德 2017-09-19 负载均衡 负载均衡(Load Balance,缩写LB)是一 ...
- 重新学习MySQL数据库1:无废话MySQL入门
重新学习Mysql数据库1:无废话MySQL入门 开始使用 我下面所有的SQL语句是基于MySQL 5.6+运行. MySQL 为关系型数据库(Relational Database Manageme ...
- HDU 1693 插头dp入门详解
放题目链接 https://vjudge.net/problem/22021/origin 给出一个n*m的01矩阵,1可走0不可通过,要求走过的路可以形成一个环且可以有多个环出现,问有多少不同的 ...
- leetcode 427. Construct Quad Tree
We want to use quad trees to store an N x N boolean grid. Each cell in the grid can only be true or ...
- request.getPathInfo() 方法的作用
request.getPathInfo(); 这个方法返回请求的实际URL相对于请求的serlvet的url的路径.(个人理解.) 比如,有一个Servlet的映射是这样配置的: <servle ...
- MySQL数据中分级分组显示数据
前面已经有了SqlServer数据分级分组显示数据了.今天又来做一个MySQL数据库中的分级分组显示,SqlServer中用到了递归,这里为了简单就直接把根的数据显示为0 ,而不用递归了. 在MySQ ...
- hdu1507
题解: 二分图最大匹配 建边和第一题差不多 每两个相邻的建边 然后输出方案 代码: #include<cstring> #include<cmath> #include< ...