https://www.oreilly.com/ideas/applying-the-kappa-architecture-in-the-telco-industry

Kappa architecture and Bayesian models yield quick, accurate analytics in cloud monitoring systems.

May 19, 2016

Ever-growing volumes of data, shorter time constraints, and an increasing need for accuracy are defining the new analytics environment. In the telecom industry, traditional user and network data co­exists with machine-­to-­machine (M2M) traffic, media data, social activities, and so on. In terms of volume, this can be referred to as an “explosion” of data. This is a great business opportunity for telco operators and a key angle to take full advantage of current infrastructure investments (4G, LTE).

In this blog post, we will describe an approach to quickly ingest and analyze large volumes of streaming data, the Kappa architecture, as well as how to build a Bayesian online-­learning model to detect novelties in a complex environment. Note that novelty does not necessarily imply an undesired situation;  it indicates a change from previously known behaviours.

We apply both Kappa and the Bayesian model to a use case using a data stream originating from a telco cloud monitoring system. The stream is composed of telemetry and log events. It is high­ volume, as many physical servers and virtual machines are monitored simultaneously.

The proposed method quickly detects anomalies with high accuracy while adapting (learning) over time to new system normals, making it a desirable tool for considerably reducing maintenance costs associated with operability of large computing infrastructures.

What is “Kappa architecture”?

In a 2014 blog post, Jay Kreps accurately coined the term Kappa architecture by pointing out the pitfalls of the Lambda architecture and proposing a potential software evolution. To understand the differences between the two, let's first observe what the Lambda architecture looks like:

Figure 1. Lambda architecture. Figure courtesy of Ignacio Mulas Viela and Nicolas Seyvet.

As shown in Figure 1, the Lambda architecture is composed of three layers: a batch layer, a real­-time (or streaming) layer, and a serving layer. Both the batch and real­-time layers receive a copy of the event, in parallel. The serving layer then aggregates and merges computation results from both layers into a complete answer.

The batch layer (aka, historical layer) has two major tasks: managing historical data and re­computing results such as machine learning models. Computations are based on iterating over the entire historical data set. Since the data set can be large, this produces accurate results at the cost of high latency due to high computation time.

The real­-time layer( speed layer, streaming layer) provides low-­latency results in near real-­time fashion. It performs updates using incremental algorithms, thus significantly reducing computation costs, often at the expense of accuracy.

The Kappa architecture simplifies the Lambda architecture by removing the batch layer and replacing it with a streaming layer. To understand how this is possible, one must first understand that a batch is a data set with a start and an end (bounded), while a stream has no start or end and is infinite (unbounded). Because a batch is a bounded stream, one can conclude that batch processing is a subset of stream processing.  Hence, the Lambda batch layer results can also be obtained by using a streaming engine. This simplification reduces the architecture to a single streaming engine capable of ingesting the needed volumes of data to handle both batch and real-time processing. Overall system complexity significantly decreases with Kappa architecture. See Figure 2:

Figure 2. Kappa architecture. Figure courtesy of Ignacio Mulas Viela and Nicolas Seyvet.

Intrinsically, there are four main principles in the Kappa architecture:

O'Reilly Data Newsletter

Receive weekly insight from industry insiders—plus exclusive content, offers, and more on the topic of data.

 
Your Email
 
Country
 
 
 
  1. Everything is a stream: Batch operations become a subset of streaming operations. Hence, everything can be treated as a stream.
  2. Immutable data sources: Raw data (data source) is persisted and views are derived, but a state can always be recomputed as the initial record is never changed.
  3. Single analytics framework: Keep it short and simple (KISS) principle. A single analytics engine is required. Code, maintenance, and upgrades are considerably reduced.
  4. Replay functionality: Computations and results can evolve by replaying the historical data from a stream.

In order to respect principle four, the data pipeline must guarantee that events stay in order from generation to ingestion. This is critical to guarantee consistency of results, as this guarantees deterministic computation results. Running the same data twice through a computation must produce the same result.

These four principles do, however, put constraints on building the analytics pipeline.

Building the analytics pipeline

Let ́s start concretizing how we can build such a data pipeline and identify the sorts of components required.

The first component is a scalable, distributed messaging system with events ordering and at-least-once delivery guarantees. Kafka can connect the output of one process to the input of another via a publish­-subscribe mechanism. Using it, we can build something similar to the Unix pipe systems where the output produced by one command is the input to the next.

The second component is a scalable stream analytics engine. Inspired by the Google Dataflow paper, Flink, at its core, is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. One of its most interesting API features allows usage of the event timestamp to build time windows for computations.

The third and fourth components are a real-­time analytics store, Elasticsearch, and a powerful visualisation tool, Kibana. Those two components are not critical, but they’re useful to store and display raw data and results.

Mapping the Kappa architecture to its implementation, Figure 3 illustrates the resulting data pipeline:

Figure 3. Kappa architecture reflected in a data pipeline. Figure courtesy of Ignacio Mulas Viela and Nicolas Seyvet.

This pipeline creates a composable environment where outputs of different jobs can be re­used as inputs to another. Each job can thus be reduced to a simple, well-defined role. The composability allows for fast development of new features. In addition, data ordering and delivery are guaranteed, making results consistent. Finally, event timestamps can be used to build time windows for computations.

Applying the above to our telco use case, each physical host and virtual machine (VM) telemetry and log event is collected and sent to Kafka. We use collectd on the hosts, and ceilometer on the VMs for telemetry, and logstash-forwarder for logs. Kafka then delivers this data to different Flink jobs that transform and process the data. This monitoring gives us both the physical and virtual resource views of the system.

With the data pipeline in place, let’s look at how a Bayesian model can be used to detect novelties in a telco cloud.

Incorporating a Bayesian model to do advanced analytics

To detect novelties, we use a Bayesian model. In this context, novelties are defined as unpredicted situations that differ from previous observations. The main idea behind Bayesian statistics is to compare statistical distributions and determine how similar or different they are. The goal here is to:

  • determine the distribution of parameters to detect an anomaly,
  • compare new samples for each parameter against calculated distributions and determine if the obtained value is expected or not,
  • and combine all parameters to determine if there is an anomaly.

Let’s dive into the math to explain how we can perform this operation in our analytics framework. Considering the anomaly A, a new sample z, θ observed parameters, P(θ) the probability distribution of the parameter, A(z|θ) the probability that z is an anomaly, X the samples, the Bayesian Principal Anomaly can be written as:

A (z | X) = ∫A(θ)P(θ|X)

A principal anomaly as defined above is valid also for multi­variate distributions. The approach taken evaluates the anomaly for each variable separately, and then combines them into a total anomaly value.

An anomaly detector considers only a small part of the variables, and typically only a single variable with a simple distribution like Poisson or Gauss, can be called a micro­model. A micromodel with gaussian distribution will look like this:

Figure 4. Micromodel with gaussian distribution. Figure courtesy of Ignacio Mulas Viela and Nicolas Seyvet.

An array of micro­models can then be formed, with one micro­model per variable (or small set of variables). Such an array can be called a component. The anomaly values from the individual detectors then have to be combined into one anomaly value for the whole component. The combination depends on the use case. Since accuracy is important (avoid false positives) and parameters can be assumed to be fairly independent from one another, then the principal anomaly for the component can be calculated as the maximum of the micro­model anomalies, but scaled down to meet the correct false alarm rate (i.e., weighted influence of components to improve the accuracy of the principal anomaly detection).

However, there may be many different “normal” situations. For example, the normal system behavior may vary within weekdays or time of day. Then, it may be necessary to model this with several components, where each component learns the distribution of one cluster. When a new sample arrives, it is tested by each component. If it is considered anomalous by all components, it is considered as anomalous. If any component finds the sample normal, then it is normal.

Learning Path

Learn the entire process of designing and building data applications.

Start learning

Applying this to our use case, we used this detector to spot errors or deviations from normal operations in a telco cloud. Each parameter θ is any of the captured metrics or logs resulting in many micromodels. By keeping a history of past models, and computing a principal anomaly for the component, we can find statistically relevant novelties. These novelties could come from configuration errors, a new error in the infrastructure, or simply a new state of the overall system (i.e., a new set of virtual machines).

Using the number of generated logs (or log frequency) appears to be the most significant feature to detect novelties. By modeling the statistical function of generated logs over time (or log frequency), the model can spot errors or novelties accurately. For example, let’s consider the case where a database becomes unavailable. At that time, any applications depending on it start logging recurring errors, (e.g., “Database X is unreachable...”). This raises the log frequency, which triggers a novelty in our model and detector.

The overall data pipeline, combining the transformations mentioned before, will look like this:

Figure 5. Data pipeline with combination of analytics and Bayesian anomaly detector. Figure courtesy of Ignacio Mulas Viela and Nicolas Seyvet.

This data pipeline receives the raw data, extracts statistical information (such as log frequencies per machine), applies the Bayesian anomaly detector over the interesting features (statistical and raw), and outputs novelties whenever they are found.

Conclusion

In this blog post, we have presented an approach using the Kappa architecture and a self-training (online) Bayesian model to yield quick, accurate analytics.

The Kappa architecture allows us to develop a new generation of analytics systems. Remember, this architecture has four main principles: data is immutable, everything is a steam, a single stream engine is used, and data can be replayed. It simplifies both the software systems and the development and maintenance of machine learning models. Those principles can easily be applied to most use cases.

The Bayesian model quickly detects novelties in our cloud. This type of online learning has the advantage of adapting over time to new situations, but one of its main challenges is a lack of ready­-to-­use algorithms. However, the analytics landscape is evolving fast and we are confident that a richer environment can be expected in the near future.

Ignacio Mulas Viela and Nicolas Seyvet will speak in greater detail on this topic as well as on a telco data set from a running cloud during their session Kappa architecture in the telecom industry at Strata + Hadoop World London 2016, May 31 to June 3, 2016.

Applying the Kappa architecture in the telco industry的更多相关文章

  1. AnalyticDB实现和特点浅析

    目录 AnalyticDB介绍与背景 AnalyticDB详细解析 架构设计 数据分区 读写分离和读写流程 其他特性介绍 混合(列-行)存储引擎 索引 小结 本篇主要是根据AnalyticDB的论文, ...

  2. Pattern: Microservice Architecture

    Microservice Architecture pattern http://microservices.io/patterns/microservices.html Context You ar ...

  3. 大数据处理中的Lambda架构和Kappa架构

    首先我们来看一个典型的互联网大数据平台的架构,如下图所示: 在这张架构图中,大数据平台里面向用户的在线业务处理组件用褐色标示出来,这部分是属于互联网在线应用的部分,其他蓝色的部分属于大数据相关组件,使 ...

  4. Java中实现SAX解析xml文件到MySQL数据库

    大致步骤: 1.Java bean 2.DBHelper.java 3.重写DefaultHandler中的方法:MyHander.java 4.循环写数据库:SAXParserDemo.java ① ...

  5. DotNet 资源大全中文版(Awesome最新版)

    Awesome系列的.Net资源整理.awesome-dotnet是由quozd发起和维护.内容包括:编译器.压缩.应用框架.应用模板.加密.数据库.反编译.IDE.日志.风格指南等. 算法与数据结构 ...

  6. A Complete List of .NET Open Source Developer Projects

    http://scottge.net/2015/07/08/a-complete-list-of-net-open-source-developer-projects/?utm_source=tuic ...

  7. The world beyond batch: Streaming 101

    https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the ...

  8. .NET 开源开发项目【翻译】

    原文地址 本文列出了 .NET 开源开发项目(open source developer projects).意在包括对开发过程的所有方面有所帮组的项目.对于消费项目(consumer project ...

  9. 20 Interesting WPF Projects on CodePlex

    20 Interesting WPF Projects on CodePlex (Some for Silverlight too) Pete Brown - 22 November 2010   I ...

随机推荐

  1. Unity3D笔记 英保通二

    一.访问另一个物体 1.代码中定义一个public的物体 例如:var target:Transform; 在面板上直接拖拽一个物体赋值给target 2.通过GameObject.Find(&quo ...

  2. svn和git的优缺点

    git官网api: https://git-scm.com/docs 一. 集中式vs分布式 1. Subversion属于集中式的版本控制系统集中式的版本控制系统都有一个单一的集中管理的服务器,保存 ...

  3. python3安装builtwith

    >>> import builtwith Traceback (most recent call last): File , in <module> File excep ...

  4. 最简单,最实用的数据库CHM文档生成工具——DBCHM

    DBCHM支持SqlServer/MySql/Oracle/PostgreSQL等数据库的表列批注维护管理. DBCHM有以下几个功能 表,列的批注可以编辑保存到数据库. 表,列的批注支持通过pdm文 ...

  5. sftp本地上传和远程下载

    1.  打开SecureCRT 连接相应的主机 2.  打开会话后,使用快捷键 alt + p,进入 sftp> 界面 3.  查看 sftp 相应的命令 help 4.  常用命令 (1)查看 ...

  6. Centos6.10安装tomcat

    1.  下载tomcat 2.  解压到相应的路径下 tar -xzvf apache-tomcat-8.5.34.tar.gz 3.  启动tomcat # 进入"apache-tomca ...

  7. 四种数据库随机获取N条数据的方法

    1.SQL Server: SELECT TOP  n  *  FROM  tableName ORDER BY NEWID(); 2.ORACLE: SELECT * FROM (SELECT * ...

  8. The Unique MST----poj1679次小生成树

    题目链接:http://poj.org/problem?id=1679 判断最小生成数是否唯一:如果唯一这权值和次小生成树不同,否则相同: #include<stdio.h> #inclu ...

  9. Java GUI程序设计

    在实际应用中,我们见到的许多应用界面都属于GUI图形型用户界面.如:我们点击QQ图标,就会弹出一个QQ登陆界面的对话框.这个QQ图标就可以被称作图形化的用户界面. 其实,用户界面的类型分为两类:Com ...

  10. 【Loadrunner】Error -26601: Decompression function 错误解决、27728报错解决方案

       一. Error -26601: Decompression function 错误解决 Action2.c(30): Error -26601: Decompression function ...