could accomplish with Flink back at Twitter.

I had an application in mind that I knew I could make more efficient by a huge factor if I could use the stateful processing guarantees available in Flink so I set out to build a prototype to do exactly that. The end result of this was a new prototype system which computed a more accurate result than the previous one and also used less than 1% of the resources of the previous system. The better accuracy came from the fact that Flink provides exactly-once processing guarantees whereas the existing system only provided at-most-once. The efficiency improvements came from several places but the largest was the elimination of a large key-value store cluster needed for the existing system. This prototype system earned my team first prize in the infrastructure category at Twitter’s December 2015 Hack Week competition!

During the course of my work with Flink I also developed a good sense of Flink’s performance capabilities so I was very interested when I read the Yahoo! benchmark that was recently published comparing Storm, Flink, and Spark. The benchmark measures the latency of the frameworks under relatively low throughput scenarios, and establishes that both Flink and Storm can achieve sub-second latencies in these scenarios, while Spark Streaming has much higher latency. However, I didn’t think the throughput numbers given for Flink lined up with what I knew was possible based on my own experience so I decided to dig into this as well. I re-ran the Yahoo! benchmarks myself along with a couple of variants that used the features in Flink to compute the windowed aggregates directly in Flink, with full exactly-once semantics, and came up with much different throughput numbers while still maintaining sub-second latencies. The extended benchmarks are available on GitHub.

In the rest of this post I will go into detail about my own benchmarking of Flink and Storm and also describe the new architecture, enabled by Flink, that turned out to be such a huge win for my prototype system at Twitter.

During the process of working with Flink and building this application on top of it I’ve come to realize just how advanced a system Apache Flink™ actually is. As a result of this whole process and working closely with the team behind Apache Flink™ I’m very happy to report that I’ve decided to join data Artisans to continue this work full time!

Benchmarking: Comparing Flink and Storm

For background, in the Yahoo! benchmark the task is to consume ad impressions from Kafka, look up which ad campaign the ad corresponds to (from Redis) and compute the number of ad views in each 10 second window grouped by campaign. The final result of the 10 second windows are written to Redis for storage as well as early updates on those windows every second. This is the same benchmark we are discussing in this section of the post.

All experiments referenced in this post were run on a cluster with the following setup. This is close to the setup used in the Yahoo! experiments with one exception. In the Yahoo! experiments the compute nodes running the Flink and Storm workers were interconnected with a 1 GigE interconnect. In the setup we tested the compute nodes were interconnected with a 10 GigE interconnect. The connection between the Kafka cluster and the compute nodes however was still just 1 GigE. Here are the exact hardware specs:

  • 10 Kafka brokers with 2 partitions each
  • 10 compute machines (Flink / Storm)
  • Each machine has 1 Xeon E3-1230-V2@3.30GHz CPU (4 cores w/ hyperthreading) and 32 GB RAM (only 8GB allocated to JVMs)
  • 10 GigE Ethernet between compute nodes
  • 1 GigE Ethernet between Kafka cluster and Flink/Storm nodes

Fault Tolerance and Throughput

We used the benchmark programs from the Yahoo! streaming benchmark to measure the maximum throughput of each system while maintaining the best possible fault tolerance.

  • For Storm, we turned acknowledgements on, to make the spouts re-send lost tuples upon failures. This, however, does not prevent lost state in the event of a Storm worker failure. The messaging guarantees provided are at-least-once which means there can be tuple replays leading to overcounting. In addition to that the actual state being accumulated on each node as the 10 second aggregates are computed is lost whenever there is a failure. This leaves the possibility of both lost values and duplicates in the final results.
  • For Flink, we changed the job to use Flink’s built-in windowing mechanism. Starting with version 0.10, Flink supports windows on event time. We use Flink’s window trigger API to emit the current window to Redis when the window is complete (every 10 seconds) and in addition we do an early update of the window every second to meet the SLA requirement defined in the Yahoo! benchmark. We also have Flink’s fault tolerance mechanism enabled (with checkpoints every second) which means the window state is recovered in the event of any failure and also guarantees exactly-once semantics with regard to the aggregated counts we are computing. Said another way this means the results we are computing here are the same whether there are failures along the way or not. This is true exactly-once semantics.

Below is a diagram giving an overview of the system used in the benchmark. On the left you see the original Flink job as reported in the Yahoo! benchmark. There is custom code to compute and cache the windows locally along with a separate user thread to flush results to Redis periodically. This is a direct port of the Storm job to Flink so it doesn’t take advantage of Flink’s window API for computing windows. On the right you see a similar diagram except in this case the job uses the Flink window API which fault-tolerantly manages the windows and emits them downstream based on user-specified triggers. This version is both simpler and provides better fault-tolerance guarantees: exactly-once.The chart below shows the maximum throughput we were able to get out of the systems. There are a few interesting points:

  • We were actually able to get 400K events/second out of Storm (compared to 170K in the Yahoo! benchmark), presumably because of the difference in CPUs we used and potentially the 10 GigE links between the worker machines. Storm was still not able to saturate the network links to Kafka, however.
  • Flink started to saturate the network links to Kafka at around 3 million events/sec and thus the throughput was limited at that point. To see how much Flink could handle on these nodes we moved the data generator into the Flink topology. The topology then had 3 operators: (datagen) -> (map-filter) -> (window). In this configuration we were able to get to a throughput of 15 million events/second.
  • We could not see a measurable throughput difference in Flink when switching fault tolerance on or off. Since the state is comparably small, the checkpoints cost very little.

Winning Twitter Hack Week: Eliminating the key-value store bottleneck

While the above results are very interesting by themselves, it’s even more interesting when moving to applications that use large windows and many distinct keys. The starting job during Hack Week was a pipeline with over a million events per second, but windows of one hour, each window containing hundreds of millions of distinct keys per window.

Where the Yahoo! streaming benchmark job writes 100 windows per second to the key-value store (100 distinct ad campaigns), the above mentioned stream needs to update millions of entries in the key-value store per second. Scaling that type of streaming application quickly becomes an exercise in scaling out the database. In addition, unless you have state managed by the streaming system with exactly once guarantees you can’t actually do accurate counting in the face of failures. Like the Yahoo! benchmark system, the system in use at Twitter also suffered from these same issues.

The solution we came up with during Hack Week to circumvent the key-value store is shown in the sketch below. It is somewhat similar to the streaming job with Flink’s windows above, but instead of periodically writing current windows into a database (to make them accessible), we directly exposed the in-flight windows to be queried. That way, only final windows need to be written to the database (once per hour). This ends up reducing the load on the database dramatically while still making the state queryable immediately as it’s computed.

To query the window contents, we wrote a custom Flink operator that computes the windows and in addition runs an Akka actor system for the queries. Flink computes the windows and checkpoints them as part of its fault tolerance mechanism, while Akka acts like an RPC system that answers the state queries (“get state for key k, time t”).

Streaming application querying state in the stream processor

To show the value of this approach in the context of the Yahoo! benchmark we created another variant in which we have 1,000,000 campaigns and store and update the windows directly in the key-value store. When you do this the key-value store very quickly becomes the bottleneck in the streaming application. In our benchmark this bottleneck occurred at around 280,000 events/sec. Beyond that there was no way to increase throughput further without either scaling up the key-value store dramatically or getting rid of it altogether.

We chose to get rid of it altogether. This is the power of having fault-tolerant local state. The stream processor itself becomes the key-value store and updating state becomes fast and cheap in-memory processing rather than communication across the network to a remote store.

Using this model we were able to completely eliminate the key-value store bottleneck and achieve a throughput of 15,000,000 events/sec all while making the data directly queryable instantly as it’s processed. This is the new architecture that the new generation of stream processing technology such as Flink enables.

Latency

While we were mainly looking into throughput and fault tolerance in these experiments, we also evaluated latency. Because the programs are very different from each other, the latencies should not be compared to each other directly. They merely stand to show that all approaches discussed here achieve acceptably low (sub-second) latency.

  • Both the Storm job and the first Flink job write their results to Redis, and latencies are analyzed with the same formula and scripts as in the Yahoo! streaming benchmark. The scripts only see the final write of a window to Redis. We measured a comparable latency distribution for Storm as the one reported in the Yahoo! streaming benchmark. The majority of the latency is caused by the delay of the thread that periodically flushes the results to Redis.
  • Flink’s final window write is triggered when the window is closed by the event time watermark. Under default settings, that watermark is generated more frequently than once per second, leading to a bit more compact latency distribution. We also observed that the latency stays low under high throughput.
  • The Flink-with-state-query job has no form of periodic flush or delay at all; hence, the latency here is naturally lower than in the other jobs, ranging between 1 and 240 milliseconds. The high-percentile latencies come most likely from garbage collection and locks during checkpoints and can probably be brought even lower by optimizing the data structure holding the state (currently a simple Java HashMap).

Our takeaway from these experiments

While Storm paved much of the way for open source streaming, Flink and Storm represent, really, two different generations of stream processing technology. Flink can be used similarly to Storm (as in the Yahoo! benchmark), but comes with features that support new approaches to building streaming applications. Embracing these new approaches can lead to huge wins in many dimensions:

  • Higher efficiency: The difference in throughput between Storm and Flink is huge. This translates directly to either scaling down to fewer machines or being able to handle much larger problems.
  • Fault tolerance and consistency: Flink provides exactly-once semantics where Storm only provides at-least-once semantics. When these better guarantees come at high cost, many deployments will deactivate them and applications will often not be able to rely on them. However, with Flink exactly-once guarantees are cheap and applications can take it for granted. This is the same argument as for NoSQL databases, where better consistency guarantees led to wider applicability. With these guarantees a whole new realm of stream processing applications become possible.
  • Exploiting local state: When building streaming applications, fault-tolerant local state is very powerful. It eliminates the need for distributed operations/transactions with external systems such as key-value stores which are often the bottleneck in practice. Exploiting the local state in Flink like we did, we were able to build the query abstraction that lifted a good part of the database work into the stream processor. This allowed us tremendous throughput and also allowed queries immediate access to the computed state.

Special thanks to Steve Cosenza (@scosenza), Dan Richelson (@nooga) and Jason Carey (@jmcarey) for all their help with the Twitter Hack Week project.

Extending the Yahoo! Streaming Benchmark的更多相关文章

  1. Flink articles

    http://ictlabs-summer-school.sics.se/2015/slides/flink-advanced.pdf http://henning.kropponline.de/20 ...

  2. Storm VS Flink ——性能对比

    1.背景 Apache Flink 和 Apache Storm 是当前业界广泛使用的两个分布式实时计算框架.其中 Apache Storm(以下简称"Storm")在美团点评实时 ...

  3. benchmarks

    系统性能测试 stream SPARK 测试 streaming benchmark https://github.com/yahoo/streaming-benchmarks

  4. Kafka设计解析(五)- Kafka性能测试方法及Benchmark报告

    本文转发自Jason’s Blog,原文链接 http://www.jasongj.com/2015/12/31/KafkaColumn5_kafka_benchmark 摘要 本文主要介绍了如何利用 ...

  5. Darwin Streaming Server 6.0.3安装、订制、插件或模块

    How to setup Darwin Streaming Server 6.0.3 on 32 or 64 bit Linux platforms, add custom functionality ...

  6. 序列化战争:主流序列化框架Benchmark

    序列化战争:主流序列化框架Benchmark GitHub上有这样一个关于序列化的Benchmark,被好多文章引用.但这个项目考虑到完整性,代码有些复杂.为了个人学习,自己实现了个简单的Benchm ...

  7. How Cigna Tuned Its Spark Streaming App for Real-time Processing with Apache Kafka

    Explore the configuration changes that Cigna’s Big Data Analytics team has made to optimize the perf ...

  8. 【原创】大数据基础之Benchmark(1)HiBench

    HiBench 7官方:https://github.com/intel-hadoop/HiBench 一 简介 HiBench is a big data benchmark suite that ...

  9. 【慕课网实战】Spark Streaming实时流处理项目实战笔记十五之铭文升级版

    铭文一级:[木有笔记] 铭文二级: 第12章 Spark Streaming项目实战 行为日志分析: 1.访问量的统计 2.网站黏性 3.推荐 Python实时产生数据 访问URL->IP信息- ...

随机推荐

  1. java基础(十六)----- equals()与hashCode()方法详解 —— 面试必问

    本文将详解 equals()与hashCode()方法 概述 java.lang.Object类中有两个非常重要的方法: public boolean equals(Object obj) publi ...

  2. C#版(击败100.00%的提交) - Leetcode 372. 超级次方 - 题解

    版权声明: 本文为博主Bravo Yeung(知乎UserName同名)的原创文章,欲转载请先私信获博主允许,转载时请附上网址 http://blog.csdn.net/lzuacm. Leetcod ...

  3. 【Python3爬虫】第一个Scrapy项目

    Python版本:3.5    IDE:Pycharm 今天跟着网上的教程做了第一个Scrapy项目,遇到了很多问题,花了很多时间终于解决了== 一.Scrapy终端(scrapy shell) Sc ...

  4. SpringCloud Feign的分析

    Feign是一个声明式的Web Service客户端,它使得编写Web Serivce客户端变得更加简单.我们只需要使用Feign来创建一个接口并用注解来配置它既可完成. @FeignClient(v ...

  5. Springboot 拦截器的背后

    今天写了个拦截器对一些mapping做了些处理,写完之后突然很想看看拦截器是怎么加进spring里面.对着源码debug了一遍.又有了新的收获. 1.拦截器的实现 1.实现HandlerInterce ...

  6. 想在Java中实现Excel和Csv的导出吗?看这就对了

    前言 最近在项目中遇到一个需求,需要后端提供一个下载Csv和Excel表格的接口.这个接口接收前端的查询参数,针对这些参数对数据库做查询操作.将查询到的结果生成Excel和Csv文件,再以字节流的形式 ...

  7. 【Vue】----- computed与watch的区别

    1.computed computed是一种计算属性,用来监听属性的变化: computed里面的方法调用的时候不需要加(),并且里面的方法必须要有一个返回值: computed里面的方法不是通过事件 ...

  8. Docker Compose 引用环境变量

    在项目中,往往需要在 docker-compose.yml 文件中使用环境变量来控制不同的条件和使用场景.本文集中介绍 docker compose 引用环境变量的方式.说明:本文的演示环境为 ubu ...

  9. [转]Centos 7搭建Gitlab服务器超详细

    本文转自:https://blog.csdn.net/duyusean/article/details/80011540 可参考:https://about.gitlab.com/install/#c ...

  10. Java Socket网络编程学习笔记(一)

    0.前言 其实大概半年前就已经看过网络编程Socket的知识了(传统IO),但是因为长时间的不使用导致忘的一干二净,最近正好准备校招,又重新看了网络编程这一章, 是传统IO(BIO)相关的内容,故在此 ...