This is a guest post from Xiaowei Jiang, Senior Director of Alibaba’s search infrastructure team. The post is adapted from Alibaba’s presentation at Flink Forward 2016, and you can see the original talk from the conference here.

Alibaba is the largest e-commerce retailer in the world. Our annual sales in 2015 totalled $394 billion--more than eBay and Amazon combined. Alibaba Search, our personalized search and recommendation platform, is a critical entry point for our customers and is responsible for much of our online revenue, and so the search infrastructure team is constantly exploring ways to improve the product. What makes for a great search engine on an e-commerce site? Results that, in real-time, are as relevant and accurate as possible for each user.

At Alibaba’s scale, this is a non-trivial problem, and it’s difficult to find technologies that are capable of handling our use cases. Apache Flink® is one such technology, and Alibaba is using Blink, a system based on Flink, to power critical aspects of its search infrastructure and to deliver relevance and accuracy to end users.

In this post, I’ll walk through Flink’s role in Alibaba search and outline the reasons we chose to work with Flink on the search infrastructure team. I’ll also discuss how we adapted Flink to meet our unique requirements with Blink and how we are working together with data Artisans and the Flink community to contribute these changes back to Flink. We are actively transitioning our system from Blink to vanilla Apache Flink once we have successfully merged our modifications into the open source project. 

Part 1: Flink in Alibaba Search

Document Creation

The first step in providing users with a world-class search engine is building the documents that will be available for search. In Alibaba’s case, the document is made up of millions of product listings and related product data. Search document creation is a challenge because product data is stored in many different places, and it’s up to the search infrastructure team to bring together all relevant information to create a complete search document. Generally speaking, this is a 3-stage process:

  1. Synchronize all product data from disparate sources (e.g. MySQL, distributed file systems) into one HBase cluster.
  2. Join data from different tables together using business logic to create a final, searchable document. This is an HBase table that we call our ‘Result’ table.
  3. Export this HBase table as a file or as a set of updates.

All 3 of these stages actually run on 2 different pipelines in a classical ‘lambda architecture’: a full-build pipeline and an incremental build pipeline.

  • In the full-build pipeline, we process all data sources, and this is traditionally a batch job.
  • In the incremental pipeline, we process updates that occur after the batch job is finished. For instance, sellers can modify price or description or inventory availability might change. This information must be reflected in search results as quickly as possible. The incremental-build pipeline is traditionally a streaming job.

Real-time A/B testing of search algorithms

Our engineers test different search algorithms on a regular basis and need to be able to evaluate performance as quickly as possible. Right now, this evaluation happens once a day, but we’d like to do the analysis in real-time, and so we used Blink to build a real-time A/B testing framework. Online logs (impressions, clicks, transactions) are collected and processed by a parser and filter then later joined together using some business logic. Next, the data is aggregated, and the aggregated result is pushed to Druid; inside Druid, it’s possible to write a query to perform complex OLAP analysis on the data and see how well different algorithms are performing.

Online machine learning

There are a couple of applications here, and first, we’ll discuss real-time feature updates. Some of the features used in Alibaba’s search ranking are product CTR, product inventory, and total number of clicks. These data change over time, and if we can use the most recent data available, we’ll be able to offer a more relevant search ranking to our users. Our Flink pipeline provides us with online feature updates and has given a significant boost on conversion rate.

Second, there are specific days of the year (such as Singles Day) where products are heavily discounted--sometimes up to 50%--and therefore, user behavior changes dramatically. Transaction volume is huge, often many times higher than what we see in a normal day. Our previously-trained models are useless in this scenario, and so we use our logs and a Flink streaming job to power online machine learning, building models that take into account the real-time data. The result is a much higher conversion rate on these uncommon, but very important, sale days.

Part 2: Choosing a framework to solve the problem

When we chose Flink to power our search infrastructure, our evaluation included the following four categories. Flink met our requirements in all four.

  • Agility: It was our goal to be able to maintain one codebase for our entire (2-pipeline) search infrastructure process. And we wanted an API that was high-level enough for us to express our business logic.
  • Consistency: Changes to the seller or product databases must be reflected in final search results, and so the search infrastructure team requires at-least-once semantics (and for some other Flink use cases in the company, we have exactly-once requirements).
  • Low latency: When inventory availability changes, this must be reflected in search results very quickly; for example, we don’t want to give a high search ranking to a sold-out product.
  • Cost: Alibaba processes lots of data, and at our scale, an efficiency improvement results in significant cost savings. We need a framework that handles high-throughput efficiently.

More broadly speaking, there are 2 ways to think about unified batch and stream processing. The first approach is to use batch as a starting point then try to build streaming on top of batch. This likely won’t meet strict latency requirements, though, because micro-batching to simulate streaming requires some fixed overhead--hence the proportion of the overhead increases as you try to reduce latency. At our scale, 1000s of tasks would need to be scheduled for each microbatch, the connection re-established, and state reloaded. So at some point, the micro-batch approach becomes too costly to make sense.

Flink, on the other hand, uses streaming as a fundamental starting point and builds a batch solution on top of streaming, where a batch is basically a special case of a stream. With this approach, we don’t lose the benefit of our optimizations in batch mode--when a stream is finite, you can still do whatever optimization you’d like to do for batch processing.

Part 3: What is Blink?

Blink is a forked version of Flink that we have been maintaining to fit some of the unique requirements we have at Alibaba. At this point, Blink is running on a few different clusters, and each cluster has about 1000 machines, so large-scale performance is very important to us. Blink’s improvements generally cover two areas:

  • Making the Table API more complete so that we can have the same SQL for batch and streaming
  • A more robust YARN mode that’s still 100% compatible with Flink’s API and broader ecosystem

Table API

We first added support for user-defined functions to make it easy to bring our unique business logic into Flink. We also added a stream-to-stream join, which is a non-trivial task but relatively straightforward in Flink due to Flink’s first-class support for state. Next, we added a few different aggregations, the most interesting one probably being distinct_count, as well as windowing support. (Editor’s note: FLIP-11 covers a range of Table API and SQL improvements for Flink related to the features listed above and is recommended reading for anyone interested in the topic.)Next, we’ll cover runtime improvements, which we can break into four separate categories.

Blink on Yarn

When we started our project, Flink supported 2 cluster modes: standalone mode and Flink on YARN. In YARN mode, a job couldn’t request and release resources dynamically and instead needed to grab all required resources up front. And different jobs might share the same JVM process, which favored resource utilization over resource isolation. Blink includes an architecture where every job has its own JobMaster to request and release resources as the job requires. And different jobs can’t run in the same Java process, which yields the best isolation between jobs and tasks. The Alibaba team is currently working with the Flink community to contribute this work back to the open source, and the improvements are captured in FLIP-6 (which extends to other cluster managers in addition to YARN).

Operator Rescale

In production, our clients might need to change the parallelism of operators, but at the same time, they don’t want to lose state. When we started working on Blink, Flink did not support changing the parallelism of operators while maintaining state. Blink introduced the concept of “buckets” as the unit of state management. There are many more buckets than tasks, and each task will be assigned multiple buckets. When the parallelism changes, we’ll reassign buckets to tasks. Using this method, it’s possible to change the parallelism of operators and maintain state.

(Editor’s note: the Flink community has concurrently solved this issue for Flink 1.2 - the feature is available in the latest version of the master branch. Flink’s notion of “key groups” is largely equivalent with “buckets” mentioned above, but the implementation differs slightly in how the data structures back these buckets. For more information, check out FLINK-3755 in Jira.)

Incremental Checkpointing

In Flink, checkpointing happens in 2 stages: taking a snapshot of state locally, then persisting the snapshot of state to HDFS (or another storage system), and the entire snapshot of state is stored in HDFS with each snapshot. Our state was too large for this approach to be acceptable, and so Blink only stores the modified state in HDFS, and we’ve been able to improve checkpointing efficiency greatly. This modification enabled us to use large state in production.

Asynchronous I/O

The production bottleneck for many of our jobs is accessing external storage like HBase. To solve this problem, we introduced Asynchronous I/O, which we’ll be working to contribute to the community and is described in detail in FLIP-12. (Editor’s note: data Artisans thinks that FLIP-12 is substantial enough to have its own, separate writeup at some point in the near future. So we’ll only briefly introduce the idea here, and for the time being, you should check out the FLIP writeup if you’d like to learn more. At the time of publishing, the code has already been contributed to Flink.)

Part 4: What’s next for Flink at Alibaba?

We’ll continue to optimize our streaming jobs, specifically, better handling of temporary skew and slow machines without negating the positive aspects of backpressure and faster recovery from failure. As was discussed by a number of different speakers at Flink Forward, we believe that Flink has great potential as a batch processor as well as a stream processor. We’re working to fully leverage Flink’s batch processing capabilities and hope to have a Flink batch mode in production in a couple months.

Another popular conversation topic from the conference is streaming SQL, and we’re continuing to add further SQL support and Table API support in Flink. And Alibaba’s business continues to grow, meaning that our jobs get larger and larger--it becomes increasingly important to make sure we can scale to even larger clusters. Very importantly, we look forward to continued collaboration with the community in order to contribute our work back to the open source so that all Flink users can benefit from the work we’ve put into Blink. We look forward to updating you on our progress at Flink Forward 2017.

Blink: How Alibaba Uses Apache Flink的更多相关文章

  1. 修改代码150万行!与 Blink 合并后的 Apache Flink 1.9.0 究竟有哪些重大变更?

    8月22日,Apache Flink 1.9.0 正式发布,早在今年1月,阿里便宣布将内部过去几年打磨的大数据处理引擎Blink进行开源并向 Apache Flink 贡献代码.当前 Flink 1. ...

  2. Apache Flink 1.9.0版本新功能介绍

    摘要:Apache Flink是一个面向分布式数据流处理和批量数据处理的开源计算平台,它能够基于同一个Flink运行时,提供支持流处理和批处理两种类型应用的功能.目前,Apache Flink 1.9 ...

  3. 终于等到你!阿里正式向 Apache Flink 贡献 Blink 源码

    摘要: 如同我们去年12月在 Flink Forward China 峰会所约,阿里巴巴内部 Flink 版本 Blink 将于 2019 年 1 月底正式开源.今天,我们终于等到了这一刻. 阿里妹导 ...

  4. Apache Flink 1.9重磅发布!首次合并阿里内部版本Blink重要功能

    8月22日,Apache Flink 1.9.0 版本正式发布,这也是阿里内部版本 Blink 合并入 Flink 后的首次版本发布.此次版本更新带来的重大功能包括批处理作业的批式恢复,以及 Tabl ...

  5. Apache Flink系列(1)-概述

    一.设计思想及介绍 基本思想:“一切数据都是流,批是流的特例” 1.Micro Batching 模式 在Micro-Batching模式的架构实现上就有一个自然流数据流入系统进行攒批的过程,这在一定 ...

  6. Apache Flink 漫谈系列 - JOIN 算子

    聊什么 在<Apache Flink 漫谈系列 - SQL概览>中我们介绍了JOIN算子的语义和基本的使用方式,介绍过程中大家发现Apache Flink在语法语义上是遵循ANSI-SQL ...

  7. Apache Flink 1.5.0 Release Announcement

    Apache Flink: Apache Flink 1.5.0 Release Announcement https://flink.apache.org/news/2018/05/25/relea ...

  8. Apache Flink 开发环境搭建和应用的配置、部署及运行

    https://mp.weixin.qq.com/s/noD2Jv6m-somEMtjWTJh3w 本文是根据 Apache Flink 系列直播课程整理而成,由阿里巴巴高级开发工程师沙晟阳分享,主要 ...

  9. 如何在 Apache Flink 中使用 Python API?

    本文根据 Apache Flink 系列直播课程整理而成,由 Apache Flink PMC,阿里巴巴高级技术专家 孙金城 分享.重点为大家介绍 Flink Python API 的现状及未来规划, ...

随机推荐

  1. DataRead和DataSet的异同

    第一种解释 DataReader和DataSet最大的区别在于,DataReader使用时始终占用SqlConnection(俗称:非断开式连接),在线操作数据库时,任何对SqlConnection的 ...

  2. Chapter 5 Blood Type——2

    The rest of the morning passed in a blur. 早上剩下的时间都在模糊中度过了. It was difficult to believe that I hadn't ...

  3. 【C#加深理解系列】(二)序列化

    什么是序列化 序列化,它又称串行化,是.NET运行时环境用来支持用户定义类型的流化的机制.序列化就是把一个对象保存到一个文件或数据库字段中去,反序列化就是在适当的时候把这个文件再转化成原来的对象使用. ...

  4. kernel pwn 入门环境搭建

    刚开始上手kernel pwn,光环境就搭了好几天,应该是我太菜了.. 好下面进入正题,环境总共就由两部分构成,qemu和gdb.这两个最好都需要使用源码安装. 我使用的安装环境为 qemu:安装前要 ...

  5. HTTP协议简介详解 HTTP协议发展 原理 请求方法 响应状态码 请求头 请求首部 java模拟浏览器客户端服务端

    协议简介 协议,自然语言里面就是契约,也是双方或者多方经过协商达成的一致意见; 契约也即类似于合同,自然有甲方123...,乙方123...,哪些能做,哪些不能做; 通信协议,也即是双方通过网络通信必 ...

  6. JS的forEach和map方法的区别

    一.前言 forEach()和map()两个方法都是ECMA5中Array引进的新方法,主要作用是对数组的每个元素执行一次提供的函数,但是它们之间还是有区别的.jQuery也有一个方法$.each() ...

  7. msf中exploit的web_delivery模块

    背景:目标设备存在远程文件包含漏洞或者命令注入漏洞,想在目标设备上加载webshell,但不想在目标设备硬盘上留下任何webshell文件信息 解决思路:让目标设备从远端服务器加载webshell代码 ...

  8. Shell编程(week4_day3)--技术流ken

    本节内容 1. shell流程控制 2. for语句 3. while语句 4. break和continue语句 5. case语句 6. shell编程高级实战 shell流程控制 流程控制是改变 ...

  9. 【响应式编程的思维艺术】 (3)flatMap背后的代数理论Monad

    目录 一. 划重点 二. flatMap功能解析 三. flatMap的推演 3.1 函数式编程基础知识回顾 3.2 从一个容器的例子开始 3.3 Monad登场 3.4 对比总结 3.5 一点疑问 ...

  10. C#单例模式的几种实现方式

     一.多线程不安全方式实现 public sealed class SingleInstance { private static SingleInstance instance; private S ...