Amundsen — Lyft’s data discovery & metadata engine
转自:https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9
In order to increase productivity of data scientists and research scientists at Lyft, we developed a data discovery application built on top of a metadata engine. Code named, Amundsen (after the Norwegian explorer, Roald Amundsen), we improve the productivity of our data users by providing a search interface for data, which looks something like this:
The problem
Data in our world has grown 40x over the last 10 years — see the chart below from United Nations Economic Commission for Europe (UNECE).
Data Growth Predictions. Source: UNECE. 2013
Unprecedented growth in Data volumes has led to 2 big challenges:
- Productivity — Whether it’s building a new model, instrumenting a new metric, or doing adhoc analysis, how can I most productively and effectively make use of this data?
- Compliance — When collecting data about a company’s users, how do organizations comply with increasing regulatory and compliance demands and uphold the trust of their users?
The key to solving these problems lies not in data, but in the metadata. And, to show you how, let’s go through a journey of how we solved a part of the productivity problem at Lyft using metadata.
Metadata is the holy grail of future applications
At its core, metadata is
a set of data that describes and gives information about other data.
There are 2 parts to metadata — a (usually smaller) set of data that describes another (usually larger) set of data.
1. A describing set of data — ABC¹ of metadata
Three broad types of metadata fit in this category:
Application Context — information needed by humans or applications to operate. This includes existence of data, and description, semantics, tags associated with the data.
Behavior — information about how the data is created and used over time. This includes information about ownership, creation, common usage patterns, people or processes that are frequent users, provenance and lineage.
Change — information about how the data is changing over time. This captures information about evolution of data (for example, schema evolution for a table) and the processes that create it (for example, the related ETL code for a table).
Capturing these three kinds of metadata and using them to drive applications is key to many applications of the future. ABCs of metadata is a terminology adopted from a paper on Ground by Joe Hellerstein, Vikram Sreekanti et al.
2. The data being described
Now let’s talk about what data is being described by the ABCs above. The short answer is any data within your organization. This includes, but is not limited to:
- Data Stores — tables, schemas, documents of structured data stores like Hive, Presto, MySQL, as well as unstructured data stores (like S3, Google Cloud Storage, etc.)
- Dashboards/reports — saved queries, reports and dashboards in BI/reporting tools like Tableau, Looker, Apache Superset, etc.
- Events/Schemas — Events and schemas stored in schema registries or tools like Segment.
- Streams — Streams/topics in Apache Kafka, AWS Kinesis, etc.
- Processing — ETL jobs, ML workflows, streaming jobs, etc.
- People — I don’t mean a software stack, I mean good old people like you and me who carry data in our head and in our organizational structure, so information like name, team, title, data resources frequently used, data resources bookmarked are all important pieces of information in this category.
This exact metadata can be used to make data users more productive by providing them the relevant metadata on their fingertips.
Productivity
At a 50,000 feet level, the data scientist workflow looks like the following.
Typical Data Science Workflow. Source: Harvard Data Science Course - CS109
At Lyft, what we observed was that the while we wanted the majority of the time to be spent in model development (aka prototyping) and productionalization, a lot of the time was being spent in data discovery.
Time spent in Data Science workflow
Data discovery includes finding the answer to questions like:
- Does this data exist? Where is it? What is the source of truth of that data? Do I have access to it?
- Who and/or which team is the owner? Who are the common users?
- Is there existing work I can re-use?
- Can I trust this data?
If they sound familiar, we feel you.
The idea for Amundsen was inspired a lot by search engines like Google — in fact, we often think of it as “Search for data” within the organization.
What we present below are mocks with fake data, to give you a sense for what using Amundsen feels like.
Landing page:
The entry point for the experience is a search box where you can type plain English to search for data, e.g. “election results” or “users”. If you don’t know what you are searching for, we present you a list of popular tables in the organization to browse through them.
Search ranking:
Once you enter your search term, you are shown search results as following.
The results show some in-line metadata — description about the table as well the last date when the table was updated. These results get chosen by fuzzy matching the entered text with a few metadata fields — table name, column name, table description and column descriptions. Search ranking uses an algorithm similar to Page Rank, whereby highly queried tables show up above, while those queried less show up later in the search results.
Detail page:
Once you have selected a result of choice, you get to the detail page which looks like below.
The detail page shows the name of the table along with it’s manually curated description. The column list along with descriptions follows. A special blue arrow by a column showcases that it’s a popular column, there by encouraging users to use it. On the right hand pane, you see information about the Behavior of the table. In other words, who’s the owner, who are frequent users and a general profile of the data to see how the count of records is changing in the table over time, and you see associated tags with the table.
Information like descriptions and tags is manually entered by our users, while information like popular users is generated automatically by grazing through the audit logs.
The bottom of the same page contains a widget for users to leave us any feedback they may have.
Feedback widget
Clicking on a column reveals more stats about that column like below.
In the above, for the integer column, the stats show the count of records, null count, zero count, min, max, and average value over the last day of data, so data scientists can start to understand the shape of the data.
Lastly the table detail page, also contains a preview button, which if you have access to view the data, would show you a preview from the latest daily partition of the data, like below. This preview only works if you have access to the underlying data.
Some trade offs
Discovery vs. Curation
We often have to strike a balance between discovery and curation. For example, if your organization had only a small number of data sets, and each of them was manually crafted by a set of Data Engineers, and each table was well named, under a well defined schema, each field appropriately named and the schema evolved in sync with how the business evolved, then your need for discovery may not be as much in such a world.
However, if you live in a organization, that grew too fast, with lots of data, it’s unlikely that curation and best practices for schema design on their own are going to make your users productive.
Our approach is to have a combination of both. To have a discovery (aka search) system, while also adopting some best practices about names and descriptions about schemas, tables and fields.
Security vs democratization
Another important balance to strike is between security and democratization. Discovery platforms like the one described above democratize the discovery of data to everyone in the organization, while the Security & Privacy team has a mission to protect and safeguard sensitive data across the organization. The question then becomes how do you balance these two seemingly competing needs?
Our approach is to divide the metadata into a few categories and give different access to each of the categories. A good way of doing so is
- Existence and other fundamental metadata (like name and description of table and fields, owners, last updated, etc.)
This metadata is made available to everyone whether or not you have access to the data or not. The reason is that in order for you to be productive, you need to know if such a data set exists and if that’s what you are looking for. Ideally you can figure out the fit using this fundamental metadata, and if it is what you are looking for, request access. The only rare exception here is if the existence of a table or a field reveals some privileged information like the countries you operate in, in which case, it’s better to fix the data model or security model and not do security by obscurity.
2. Richer metadata (like column stats, preview)
This metadata is only available to users who have access to the data. This is because these stats may reveal sensitive information to users, and hence should be considered privileged.
Future
Amundsen has been super successful at Lyft, with really high adoption rate and Customer Satisfaction (CSAT) score. It has driven down the time to discover an artifact to be 5% of the pre-Amundsen baseline. Users can now discover more data in a shorter time, and with higher degree of trust.
The future as we see it lies in nailing down productivity even further by adding more features, but more importantly in unlocking a new use-case through all the great metadata already available in Amundsen — the use-case of compliance.
Compliance
While GDPR and newer privacy laws like the California Consumer Privacy Act (CCPA) affect the treatment of data in many ways, their provision of user data rights is one of the most impactful. Organizations must manage ways to comply with exercise of these various rights, such as those to access, correct and delete certain data.
These privacy laws typically provide for certain exceptions, such as the ability to keep certain information due to legal obligations, even in the face of a deletion request. Thus far, organizations have taken a varied number of approaches to becoming compliant. Some have established manual processes for resolving the data service requests that come in, while others have gone and quarantined personal data in one location/database, so user rights management becomes easier.
However, those method may fail to scale — both as the organization and the amount of data and use-cases on it grows as well as when the number of incoming data service requests grows.
One approach that scales is the one powered by metadata. It’s the approach where a tool like Amundsen is used to store, and tag all personal data within the organization. Such a metadata powered solution can help an organization remain compliant as the data and its use-cases or service requests grow.
Productivity
Currently we integrate with Hive, Presto and any other systems that integrate with the Hive metastore (e.g. Apache Impala, Spark, etc.). These are the upcoming items in our roadmap:
- Add people to Amundsen’s data graph, by integrating with integration with HR systems like Workday. Show commonly used and bookmarked data assets.
- Add dashboards and reports (e.g. Tableau, Looker, Apache Superset) to Amundsen.
- Add support for lineage across disparate data assets like dashboards and tables.
- Add events/schemas (e.g. schema registry) to Amundsen.
- Add streams (e.g. Apache Kafka, AWS Kinesis) to Amundsen.
Conclusion
With large amounts of data, the success in using data to fullest lies not in data but in the metadata. Lyft has built a data discovery platform, Amundsen, which has worked really well in improving the productivity of its data scientists by faster data discovery.
At the same time, there’s a lot of value a metadata driven solution can provide in the space of compliance, in tracking personal data across the entire data infrastructure. We should expect a lot of investment in that area in the future.
Stay tuned for an upcoming blog post detailing the architecture of the data discovery application and the metadata engine that powers it!
Thanks to Max Beauchemin, Andrew Stahlman, Beto Dealmeida for reviewing the post.
Thanks to the engineers who made it possible (in alphabetical order):Alagappan Sethuraman, Daniel Won, Jin Chang, Tamika Tannis, Tao Feng, to Matt Spiel for design, and to the engineering and product leadership of Shenghu Yang and Philippe Mizrahi.
Amundsen — Lyft’s data discovery & metadata engine的更多相关文章
- SQL Data Discovery and Classification
The new version of SQL Server Management Studio (v17.5) brings with it a new feature: SQL Data Disco ...
- MySQL 之 Metadata Locking 研究
MySQL5.5 中引入了 metadata lock. 顾名思义,metadata lock 不是为了保护表中的数据的,而是保护 database objects(元数据)的.包括表结构.schem ...
- Streaming data from Oracle using Oracle GoldenGate and Kafka Connect
This is a guest blog from Robin Moffatt. Robin Moffatt is Head of R&D (Europe) at Rittman Mead, ...
- devmapper: Thin Pool has 162394 free data blocks which is less than minimum required 163840 free data blocks. Create more free space in thin pool or use dm.min_free_space option to change behavior
问题: 制作镜像的时候报错 devmapper: Thin Pool has 162394 free data blocks which is less than minimum required 1 ...
- (笔记)MySQL 之 Metadata Locking 研究(5.5版本)
MySQL5.5 中引入了 metadata lock. 顾名思义,metadata lock 不是为了保护表中的数据的,而是保护 database objects(元数据)的.包括表结构.sch ...
- Indexing Sensor Data
In particular embodiments, a method includes, from an indexer in a sensor network, accessing a set o ...
- 数据治理方案技术调研 Atlas VS Datahub VS Amundsen
数据治理意义重大,传统的数据治理采用文档的形式进行管理,已经无法满足大数据下的数据治理需要.而适合于Hadoop大数据生态体系的数据治理就非常的重要了. 大数据下的数据治理作为很多企业的一个巨大的 ...
- SSIS Data Flow 的 Execution Tree 和 Data Pipeline
一,Execution Tree 执行树是数据流组件(转换和适配器)基于同步关系所建立的逻辑分组,每一个分组都是一个执行树的开始和结束,也可以将执行树理解为一个缓冲区的开始和结束,即缓冲区的整个生命周 ...
- [BTS] The adapter "SQL" raised an error message. Details "The Messaging Engine is shutting down. ".
Get a warning in event log. Log Name: ApplicationSource: BizTalk ServerDate: 3/ ...
随机推荐
- 09.vue中样式-style
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8&quo ...
- react-native android/ios 根据配置文件编译时自动修改版本号
开发react-native时大都有过这个操作,当版本迭代时候要修改app版本号时,一般都这样做 Android: 的要修改build.gradle文件的versionName ios: 打开xcod ...
- CTime格式化
CTime Formateg://CString date = time.Format("%Y-%m-%d %H:%M:%S %W-%A");格式符号说明 %a —— 星期(缩写英 ...
- selenium自动加载各个浏览器插件
在自动化测试过程中,通过selenium启动浏览器时,可能需要加载插件(如测试用的firebug.或产品中要求必须添加某插件等).读取用户数据(自己浏览器的配置文件/别人直接给的浏览器配置文件).设置 ...
- Windows 独立启动方式安装 Archiva
在 Windows 中以独立启动方式安装. 你可以将安装文件拷贝到任何你希望运行的目录中,下面的步骤中.我们没有将 Archiva 安装成服务,所以你需要通过控制台的方式来进行启动. Windows ...
- CentOS安装glibc异常Protected multilib versions
安装失败 在执行yum install glibc.i686 libstdc++.i686 libcurl.i686安装命令时出现Protected multilib versions 解决方案 在命 ...
- 移动端解决悬浮层(悬浮header、footer)会遮挡住内容的方法
固定Footer Bootstrap框架提供了两种固定导航条的方式: ☑ .navbar-fixed-top:导航条固定在浏览器窗口顶部 ☑ .navbar-fixed-bottom:导航条固定在 ...
- C++将十进制数转化为二进制
#include<iostream> using namespace std; void main() { ; ]; cin>>n; i=n; while(i) { a[j]= ...
- Vue.js 3.0 新特性预览
总结起来,Vue 3 以下方面值得我们期待 : 更快 更小 更易于维护 更多的原生支持 更易于开发使用 完整的PPT:docs.google.com/presentatio… Evan 和 Vue 团 ...
- 架构之路:nginx与IIS服务器搭建集群实现负载均衡(二)
[前言] 在<架构之路:nginx与IIS服务器搭建集群实现负载均衡(一)>中小编简单的讲解了Nginx的原理!俗话说:光说不练假把式.接下来,小编就和大家一起来做个小Demo来体会一下N ...