Amundsen — Lyft’s data discovery & metadata engine

转自：https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9

In order to increase productivity of data scientists and research scientists at Lyft, we developed a data discovery application built on top of a metadata engine. Code named, Amundsen (after the Norwegian explorer, Roald Amundsen), we improve the productivity of our data users by providing a search interface for data, which looks something like this:

The problem

Data in our world has grown 40x over the last 10 years — see the chart below from United Nations Economic Commission for Europe (UNECE).

Data Growth Predictions. Source: UNECE. 2013

Unprecedented growth in Data volumes has led to 2 big challenges:

Productivity — Whether it’s building a new model, instrumenting a new metric, or doing adhoc analysis, how can I most productively and effectively make use of this data?
Compliance — When collecting data about a company’s users, how do organizations comply with increasing regulatory and compliance demands and uphold the trust of their users?

The key to solving these problems lies not in data, but in the metadata. And, to show you how, let’s go through a journey of how we solved a part of the productivity problem at Lyft using metadata.

Metadata is the holy grail of future applications

At its core, metadata is

a set of data that describes and gives information about other data.

There are 2 parts to metadata — a (usually smaller) set of data that describes another (usually larger) set of data.

1. A describing set of data — ABC¹ of metadata

Three broad types of metadata fit in this category:

Application Context — information needed by humans or applications to operate. This includes existence of data, and description, semantics, tags associated with the data.

Behavior — information about how the data is created and used over time. This includes information about ownership, creation, common usage patterns, people or processes that are frequent users, provenance and lineage.

Change — information about how the data is changing over time. This captures information about evolution of data (for example, schema evolution for a table) and the processes that create it (for example, the related ETL code for a table).

Capturing these three kinds of metadata and using them to drive applications is key to many applications of the future. ABCs of metadata is a terminology adopted from a paper on Ground by Joe Hellerstein, Vikram Sreekanti et al.

2. The data being described

Now let’s talk about what data is being described by the ABCs above. The short answer is any data within your organization. This includes, but is not limited to:

Data Stores — tables, schemas, documents of structured data stores like Hive, Presto, MySQL, as well as unstructured data stores (like S3, Google Cloud Storage, etc.)
Dashboards/reports — saved queries, reports and dashboards in BI/reporting tools like Tableau, Looker, Apache Superset, etc.
Events/Schemas — Events and schemas stored in schema registries or tools like Segment.
Streams — Streams/topics in Apache Kafka, AWS Kinesis, etc.
Processing — ETL jobs, ML workflows, streaming jobs, etc.
People — I don’t mean a software stack, I mean good old people like you and me who carry data in our head and in our organizational structure, so information like name, team, title, data resources frequently used, data resources bookmarked are all important pieces of information in this category.

This exact metadata can be used to make data users more productive by providing them the relevant metadata on their fingertips.

Productivity

At a 50,000 feet level, the data scientist workflow looks like the following.

Typical Data Science Workflow. Source: Harvard Data Science Course - CS109

At Lyft, what we observed was that the while we wanted the majority of the time to be spent in model development (aka prototyping) and productionalization, a lot of the time was being spent in data discovery.

Time spent in Data Science workflow

Data discovery includes finding the answer to questions like:

Does this data exist? Where is it? What is the source of truth of that data? Do I have access to it?
Who and/or which team is the owner? Who are the common users?
Is there existing work I can re-use?
Can I trust this data?

If they sound familiar, we feel you.

The idea for Amundsen was inspired a lot by search engines like Google — in fact, we often think of it as “Search for data” within the organization.

What we present below are mocks with fake data, to give you a sense for what using Amundsen feels like.

Landing page:

The entry point for the experience is a search box where you can type plain English to search for data, e.g. “election results” or “users”. If you don’t know what you are searching for, we present you a list of popular tables in the organization to browse through them.

Search ranking:

Once you enter your search term, you are shown search results as following.

The results show some in-line metadata — description about the table as well the last date when the table was updated. These results get chosen by fuzzy matching the entered text with a few metadata fields — table name, column name, table description and column descriptions. Search ranking uses an algorithm similar to Page Rank, whereby highly queried tables show up above, while those queried less show up later in the search results.

Detail page:

Once you have selected a result of choice, you get to the detail page which looks like below.

The detail page shows the name of the table along with it’s manually curated description. The column list along with descriptions follows. A special blue arrow by a column showcases that it’s a popular column, there by encouraging users to use it. On the right hand pane, you see information about the Behavior of the table. In other words, who’s the owner, who are frequent users and a general profile of the data to see how the count of records is changing in the table over time, and you see associated tags with the table.

Information like descriptions and tags is manually entered by our users, while information like popular users is generated automatically by grazing through the audit logs.

The bottom of the same page contains a widget for users to leave us any feedback they may have.

Feedback widget

Clicking on a column reveals more stats about that column like below.

In the above, for the integer column, the stats show the count of records, null count, zero count, min, max, and average value over the last day of data, so data scientists can start to understand the shape of the data.

Lastly the table detail page, also contains a preview button, which if you have access to view the data, would show you a preview from the latest daily partition of the data, like below. This preview only works if you have access to the underlying data.

Some trade offs

Discovery vs. Curation

We often have to strike a balance between discovery and curation. For example, if your organization had only a small number of data sets, and each of them was manually crafted by a set of Data Engineers, and each table was well named, under a well defined schema, each field appropriately named and the schema evolved in sync with how the business evolved, then your need for discovery may not be as much in such a world.

However, if you live in a organization, that grew too fast, with lots of data, it’s unlikely that curation and best practices for schema design on their own are going to make your users productive.

Our approach is to have a combination of both. To have a discovery (aka search) system, while also adopting some best practices about names and descriptions about schemas, tables and fields.

Security vs democratization

Another important balance to strike is between security and democratization. Discovery platforms like the one described above democratize the discovery of data to everyone in the organization, while the Security & Privacy team has a mission to protect and safeguard sensitive data across the organization. The question then becomes how do you balance these two seemingly competing needs?

Our approach is to divide the metadata into a few categories and give different access to each of the categories. A good way of doing so is

Existence and other fundamental metadata (like name and description of table and fields, owners, last updated, etc.)

This metadata is made available to everyone whether or not you have access to the data or not. The reason is that in order for you to be productive, you need to know if such a data set exists and if that’s what you are looking for. Ideally you can figure out the fit using this fundamental metadata, and if it is what you are looking for, request access. The only rare exception here is if the existence of a table or a field reveals some privileged information like the countries you operate in, in which case, it’s better to fix the data model or security model and not do security by obscurity.

2. Richer metadata (like column stats, preview)

This metadata is only available to users who have access to the data. This is because these stats may reveal sensitive information to users, and hence should be considered privileged.

Future

Amundsen has been super successful at Lyft, with really high adoption rate and Customer Satisfaction (CSAT) score. It has driven down the time to discover an artifact to be 5% of the pre-Amundsen baseline. Users can now discover more data in a shorter time, and with higher degree of trust.

The future as we see it lies in nailing down productivity even further by adding more features, but more importantly in unlocking a new use-case through all the great metadata already available in Amundsen — the use-case of compliance.

Compliance

While GDPR and newer privacy laws like the California Consumer Privacy Act (CCPA) affect the treatment of data in many ways, their provision of user data rights is one of the most impactful. Organizations must manage ways to comply with exercise of these various rights, such as those to access, correct and delete certain data.

These privacy laws typically provide for certain exceptions, such as the ability to keep certain information due to legal obligations, even in the face of a deletion request. Thus far, organizations have taken a varied number of approaches to becoming compliant. Some have established manual processes for resolving the data service requests that come in, while others have gone and quarantined personal data in one location/database, so user rights management becomes easier.

However, those method may fail to scale — both as the organization and the amount of data and use-cases on it grows as well as when the number of incoming data service requests grows.

One approach that scales is the one powered by metadata. It’s the approach where a tool like Amundsen is used to store, and tag all personal data within the organization. Such a metadata powered solution can help an organization remain compliant as the data and its use-cases or service requests grow.

Productivity

Currently we integrate with Hive, Presto and any other systems that integrate with the Hive metastore (e.g. Apache Impala, Spark, etc.). These are the upcoming items in our roadmap:

Add people to Amundsen’s data graph, by integrating with integration with HR systems like Workday. Show commonly used and bookmarked data assets.
Add dashboards and reports (e.g. Tableau, Looker, Apache Superset) to Amundsen.
Add support for lineage across disparate data assets like dashboards and tables.
Add events/schemas (e.g. schema registry) to Amundsen.
Add streams (e.g. Apache Kafka, AWS Kinesis) to Amundsen.

Conclusion

With large amounts of data, the success in using data to fullest lies not in data but in the metadata. Lyft has built a data discovery platform, Amundsen, which has worked really well in improving the productivity of its data scientists by faster data discovery.

At the same time, there’s a lot of value a metadata driven solution can provide in the space of compliance, in tracking personal data across the entire data infrastructure. We should expect a lot of investment in that area in the future.

Stay tuned for an upcoming blog post detailing the architecture of the data discovery application and the metadata engine that powers it!

Thanks to Max Beauchemin, Andrew Stahlman, Beto Dealmeida for reviewing the post.

Thanks to the engineers who made it possible (in alphabetical order):Alagappan Sethuraman, Daniel Won, Jin Chang, Tamika Tannis, Tao Feng, to Matt Spiel for design, and to the engineering and product leadership of Shenghu Yang and Philippe Mizrahi.