My journey introducing the data build tool (dbt) in project’s analytical stacks
转自:https://www.lantrns.co/my-journey-introducing-the-data-build-tool-dbt-in-projects-analytical-stacks/
Not sure I remember how, but I had the good luck a few weeks ago to stumble upon posts from Tristan Handy where he mentioned a tool his team built, simply called the Data Build Tool (dbt). I immediately took interest, as I was struggling with ETL (Extract, Transform, Load) at that moment, and had the extreme good fortune of having 2 clients who were willing to let me experiment with it, and even eventually use it for their own data warehouse.
Starting out with such a tool can sometimes be a bit confusing, as you’re not sure how and where to start. You can envision how dbt will fit in your overall analytical stack, but you might need a little bit of guidance to get things started. Although the documentation is really good, I feel like it’s always a good thing to read about someone else’s experimenting before starting my own.
So I decided to put together this little article that just goes over my own experience working with dbt, what I’ve picked up along the way, and how it all fits in our overall data engineering journey. I’m no power user of dbt and this may seem superficial to more experienced users, but the goal here is to help new users by sharing my own journey so far.
Getting started
As I mentioned, I have the good fortune to be working with clients who are really keen on adopting the technologies that will make their analytical stack more robust and scalable. And that was actually one problem I was facing with one project. What had been deployed so far worked perfectly for the needs that were specified, but as new requirements came in, it was obvious that our solution wasn’t scalable. We were using python scripts to do the ETL stuff, and that was neither agile, nor DRY (Don’t Repeat Yourself), nor fun. So we wanted to improve our ETL approach, and dbt just fell on our lap right there and then.
Another project is with a startup that just started investing into their analytical infrastructure. This is pretty rare, but here I was with the opportunity to propose and build an analytical stack from scratch. Of course, I wanted dbt to be part of it.
So this is how I started looking more seriously at what dbt had to offer, how it could be used in a production environment, how we could test it and how it could handle future needs.
What’s interesting with dbt is how instead of doing ETL, you’re really taking advantage of the power of cloud-based data warehouses to do ETL instead.
What we are doing is that instead of adopting a traditional ETL approach that requires the maintenance of scripts that does all 3 steps (extract, transform and load data towards the data warehouse), we are moving to a ELT approach where data is first moved (extract + load) to a data warehouse and transformation is done on the data warehouse directly.
That has the benefit of reducing the complexity of scripts (we are now only dealing with pure SQL scripts), maintenance (all models are simple select statements that are ran through a tool called dbt – data build tool) and integrity (once business logic is added within a model, all further models down the chain will always reuse that same business logic).
What is dbt and how does it fit
I should first mention that dbt is the work of the good folks at Fishtown Analytics. I believe that dbt was built while the co-founders were working for another company and that they kinda kept dbt with them as they built their new company (I might be completely wrong on this one). But the point is that they developed that tool, made it publicly available and are building a great community around it. My hat’s off to them – we should all aspire to bring such excellent tools to our data engineering community.
So, what is dbt?
I like to think of it as SQL on steroids.
Essentially, what we do is that we are building SQL models on top of source tables (coming from your production database, cloud services, website events, etc.) which are then referred to by further models, acting as building blocks towards a final model that’s part of a mart within your data warehouse.
For example, let’s say I want to build a fact table for transactions. But transactions may be coming from different sources. I would build models that refer to one another until you get to that final table. Here’s an example taken from a project I’m working on…
dbt – an exemple of data modeling
That graphic above might seem like a nightmare, but there is method to that madness (I think, lol). Essentially, you’re looking at models on the left that refer to tables in staging area, a bunch of transformations in the middle, and the final entity at the far right. We’ll talk more about this below, as well as how to generate that graph.
HAVE YOU SEEN OUR PRODUCT ANALYTICS NEWSLETTER?
OUR MISSION IS TO SHARE DATA ANALYTICS’ BEST PRACTICES AND NEW IDEAS WITH PRODUCT OWNERS, SO THEY CAN IMPROVE PRODUCT-MARKET FIT AND ACCELERATE THEIR DEVELOPMENT CYCLE.
Community
Now that we have a high-level understanding of what dbt is and how it fits within your analytical stack, I think we should talk a little bit about its community, because it does make working with dbt such a positive experience.
We should mention first that dbt is totally open source and you can go play inside the beast if that’s your thing.
Once you’ve satisfied your curiosity, you should definitely consider joining the dbt community on Slack. This is a pretty active community composed of members who are really helpful to one another. And the creators of dbt are also always present and ready to help whenever you stumble upon a problem. But what’s pretty cool also is that the discussions will definitely make you a better data engineer, as some of the questions/answers and the resources shared do open up your horizons to many other facets on how to build a modern analytical stack.
One other resource that’s starting to pick up steam is dbt’s Discourse environment. This is where you can find some pretty thorough answers to commonly asked questions. Or at least that’s what the idea of it is. This is still pretty new, but I think there’s a lot of potential with this and there are already some gems that are living there.
And last but not least, dbt’s documentation. This of course will be a resource you refer to often as you start working with dbt, but also as you get more sophisticated with the use of the tool. There are solid documentations here, which pretty much answers all the questions you might have in regards to using the different features of this tool.
So, all in all, I think that this community is as important to the success of dbt as the tool itself. You can always count on the documentation, the already answered questions in Discourse, or the availability of other community members on Slack, to point you in the right direction.
First Steps
Alright, now that we completed that grand tour of dbt, it’s about time we get our hands dirty.
To get things started, I would definitely recommend going through that dbt tutorial provided in the documentation. You’ll learn how to install dbt and create a project.
There are also useful guides, which are to be consulted as you discover some of those aspects of dbt. But I would definitely recommend going through the “Configuring models” guide, as this is essential skill to work with dbt.
I would also recommend that you take 20 minutes of your time and listen to the “Project Walkthrough Video”, which actually confirmed to me that dbt was actually as easy as I suspected it to be. Up to that video, I was convinced that I was missing something that would make dbt complex to use, but nope. If you know SQL, you’ll be confortable with dbt right from the start.
Hopefully, by this point you have a good understanding of dbt, how it fits within your architecture, how to set it up and have created your first project. Now what?
Well, you would need some data of course. Without going over the details, because this is out of scope for this article, let’s assume you have a few source tables in a Redshift/PostgreSQL database.
An example for me was to transform an export of Eventbrite registrations into a clean DW structure to be used in Tableau. My staging model would look something like this…
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
SELECT CAST ( 'eventbrite' AS VARCHAR ) AS source, CAST ( "Order #" AS integer ) AS registration_natural_key, CAST ( "Event ID" AS integer ) AS event_natural_key, CAST ( "Event Name" AS varchar (64)) AS event_name, CAST ( "Attendee #" AS integer ) AS attendee_natural_key, CAST ( "Email" AS varchar (128)) AS attendee_email, CAST ( "First Name" AS varchar (32)) AS attendee_first_name, CAST ( "Last Name" AS varchar (32)) AS attendee_last_name, CAST ( "Job Title" AS varchar (128)) AS attendee_job_title, CAST ( "Company" AS varchar (32)) AS company_name, CAST ( "Ticket Type" AS varchar (32)) AS tickets_type, CAST ( "Quantity" AS smallint ) AS tickets_quantity, CAST ( "Total Paid" AS decimal (6, 2)) AS transaction_amount, CAST ( "Currency" AS char (3)) AS currency, CAST ( "Order Date" AS timestamp ) AS registration_datetime</pre> FROM eventbrite.registrations |
I could then derive entities such as dim_event…
1
2
3
4
5
6
7
8
9
10
11
12
|
WITH registration AS ( SELECT * FROM {{ ref( 'stg_eventbrite_registrations' ) }} ) SELECT DISTINCT {{ dbt_utils.surrogate_key( 'registration.source' , 'registration.event_natural_key' , 'registration.event_name' ) }} AS surrogate_key, registration.source, registration.event_natural_key AS natural_key, registration.event_name AS name FROM registration |
dim_attendee…
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
WITH registration AS ( SELECT * FROM {{ ref( 'stg_eventbrite_registrations' ) }} ) SELECT DISTINCT {{ dbt_utils.surrogate_key( 'registration.source' , 'registration.attendee_natural_key' , 'registration.attendee_email' ) }} AS surrogate_key, registration.source, registration.attendee_natural_key AS natural_key, registration.attendee_email AS email, registration.attendee_first_name AS first_name, registration.attendee_last_name AS last_name, registration.attendee_job_title AS job_title, registration.company_name FROM registration |
and fct_registration…
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
WITH registration AS ( SELECT * FROM {{ ref( 'stg_eventbrite_registrations' ) }} ) SELECT {{ dbt_utils.surrogate_key( 'registration.source' , 'registration.registration_natural_key' , 'registration.attendee_natural_key' , 'registration.event_natural_key' ) }} AS surrogate_key, registration.source, registration.registration_natural_key AS natural_key, registration.attendee_natural_key, registration.event_natural_key, registration.tickets_quantity, registration.tickets_type, registration.transaction_amount, registration.currency, registration.registration_datetime FROM registration |
SinterData provides a tool to view the relationships between our models, which when our data warehouse becomes complex, is a great tool to have at our disposal. For our current example, we end up with the following DAG (directed acyclic graph)…
dbt – another exemple of data modeling
And that’s it, we just created a first version of our data warehouse.
Good practices
The example above is pretty basic, but as things become more complex, you’ll need to start picking up some best practices. Some of them, you can introduce them right away, some of them you’ll learn of as you go, and some of them will impose themselves upon you.
Here’s a non-exhaustive list of “good practices” I picked up during my first few projects.
Folder Structure
How you structure your data warehouse and how you structure your dbt/models folder are 2 entirely different things. There was some good discussions around folder organization in dbt, but I think in the end it all depends on how you would like to work.
Here’s what works for me. I have 4 main folders:
- Staging – This is where I create my foundational models on top of the source tables that were exported to the data warehouse. I do a selection of fields, typecast, filter and rename the fields if necessary. You know, basic maintenance.
- Transform – This is where the bulk of the work happens. It’s where staging models are being used and transformed to become the final entities. You would join sources, add aggregates, filter, etc.
- Entities – This is where I have my final fact and dimension models.
- FinalModels (I usually have a more business-focused name, such as ‘Marketplace’) – This is where I create the main queries that are being used by BI tools. There are discussions in regards to building your final model within dbt or directly in your BI tool, but I personally like to have the “big ones” here in dbt with all other queries.
In regards to the Transform table, even here I have a specific folder structure underneath. Here’s how it looks:
- EntityName
- Source1
- EntityName_Source1_1_FirstTransformation.sql
- EntityName_Source1_2_SecondTransformation.sql
- Source2
- EntityName_1_MergeSources.sql
- Source1
As you can see, I like to keep it structured by source and sequence. This really helps whenever I have to introduce new features or when debugging.
Testing Suite
Here’s something that I didn’t use first, but that was always in the back of my mind.
I actually was looking for a solution outside of dbt, because I had a specific test scenario that I wanted to do and didn’t know much about the testing framework in dbt to get it working with dbt itself. So I had asked my question on Slack directly and Tristan pointed me out to custom schema tests.
And that kinda propelled me in the world of testing within the confines of dbt. The Data Build Tool already has a few tests that can be used right off the box, but there’s also the dbt-utils package which provides other tests that can be used in your own suite of testing.
Right now, for one project, I have over 60 tests that I run in my test environment after doing changes and that I run automatically every hour in the prod environment. Whenever a test fails in prod, an alert is being sent to our Slack channel.
Amazing!
Data Quality Exploration
Once I’ve made changes in my dev environment, built my models and tested them, I like to use Tableau to actually visualize my Data Warehouse before it gets built in prod. Some sort of Quality Assessment.
I’m not doing anything fancy at this point, but I do like to plot my main entities’ data sets to make sure that no weird anomalies were introduced during development.
The testing suite that was discussed previously safeguards my data at the atomic level, but my Tableau QA workbook safeguards data quality at the aggregated level.
Work Flow
That brings me to my overall work flow. Here are the steps I take when introducing a new feature within a data warehouse:
- Research/Exploration into required fields and their sources
- Write up tests as acceptance criterias
- Build data models in dbt
- Testing in development environment
- Quality assessment with Tableau
- Commit code to git repository
- Pull new code in prod
- Build new ETL in prod
- Testing in prod environment
- Integrate new feature in final BI report
- Push to Sandbox for client approval
Extras
I did briefly touch on those previously, but there are a few extras worth knowing about:
- dbt-utils – This package is provided by the creators of dbt and packs in a few extras that are worth looking at. I personally use only a few of what’s provided (such as the ‘surrogate_key’ SQL helper, as seen in the models above), but there is a lot to explore here and that might also be useful.
- Sinter – Not sure how that service is associated to dbt and Fishtown Analytics, but their service is all about dbt. Essentially, they provide the means to run, schedule and manage your dbt jobs. Simple, yet really well made.
- dbt Graph Viewer – Actually provided by the folks at Sinter, this little tool is something I use frequently. Not only can you see the whole DAG of your project, but you can also focus on only one model and look at their parent and/or children models.
I should also quickly talk about macros, which are potentially a big thing, but that I haven’t gotten around to. Although I did just read that article by Claire Carole on design patterns which talks about maintaining your code DRY, using Jinja’s templating language and the introduction of macros. Anyways, still a whole lot of stuff to discover
The Big Picture
As you can see, dbt is more than just a simple tool. I like to think of it as a core tool within your data warehouse building arsenal.
You can think of it as following the principles of the Unix Tools Philosophy, where the goal is to have an ensemble of tools that do one job only, but do it really well.
But something else that is important in that philosophy. It is for those tools to easily interact together (by the use of the pipe “|” symbol which pipes output from one tool as input to another tool). That dbt keeps really close to SQL is what makes it so easy to adopt, but also so easy to interact with many other tools. It’s a common language that allows for outputs to become inputs for another tool.
But also, the choice of SQL is one that assures interoperability within a modular BI architecture. As is argued by Ajay Kulkarni, co-founder of TimeScaleDB:
“We believe that SQL has become the universal interface for data analysis.”
(Source: Why SQL is beating NoSQL, and what this means for the future of data)
There is a lot that’s been said and written about the modern BI architecture, with cloud-based services as their underlying infrastructure and modularity as a core principle. Having tools such as dbt adds to that philosophy by being itself modular (and replacable), but to actually be very flexible so that it can work with many different inputs (data sources) and outputs (data warehouses).
Here’s to your dbt journey, to the future development of dbt and the community that is being fostered around that project!
My journey introducing the data build tool (dbt) in project’s analytical stacks的更多相关文章
- 使用Percona Data Recovery Tool for InnoDB恢复数据
运维工作中难免会发生一些误操作,当数据库表被误操作删除需要紧急恢复,或者没有备份时,Percona Data Recovery Tool for InnoDB这个工具也已提供一些便捷的恢复. 当然 ...
- grunt 构建工具(build tool)初体验
操作环境:win8 系统,建议使用 git bash (window下的命令行工具) 1,安装node.js 官网下载:https://nodejs.org/ 直接点击install ,会根据你的操 ...
- Percona Data Recovery Tool for InnoDB工具恢复单表的案例
今天上班有个朋友询问我,相关Percona Data Recovery Tool for InnoDB恢复数据中的一些问题,比如说delete,没法恢复数据,原先做过类似的异常处理就,再次模拟了下相关 ...
- **ERROR: Ninja build tool not found.
| if which ninja-build ;\| then \| ln -s `which ninja-build` bin/ninja ;\| else \| echo "***ERR ...
- buils tool是什么?为什么使用build tool?java主流的build tool
定义: build tool是可以自动由源代码创建可执行的应用程序的程序. Building 包括编译.链接和打包代码成一个可用的或可执行形式. 在小型项目,开发人员常常会手动调用构建过程.在更大的项 ...
- Build Tool/Maven, Gradle
一.Build Tool 1.什么是Build Tool build tool是可以自动由源代码创建可执行的应用程序的程序. Building 包括编译.链接和打包代码成一个可用的或可执行形式. 在小 ...
- build tool 的简单认知
Build Tool 什么是Build Tool(构建工具)? 构建工具是从源代码自动创建可执行应用程序的程序(例如.apk for android app).构建包括将代码编译,链接和打包成可用或可 ...
- Introduction of Build Tool/Maven, Gradle
什么是build tool: build tool是可以自动由源代码创建可执行的应用程序的程序. Building 包括编译.链接和打包代码成一个可用的或可执行形式. 在小型项目,开发人员常常会手动调 ...
- sbt is a build tool for Scala, Java, and more
http://www.scala-sbt.org/0.13/docs/index.html sbt is a build tool for Scala, Java, and more. It requ ...
随机推荐
- CSS3弹性盒布局方式
一.CSS3弹性盒子 弹性盒子是CSS3的一种新布局模式. CSS3 弹性盒( Flexible Box 或 flexbox),是一种当页面需要适应不同的屏幕大小以及设备类型时确保元素拥有恰当的行为的 ...
- python自动化测试之appium环境安装
1.安装client pip install Appium-Python-Clinet 若有两个版本的python则使用(python3 -m pip install Appium-Python-C ...
- ubuntu安装texlive2019
1.下载texlive2019的iso文件,清华镜像地址:https://mirrors.tuna.tsinghua.edu.cn/CTAN/systems/texlive/Images/texliv ...
- 关于如何控制一个页面的Ajax读数据只读一次的简单解决办法!
例如:一个页面有一个按钮,点击的时候用ajax去后台获取数据,获取成功以后返回.下次再点击的时候就不要去获取数据了. 解决办法有很多: 1.用Get方法去读数据,会缓存. 2.用jquery的data ...
- postgreSQL 备份+还原多张表
-U表示用户 -h表示主机 -p表示端口号 -t表示表名 -f表示备份后的sql文件的名字 -d表示要恢复数据库名 一.打开cmd 进入postgresql安装路径下的bin文件夹,以我的为例: cd ...
- head 与 tail
head head [-n] 数字『文件』 显示前面n行 例如 head -n 3 test 显示 test 文件的前 3 行,也可以写作 head -3 test 比较有趣的是 -n 后面的数字,可 ...
- pip笔记(译)
从PyPI中安装包 >>> pip install SomePackage [...] Successfully installed SomePackage 从PyPI或其他地方安装 ...
- CentOS 7 使用 firewalld 打开关闭防火墙与端口
1.firewalld的基本使用启动: systemctl start firewalld关闭: systemctl stop firewalld查看状态: systemctl status fire ...
- 手动实现KNN算法
手动实现KNN算法 计算距离 取k个邻近排序 距离(欧氏) 预习 import numpy as np # 数组运算是面向元素级别的 arr1 = np.array([1,2,3]) arr2 = n ...
- JavaScript: 详解正则表达式之三
在上两篇文章中博主介绍了JavaScript中的正则常用方法和正则修饰符,今天准备聊一聊元字符和高级匹配的相关内容. 首先说说元字符,想必大家也都比较熟悉了,JS中的元字符有以下几种: / \ | . ...