A Beginner’s Guide to DBT (data build tool) — Part 1: Introduction
This is a rather short introduction about DBT. Further explanation with examples will be available in my next episode.
Table of Contents
What is DBT?
The Power of DBT
What is DBT?
DBT (data build tool) is a command-line tool that enables data analysts and engineers to transform data in their warehouses simply by writing SELECT statements.
DBT does the T in ETL (Extract, Transform, Load) process — it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse.
Why would I switch from SQL scripts to
DBTscripts considering the learning curve?
“We believe that analytics teams have a workflow problem. Too often, analysts operate in isolation, and this creates suboptimal outcomes. Knowledge is siloed. We too often rewrite analyses that a colleague had already written. We fail to grasp the nuances of datasets that we’re less familiar with. We differ in our calculations of a shared metric.”
In summary, the main focus here is cross-team collaboration with high quality code and high reliability. I came across a great story that also outlines these pains by the DataGuyStory.
The Power of DBT
DBT is basically just a folder of SQL queries that can use a few parameters for dependencies. The 4 key points of pros are below.
Built-in testing for data quality
DBT basically has two type of test built in: schema test (more common) and data test. The goal of writing DBT schema and data tests it to be an integrated part of the data architecture which saves time. Defining tests is a great way to confirm that your code is working correctly, and helps prevent regressions when your code changes.
Jinja is a templating language that you can combine with SQL in your DBT project. Macros in Jinja are reusable pieces of code — they are analogous to “functions” in other programming languages. Macros are defined in
.sql files, typically in your
There are a few applications on how you could use macros. The first one is using variables.
1. Using variables
Imagine you need to generate a report on a daily basis. The report is based on a
.sql with a filtering on date (WHERE order_date = “2021–06–06”). However, you are not paid to generate just a single report. In practice, there might be a dozen reports to be ready by a 10AM for the management teams. It then translates to hundreds
.sql files with those WHERE clause, to be changed on a daily basis. Visiting each of those
.sql to edit will be super time-consuming and error-prone.
Therefore, we should replace the actual date with a variable that could be used across multiple models within a package. That variables will be changed and passed into the DBT run command each day those reports are generated
2. Managing dependencies in SQL
A dependency is created between two objects when one object appears by name inside a SQL statement stored in another object.
Let’s go back to our report generation story above. Now, instead of running a single SQL to get your report, you have to run 3 different
.sqlfiles in a SPECIFIC order to get the final result. You can imagine multiplying it with the number of reports, that would be very brain draining. Sometimes you have to store the order of running those in another document ( think of it like Google Sheet). What happen if the access to the G Sheet is loosely controlled and someone accidentally (I hope) delete those steps. That is a perfect ingredient for disaster
Now with DBT, you can tell your program (which is DBT) to run those SQL in the order that you want, using the
ref function. And you edit it directly into the
.sql. Hence, you don’t really need another Google Sheet to mess up your workflow.
Without DBT, you need to manage dependencies among your models. In DBT, DBT automatically infers the dependencies between models.
Shockingly low learning curve for non-engineers
I learn DBT from scratch within 2 weeks. I started out skimming through DBT docs , DBT setting up guide, go through their
jaffle-shop example, create my own DBT project and work on BigQuery. The dataset I took is from Kaggle.
Once created, your DBT models need to run on a regular basis. There are many tools and framework available on the market for scheduling. It can be divided into two main categories.
I will focus on the more open-source — Airflow in the next episode.
- Me coming from a non-engineer background lack LOTS of knowledge on data warehousing, which made it struggling for me to understand the DBT documentation. What I did is to read some basic data warehousing concepts first, then jumped to the docs later.
- I encountered some difficulties in setting up the first DBT project. In my second article under DBT series, I will cover my common mistakes and my fast workaround.
- I don’t know what to do next after reading the setting up guide. I don’t see the big picture how DBT can fit into the work flow.
DBT simplifies collaboration between engineers and non-engineers with high quality code and high reliability
In my next episode on
dbt tutorial series, I will cover the quick setup guide using CLI, demo
dbt run and
dbt test, and share some tips and tricks for non-engineers like me to quickly get a hang of it. Let me know if you have any questions or comments in the comments section below.