Part 5b: How to fail fast with Airflow
My quick post on my debugging method
Previously — on my #dataseries
Part 5a: Schedule dbt models on Airflow
Table of Contents
- Debugging
- Common errors
2.1 Python
2.2 Semantic
- The long way
- The hacky solution - Airflow CLI
Debugging
Yesterday night I was struggling to debug some of the problems in Airflow. But after a while, I was able to iterate quickly and fix the problem.
After using this debugging method I was able to follow the fail-fast principle and debug quickly. Now today I quickly sat down and wrote this short post to share with everyone (but i think many of you have already been using it). In my opinion, this is very important in data analytics and could greatly improve the efficiency of the data analyst.
Common Types of Errors
After playing around with writing DAG
for a while, I realise there are two main types of errors in Python— (1) syntax error, and (2) semantic errors.
“Syntax errors are produced by Python when it is translating the source code into byte code.” — This is easy to spot and fix, so we check this first.
“Semantic errors are problems with a program that runs without producing error messages but doesn’t do the right thing.”- Semantic is more challenging, i think.
Python Syntax
Firstly, we test the Python syntax error by running python3 your-dag-file.py
in your terminal and make sure you install your Python. For example, I ran:
python3 my_ten_dag.py
Semantic
Secondly, we test the semantic errors to see if your code produces the correct result as you want. I will demonstrate how I debug with my example in my previous post: Part 5 — Schedule dbt models on Airflow
I will touch on the long way which is the initial way I was doing first, followed by the hacky way that it took me a while to realise with some help.
The long way
If you notice, my dag
file I named it as my_ten_dag.py
which means this is the 10th dag
I have created, and the previous nine dag files have semantic errors. I couldn’t run dbt model successfully until the tenth version. (So if you are also playing around with Airflow and struggling, don’t give up 💪)
When I was debugging, I created a new dag
file and put all in my dags
folder — from my_first_dag.py
all the way to my_ten_dag.py
. For each iteration, it took me around 10 minutes to know if that dag version works.
Hmm at this point, you might be wondering: “Why would we do this? Why not just editing directly into one dag
file such as my_dag.py
”
The answer is: If you keep the same dag
file name, but with new changes, logs on Airflow UI will be overwritten and it will be hard to take a look at the old errors.
The hacky way
Then I realised oh you don’t need to create a different dag file with so many versions to debug. You can debug in the docker container on your terminal. Here’s how I debugged.
Step 01 — Open my terminal (for me I use iTerm2)
Step 02 — Check what is your docker container id that runs airflow
Tip: If you don’t know your docker container id, run this in your terminal to list out all of the containers. And you will see the container id of which runs airflow scheduler.
docker container ls
Step 03 — ssh
into docker container that runs airflow scheduler. Run this:
docker exec -it <your docker container id> bin/bash
After that you will see root@<your docker container id>
.
Tip: Check what operators you use in Airflow. The most important thing is you don’t need to cd
into any folder to run the BashOperator.
Then you can execute whatever you want to execute there. For example, in my case I executed bash command specified in my dag
file. I ran
cd $DBT_PROFILES_DIR && dbt run — models shiba_ecommerce.daily_order_count — vars ‘{“ingestion_date”: “2018–06–30”}’
And the log above in my docker container in your laptop is exactly the same as the log I view in Airflow UI.
One of my failed dag
versions
As I mentioned above, the log in docker container vs the log in Airflow UI is exactly the same.
Key Takeaway: Hence to save time, instead of putting in Airflow and run and view log for which each iteration took me at least 10 minutes. The hacky way is to execute whatever you want to execute in your Docker container. Once it works, put it into your dag
file and schedule a run on Airflow. That would be much more efficient and save you time when debugging!
Airflow CLI
After you have managed to run the Bash command and you have successfully created a working dag
, you can still test your dag
one more time by running the Airflow CLI command in your ssh airflow docker container. The full documentation of the command is here.
airflow run [-h] [-sd SUBDIR] [-m] [-f] [--pool POOL] [--cfg_path CFG_PATH]
[-l] [-A] [-i] [-I] [--ship_dag] [-p PICKLE] [-int]
dag_id task_id execution_date
This will save you a lot of time because normally you have to turn on your dag
and wait for the scheduler to schedule and run your tasks. If you run it in a big system with many dag
files lining up waiting to be run, it can take a lot of time — 10 minutes to half an hour.