Part 5b: How to fail fast with Airflow

My quick post on my debugging method

Jessica Le

5 min readJul 16, 2021

Previously — on my #dataseries

Part 5a: Schedule dbt models on Airflow

Debugging
Common errors
2.1 Python
2.2 Semantic
- The long way
- The hacky solution
Airflow CLI

Debugging

Yesterday night I was struggling to debug some of the problems in Airflow. But after a while, I was able to iterate quickly and fix the problem.

After using this debugging method I was able to follow the fail-fast principle and debug quickly. Now today I quickly sat down and wrote this short post to share with everyone (but i think many of you have already been using it). In my opinion, this is very important in data analytics and could greatly improve the efficiency of the data analyst.

Common Types of Errors

After playing around with writing DAG for a while, I realise there are two main types of errors in Python— (1) syntax error, and (2) semantic errors.

“Syntax errors are produced by Python when it is translating the source code into byte code.” — This is easy to spot and fix, so we check this first.

“Semantic errors are problems with a program that runs without producing error messages but doesn’t do the right thing.”- Semantic is more challenging, i think.

Python Syntax

Firstly, we test the Python syntax error by running python3 your-dag-file.py in your terminal and make sure you install your Python. For example, I ran:

python3 my_ten_dag.py

Semantic

Secondly, we test the semantic errors to see if your code produces the correct result as you want. I will demonstrate how I debug with my example in my previous post: Part 5 — Schedule dbt models on Airflow

I will touch on the long way which is the initial way I was doing first, followed by the hacky way that it took me a while to realise with some help.

The long way

If you notice, my dag file I named it as my_ten_dag.py which means this is the 10th dag I have created, and the previous nine dag files have semantic errors. I couldn’t run dbt model successfully until the tenth version. (So if you are also playing around with Airflow and struggling, don’t give up 💪)

When I was debugging, I created a new dag file and put all in my dags folder — from my_first_dag.py all the way to my_ten_dag.py . For each iteration, it took me around 10 minutes to know if that dag version works.

Hmm at this point, you might be wondering: “Why would we do this? Why not just editing directly into one dag file such as my_dag.py”

The answer is: If you keep the same dag file name, but with new changes, logs on Airflow UI will be overwritten and it will be hard to take a look at the old errors.

The hacky way

Then I realised oh you don’t need to create a different dag file with so many versions to debug. You can debug in the docker container on your terminal. Here’s how I debugged.

Step 01 — Open my terminal (for me I use iTerm2)

Step 02 — Check what is your docker container id that runs airflow

Tip: If you don’t know your docker container id, run this in your terminal to list out all of the containers. And you will see the container id of which runs airflow scheduler.

docker container ls

After running `docker container ls`, you will see the list of your containers with the headers above

Step 03 — ssh into docker container that runs airflow scheduler. Run this:

docker exec -it <your docker container id> bin/bash

After that you will see root@<your docker container id> .

Tip: Check what operators you use in Airflow. The most important thing is you don’t need to cd into any folder to run the BashOperator.

Then you can execute whatever you want to execute there. For example, in my case I executed bash command specified in my dag file. I ran

cd $DBT_PROFILES_DIR && dbt run — models shiba_ecommerce.daily_order_count — vars ‘{“ingestion_date”: “2018–06–30”}’

Screenshot of my log after running the Bash Command (result: run successfully)

And the log above in my docker container in your laptop is exactly the same as the log I view in Airflow UI.

Screenshot of my log in Airflow UI (result: run successfully)

One of my failed dag versions

Screenshot of my log after running the Bash Command (result: failed)

An extract of screenshot of my log in Airflow UI (result: failed)

As I mentioned above, the log in docker container vs the log in Airflow UI is exactly the same.

Key Takeaway: Hence to save time, instead of putting in Airflow and run and view log for which each iteration took me at least 10 minutes. The hacky way is to execute whatever you want to execute in your Docker container. Once it works, put it into your dag file and schedule a run on Airflow. That would be much more efficient and save you time when debugging!

Airflow CLI

After you have managed to run the Bash command and you have successfully created a working dag , you can still test your dag one more time by running the Airflow CLI command in your ssh airflow docker container. The full documentation of the command is here.

airflow run [-h] [-sd SUBDIR] [-m] [-f] [--pool POOL] [--cfg_path CFG_PATH]
            [-l] [-A] [-i] [-I] [--ship_dag] [-p PICKLE] [-int]
            dag_id task_id execution_date

This will save you a lot of time because normally you have to turn on your dag and wait for the scheduler to schedule and run your tasks. If you run it in a big system with many dag files lining up waiting to be run, it can take a lot of time — 10 minutes to half an hour.