Writen by Yan Cescon Haeffner, Data Engineering, in 27/09/2021
7 minutes of reading
We need to talk about data build tool — Hands On Orchestration
On the third (and last) part of our series about DBT, we will cover a hands on implementation of an automated DBT project using Docker and Airflow.
So, now that we’ve discussed what is DBT and why you should it along with your ELT stack, it’s finally the time to talk about how to actually implement it in a real life situation using an orchestration tool.
Even though you could always use the simplified implementation of DBT based on the SaaS solution of DBT Cloud, sometimes we would like to keep everything being executed from within our own place (our data our rules), right? In order to fill the gap of manual implementation of DBT as a command line tool software, we will automate the execution of each project task through an orchestrator.
The orchestration will be handled by Airflow (and if you don’t know it yet, make sure to check this post first) through the use of Docker containers to run the whole DBT model and test it. At the end of this post, you should be able to automate your DBT project while merging its awesome features with Airflow useful Gantt charts and debug.
First of all, in order to keep it as simple as possible for new DBT users, we will use this Fishtown Analytics tutorial as our starting point, so you should have your Jaffle Shop project sitting somewhere in your local file system before proceeding.
This whole hands on has been tested on a Windows 10 machine while using native Ubuntu virtualization from Windows 10 app store to run Airflow.
Also, make sure to set your Airflow instance running with the Local Executor, or equivalent, in order to have parallelism enabled. By following this tutorial you should be ready to use Airflow for this exercise.
One last requisite is to have Docker Desktop installed on your system because we will need to deploy and start containers to run our DBT tasks. To achieve that you can follow this step-by-step guide!
The strategy behind this approach is really simple: we want to have a Docker Image that will allow us to launch an instance of our DBT project environment and execute a DBT command inside it.
In order to achieve that, we need to build a Dockerfile that will pack everything that is needed to run DBT, from a proper Python environment to the profile file used on the project.
Now you may be wondering why on earth would you pack your profile (a.k.a. DWH credentials) into your container, right? For this hands on purpose only we will try to keep it as simple as possible and pack our BigQuery KeyFile with our DBT project. Fortunately, we are able to call environment variables from within our container, which means that by setting our credentials as system variables we are allowed to pass them by parameters to our Docker Image through Airflow’s Docker Operator. Please take into consideration that in a real world project you should always keep your secrets (and keyfiles) safe.
Here’s the profiles.yml file we will use (make sure to update yours accordingly):
The final Dockerfile needed looks like the following:
The requirements.txt file is described by the following gist:
Also, for the above Dockerfile to work you need to make sure that your folder structure looks like the one below:
Notice how your target folder should only contain the generated manifest.json file from your dbt run (or dbt docs generate) command. Also, make sure to create and populate your profile folder as described above.
The manifest.json file is not necessary for the Docker Image, but it will be used by the next step of our hands on, so make sure to keep it there.
Finally, open your terminal where those files are located and run the following command:
docker build -t dbt-jaffle-shop .
You should now be able to see your DBT project Docker Image by:
Now that we have successfully containerized our project, let’s build the Airflow DAG that will handle its execution.
It would not make sense for us to have to write down a new DAG everytime we update our DBT project, right?
To avoid that, we can make use of a powerful file that DBT uses to handle project definitions, the manifest file that is located inside the project target folder as a JSON file.
If you don’t see that file in your target folder, please run the following command within your DBT folder:
dbt docs generate
Make sure to have the manifest file available in the target folder and then create a new python script, called generate_dbt_dag.py, inside the parent folder of your DBT project. Open it and paste the following code inside it:
Save the script and execute it with the following command:
python generate_dbt_dag -schedule “0 0 * * *”
This will generate the DAG and schedule it to run every day at midnight.
You should now be able to locate a new python script named dbt_jaffle_shop_dag.py in the same folder. This new file should be moved into your Airflow dags folder in order to be available for the orchestrator.
Finally, if everything executed as expected, you should be able to see it in your Airflow Web Server in a few seconds.
Now that we successfully generated a DAG for our DBT project, let’s take a moment to understand what that script actually does, shall we?
The manifest.json file contains compiled information of your DBT project, from SQL queries to model relationships.
It means that some DWH architecture sensitive data could be located inside that file, which also means that you should take care of that file and try to edit it as much as possible in order to avoid exposing anything other then what your script really needs to create a DAG for that project.
The final file information is parsed and processed by our script in order to build tasks relationships as they are defined in our DBT project. Notice that we also define one single Airflow task to run all tests related to a specific node from our project, it helps by automating and separating test execution from model building tasks, which leds to improvement to our data reliability by confirming that no further model will run if previous models contain invalid data.
Also, as mentioned before, this file contains a lot of information and, because of the simplicity of this exercise, we are only using a few of what’s available… So make sure to explore this file and improve the provided script to create even better DAGs for your DBT projects!
It means that we could go even further beyond on this approach and add this whole process to an automated CI/CD pipeline which would test our manifest.json file and then proceed to create the associated DAG and deploy it to Airflow. Exciting, right?
One thing that you should notice from our generate_dbt_dag.py script is that it uses the Docker Operator to run tasks on Airflow, so if you want to use your containerized DBT project with Kubernetes, please take some time and read the Kubernetes Executor documentation for Airflow. All you need to do is update the generate_tasks method from the script above to create tasks with a different (even custom) operator and you should be good to go!
So, time to hit the music and let the orchestrator play!
Head back to Airflow and turn the DAG on, which should automatically schedule a running instance for us. You can even open up your Docker GUI and take a look into the containers that it will start from now on.
If you set everything as described until now, you should see something like this:
We can see that not only our tasks are kicking in using parallelism on Airflow, they are also storing the container logs into our task logs and feeding us with all the information that Airflow provides (like Gantt charts and log debug).
If every task executed successfully, you should be able to see this fine peace of DAG graph view on your Airflow Webserver:
We’ve successfully developed and automated the execution of a DBT project using Docker containers and Airflow, all along with automated DAG generation, scheduling and keeping everything inside our own structures.
The Road Ahead
Thank you for taking your time to walk with me along this series about Data Build Tool, I hope that it helped you somehow on your daily doses of data engineering and had opened your mind to one of the amazing new tools of modern data stacks!
This series covered just a small portion of what the community behind DBT can offer you, so make sure to follow their updates and the oficial DBT GitHub repository!
Since everything is constantly changing in the data world, you definetly will not stop here and should really try to improve everything we did and learned so far, who knows what amazing challenges awaits ahead of us?
Once again, thank you for joining us and remember to always think beyond!