Managing our data using BigQuery, dbt and Github Actions
Tom Phillips • 2 March 2022
Tom Phillips • 2 March 2022
One of the first data engineering projects at Centre for Net Zero was setting up a data warehouse for the research team. Data cleaning is time consuming so we wanted to make the process of working with data as efficient as possible. We also wanted data testing capability, so we can trust our data and change models without fear of breaking them.
We decided on a modern extract-load-transform (ELT) workflow using BigQuery for our cloud data warehouse and dbt to transform the data within it. In this blog post we want to share our set-up dbt and automated it using Github Actions.
Our dbt project is on Github, so you can explore it for yourself and follow along with the links below.
We adopted the dbt project structure used by dbt Labs. Data flows through three distinct phases. First, tables loaded into the data warehouse are defined as sources. Next, sources are transformed into staging models. Finally, staging models become marts models.
Staging models are organised by data source. Staging models clean up sources and make them available for further use. Usually there is a one-to-one relationship between a source and staging model. Cleaning a dataset once saves a lot of time.
Marts are organised by research project. Inside marts are dimension models, which are tables where each row represents a “thing” and the columns are its attributes.
Google Cloud Platform organises resources by project. We use two projects for our data warehouse: production and development. Production is our source of truth, so we don’t want it littered with models that are works-in-progress.
We load datasets into BigQuery in the production project. So dbt can find them there, we specify the production project in the source configuration file of every source. In our local dbt profile we specify the development project.
When we run dbt locally, dbt reads source datasets from the production project but creates models in the development project. Within the development project, datasets are prefixed by the creator’s initials, so we don’t have any conflicts between users.
We use access control to make the production project read-only. Users have read-and-write permissions for the development project.
How do we get new models into production? We use Github Actions. The Github Actions runner is authenticated with a service account that, unlike everyone else, has write permissions for the production project.
When someone opens a PR, Github Actions starts the pull request workflow. This runs dbt against the development project. Instead of the creator’s initials, the schema is set to the short SHA of the last commit on the branch. Then we test everything except sources. If the tests fail, we find the dataset in the development project using the SHA and debug the problem.
Once tests have passed and the pull request is merged, Github Actions starts the main branch workflow. Here everything is run in the production project. First all the sources are tested to check that the data is as we expect, then dbt runs all the models, and finally tests everything except the sources. We do this so that we don’t create models on top of incorrect data.
Here I’ve explained how we organise our dbt project and configure dbt to use two Google Cloud projects for production and development environments. As our data warehouse grows we will see how our workflows scale (users with larger data warehouses might want to consider only running models that have changed). Our team is very happy with the setup so far – if you’re thinking about building a similar data warehouse, we encourage you to fork our repo and try it out for yourself.