5 Lessons I Have Learned Using dbt Core
Over the past few years I have managed quite a few dbt Core projects while working various consulting and data team jobs. While I really enjoy modeling data with dbt Core, it can be a serious headache when it’s implemented poorly. Here are 5 things that I have learned that can help build a great project:
1. Follow the dbt Labs style guide
The first thing that I recommend is to follow the style guide published by dbt Labs. I think this style guide is great; it covers all aspects of a good project but is still pretty concise and easy to understand. Even if you don’t end up following the entire thing word for word, it’s a good starting point. I think it’s a good idea to give it a read over and see what rules you want to enforce in your project. Some of the areas of the style guide that I think are most important are:
- General project structure with stg and mart models
- Importing all of the tables at the top of the file
- Use CTE over window functions whenever possible
- Prioritize readability in code instead of focusing on fewer lines
- Table alias names instead of single letters (o.count -> orders.count)
With that said, these may not be the points that you find most important. Regardless of whether you use everything on the style guide exactly how it’s listed or change it a bit to navigate your own use case, I think that for the most part this is a great starting point for anyone writing models.
2. Don’t overdue it with jinja or macros
Don’t get me wrong, jinja and macros are both great. They are essential pieces of dbt Core that can be eternally helpful when used correctly. With that said they add an extra layer of complexity to the code that shouldn’t be used unless it’s necessary. One of the great strengths of dbt Core is that it allows for complex data modeling while still mostly just using pure SQL. Adding in more complexity and taking things farther away from pure SQL comes with some downsides. The first downside is that it will be more difficult for a new engineer to understand the code; SQL is universally understood amongst data folks, but jinja and macros will both may take some up-skilling to use. While it may not be a substantial up-skilling for some engineers, it is something to consider. The second and arguably more influential reason is that it will make it much harder to migrate off dbt Core if that time ever does comes. Someday if you do decide to migrate off of dbt Core it could be a simple copy-paste job to another SQL based platform, or you may need to refactor and rewrite all of your models because they contain languages and features that are only included in dbt Core. For these reasons, in the projects that I manage I usually do not implement complex jinja or macros unless they are necessary and the value provided outweighs the negatives.
3. Only Self Manage With DevOps Support
There are a ton of choices when it comes to managing dbt Core. It is just a simple CLI tool, so engineers around the world have thought up countless ways to run it. I have actually written an entire article about my 3 favorite ways to manage dbt Core. Many companies decide to self manage dbt Core for two reasons: cost reduction and reducing PHI risk. By far the most popular method for self managing dbt Core is by running it on one of the managed airflow platforms like Astronomer or Google Composer. This method is pretty complicated, and requires a lot of engineering that is usually outside the skill set of most analytics engineers. Although an analytics engineer should be able to write a dag that executes a dbt Core project, they may not be able to setup and manage the infrastructure that the code executes on, manage a pipeline and storage for metadata, manage the CI/CD pipeline and setup alerting and monitoring. To do this you will need support from a DevOps or Cloud Engineer. Even if your analytics engineers have the skill to manage all of this, in my opinion you want them focused on modeling data, not managing the infrastructure that it runs on.
4. Don’t Build a Giant Spider Web
What I mean by this is that there are a lot of ways to organize a project. You can follow the style guide and use stg and mart folders which I would recommend, or create something of your own design. However you decide to organize your project, just make sure you have a cohesive plan going into it. Whether it be two distinct layers like the dbt Labs style guide suggests or something else that works better for your use case, make sure you aren’t building a huge spider web of dependent models without any obvious explanation. It can be very difficult to support environments like this, especially for new engineers on the project. If the project is organized poorly then debugging will be difficult and time consuming, bringing in new developers will be challenging and loosing any key team members will be daunting. For the sake of this article I am not going to go into how exactly you should organize your project, just make sure you have a plan that future engineers will be able to understand when they look at it.
5. Don’t Use dbt Core for Use Cases Outside of Analytics
Need live streaming data? Doing a machine learning project? Working with OLTP systems? dbt Core is not what you need. After using dbt Core extensively and seeing it succeed in some scenarios and fail in others, I think that it should only be used for use cases within analytics. I do not think that dbt Core is a general purpose data engineering tool that can be used for many use cases as some people suggest. I have found that it is remarkably good for modeling data for analytics and dashboard-ing, but not very good for use cases outside of this. In general if you are doing a data project outside of analytics, there are probably better tools to use. You wouldn't use a hammer to hit a screw, would you? The same logic applies here; it is a great tool but make sure you are using the correct tool for the job.