3 Tips for Troubleshooting Incremental Models in dbt Core

Eric Arsenault
4 min readJul 26, 2022

--

There is a lot to love about dbt Core. While it’s hard to pinpoint my favorite feature, incremental models are definitely near the top of the list. In my opinion incremental models give a tremendous amount of functionality and flexibility while still being pretty simple to use. Like everything else in dbt Core they are pretty easy to set-up but do still leave a lot of room for error. In this post I will give a few tips I have learned while developing and maintaining incremental models in dbt Core.

1.) Build the model twice!

The first tip may seem like a simple one to those that already know, but it is pretty essential and forgetting to do this can create a lot of angry stakeholders. When testing an incremental model in dbt Core, we must remember that there are actually two steps to the build: the initial full refresh and subsequent incremental builds.

This is important because they function differently and ultimately run different code. When the table does a full refresh, it builds as a regular table and ignores the incremental policy that you have defined in the model. This means that if there is an error in your incremental policy, the first run will still be successful and dbt Core will not throw an error until the second run. This can cause a lot of confusion, and can be especially troublesome for models that are run overnight once per day; the engineer will manually test their model during business hours and think that it works, then get a message from the stakeholder after hours saying that the model failed and the data is now stale :(

2.) Add some metadata to the table

Metadata is usually helpful in troubleshooting, and this is no different. Although dbt Core does not write any metadata by default, using the ‘is_incremental()’ macro we can build out our own metadata fields. The fields needed will probably change depending on the use case, but here are a few of my favorites as an example. I have added them to a fake model for a simple key-based incremental table with one dimension and one measure.

{{config(materialized='incremental',
unique_key='key_dimension')}}
with stage_table as (select * from {{ref('stg_data')}}),aggregated_data as (
select
key_dimension
,count(field) as measure
,max(event_date) as last_event_date
--This adds the last build time per row ,CURRENT_TIMESTAMP as build_time

--This adds the build source of the last build (full/incremental)
,{% if is_incremental() %}
'incremental'
{% else %}
'full'
{% endif %} as build_type

--This starts a counter to show the incremental build number per row
,{% if is_incremental() %}
(select max(build_num) from {{ this }}) + 1
{% else %}
1
{% endif %} as build_num

--This shows the date of the last full refresh
,{% if is_incremental() %}
(select min(last_full_refresh) from {{ this }})
{% else %}
CURRENT_TIMESTAMP
{% endif %} as last_full_refresh

from stage_table
group by 1)

select * from aggregated_data
{% if is_incremental() %} where last_event_date > (select max(last_event_date) from {{ this }}){% endif %}

3.) Remember schema change strategy and limitations

This last one is really a silent killer. Schema changes on incremental tables can cause a lot of headache, especially if you are using something to extract the data that has automatic schema updates like Fivetran or Stitch.

It’s important to remember that schema changes are not handled the same way in incremental models as they are in regular tables/views with full refresh. With incremental models there is an added layer of complexity; we must define a schema change strategy using ‘on_schema_change’ in the model config or project.yml file, as well as factoring in the limitations of this parameter. This parameter has four options that are all pretty straightforward, here is a snippet from the dbt Core docs that explain the difference:

The possible values for on_schema_change are:

ignore: Default behavior.

fail: Triggers an error message when the source and target schemas diverge.

append_new_columns: Append new columns to the existing table. Note that this setting does not remove columns from the existing table that are not present in the new data.

sync_all_columns: Adds any new columns to the existing table, and removes any columns that are now missing. Note that this is inclusive of data type changes. On BigQuery, changing column types requires a full table scan; be mindful of the trade-offs when implementing.

Note: None of the on_schema_change behaviors backfill values in old records for newly added columns. If you need to populate those values, we recommend running manual updates, or triggering a --full-refresh.

So, what is the main take-away here? To put it simply, dbt Core can handle schema changes on incremental tables in multiple ways, and depending on your use case you may choose a different strategy. With that said, there are some limitations. Under no circumstance will dbt Core back-fill new columns or drop an existing column from your table. If you know that you will be experiencing schema changes and need these done, you will need to find some times to schedule intermittent full refreshes so that the schema changes are reflected in your final table.

Those are all of the tips for this post. If anyone has any questions or if you need help with your dbt Core project feel free to reach out to me. Thanks for reading, hope this was helpful.

--

--

Eric Arsenault
Eric Arsenault

Written by Eric Arsenault

Tech Lead | Analytics Engineering

No responses yet