Development Testing: Open Source
Develop dbt models faster by testing as you code.
See how every change to dbt code affects the data produced in the modified model and downstream.
🔧 Interested in adding Datafold Team Cloud to your CI pipeline? Let's talk! ☎️
What is data-diff?
data-diff is an open source package that you can use to see the impact of your dbt code changes on your dbt models as you code.
Install data-diff
Navigate to your dbt project, and install data-diff and a database connector.
- Snowflake
- BigQuery
- Redshift
- PostgreSQL
- Databricks
- DuckDB
pip install data-diff 'data-diff[snowflake]' -U
pip install data-diff google-cloud-bigquery -U
Additional BigQuery details
For example, run:
gcloud auth application-default login
Before running:
dbt run --select <MODEL> && data-diff --dbt
pip install data-diff 'data-diff[redshift]' -U
pip install data-diff 'data-diff[postgres]' -U
pip install data-diff 'data-diff[databricks]' -U
pip install data-diff 'data-diff[duckdb]' -U
Configure your dbt project
These docs reflect the latest version of data-diff listed here!
Run pip install data-diff -U
to update!
Add the following variables to dbt_project.yml:
#dbt_project.yml
vars:
data_diff:
prod_database: my_default_database # default database for the prod target
prod_schema: my_default_schema # default schema for the prod target
prod_custom_schema: PROD_<custom_schema> # Optional: see dropdown below
Additional schema variable details
prod_custom_schema:
will vary based on how you have setup dbt. This variable is used when a model has a custom schema and becomes dynamic when the string literal <custom_schema>
is present. The <custom_schema>
substring is replaced with the custom schema for the model in order to support the various ways schema name generation can be overridden here -- also referred to as "advanced custom schemas".
Examples (not exhaustive)
Single production schema
If your prod environment looks like this ...
PROD.ANALYTICS
... your data-diff configuration should look like this:
vars:
data_diff:
prod_database: PROD
prod_schema: ANALYTICS
Some custom schemas in production with a prefix like “prod_”
If your prod environment looks like this ...
PROD.ANALYTICS
PROD.PROD_MARKETING
PROD.PROD_SALES
... your data-diff configuration should look like this:
vars:
data_diff:
prod_database: PROD
prod_schema: ANALYTICS
prod_custom_schema: PROD_<custom_schema>
Some custom schemas in production with no prefix
If your prod environment looks like this ...
PROD.ANALYTICS
PROD.MARKETING
PROD.SALES
... your data-diff configuration should look like this:
vars:
data_diff:
prod_database: PROD
prod_schema: ANALYTICS
prod_custom_schema: <custom_schema>
Then, identify primary keys in each model by adding tags, metadata, or uniqueness tests. Check out this page for more details on configuration.
Run with --dbt
Run your dbt model with data-diff --dbt
to see the impact that your model change had on the data.
# as one command
dbt run --select <MODEL> && data-diff --dbt
# or as separate commands
dbt run --select <MODEL>
data-diff --dbt
Optional configurations and flags
Running data-diff
on specific dbt models
Out of the box, data-diff --dbt
will diff all models that were built in your last dbt run
.
Beginning with data-diff
version 0.7.5, you can add a --select
flag to override the default behavior and specify which models you want to diff.
data-diff --dbt --select <models>
Handling very large dbt models
data-diff
will reach performance limitations on large dbt models. One strategy to reduce run time in this scenario is to add a filter, which is essentially a where
clause that is configured in that model's yml. This defines which rows will be diffed.