Development Testing: Open Source
Develop dbt models faster by testing as you code.
See how every change to dbt code affects the data produced in the modified model and downstream.
🔧 Interested in adding Datafold Team Cloud to your CI pipeline? Let's talk! ☎️
What is data-diff?
data-diff is an open source package that you can use to see the impact of your dbt code changes on your dbt models as you code.
Install data-diff
Navigate to your dbt project, and install data-diff and a database connector.
- Snowflake
- BigQuery
- Redshift
- PostgreSQL
- Databricks
- DuckDB
pip install data-diff 'data-diff[snowflake]' -U
pip install data-diff google-cloud-bigquery -U
Additional BigQuery details
For example, run:
gcloud auth application-default login
Before running:
dbt run --select <MODEL> && data-diff --dbt
pip install data-diff 'data-diff[redshift]' -U
pip install data-diff 'data-diff[postgres]' -U
pip install data-diff 'data-diff[databricks]' -U
pip install data-diff 'data-diff[duckdb]' -U
Configure your dbt project
These docs reflect the latest version of data-diff listed here!
Run pip install data-diff -U
to update!
There are two options for configuring data-diff --dbt.
Option 1: Add variables to the dbt_project.yml
# dbt_project.yml
vars:
data_diff:
prod_database: my_default_database # default database for the prod target
prod_schema: my_default_schema # default schema for the prod target
prod_custom_schema: PROD_<custom_schema> # Optional: see dropdown below
Additional schema variable details
prod_custom_schema:
will vary based on how you have setup dbt. This variable is used when a model has a custom schema and becomes dynamic when the string literal <custom_schema>
is present. The <custom_schema>
substring is replaced with the custom schema for the model in order to support the various ways schema name generation can be overridden here -- also referred to as "advanced custom schemas".
Examples (not exhaustive)
Single production schema
If your prod environment looks like this ...
PROD.ANALYTICS
... your data-diff configuration should look like this:
vars:
data_diff:
prod_database: PROD
prod_schema: ANALYTICS
Some custom schemas in production with a prefix like “prod_”
If your prod environment looks like this ...
PROD.ANALYTICS
PROD.PROD_MARKETING
PROD.PROD_SALES
... your data-diff configuration should look like this:
vars:
data_diff:
prod_database: PROD
prod_schema: ANALYTICS
prod_custom_schema: PROD_<custom_schema>
Some custom schemas in production with no prefix
If your prod environment looks like this ...
PROD.ANALYTICS
PROD.MARKETING
PROD.SALES
... your data-diff configuration should look like this:
vars:
data_diff:
prod_database: PROD
prod_schema: ANALYTICS
prod_custom_schema: <custom_schema>
Option 2: Specify a production manifest.json
using --state
Using the --state
option is highly recommended for dbt projects with multiple target database and schema configurations. For example, if you customized the generate_schema_name
macro, this is the best option for you.
Note:
dbt ls
is preferred overdbt compile
as it runs faster and data diffing does not require fully compiled models to work.
dbt ls -t prod # compile a manifest.json using the "prod" target
mv target/manifest.json prod_manifest.json # move the file up a directory and rename it to prod_manifest.json
dbt run # run your entire dbt project or only a subset of models with `dbt run --select <model_name>`
data-diff --dbt --state prod_manifest.json # run data-diff to compare your development results to the production database/schema results in the prod manifest
Then, identify primary keys in each model by adding tags, metadata, or uniqueness tests. Check out this page for more details on configuration.
Run with --dbt
Note: This is for Option 1 only. If you are using Option 2, you can skip this step.
Run your dbt model with data-diff --dbt
to see the impact that your model change had on the data.
# as one command
dbt run --select <MODEL> && data-diff --dbt
# or as separate commands
dbt run --select <MODEL>
data-diff --dbt
Optional configurations and flags
Running data-diff
on specific dbt models
Out of the box, data-diff --dbt
will diff all models that were built in your last dbt run
.
Beginning with data-diff
version 0.7.5, you can add a --select
flag to override the default behavior and specify which models you want to diff.
data-diff --dbt --select <models>
Handling very large dbt models
data-diff
will reach performance limitations on large dbt models. One strategy to reduce run time in this scenario is to add a filter, which is essentially a where
clause that is configured in that model's yml. This defines which rows will be diffed.
Another option is to limit the number of columns via the Include / Exclude Columns configuration.