Skip to main content

Development Testing: Open Source

Develop dbt models faster by testing as you code.

See how every change to dbt code affects the data produced in the modified model and downstream.

Team Cloud

🔧 Interested in adding Datafold Team Cloud to your CI pipeline? Let's talk! ☎️



What is data-diff?

data-diff is an open source package that you can use to see the impact of your dbt code changes on your dbt models as you code.

Install data-diff

Navigate to your dbt project, and install data-diff and a database connector.

pip install data-diff 'data-diff[snowflake]' -U

Configure your dbt project

info

These docs reflect the latest version of data-diff listed here!

Run pip install data-diff -U to update!

Add the following variables to dbt_project.yml:

#dbt_project.yml
vars:
data_diff:
prod_database: my_default_database # default database for the prod target
prod_schema: my_default_schema # default schema for the prod target
prod_custom_schema: PROD_<custom_schema> # Optional: see dropdown below
Additional schema variable details
The value for prod_custom_schema: will vary based on how you have setup dbt.

This variable is used when a model has a custom schema and becomes dynamic when the string literal <custom_schema> is present. The <custom_schema> substring is replaced with the custom schema for the model in order to support the various ways schema name generation can be overridden here -- also referred to as "advanced custom schemas".

Examples (not exhaustive)


Single production schema
If your prod environment looks like this ...
PROD.ANALYTICS
... your data-diff configuration should look like this:
vars:
  data_diff:
    prod_database: PROD
    prod_schema: ANALYTICS


Some custom schemas in production with a prefix like “prod_”
If your prod environment looks like this ...
PROD.ANALYTICS
PROD.PROD_MARKETING
PROD.PROD_SALES
... your data-diff configuration should look like this:
vars:
  data_diff:
    prod_database: PROD
    prod_schema: ANALYTICS
    prod_custom_schema: PROD_<custom_schema>


Some custom schemas in production with no prefix
If your prod environment looks like this ...
PROD.ANALYTICS
PROD.MARKETING
PROD.SALES
... your data-diff configuration should look like this:
vars:
  data_diff:
    prod_database: PROD
    prod_schema: ANALYTICS
    prod_custom_schema: <custom_schema>

Then, identify primary keys in each model by adding tags, metadata, or uniqueness tests. Check out this page for more details on configuration.

Run with --dbt

Run your dbt model with data-diff --dbt to see the impact that your model change had on the data.

# as one command
dbt run --select <MODEL> && data-diff --dbt
# or as separate commands
dbt run --select <MODEL>
data-diff --dbt

Optional configurations and flags

Running data-diff on specific dbt models

Out of the box, data-diff --dbt will diff all models that were built in your last dbt run.

Beginning with data-diff version 0.7.5, you can add a --select flag to override the default behavior and specify which models you want to diff.

data-diff --dbt --select <models>

Handling very large dbt models

data-diff will reach performance limitations on large dbt models. One strategy to reduce run time in this scenario is to add a filter, which is essentially a where clause that is configured in that model's yml. This defines which rows will be diffed.