Skip to main content

data-diff CLI


As of May 17, 2024, Datafold is no longer actively supporting or developing open source data-diff. For continued access to in-database and cross-database diffing, please check out a free trial of Datafold Cloud. We’re grateful to everyone who made contributions along the way. Please see our blog post for additional context on this decision.

Getting Started

1. Install data-diff

data-diff is an open source package that you can install with the necessary database connector.

pip install data-diff 'data-diff[snowflake]' -U

2. Configure your dbt project


These docs reflect the latest version of data-diff listed here!

Run pip install data-diff -U to update!

There are two options for configuring data-diff --dbt.

Option 1: Add variables to the dbt_project.yml

# dbt_project.yml
prod_database: my_default_database # default database for the prod target
prod_schema: my_default_schema # default schema for the prod target
prod_custom_schema: PROD_<custom_schema> # Optional: see dropdown below
Additional schema variable details
The value for prod_custom_schema: will vary based on how you have setup dbt.

This variable is used when a model has a custom schema and becomes dynamic when the string literal <custom_schema> is present. The <custom_schema> substring is replaced with the custom schema for the model in order to support the various ways schema name generation can be overridden here -- also referred to as "advanced custom schemas".

Examples (not exhaustive)

Single production schema
If your prod environment looks like this ...
... your data-diff configuration should look like this:
    prod_database: PROD
    prod_schema: ANALYTICS

Some custom schemas in production with a prefix like “prod_”
If your prod environment looks like this ...
... your data-diff configuration should look like this:
    prod_database: PROD
    prod_schema: ANALYTICS
    prod_custom_schema: PROD_<custom_schema>

Some custom schemas in production with no prefix
If your prod environment looks like this ...
... your data-diff configuration should look like this:
    prod_database: PROD
    prod_schema: ANALYTICS
    prod_custom_schema: <custom_schema>

Option 2: Specify a production manifest.json using --state

Using the --state option is highly recommended for dbt projects with multiple target database and schema configurations. For example, if you customized the generate_schema_name macro, this is the best option for you.

Note: dbt ls is preferred over dbt compile as it runs faster and data diffing does not require fully compiled models to work.

dbt ls -t prod # compile a manifest.json using the "prod" target
mv target/manifest.json prod_manifest.json # move the file up a directory and rename it to prod_manifest.json
dbt run # run your entire dbt project or only a subset of models with `dbt run --select <model_name>`
data-diff --dbt --state prod_manifest.json # run data-diff to compare your development results to the production database/schema results in the prod manifest

Then, identify primary keys in each model by adding tags, metadata, or uniqueness tests. Check out this page for more details on configuration.

3. Run with --dbt

Note: This is for Option 1 only. If you are using Option 2, you can skip this step.

Run your dbt model with data-diff --dbt to see the impact that your model change had on the data.

# as one command
dbt run --select <MODEL> && data-diff --dbt
# or as separate commands
dbt run --select <MODEL>
data-diff --dbt

Optional configurations and flags

Running data-diff on specific dbt models

Out of the box, data-diff --dbt will diff all models that were built in your last dbt run.

Beginning with data-diff version 0.7.5, you can add a --select flag to override the default behavior and specify which models you want to diff.

data-diff --dbt --select <models>

Handling very large dbt models

data-diff will reach performance limitations on large dbt models. One strategy to reduce run time in this scenario is to add a filter, which is essentially a where clause that is configured in that model's yml. This defines which rows will be diffed.

Another option is to limit the number of columns via the Include / Exclude Columns configuration.