dbt Integration
Open-Source data-diff
Use Datafold Open Source during development see a summary of how code changes impact your data.
Install data-diff
Navigate to your dbt project, and install data-diff and a database connector.
- Snowflake
- BigQuery
- Redshift
- PostgreSQL
- Databricks
- DuckDB
pip install data-diff 'data-diff[snowflake]' -U
pip install data-diff google-cloud-bigquery -U
Additional BigQuery details
For example, run:
gcloud auth application-default login
Before running:
dbt run --select <MODEL> && data-diff --dbt
pip install data-diff 'data-diff[redshift]' -U
pip install data-diff 'data-diff[postgres]' -U
pip install data-diff 'data-diff[databricks]' -U
pip install data-diff 'data-diff[duckdb]' -U
Configure your dbt Project
Add the following variables to dbt_project.yml and tag the primary keys for each model in schema.yml.
#dbt_project.yml
vars:
data_diff:
prod_database: <PROD_DATABASE_NAME>
prod_schema: <PROD_SCHEMA_NAME> # optional (see below)
custom_schemas: <True/False> # optional (see below)
Optional arguments
If you utilize schema environments, including custom schemas (default dbt behavior), set the vars like the example below:
vars:
data_diff:
prod_database: PROD_DATABASE_NAME
prod_schema: PROD_SCHEMA_NAME
Then the path to the prod model will rendered in one of two ways per model.When the model has a custom schema:
<prod_database>.<prod_schema>_<custom_schema>.<model>
Otherwise:
<prod_database>.<prod_schema>.<model>
You can add custom_schemas: False to force the path to use <prod_schema> every time, but this is an unlikely scenario.
Database Environments:
If you have modified the default dbt behavior in order to use databases as environments (prod_db, pr_db, dev_db), and each of those dbs have the same schemas, set the vars like the example below:
vars:
data_diff:
prod_database: PROD_DATABASE_NAME
custom_schemas: False
Then the path to the prod model will always be:<prod_database>.<same_schema_as_dev_model>.<model>
#schema.yaml
models:
- name: <MODEL_NAME>
columns:
- name: <COLUMN_NAME>
tags:
- primary-key
To learn more about setting primary key tags, check out this section.
Run with --dbt
Run your dbt model with data-diff --dbt
to see the impact that your model change had on the data.
# as one command
dbt run --select <MODEL> && data-diff --dbt
# or as separate commands
dbt run --select <MODEL>
data-diff --dbt
Datafold Cloud
Want to see even more? Datafold Cloud gives you insight into value-level changes.
Create an Account
Datafold Cloud is currently in private beta. Let us know if you are interested in participating.
Set up Datafold's open-source data-diff before configuring the cloud integration below.
Configure a Data Source
To configure a Data Source, navigate to Settings → Integrations → Data warehouses and click Add new integration and follow the prompts. For more information, check out our Data Source configuration guides.
After you Test and Save, add the Data Source ID (which can be found on the Data warehouses page) to your dbt_project_yml.
# dbt_project.yml
vars:
data_diff:
...
datasource_id: <DATA_SOURCE_ID>
Generate an API Key
To generate a personal API key, navigate to Settings → Account and click Create API Key.
Copy and export your API Key as an environment variable. We suggest storing it in a file like .zshrc
or .bash_profile
, but you can also run the command below directly in your project.
export DATAFOLD_API_KEY=XXXXXXXXX
Run with --cloud
Run your dbt model with --cloud
to see the impact that your model change had on the data.
# as one command
dbt run --select <MODEL> && data-diff --dbt --cloud
# or as separate commands
dbt run --select <MODEL>
data-diff --dbt --cloud