Skip to main content

dbt Integration

Open-Source data-diff

Use Datafold Open Source during development see a summary of how code changes impact your data.

Install data-diff

Navigate to your dbt project, and install data-diff and a database connector.

pip install data-diff 'data-diff[snowflake]' -U

Configure your dbt Project

Add the following variables to dbt_project.yml and tag the primary keys for each model in schema.yml.

#dbt_project.yml
vars:
data_diff:
prod_database: <PROD_DATABASE_NAME>
prod_schema: <PROD_SCHEMA_NAME> # optional (see below)
custom_schemas: <True/False> # optional (see below)
Optional arguments
Schema Environments:
If you utilize schema environments, including custom schemas (default dbt behavior), set the vars like the example below:
vars:
  data_diff:
    prod_database: PROD_DATABASE_NAME
    prod_schema: PROD_SCHEMA_NAME
Then the path to the prod model will rendered in one of two ways per model.

When the model has a custom schema:
  <prod_database>.<prod_schema>_<custom_schema>.<model>
Otherwise:
  <prod_database>.<prod_schema>.<model>

You can add custom_schemas: False to force the path to use <prod_schema> every time, but this is an unlikely scenario.

Database Environments:
If you have modified the default dbt behavior in order to use databases as environments (prod_db, pr_db, dev_db), and each of those dbs have the same schemas, set the vars like the example below:
vars:
  data_diff:
    prod_database: PROD_DATABASE_NAME
    custom_schemas: False
Then the path to the prod model will always be:
  <prod_database>.<same_schema_as_dev_model>.<model>
#schema.yaml
models:
- name: <MODEL_NAME>
columns:
- name: <COLUMN_NAME>
tags:
- primary-key

To learn more about setting primary key tags, check out this section.

Run with --dbt

Run your dbt model with data-diff --dbt to see the impact that your model change had on the data.

# as one command
dbt run --select <MODEL> && data-diff --dbt
# or as separate commands
dbt run --select <MODEL>
data-diff --dbt


Datafold Cloud

Want to see even more? Datafold Cloud gives you insight into value-level changes.

Create an Account

info

Datafold Cloud is currently in private beta. Let us know if you are interested in participating.

Set up Datafold's open-source data-diff before configuring the cloud integration below.

Configure a Data Source

To configure a Data Source, navigate to Settings Integrations Data warehouses and click Add new integration and follow the prompts. For more information, check out our Data Source configuration guides.

After you Test and Save, add the Data Source ID (which can be found on the Data warehouses page) to your dbt_project_yml.

# dbt_project.yml
vars:
data_diff:
...
datasource_id: <DATA_SOURCE_ID>

Generate an API Key

To generate a personal API key, navigate to Settings Account and click Create API Key.

Copy and export your API Key as an environment variable. We suggest storing it in a file like .zshrc or .bash_profile, but you can also run the command below directly in your project.

export DATAFOLD_API_KEY=XXXXXXXXX

Run with --cloud

Run your dbt model with --cloud to see the impact that your model change had on the data.

# as one command
dbt run --select <MODEL> && data-diff --dbt --cloud
# or as separate commands
dbt run --select <MODEL>
data-diff --dbt --cloud