Datafold SDK

The Datafold SDK allows you to accomplish certain actions using a thin programmatic wrapper around the Datafold REST API, in particular:

Custom CI Integrations: Submitting information to Datafold about what tables to diff in CI
dbt CI Integrations: Submitting dbt artifacts via CI runner
dbt development: Kick off data diffs from the command line while developing in your dbt project

Install

First, create and activate your virtual environment for Python:

python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip setuptools wheel

Now, you’re ready to install the Datafold SDK:

pip install datafold-sdk

CLI environment variables

To use the Datafold CLI, you need to set up some environment variables:

export DATAFOLD_API_KEY=XXXXXXXXX

If your Datafold app URL is different from the default app.datafold.com, set the custom domain as the variable:

export DATAFOLD_HOST=<CUSTOM_DATAFOLD_APP_DOMAIN>

Custom CI Integrations

Please follow our CI orchestration docs to set up a custom CI integration levering the Datafold SDK.

dbt Core CI Integrations

When you set up Datafold CI diffing for a dbt Core project, we rely on the submission of manifest.json files to represent the production and staging versions of your dbt project. Please see our detailed docs on how to set up Datafold in CI for dbt Core, and reach out to our team if you have questions.

CLI

    datafold dbt upload \
    --ci-config-id <ci_config_id> \
    --run-type <run-type> \
    --target-folder <artifacts_path> \
    --commit-sha <git_sha>

Python

import os

from datafold.sdk.dbt import submit_artifacts

api_key = os.environ.get('DATAFOLD_API_KEY')

# only needed if your Datafold app url is not app.datafold.com
host = os.environ.get("DATAFOLD_HOST")

submit_artifacts(host=host,
                 api_key=api_key,
                 ci_config_id=<ci_config_id>,
                 run_type='<run-type>',
                 target_folder='<artifacts_path>',
                 commit_sha='<git_sha>')

Diffing dbt models in development

It can be beneficial to diff between two dbt environments before opening a pull request. This can be done using the Datafold SDK from the command line:

datafold diff dbt

That command will compare data between your development and production environments. By default, all models that were built in the previous dbt run or dbt build command will be compared.

Running Data Diffs before opening a pull request

It can be helpful to view Data Diff results in your ticket before creating a pull request. This enables faster code reviews by letting developers QA changes earlier. To do this, you can create a draft PR and run the following command:

dbt run && datafold diff dbt

This executes dbt locally and triggers a Data Diff to preview data changes without committing to Git. To automate this workflow, see our guide here.

Update your dbt_project.yml with configurations

Option 1: Add variables to the `dbt_project.yml`

# dbt_project.yml
vars:
  data_diff:
    prod_database: my_default_database # default database for the prod target
    prod_schema: my_default_schema # default schema for the prod target
    prod_custom_schema: PROD_<custom_schema> # Optional: see dropdown below

Additional schema variable details The value for prod_custom_schema: will vary based on how you have setup dbt. This variable is used when a model has a custom schema and becomes dynamic when the string literal <custom_schema> is present. The <custom_schema> substring is replaced with the custom schema for the model in order to support the various ways schema name generation can be overridden here — also referred to as “advanced custom schemas”. Examples (not exhaustive) Single production schema If your prod environment looks like this …

PROD.ANALYTICS

… your data-diff configuration should look like this:

  vars:
      data_diff:
          prod_database: PROD
          prod_schema: ANALYTICS

Some custom schemas in production with a prefix like “prod_” If your prod environment looks like this …

PROD.ANALYTICS
PROD.PROD_MARKETING
PROD.PROD_SALES

… your data-diff configuration should look like this:

  vars:
      data_diff:
          prod_database: PROD
          prod_schema: ANALYTICS
          prod_custom_schema: PROD_<custom_schema>

Some custom schemas in production with no prefix If your prod environment looks like this …

PROD.ANALYTICS
PROD.MARKETING
PROD.SALES

… your data-diff configuration should look like this:

vars:
  data_diff:
    prod_database: PROD
    prod_scheam: ANALYTICS
    prod_custom_schema: <custom_schema>

Option 2: Specify a production `manifest.json` using `--state`

Using the --state option is highly recommended for dbt projects with multiple target database and schema configurations. For example, if you customized the generate_schema_name macro, this is the best option for you.

Note: dbt ls is preferred over dbt compile as it runs faster and data diffing does not require fully compiled models to work.

dbt ls -t prod # compile a manifest.json using the "prod" target
mv target/manifest.json prod_manifest.json # move the file up a directory and rename it to prod_manifest.json
dbt run # run your entire dbt project or only a subset of models with `dbt run --select <model_name>`
data-diff --dbt --state prod_manifest.json # run data-diff to compare your development results to the production database/schema results in the prod manifest

Add your Datafold data connection integration ID to your dbt_project.yml

To connect to your database, navigate to Settings → Integrations → Data connections and click Add new integration and follow the prompts. After you Test and Save, add the ID (which can be found on Integrations > Data connections) to your dbt_project.yml.

# dbt_project.yml
vars:
  data_diff:
      ...
      datasource_id: <DATA_SOURCE_ID>

The following optional arguments are available:

Options	Description
`--version`	Print version info and exit.
`-w, --where EXPR`	An additional ‘where’ expression to restrict the search space. Beware of SQL Injection!
`--dbt-profiles-dir PATH`	Which directory to look in for the `profiles.yml` file. If not set, we follow the default `profiles.yml` location for the dbt version being used. Can also be set via the `DBT_PROFILES_DIR` environment variable.
`--dbt-project-dir PATH`	Which directory to look in for the `dbt_project.yml` file. Default is the current working directory and its parents.
`--select SELECTION or MODEL_NAME`	Select dbt resources to compare using dbt selection syntax in dbt versions >= 1.5. In versions < 1.5, it will naively search for a model with `MODEL_NAME` as the name.
`--state PATH`	Specify manifest to utilize for ‘prod’ comparison paths instead of using configuration.
`-pd, --prod-database TEXT`	Override the dbt production database configuration within `dbt_project.yml`.
`-ps, --prod-schema TEXT`	Override the dbt production schema configuration within `dbt_project.yml`.
`--help`	Show this message and exit.

REST API

Audit Logs

CI

Data sources

Data diffs

Explore

BI

Monitors

Install

CLI environment variables

Custom CI Integrations

dbt Core CI Integrations

CLI

Python

Diffing dbt models in development

Running Data Diffs before opening a pull request

Update your dbt_project.yml with configurations

Option 1: Add variables to the `dbt_project.yml`

Option 2: Specify a production `manifest.json` using `--state`

Add your Datafold data connection integration ID to your dbt_project.yml

REST API

Audit Logs

CI

Data sources

Data diffs

Explore

BI

Monitors

​Install

​CLI environment variables

​Custom CI Integrations

​dbt Core CI Integrations

​CLI

​Python

​Diffing dbt models in development

​Running Data Diffs before opening a pull request

​Update your dbt_project.yml with configurations

​Option 1: Add variables to the dbt_project.yml

​Option 2: Specify a production manifest.json using --state

​Add your Datafold data connection integration ID to your dbt_project.yml

Install

CLI environment variables

Custom CI Integrations

dbt Core CI Integrations

CLI

Python

Diffing dbt models in development

Running Data Diffs before opening a pull request

Update your dbt_project.yml with configurations

Option 1: Add variables to the `dbt_project.yml`

Option 2: Specify a production `manifest.json` using `--state`

Add your Datafold data connection integration ID to your dbt_project.yml