Datafold SDK
The Datafold SDK allows you to accomplish certain actions using a thin programmatic wrapper around the Datafold REST API, in particular:
- Custom CI Integrations: Submitting information to Datafold about what tables to diff in CI
- dbt CI Integrations: Submitting dbt artifacts via CI runner
- dbt development: Kick off data diffs from the command line while developing in your dbt project
Install
First, create and activate your virtual environment for Python:
Now, you’re ready to install the Datafold SDK:
CLI environment variables
To use the Datafold CLI, you need to set up some environment variables:
If your Datafold app URL is different from the default app.datafold.com
, set the custom domain as the variable:
Custom CI Integrations
Please follow our CI orchestration docs to set up a custom CI integration levering the Datafold SDK.
dbt Core CI Integrations
When you set up Datafold CI diffing for a dbt Core project, we rely on the submission of manifest.json
files to represent the production and staging versions of your dbt project.
Please see our detailed docs on how to set up Datafold in CI for dbt Core, and reach out to our team if you have questions.
CLI
Python
Diffing dbt models in development
It can be beneficial to diff between two dbt environments before opening a pull request. This can be done using the Datafold SDK from the command line:
That command will compare data between your development and production environments. By default, all models that were built in the previous dbt run
or dbt build
command will be compared.
Running Data Diffs before opening a pull request
It can be helpful to view Data Diff results in your ticket before creating a pull request. This enables faster code reviews by letting developers QA changes earlier.
To do this, you can create a draft PR and run the following command:
This executes dbt locally and triggers a Data Diff to preview data changes without committing to Git. To automate this workflow, see our guide here.
Update your dbt_project.yml with configurations
Option 1: Add variables to the dbt_project.yml
Additional schema variable details
The value for prod_custom_schema:
will vary based on how you have setup dbt.
This variable is used when a model has a custom schema and becomes dynamic when the string literal <custom_schema>
is present. The <custom_schema>
substring is replaced with the custom schema for the model in order to support the various ways schema name generation can be overridden here — also referred to as “advanced custom schemas”.
Examples (not exhaustive)
Single production schema
If your prod environment looks like this …
… your data-diff configuration should look like this:
Some custom schemas in production with a prefix like “prod_“
If your prod environment looks like this …
… your data-diff configuration should look like this:
Some custom schemas in production with no prefix
If your prod environment looks like this …
… your data-diff configuration should look like this:
Option 2: Specify a production manifest.json
using --state
Using the --state
option is highly recommended for dbt projects with multiple target database and schema configurations. For example, if you customized the generate_schema_name
macro, this is the best option for you.
Note:
dbt ls
is preferred overdbt compile
as it runs faster and data diffing does not require fully compiled models to work.
Add your Datafold data connection integration ID to your dbt_project.yml
To connect to your database, navigate to Settings → Integrations → Data connections and click Add new integration and follow the prompts.
After you Test and Save, add the ID (which can be found on Integrations > Data connections) to your dbt_project.yml.
The following optional arguments are available:
Options | Description |
---|---|
--version | Print version info and exit. |
-w, --where EXPR | An additional ‘where’ expression to restrict the search space. Beware of SQL Injection! |
--dbt-profiles-dir PATH | Which directory to look in for the profiles.yml file. If not set, we follow the default profiles.yml location for the dbt version being used. Can also be set via the DBT_PROFILES_DIR environment variable. |
--dbt-project-dir PATH | Which directory to look in for the dbt_project.yml file. Default is the current working directory and its parents. |
--select SELECTION or MODEL_NAME | Select dbt resources to compare using dbt selection syntax in dbt versions >= 1.5. In versions < 1.5, it will naively search for a model with MODEL_NAME as the name. |
--state PATH | Specify manifest to utilize for ‘prod’ comparison paths instead of using configuration. |
-pd, --prod-database TEXT | Override the dbt production database configuration within dbt_project.yml . |
-ps, --prod-schema TEXT | Override the dbt production schema configuration within dbt_project.yml . |
--help | Show this message and exit. |