Skip to main content

Datafold SDK

Install

First, create and activate your virtual environment for Python:

python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip setuptools wheel

Now, you're ready to install the Datafold SDK:

pip install datafold-sdk

Configure

After selecting datafold-sdk from the available options, complete the configuration with the following information:

Field NameDescription
RepositorySelect the repository that generates the webhooks and where pull/merge requests will be raised.
Data ConnectionSelect the data connection where the code that is changed in the repository will run.
NameAn identifier used in Datafold to identify this CI configuration.
Files to ignoreIf defined, the files matching the pattern will be ignored in the PRs. The pattern uses the syntax of .gitignore.
Mark the CI check as failed on errorsIf the checkbox is disabled, the errors in the CI runs will be reported back to GitHub/GitLab as successes, to keep the check "green" and not block the pull/merge request. This is enabled by default: the errors are reported as failures and may prevent pull/merge requests from being merged.
Require the datafold label to start CIWhen this is selected, the Datafold CI process will only run when the 'datafold' label has been applied. This label needs to be created manually in GitHub or GitLab and the title or name must match 'datafold' exactly.
Sampling toleranceThe tolerance to apply in sampling for all data diffs.
Sampling confidenceThe confidence to apply when sampling.
Sampling ThresholdSampling will be disabled automatically if tables are smaller than the specified threshold. If unspecified, default values will be used depending on the Data Connection type.

CLI environment variables

To use the Datafold CLI, you need to set up some environment variables:

export DATAFOLD_API_KEY=XXXXXXXXX

If your Datafold app URL is different from the default app.datafold.com, set the custom domain as the variable:

export DATAFOLD_HOST=<CUSTOM_DATAFOLD_APP_DOMAIN>

Examples

Run data diffs in Datafold Cloud

The following are optional arguments that can be specified when triggering a data diff run in the app:

OptionsDescription
--versionPrint version info and exit.
-w, --where EXPRAn additional 'where' expression to restrict the search space. Beware of SQL Injection!
--dbt-profiles-dir PATHWhich directory to look in for the profiles.yml file. If not set, we follow the default profiles.yml location for the dbt version being used. Can also be set via the DBT_PROFILES_DIR environment variable.
--dbt-project-dir PATHWhich directory to look in for the dbt_project.yml file. Default is the current working directory and its parents.
--select SELECTION or MODEL_NAMESelect dbt resources to compare using dbt selection syntax in dbt versions >= 1.5. In versions < 1.5, it will naively search for a model with MODEL_NAME as the name.
--state PATHSpecify manifest to utilize for 'prod' comparison paths instead of using configuration.
-pd, --prod-database TEXTOverride the dbt production database configuration within dbt_project.yml.
-ps, --prod-schema TEXTOverride the dbt production schema configuration within dbt_project.yml.
--helpShow this message and exit.

CLI

To run a data diff in Datafold Cloud using dbt:

datafold diff dbt

Submit diffs for a CI run

The following arguments need to be specified when submitting dbt artifacts via the Datafold SDK (examples for Python and CLI below):

ArgumentDescription
ci_config_idThe ID of the CI config in Datafold (see CI settings screen)
pr_numThe number of the pull request
diffsThe compose file to work with. We expect a JSON array, example is below. If no file is specified we read from stdin by default
Example JSON format for diffs file

The JSON file should define the production and pull request tables to compare, along with any primary keys and columns to include or exclude in the comparison.

[
{
"prod": "YOUR_PROJECT.PRODUCTION_TABLE_A",
"pr": "YOUR_PROJECT.PR_TABLE_NUM",
"pk": ["ID"],
"include_columns": ["Column1", "Column2"],
"exclude_columns": ["Column3"]
},
{
"prod": "YOUR_PROJECT.PRODUCTION_TABLE_B",
"pr": "YOUR_PROJECT.PR_TABLE_NUM",
"pk": ["ID"],
"include_columns": ["Column1"],
"exclude_columns": []
}
]

CLI

To submit diffs for a CI run, replace ci_config_id, pr_num, and diffs_file with the appropriate values for your CI configuration ID, pull request number, and the path to your diffs JSON file.

datafold ci submit \
--ci-config-id <ci_config_id> \
--pr-num <pr_num> \
--diffs <diffs_file> \

Python

To submit diffs for a CI run, replace ci_config_id, pr_num, and diffs_file with the appropriate values for your CI configuration ID, pull request number, and the path to your diffs JSON file.

import os

from datafold_sdk.sdk.ci import run_diff

api_key = os.environ.get('DATAFOLD_API_KEY')

# Only needed if your Datafold app URL is not app.datafold.com
host = os.environ.get("DATAFOLD_HOST")

run_diff(host=host,
api_key=api_key,
ci_config_id=<ci_config_id>,
pr_num=<pr_num>,
diffs='<diffs_file>')