Datafold SDK
Install
First, create and activate your virtual environment for Python:
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip setuptools wheel
Now, you’re ready to install the Datafold SDK:
pip install datafold-sdk
Configure
After selecting datafold-sdk
from the available options, complete the configuration with the following information:
Field Name | Description |
---|---|
Repository | Select the repository that generates the webhooks and where pull/merge requests will be raised. |
Data Connection | Select the data connection where the code that is changed in the repository will run. |
Name | An identifier used in Datafold to identify this CI configuration. |
Files to ignore | If defined, the files matching the pattern will be ignored in the PRs. The pattern uses the syntax of .gitignore. |
Mark the CI check as failed on errors | If the checkbox is disabled, the errors in the CI runs will be reported back to GitHub/GitLab as successes, to keep the check “green” and not block the pull/merge request. This is enabled by default: the errors are reported as failures and may prevent pull/merge requests from being merged. |
Require the datafold label to start CI | When this is selected, the Datafold CI process will only run when the ‘datafold’ label has been applied. This label needs to be created manually in GitHub or GitLab and the title or name must match ‘datafold’ exactly. |
Sampling tolerance | The tolerance to apply in sampling for all data diffs. |
Sampling confidence | The confidence to apply when sampling. |
Sampling Threshold | Sampling will be disabled automatically if tables are smaller than the specified threshold. If unspecified, default values will be used depending on the Data Connection type. |
CLI environment variables
To use the Datafold CLI, you need to set up some environment variables:
export DATAFOLD_API_KEY=XXXXXXXXX
If your Datafold app URL is different from the default app.datafold.com
, set the custom domain as the variable:
export DATAFOLD_HOST=<CUSTOM_DATAFOLD_APP_DOMAIN>
Examples
Run data diffs in Datafold
The following are optional arguments that can be specified when triggering a data diff run in the app:
Options | Description |
---|---|
—version | Print version info and exit. |
-w, —where EXPR | An additional ‘where’ expression to restrict the search space. Beware of SQL Injection! |
—dbt-profiles-dir PATH | Which directory to look in for the profiles.yml file. If not set, we follow the default profiles.yml location for the dbt version being used. Can also be set via the DBT_PROFILES_DIR environment variable. |
—dbt-project-dir PATH | Which directory to look in for the dbt_project.yml file. Default is the current working directory and its parents. |
—select SELECTION or MODEL_NAME | Select dbt resources to compare using dbt selection syntax in dbt versions >= 1.5. In versions < 1.5, it will naively search for a model with MODEL_NAME as the name. |
—state PATH | Specify manifest to utilize for ‘prod’ comparison paths instead of using configuration. |
-pd, —prod-database TEXT | Override the dbt production database configuration within dbt_project.yml. |
-ps, —prod-schema TEXT | Override the dbt production schema configuration within dbt_project.yml. |
—help | Show this message and exit. |
CLI
To run a data diff in Datafold using dbt:
datafold diff dbt
Submit diffs for a CI run
The following arguments need to be specified when submitting dbt artifacts via the Datafold SDK (examples for Python and CLI below):
Argument | Description |
---|---|
ci_config_id | The ID of the CI config in Datafold (see CI settings screen) |
pr_num | The number of the pull request |
diffs | The compose file to work with. We expect a JSON array, example is below. If no file is specified we read from stdin by default |
Example JSON format for diffs file
The JSON
file should define the production and pull request tables to compare, along with any primary keys and columns to include or exclude in the comparison.
[
{
"prod": "YOUR_PROJECT.PRODUCTION_TABLE_A",
"pr": "YOUR_PROJECT.PR_TABLE_NUM",
"pk": ["ID"],
"include_columns": ["Column1", "Column2"],
"exclude_columns": ["Column3"]
},
{
"prod": "YOUR_PROJECT.PRODUCTION_TABLE_B",
"pr": "YOUR_PROJECT.PR_TABLE_NUM",
"pk": ["ID"],
"include_columns": ["Column1"],
"exclude_columns": []
}
]
CLI
To submit diffs for a CI run, replace ci_config_id
, pr_num
, and diffs_file
with the appropriate values for your CI configuration ID, pull request number, and the path to your diffs JSON
file.
datafold ci submit \
--ci-config-id <ci_config_id> \
--pr-num <pr_num> \
--diffs <diffs_file> \
Python
To submit diffs for a CI run, replace ci_config_id
, pr_num
, and diffs_file
with the appropriate values for your CI configuration ID, pull request number, and the path to your diffs JSON
file.
import os
from datafold_sdk.sdk.ci import run_diff
api_key = os.environ.get('DATAFOLD_API_KEY')
# Only needed if your Datafold app URL is not app.datafold.com
host = os.environ.get("DATAFOLD_HOST")
run_diff(host=host,
api_key=api_key,
ci_config_id=<ci_config_id>,
pr_num=<pr_num>,
diffs='<diffs_file>')