API
Learn how to set up and configure Datafold’s API for CI/CD testing.
1. Create a repository integration
Integrate your code repository using the appropriate integration.
2. Create an API integration
In the Datafold app, create an API integration.
3. Set up the API integration
Complete the configuration by specifying the following fields:
Basic settings
Field Name | Description |
---|---|
Configuration name | Choose a name for your for your Datafold dbt integration. |
Repository | Select the repository you configured in step 1. |
Data Source | Select the data source your repository writes to. |
Advanced settings: Configuration
Field Name | Description |
---|---|
Diff Hightouch Models | Run data diffs for Hightouch models affected by your PR. |
CI fails on primary key issues | If null or duplicate primary keys exist, CI will fail. |
Pull Request Label | When this is selected, the Datafold CI process will only run when the ‘datafold’ label has been applied. |
CI Diff Threshold | Data Diffs will only be run automatically for given CI Run if the number of diffs doesn’t exceed this threshold. |
Custom base branch | If defined, the Datafold CI process will only run on pull requests with the specified base branch. |
Files to ignore | Datafold CI diffs all changed models in the PR if at least one modified file doesn’t match the ignore pattern. Datafold CI doesn’t run in the PR if all modified files should be ignored. (Additional details.) |
Advanced settings: Sampling
Field Name | Description |
---|---|
Enable sampling | Enable sampling for data diffs to optimize analyzing large datasets. |
Sampling tolerance | The tolerance to apply in sampling for all data diffs. |
Sampling confidence | The confidence to apply when sampling. |
Sampling threshold | Sampling will be disabled automatically if tables are smaller than specified threshold. If unspecified, default values will be used depending on the Data Source type. |
4. Obtain a Datafold API Key and CI config ID
Generate a new Datafold API Key and obtain the CI config ID from the CI API integration settings page:
You will need these values later on when setting up the CI Jobs.
5. Install Datafold SDK into your Python environment
pip install datafold-sdk
6. Configure your CI script(s) with the Datafold SDK
Using the Datafold SDK, configure your CI script(s) to use the Datafold SDK ci submit
command. The example below should be adapted to match your specific use-case.
datafold ci submit --ci-config-id <datafold_ci_config_id> --pr-num <pr_num> --diffs ./diffs.json
Since Datafold cannot infer which tables have changed, you’ll need to manually provide this information in a specific json
file format. Datafold can then determine which models to diff in a CI run based on the diffs.json
you pass in to the Datafold SDK ci submit
command.
[
{
"prod": "MY.PROD.TABLE", // Production table to compare PR changes against
"pr": "MY.PR.TABLE", // Changed table containing data modifications in the PR
"pk": ["MY", "PK", "LIST"], // Primary key; can be an empty array
// These fields are not required and can be omitted from the JSON file:
"include_columns": ["COLUMNS", "TO", "INCLUDE"],
"exclude_columns": ["COLUMNS", "TO", "EXCLUDE"]
}
]
Note: The JSON
file is optional and you can also achieve the same effect by using standard input (stdin) as shown here. However, for brevity, we’ll use the JSON
file approach in this example:
datafold ci submit \
--ci-config-id <datafold_ci_config_id> \
--pr-num <pr_num> <<- EOF
[{
"prod": "MY.PROD.TABLE",
"pr": "MY.PR.TABLE",
"pk": ["MY", "PK", "LIST"]
}]
Implementation details will vary depending on which CI tool you use. Please review the following instructions and examples for your organization’s CI tool.
NOTE
Populating the diffs.json
file is specific to your use case and therefore out of scope for this guide. The only requirement is to adhere to the JSON
schema structure explained above.
CI Implementation Tools
We’ve created guides and templates for three popular CI tools.
HAVING TROUBLE SETTING UP DATAFOLD IN CI?
We’re here to help! Please reach out and chat with a Datafold Solutions Engineer.To add Datafold to your CI tool, add datafold ci submit
step in your PR CI job.
name: Datafold PR Job
# Run this job when a commit is pushed to any branch except main
on:
pull_request:
push:
branches:
- '!main'
jobs:
run:
runs-on: ubuntu-20.04 # your image will vary
steps:
- name: Install Datafold SDK
run: pip install -q datafold-sdk
# ...
- name: Upload what to diff to Datafold
run: datafold ci submit --ci-config-id <datafold_ci_config_id> --pr-num ${PR_NUM} --diffs <path_to_diffs_json_file>
env:
# env variables used by Datafold SDK internally
DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
DATAFOLD_HOST: ${DATAFOLD_HOST}
# For Dedicated Cloud/private deployments of Datafold,
# Set the "https://custom.url.datafold.com" variable as the base URL as an environment variable, either as a string or a project variable
# There are multiple ways to get the PR_NUM, this is just a simple example
PR_NUM: ${{ github.event.number }}
Be sure to replace <datafold_ci_config_id>
with the CI config ID value.
NOTE
It is beyond the scope of this guide to provide guidance on generating the <path_to_diffs_json_file>
, as it heavily depends on your specific use case. However, ensure that the generated file adheres to the required schema outlined above.
Finally, store your Datafold API Key as a secret named DATAFOLD_API_KEY
in your GitHub repository settings.
Once you’ve completed these steps, Datafold will run data diffs between production and development data on the next GitHub Actions CI run.
Optional CI Configurations and Strategies
Skip Datafold in CI
To skip the Datafold step in CI, include the string datafold-skip-ci
in the last commit message.