1. Create a repository integration

Integrate your code repository using the appropriate integration.

2. Create an API integration

In the Datafold app, create an API integration

3. Set up the API integration

Complete the configuration by specifying the following fields:

Basic settings

Field NameDescription
Configuration nameChoose a name for your for your Datafold dbt integration.
RepositorySelect the repository you configured in step 1.
Data SourceSelect the data source your repository writes to.

Advanced settings: Configuration

Field NameDescription
Diff Hightouch ModelsRun data diffs for Hightouch models affected by your PR.
CI fails on primary key issuesIf null or duplicate primary keys exist, CI will fail.
Pull Request LabelWhen this is selected, the Datafold CI process will only run when the ‘datafold’ label has been applied.
CI Diff ThresholdData Diffs will only be run automatically for given CI Run if the number of diffs doesn’t exceed this threshold.
Custom base branchIf defined, the Datafold CI process will only run on pull requests with the specified base branch.
Files to ignoreDatafold CI diffs all changed models in the PR if at least one modified file doesn’t match the ignore pattern. Datafold CI doesn’t run in the PR if all modified files should be ignored. (Additional details.)

Advanced settings: Sampling

Field NameDescription
Enable samplingEnable sampling for data diffs to optimize analyzing large datasets.
Sampling toleranceThe tolerance to apply in sampling for all data diffs.
Sampling confidenceThe confidence to apply when sampling.
Sampling thresholdSampling will be disabled automatically if tables are smaller than specified threshold. If unspecified, default values will be used depending on the Data Source type.

4. Obtain a Datafold API Key and CI config ID

Generate a new Datafold API Key and obtain the CI config ID from the CI API integration settings page:

You will need these values later on when setting up the CI Jobs.

5. Install Datafold SDK into your Python environment

pip install datafold-sdk

6. Configure your CI script(s) with the Datafold SDK

Using the Datafold SDK, configure your CI script(s) to use the Datafold SDK ci submit command. The example below should be adapted to match your specific use-case.

datafold ci submit --ci-config-id <datafold_ci_config_id> --pr-num <pr_num> --diffs ./diffs.json

Since Datafold cannot infer which tables have changed, you’ll need to manually provide this information in a specific json file format. Datafold can then determine which models to diff in a CI run based on the diffs.json you pass in to the Datafold SDK ci submit command.

[
  {
    "prod": "MY.PROD.TABLE", // Production table to compare PR changes against
    "pr": "MY.PR.TABLE", // Changed table containing data modifications in the PR
    "pk": ["MY", "PK", "LIST"], // Primary key; can be an empty array
    // These fields are not required and can be omitted from the JSON file:
    "include_columns": ["COLUMNS", "TO", "INCLUDE"],
    "exclude_columns": ["COLUMNS", "TO", "EXCLUDE"]
  }
]

Note: The JSON file is optional and you can also achieve the same effect by using standard input (stdin) as shown here. However, for brevity, we’ll use the JSON file approach in this example:

datafold ci submit \
    --ci-config-id <datafold_ci_config_id> \
    --pr-num <pr_num> <<- EOF
[{
        "prod": "MY.PROD.TABLE",
        "pr": "MY.PR.TABLE",
        "pk": ["MY", "PK", "LIST"]
}]

Implementation details will vary depending on which CI tool you use. Please review the following instructions and examples for your organization’s CI tool.

NOTE

Populating the diffs.json file is specific to your use case and therefore out of scope for this guide. The only requirement is to adhere to the JSON schema structure explained above.

CI Implementation Tools

We’ve created guides and templates for three popular CI tools.

HAVING TROUBLE SETTING UP DATAFOLD IN CI?

We’re here to help! Please reach out and chat with a Datafold Solutions Engineer.

To add Datafold to your CI tool, add datafold ci submit step in your PR CI job.


name: Datafold PR Job

# Run this job when a commit is pushed to any branch except main
on:
  pull_request:
  push:
    branches:
      - '!main'

jobs:
  run:
    runs-on: ubuntu-20.04 # your image will vary

    steps:

      - name: Install Datafold SDK
        run: pip install -q datafold-sdk
    # ...
      - name: Upload what to diff to Datafold
        run: datafold ci submit --ci-config-id <datafold_ci_config_id> --pr-num ${PR_NUM} --diffs <path_to_diffs_json_file>
        env:
          # env variables used by Datafold SDK internally
          DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
          DATAFOLD_HOST: ${DATAFOLD_HOST}
          # For Dedicated Cloud/private deployments of Datafold,
          # Set the "https://custom.url.datafold.com" variable as the base URL as an environment variable, either as a string or a project variable
          # There are multiple ways to get the PR_NUM, this is just a simple example
          PR_NUM: ${{ github.event.number }}

Be sure to replace <datafold_ci_config_id> with the CI config ID value.

NOTE

It is beyond the scope of this guide to provide guidance on generating the <path_to_diffs_json_file>, as it heavily depends on your specific use case. However, ensure that the generated file adheres to the required schema outlined above.

Finally, store your Datafold API Key as a secret named DATAFOLD_API_KEY in your GitHub repository settings.

Once you’ve completed these steps, Datafold will run data diffs between production and development data on the next GitHub Actions CI run.

Optional CI Configurations and Strategies

Skip Datafold in CI

  • To skip the Datafold step in CI, include the string datafold-skip-ci in the last commit message.