Set up Datafold’s integration with dbt Core to automate Data Diffs in your CI pipeline.
Field Name | Description |
---|---|
Configuration name | Choose a name for your for your Datafold dbt integration. |
Repository | Select your dbt project. |
Data Connection | Select the data connection your dbt project writes to. |
Primary key tag | Choose a string for tagging primary keys. |
Field Name | Description |
---|---|
Import dbt tags and descriptions | Import dbt metadata (including column and table descriptions, tags, and owners) to Datafold. |
Slim Diff | Data diffs will be run only for models changed in a pull request. See our guide to Slim Diff for configuration options. |
Diff Hightouch Models | Run Data Diffs for Hightouch models affected by your PR. |
CI fails on primary key issues | The existence of null or duplicate primary keys will cause CI to fail. |
Pull Request Label | When this is selected, the Datafold CI process will only run when the datafold label has been applied. |
CI Diff Threshold | Data Diffs will only be run automatically for a given CI run if the number of diffs doesn’t exceed this threshold. |
Branch commit selection strategy | Select “Latest” if your CI tool creates a merge commit (the default behavior for GitHub Actions). Choose “Merge base” if CI is run against the PR branch head (the default behavior for GitLab). |
Custom base branch | If defined, CI will run only on pull requests with the specified base branch. |
Columns to ignore | Use standard gitignore syntax to identify columns that Datafold should never diff for any table. This can improve performance for large datasets. Primary key columns will not be excluded even if they match the pattern. |
Files to ignore | If at least one modified file doesn’t match the ignore pattern, Datafold CI diffs all changed models in the PR. If all modified files should be ignored, Datafold CI does not run in the PR. (Additional details.) |
Field Name | Description |
---|---|
Enable sampling | Enable sampling for data diffs to optimize analyzing large datasets. |
Sampling tolerance | The tolerance to apply in sampling for all data diffs. |
Sampling confidence | The confidence to apply when sampling. |
Sampling threshold | Sampling will be disabled automatically if tables are smaller than specified threshold. If unspecified, default values will be used depending on the Data Connection type. |
manifest.json
files.
The datafold dbt upload
command takes this general form and arguments:
manifest.json
files in 2 scenarios:
manifest.json
files represent the state of the dbt project on the base/production branch from which PRs are created.manifest.json
files represent the state of the dbt project on the PR branch.manifest.json
files, Datafold determines which dbt models to diff in a CI run.
Implementation details vary depending on which CI tool you use. Please review these instructions and examples to help you configure updates to your organization’s CI scripts.
manifest.json
files represent the base/production state.
Then, open a new pull request with changes to a SQL file to trigger a CI run.
datafold dbt upload
steps in two CI jobs:
manifest.json
. This can be either your Production Job or a special Artifacts Job that runs on merge to main (explained below).manifest.json
.manifest.json
files, enabling us to run data diffs comparing production data to dev data.
datafold dbt upload
step to either your Production Job or an Artifacts Job.Production JobIf your dbt prod job kicks off on merges to the base branch, add a datafold dbt upload
step after the dbt build
step.manifest.json
file to Datafold.datafold dbt upload
step in your CI job that builds PR data.DATAFOLD_API_KEY
in your GitHub repository settings.datafold-skip-ci
in the last commit message.
${{ github.event.pull_request.head.sha }}
for the Pull Request Job instead of ${{ github.sha }}
, which is often mistakenly used.${{ github.sha }}
defaults to the latest commit SHA on the branch and will not work correctly for pull requests.