dbt Core
Set up Datafold’s integration with dbt Core to automate Data Diffs in your CI pipeline.
Getting Started
To add Datafold to your continuous integration (CI) pipeline using dbt Core, follow these steps:
1. Create a dbt Core integration.
2.Set up the dbt Core integration
Complete the configuration by specifying the following fields:
Basic settings
Field Name | Description |
---|---|
Configuration name | Choose a name for your for your Datafold dbt integration. |
Repository | Select your dbt project. |
Data Connection | Select the data connection your dbt project writes to. |
Primary key tag | Choose a string for tagging primary keys. |
Advanced settings: Configuration
Field Name | Description |
---|---|
Import dbt tags and descriptions | Import dbt metadata (including column and table descriptions, tags, and owners) to Datafold. |
Slim Diff | Data diffs will be run only for models changed in a pull request. See our guide to Slim Diff for configuration options. |
Diff Hightouch Models | Run Data Diffs for Hightouch models affected by your PR. |
CI fails on primary key issues | The existence of null or duplicate primary keys will cause CI to fail. |
Pull Request Label | When this is selected, the Datafold CI process will only run when the ‘datafold’ label has been applied. |
Branch commit selection strategy | Select “Latest” if your CI tool creates a merge commit (the default behavior for GitHub Actions). Choose “Merge base” if CI is run against the PR branch head (the default behavior for GitLab). |
Files to ignore | If at least one modified file doesn’t match the ignore pattern, Datafold CI diffs all changed models in the PR. If all modified files should be ignored, Datafold CI does not run in the PR. (Additional details.) |
Advanced settings: Sampling
Field Name | Description |
---|---|
Enable sampling | Enable sampling for data diffs to optimize analyzing large datasets. |
Sampling tolerance | The tolerance to apply in sampling for all data diffs. |
Sampling confidence | The confidence to apply when sampling. |
Sampling threshold | Sampling will be disabled automatically if tables are smaller than specified threshold. If unspecified, default values will be used depending on the Data Connection type. |
3. Obtain an Datafold API Key and CI config ID.
In the dbt Core integration creation form, generate a new Datafold API Key and obtain the CI config ID.
4. Configure your CI script(s) with the Datafold SDK.
Using the Datafold SDK, configure your CI script(s) to upload dbt manifest.json
files. The dbt Core integration creation form automatically generates scripts for integrating Datafold with your CI/CD pipeline.
Datafold determines which dbt models to diff in a CI run by comparing two manifest.json
files generated by your production branch and your PR branch.
The following command will be incorporated into your CI script(s).
datafold dbt upload --ci-config-id <your-ci_config-id> --run-type <job-type> --commit-sha <commit-sha>
Implementation details vary depending on which CI tool you use. Please review the following instructions and examples for your organization’s CI tool.
5. Test your dbt Core integration.
Wait for the production manifest to be successfully uploaded.
Then, test your CI integration by opening a new pull request with changes to a SQL file to trigger the workflow.
CI Implementation Tools
We’ve created guides and templates for three popular CI tools.
HAVING TROUBLE SETTING UP DATAFOLD IN CI?
We’re here to help! Please reach out and chat with a Datafold Solutions Engineer.To add Datafold to your CI tool, add datafold dbt
upload steps in two CI jobs:
- Upload Production Artifacts: A CI job that build a production
manifest.json
. This can be either your Production Job or a special Artifacts Job (explained below). - Upload Pull Request Artifacts: A CI job that builds a PR
manifest.json
.
This ensures Datafold always has the necessary manifest.json
files, enabling us to run data diffs comparing production data to dev data.
Upload Production Artifacts
Add the datafold dbt
upload step to _either _your Production Job or an Artifacts Job.
Production Job
If your dbt prod job kicks off on merges to main/master, you can simply add a datafold dbt
upload step after the dbt build
step
name: Production Job
on:
push: # Run the job on push to the main branch
branches:
- main
jobs:
run:
runs-on: ubuntu-20.04 # your image will vary
steps:
- name: Install Datafold SDK
run: pip install -q datafold-sdk
# ...
- name: Upload dbt artifacts to Datafold
run: datafold dbt upload --ci-config-id <datafold_ci_config_id> --run-type production --commit-sha ${GIT_SHA}
env:
DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
GIT_SHA: "${{ github.sha }}"
# Set the DATAFOLD_HOST var for Dedicated Cloud deployments of Datafold.
# DATAFOLD_HOST: "https://custom.url.datafold.com" # set the base URL as an environment variable
Artifacts Job
Alternatively, if your dbt prod job does not run on merges to main/master and only runs on a schedule, we recommend creating an additional dedicated job that runs on merges to main/master.
The Artifacts Job’s entire purpose is generating and uploading a manifest.json
file to Datafold to represent the state of your dbt project’s production branch.
This is the basic structure of an Artifacts Job:
image:
name: ghcr.io/dbt-labs/dbt-core:1.x # your name will vary
entrypoint: [ "" ]
run_pipeline:
stage: deploy
before_script:
- pip install -q datafold-sdk
script:
# Generate manifest.json
- dbt ls --profiles-dir ./
# Upload the `manifest.json` to Datafold
- datafold dbt upload --ci-config-id <ci-config-id> --run-type production --commit-sha $CI_COMMIT_SHA
# The <datafold_ci_config_id> value can be obtained from the Datafold application: Settings > Integrations > dbt Core/Cloud > the ID column
Upload Pull Request Artifacts
The datafold dbt upload
step also needs to be added to the CI job that builds PR data.
DBT IN CI
If you don’t have a CI job that builds PR data, we can help you set this up. Please check out this step-by-step blog post, or book time to chat with a Datafold Solutions Engineer.Pull Request Job
If you already have a Pull Request Job, adding the datafold dbt upload
step is easy! Simply add the datafold dbt upload
step to after the dbt build step
.
image:
name: ghcr.io/dbt-labs/dbt-core:1.x # your name will vary
entrypoint: [ "" ]
run_pipeline:
stage: test
before_script:
- pip install -q datafold-sdk
script:
# Generate manifest.json
- dbt build --profiles-dir ./
# Upload the `manifest.json` to Datafold
- datafold dbt upload --ci-config-id <ci-config-id> --run-type production --commit-sha $CI_COMMIT_SHA
# The <datafold_ci_config_id> value can be obtained from the Datafold application: Settings > Integrations > dbt Core/Cloud > the ID column
rules:
- if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
Be sure to replace <datafold_ci_config_id>
with the value you obtained in this step.
Store your Datafold API Key
Finally, store your Datafold API Key as a secret named DATAFOLD_API_KEY
in your GitLab repository settings.
Once you’ve completed these steps, Datafold will run data diffs between production and development data on the next GitLab CI run.
CI for dbt multi-projects
When setting up CI for dbt multi-projects, each project should have its own dedicated CI integration to ensure that changes are validated independently.
CI for dbt multi-projects within a monorepo
When managing multiple dbt projects within a monorepo (a single repository), it’s essential to configure individual Datafold CI integrations for each project to ensure proper isolation. This approach prevents unintended triggering of CI processes for projects unrelated to the changes made. Here’s the recommended approach for setting it up in Datafold:
-
Create separate CI integrations: Create separate CI integrations within Datafold, one for each dbt project within the monorepo. Each integration should be configured to reference the same GitHub repository.
-
Configure file filters: For each CI integration, define file filters to specify which files should trigger the CI run. These filters prevent CI runs from being initiated when files from other projects in the monorepo are updated.
-
Test and validate: Before deployment, test each CI integration to validate that it triggers only when changes occur within its designated dbt project. Verify that modifications to files in one project do not inadvertently initiate CI processes for unrelated projects in the monorepo.
Optional CI Configurations and Strategies
Skip Datafold in CI
To skip the Datafold step in CI, include the string datafold-skip-ci
in the last commit message.
Programmatically trigger CI runs
The Datafold app relies on the version control service webhooks to trigger the CI runs. When the dedicated cloud deployments is behind a VPN, webhooks cannot directly reach the deployment due to the network’s restricted access.
We can overcome this by triggering the CI runs via the datafold-sdk in the Actions/Job Runners, assuming they’re running in the same network.
Add a new Datafold SDK command after uploading the manifest in a PR job:
Important
When configuring your CI script, be sure to use ${{ github.event.pull_request.head.sha }}
for the Pull Request Job instead of ${{ github.sha }}
, which is often mistakenly used.
${{ github.sha }}
defaults to the latest commit SHA on the branch and will not work correctly for pull requests.
- -name: Trigger CI
run: |
set -ex
datafold ci trigger --ci-config-id <datafold_ci_config_id> \
--pr-num ${PR_NUM} \
--base-branch ${BASE_BRANCH} \
--base-sha ${BASE_SHA} \
--pr-branch ${PR_BRANCH} \
--pr-sha ${PR_SHA}
env:
DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
DATAFOLD_HOST: ${{ secrets.DATAFOLD_HOST }}
PR_NUM: ${{ github.event.number }}
PR_BRANCH: ${{ github.event.pull_request.head.ref }}
BASE_BRANCH: ${{ github.event.pull_request.base.ref }}
PR_SHA: ${{ github.event.pull_request.head.sha }}
BASE_SHA: ${{ github.event.pull_request.base.sha }}