Skip to main content

dbt Core

Getting Started

To add Datafold to your CI using dbt Core:

1. Create a dbt Core integration.

Complete the configuration by specifying the following fields:

Field NameDescription
RepositorySelect your dbt project.
Data SourceSelect the data source your dbt project writes to.
NameChoose any name for your Datafold dbt integration.
Primary key tagChoose a string for tagging primary keys.
Import dbt tags and descriptionsImport dbt metadata (including column and table descriptions, tags, and owners) to Datafold.
Slim DiffData diffs will be run only for models changed in a pull request. See our guide to Slim Diff for configuration options.
Diff Hightouch ModelsRun Data Diffs for Hightouch models affected by your PR.
CI fails on primary key issuesThe existence of null or duplicate primary keys will cause CI to fail.
Pull Request LabelWhen this is selected, the Datafold CI process will only run when the 'datafold' label has been applied.
Branch commit selection strategySelect "Latest" if your CI tool creates a merge commit (the default behavior for GitHub Actions). Choose "Merge base" if CI is run against the PR branch head (the default behavior for GitLab).
Files to ignoreIf at least one modified file doesn’t match the ignore pattern, Datafold CI diffs all changed models in the PR. If all modified files should be ignored, Datafold CI does not run in the PR. (Additional details.)
Sampling toleranceThe tolerance to apply in sampling for all data diffs.
Sampling confidenceThe confidence to apply when sampling.
Sampling ThresholdSampling will be disabled automatically if tables are smaller than specified threshold. If unspecified, default values will be used depending on the Data Source type.

2. Obtain the CI config ID of your dbt Core integration.

3. Obtain an Datafold API Key.

4. Using the Datafold SDK, configure your CI script(s) to upload dbt manifest.json files.

Datafold determines wich dbt models to diff in a CI run by comparing two manifest.json files generated by your production branch and your PR branch.

The following command will be incorporated into your CI script(s).

datafold dbt upload --ci-config-id <your-ci_config-id> --run-type <job-type> --commit-sha <commit-sha>

Implementation details vary depending on which CI tool you use. Please review the following instructions and examples for your organization's CI tool.

CI Implementation Tools

We've created guides and templates for three popular CI tools.

Having trouble setting up Datafold in CI?

👋 We're here to help! Please reach out and chat with a Datafold Solutions Engineer. ☎️

To add Datafold to your CI tool, add datafold dbt upload steps in two CI jobs:

  • Upload Production Artifacts: A CI job that build a production manifest.json. This can be either your Production Job or a special Artifacts Job (explained below).
  • Upload Pull Request Artifacts: A CI job that builds a PR manifest.json.

This ensures Datafold always has the necessary manifest.json files, enabling us to run data diffs comparing production data to dev data.

Upload Production Artifacts

Add the datafold dbt upload step to either your Production Job or an Artifacts Job.

Production Job

If your dbt prod job kicks off on merges to main/master, you can simply add a datafold dbt upload step after the dbt build step.

name: Production Job

on:
push: # Run the job on push to the main branch
branches:
- main

jobs:
run:
runs-on: ubuntu-20.04 # your image will vary

steps:

- name: Install Datafold SDK
run: pip install -q datafold-sdk
# ...
- name: Upload dbt artifacts to Datafold
run: datafold dbt upload --ci-config-id <datafold_ci_config_id> --run-type production --commit-sha ${GIT_SHA}
env:
DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
GIT_SHA: "${{ github.sha }}"

Artifacts Job

Alternatively, if your dbt prod job does not run on merges to main/master and only runs on a schedule, we recommend creating an additional dedicated job that runs on merges to main/master.

The Artifacts Job's entire purpose is generating and uploading a manifest.json file to Datafold to represent the state of your dbt project's production branch.

This is the basic structure of an Artifacts Job:

name: Artifacts Job

on:
push: # Run the job on push to the main branch
branches:
- main

jobs:
run:
runs-on: ubuntu-20.04 # your image will vary

steps:

# You should include the same or (similar steps) that exist in your production job, such as:
# - Checkout your code base
# - Install Python
# - Install any additional requirements

- name: Install Datafold SDK
run: pip install -q datafold-sdk

- name: Generate dbt manifest.json
run: dbt ls # or dbt compile

- name: Upload dbt artifacts to Datafold
run: datafold dbt upload --ci-config-id <datafold_ci_config_id> --run-type production --commit-sha ${GIT_SHA}
env:
DATAFOLD_APIKEY: ${{ secrets.DATAFOLD_APIKEY }}
GIT_SHA: "${{ github.sha }}"

Upload Pull Request Artifacts

The datafold dbt upload step also needs to be added to the CI job that builds PR data.

dbt in CI

🔧 If you don't have a CI job that builds PR data, we can help you set this up. Please check out this step-by-step blog post, or book time to chat with a Datafold Solutions Engineer. ☎️

Pull Request Job

If you already have a Pull Request Job, adding the datafold dbt upload step is easy! Simply add the datafold dbt upload step to after the dbt build step.

name: Pull Request Job

# Run this job when a commit is pushed to any branch except main
on:
pull_request:
push:
branches:
- '!main'

jobs:
run:
runs-on: ubuntu-20.04 # your image will vary

steps:

- name: Install Datafold SDK
run: pip install -q datafold-sdk

- name: Upload PR manifest.json to Datafold
run: |
datafold dbt upload --ci-config-id <datafold_ci_config_id> --run-type pull_request --commit-sha ${GIT_SHA}
# The <datafold_ci_config_id> value can be obtained from the Datafold application: Settings > Integrations > dbt Core/Cloud > the ID column
env:
DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
GIT_SHA: "${{ github.event.pull_request.head.sha }}"

Be sure to replace <datafold_ci_config_id> with the value you obtained in this step.

Store your Datafold API Key

Finally, store your Datafold API Key as a secret named DATAFOLD_API_KEY in your GitHub repository settings.

Once you've completed these steps, Datafold will run data diffs between production and development data on the next GitHub Actions CI run.

Optional CI Configurations and Strategies

Including datafold-skip-ci in the git commit message

  • If the last commit contains the string datafold-skip-ci, the Datafold step in CI will be skipped.