Note

This section of the docs is only relevant if the data used as inputs during the PR build are inconsistent with the data used as inputs during the last production build. Please contact support@datafold.com if you’d like to learn more.

What is data drift in CI?

Datafold is used in CI to illuminate the impact of a pull request’s proposed code change by comparing two versions of the data and identifying differences.

Data drift in CI happens when those data differences occur due to changes in upstream data sources—not because of proposed code changes.

Data drift in CI adds “noise” to your CI testing analysis, making it tricky to tell if data differences are due to new code, or changes in the source data. Unless both versions rely on the same snapshot of upstream data, data drift can compromise your ability to see the true effect of the code changes.

Tip

dbt users should implement Slim CI in dbt Core or dbt Cloud to prevent most instances of data drift. Slim CI reduces build time and eliminates most instances of data drift because the CI build depends on upstreams in production due to state deferral. However, Slim CI will not completely eliminate data drift in CI, specifically in cases where the model being modified in the PR depends on a source. In those cases, we recommend building twice in CI.

Why prevent data drift in CI?

By eliminating data drift entirely, you can be confident that any differences detected in CI are driven only by your code, not unexpected data changes.

You can think of this as similar to a scientific experiment, where the control versus treatment groups ideally exist in identical baseline conditions, with the treatment as the only variable which would cause differential outcomes.

In practice, many organizations do not completely eliminate data drift, and still derive value from automatic data diffing and analysis conducted by Datafold in CI, in spite of minor noise that does exist.

Handling data drift

Build twice in CI

The most rigorous way to prevent data drift in CI is to set up two builds in CI: one build representing PR data and another representing production data, both based on an identical snapshot of upstream data.

  1. Create a fixed snapshot of the upstream data that both builds will use.
  2. The CI pipeline executes two dbt builds: one using the PR branch of code, and another using the base branch of code. This creates two data environments which Datafold can compare.

Since both builds transform the same snapshot of upstream data, any detected differences will be due to the code changes alone, ensuring an accurate comparison with no false positives.

In this architecture, production data is not directly used in CI, but rather, a snapshot or clone. This eliminates any potential noise introduced if the most recent production job used outdated upstream data.

By building two versions of the data in CI, you can ensure an “apples-to-apples” comparison that depends on the same version of upstream data.

If performance is a concern, you can use a reduced or filtered upstream data set to speed up the CI process while still providing rich insight into the data.

This method assumes the production build doesn’t involve multiple jobs that process different sets of models at different times.