Skip to main content

Handling Data Drift

What is Data Drift in CI?

Data Drift in CI occurs when the two data transformation builds that are compared by Datafold in CI have differing data outputs due to the upstream data changing over time.

Why should Data Drift in CI be prevented?

Datafold in CI compares data representing the PR branch of code to data representing production.

If this comparison is between CI and production data, and the two builds are not simultaneous, then differences in the data could be caused not by code changes, but by differences in upstream data—Data Drift.

This undermines the central goal of Datafold in CI, which is to illuminate the impact of the PR's proposed code change by comparing two versions of the data.

tip

dbt users should implement Slim CI to prevent most instances of Data Drift.

Handling Data Drift with Datafold

The best way to prevent Data Drift is by building not once, but twice in CI: one build representing PR data, and another representing production. These builds may use a reduced or filtered upstream data set to speed up the CI process while still providing rich insight into the data.

With this architecture, production data is not involved in Datafold diffing in CI.

data-drift-solution

Benefits of this architecture

  • Comparing PR CI data to production can introduce noise if the most recent production job used outdated upstream data.
  • By building two versions of the data in CI, you can ensure an apples-to-apples comparison that depends on the same version of upstream data.
  • This approach ensures that Datafold diffing illustrates impact of only the code change, with no noise introduced by shifting upstream data.