Handling Data Drift
Datafold in CI compares PR branch data to production data to detect changes, preventing data drift from upstream changes.
Note
This section of the docs is only relevant if the data used as inputs during the PR build are inconsistent with the data used as inputs during the last production build. Please contact support@datafold.com if you’d like to learn more.
What is data drift in CI?
Data drift in CI occurs when the two data transformation builds that are compared by Datafold in CI have differing data outputs due to the upstream data inputs changing over time.
Tip
dbt users should implement Slim CI to prevent most instances of Data Drift.
Why should data drift in CI be prevented?
If this comparison is between CI and production data, and the two builds are not simultaneous, then differences in the data could be caused not by code changes, but by differences in upstream data—Data Drift.
This undermines the central goal of Datafold in CI, which is to illuminate the impact of the PR’s proposed code change by comparing two versions of the data.
Handling data drift with Datafold
The best way to prevent data drift is by building not once, but twice in CI: one build representing PR data, and another representing production. These builds may use a reduced or filtered upstream data set to speed up the CI process while still providing rich insight into the data.
With this architecture, production data is not involved in Datafold diffing in CI.
Benefits of this architecture
- Comparing PR CI data to production can introduce noise if the most recent production job used outdated upstream data.
- By building two versions of the data in CI, you can ensure an apples-to-apples comparison that depends on the same version of upstream data.
- This approach ensures that Datafold diffing illustrates impact of only the code change, with no noise introduced by shifting upstream data.