Skip to main content

How Datafold in CI works

A core component of Datafold Cloud is the integration of Datafold into your Continuous Integration (CI) process. This is how Datafold creates Data Diffs for all SQL code changes, catching issues before they make it into production.

What is CI?

Put simply, Continuous Integration (or CI) is a process for building and testing changes to your code before deploying to production.

Without CI

  • Updates are manually coordinated and become a complex synchronization chore.
  • Testing is done manually, if at all.
  • Code changes are released at a slower cadence, and with higher rates of failure.

With CI

  • Smoothly manage code changes, and scale as your team and code base grow.
  • Automate high-confidence test coverage.
  • Boost the quantity and quality of developer output.

For Datafold to work in CI, a step building staging data needs to be added to your CI process in your code repository system (e.g., GitHub).

What is Staging Data?

Staging data is created using the version of the code in your PR/MR branch, which contains the edits you're currently working on.

Prerequisite: Building staging data in CI

If you use dbt, in order to add Datafold to CI, you need to first add a dbt build step to CI. This can be done using dbt Cloud or dbt Core.

If you use another orchestrator such as Airflow, you should follow the steps in this blog to build staging data in CI, or reach out to our team for customized recommendations based on your infrastructure.

Creating production and staging data

Datafold in CI automatically identifies value-level data differences between production data and staging data.

Summarized Data Diff results are written directly to your PR as a comment. From there, by clicking into Datafold Cloud, you can access value-level differences, downstream impact on BI tools, and other context-rich information about the impact of your PR code changes.

Production data

The orchestrator that runs your SQL code (e.g., dbt, Airflow) builds and updates production data in your warehouse. This is the data that your dashboards, BI systems, and users depend on.

If you use dbt, we'll assume that you have a production job in dbt Cloud or dbt Core that builds or updates your dbt models in the warehouse on a schedule. Or, you might have a scheduled job in Airflow or another orchestrator that builds production data on a regular basis.

Staging data

For Datafold to run Data Diffs in CI, there should be a step in your CI process which builds a version of your data in a dedicated schema using the code in your PR/MR branch. Datafold compares this staging data to your production data when diffing.

tip

You can use either dbt Cloud or dbt Core to add astep in your CI process that builds staging data.

Comparing production and staging data

Once you have a job in CI that builds staging data, you'll be ready to get started with Datafold in CI!

We'll walk through the setup steps in more detail in the Getting Started section.

Datafold in CI for dbt users

While Datafold can be added to CI no matter what orchestrator you use, it's worth detailing exactly how this works with dbt, a popular and opinionated tool for which we have specific recommendations.

Here is how Datafold + dbt in CI works:

  • Two versions of your dbt project's manifest.json will be submitted to Datafold representing the state of production code as well as PR/MR code.
    • This submission of dbt artifacts happens out-of-the-box with dbt Cloud.
    • dbt Core users can set this up by adding steps to their existing CI configuration in Circle CI, GitHub Actions, or GitLab.
  • Datafold uses these two versions of the manifest.json to identify code differences.
  • Datafold queries your warehouse and runs Data Diffs of modified models and other downstream impacts to data apps like Looker, Tableau, Hightouch, and Mode.
    • Datafold diffs dbt models that are materialized as both tables and views.
    • Got a huge dbt project with many downstreams? Don't worry! You can set up Slim Diff or utilize other configuration options to manage scale, while ensuring critical models are diffed.
  • The results of the Data Diffs are then written directly to your code repository system (e.g., GitHub), and more details can be viewed in the Datafold Cloud application.