How Datafold in CI Works
Learn how Datafold integrates with your Continuous Integration (CI) process to create Data Diffs for all SQL code changes, catching issues before they make it into production.
What is CI?
Continuous Integration (or CI) is a process for building and testing changes to your code before deploying to production. This ensures early detection of potential issues and improves the quality of code deployment.
Without CI | With CI |
---|---|
Updates are manually coordinated and become a complex synchronization chore. | Smoothly manage code changes, and scale as your team and code base grow. |
Testing is done manually, if at all. | Automate high-confidence test coverage. |
Code changes are released at a slower cadence, and with higher rates of failure. | Boost the quantity and quality of developer output. |
Datafold in CI
For Datafold to work in CI, you need to add a step that builds in your CI process (e.g., GitHub).
Prerequisite: Building staging data in CI
If you’re using dbt, you’ll need to add a dbt build step to your CI pipeline first. This can be done through either dbt Cloud or dbt Core.
For other orchestrators like Airflow, follow this guide to build staging data in CI, or contact us for custom recommendations based on your infrastructure.
In this short clip, see how the Datafold bot automatically comments on your PR, highlighting data differences between the production and development versions of your code:
Creating production and staging data
When Datafold is integrated into your CI, it automatically detects and highlights value-level differences between production data and staging data.
These summarized Data Diff results are written directly in your pull request (PR) as a comment. From the comment, you can access the Datafold App to explore value-level differences, understand the impact on downstream BI tools, and other context-rich information about the impact of your PR code changes.
Production data
Production data refers to the data that your organization depends on for daily operations, such as powering dashboards and BI tools. Your orchestrator (e.g., dbt, Airflow) is responsible for running SQL code that builds and maintains this data in your warehouse.
If you use dbt, we’ll assume that you have a production job in either dbt Cloud or dbt Core that builds or updates your dbt models in the warehouse on a schedule. Or, you might have a scheduled job in Airflow or another orchestrator that builds production data on a regular basis.
Staging data
For Datafold to run Data Diffs in CI, you need a step in your CI process that builds staging data (a version of your data in a dedicated schema) using the code in your PR/MR branch. Datafold will compares this staging data against your production data when diffing.
Tip
You can use either dbt Cloud or dbt Core to add a step in your CI process that builds staging data.
- Setting up dbt in CI for dbt Cloud users
- Setting up dbt in CI for dbt Core users
- Building staging data in CI using Airflow
Comparing production and staging data
Once you have a job in CI that builds staging data, you’re ready to get started with Datafold in CI!
By comparing production and staging data, Datafold ensures that any code changes are thoroughly validated before being merged, helping to prevent data issues from reaching production.
We’ll walk through the setup steps in more detail in the Getting Started section.
Datafold in CI for dbt users
While Datafold can be added to CI no matter what orchestrator you use, it’s worth detailing exactly how this works with dbt, a popular and opinionated tool for which we have specific recommendations.
Here is how Datafold + dbt in CI works:
Submit dbt Project Manifests
Two versions of your dbt project’s manifest.json
are submitted to Datafold, representing the state of the production code and the PR/MR code.
- For dbt Cloud users, this submission happens automatically.
- dbt Core users need to add steps in their CI configuration (e.g., Circle CI, GitHub Actions, or GitLab) to submit the artifacts.
Identify Code Differences
Datafold compares the two versions of the manifest.json
to identify differences in the code.
Run Data Diffs
Datafold queries your data warehouse to run Data Diffs on the modified dbt models. It also identifies downstream assets (e.g., BI tools, reverse ETL pipelines) that might be impacted by the changes.
- Datafold can diff dbt models that are materialized as both tables and views.
- If your dbt project has many downstream dependencies, you can use Slim Diff or other configuration options to manage scale, ensuring critical models are prioritized.
Summarize Data Diffs in Pull Request
The results of the Data Diffs are summarized in a comment on your pull request (e.g., in GitHub). You can click the comment to view more detailed information in the Datafold application.