Skip to main content

What's a data diff?

A data diff is the value-level comparison between two tables—used to identify critical changes to your data and guarantee data quality.

data diff

Datafold's data-diff is a tool that compares datasets fast, within or across databases.

There is a lot you can do with data-diff: you can test SQL code by comparing development or staging environment data to production, or compare source and target data to identify discrepancies when moving data between databases.

Why diff data?

Just as diffing code and text is fundamental to software engineering and working with text documents, diffing data is essential to the data engineering workflow.

In data engineering, both data and the code that processes it are constantly evolving. Without the ability to easily diff data, understanding and tracking data changes becomes challenging. This slows down the developement process and makes it harder to ensure data quality.

Is data-diff open-source?

Yes, we made data-diff an open-source Python package as we believe that diffing is a fundamental capability in data engineering that every engineer should have access to. Datafold's data-diff offers a crucial capability – the ability to compare datasets within or across databases effectively. It works on a per-table basis and provides essential functionality for data comparison.

Datafold Cloud is the enterprise-ready solution for data testing at scale. It includes more comprehensive, optimized, and automated diffing solutions, API access, and secure deployment options.

How does open-source data-diff compare to Datafold Cloud?

Datafold Cloud is to data-diff what GitHub is to git: an application that automates workflows on top of an enabling open-source technology.

Datafold's open source data-diff is primarily designed for individual developers running ad hoc data diffs. Datafold Cloud caters to more complex and production-ready scenarios, including:

  • Automated and collaborative diffing and testing for data transformations in CI
  • Data diffing informed by column-level lineage, and validation of code changes with visibility into BI applications
  • Validating large data migrations or continuous replications with automated cross-database diffing capabilities

While data-diff unlocks an essential capability of diffing data for data engineers, Datafold Cloud provides end-to-end solutions for automating testing. Datafold Cloud incorporates data-diff within its platform, superpowered by features including column-level lineage, ML-based anomaly detection, and infrastructure support for enterprise scale.

Here's a high-level comparison of open-source data-diff and Datafold Cloud:

Feature CategoryOpen source data-diffDatafold Cloud
Database Support
Databases that are supported for source-destination diffing
Community-supported adaptersAny SQL database, inquire about specific support
Scale
Size of datasets supported for diffing
UnlimitedUnlimited with advanced performance optimization
Primary Key Data Type Support
Data types of primary keys that are supported for diffing
NumericalNumerical, string, datetime, boolean, composite
Data Types Diffing Support
Data types that are supported for per-column diffing
All data typesAll data types
Export Diff Results to Database
Materialize diffing results in your database of choice
Limited to in-database diffing
Value-level diffs
Investigate row-by-row column value differences between source and destination databases
✅ (JSON)✅ (JSON & GUI)
Diff UI
Explore diffs visually and easily share them with your team and stakeholders
API Access
Automatically create diffs and receive results at scale using the Datafold REST API
Persisting Diff History
Persist the result history of diffs to know how your data and diffs have changed over time
Scheduled Checks
Run scheduled diffs for a defined list of tables
Alerting
Receive automatic alerts about detected discrepancies between tables (Coming Soon)
✅ (Coming soon)
Security and Compliance
Run diffs in secure and compliant environments
N/AHIPAA, SOC2 Type II, GDPR compliant
Deployment Options
Deploy your diffs in secure environments that meet your security standards
N/AMulti-tenant SaaS or Single-tenant in VPC
Support
Choose which channels offer the greatest support to your use cases and users
Community-basedEnterprise support from Datafold team members
SLA
The types of SLAs that exist to guarantee your team can diff and interact with diffs as expected
N/A✅ (Coming soon)