Data Diffing
A data diff is a value-level comparison between two tables—used to identify critical changes to your data and guarantee data quality.
Similar to how git diff highlights changes in code by comparing different versions of files to show what lines have been added, modified, or deleted, a data diff compares rows and columns in two tables to pinpoint specific data changes.
Datafold can compare data in tables, views, and SQL queries in databases and data lakes.
Datafold facilitates data diffing by supporting a wide range of basic data types across popular database systems like Snowflake, Databricks, BigQuery, Redshift, and PostgreSQL. Datafold can also diff data across legacy warehouses like Oracle, SQL Server, Teradata, IBM Netezza, MySQL, and more. See the full list of supported data types here.
No, Datafold cannot perform data diffs on unstructured data such as files. However, it supports diffing structured and semi-structured data in tabular formats, including JSON
columns.
See the full list of supported data types here.
When comparing numerical columns or columns of the FLOAT
type, it is beneficial to set tolerance levels for differences to avoid flagging inconsequential discrepancies. This practice ensures that only meaningful differences are highlighted, maintaining the focus on significant changes.
When a change is detected, Datafold highlights the differences in the App or through PR comments, allowing data engineers and other users to review, validate, and approve these changes during the CI process.
When diffing data within the same physical database or data lake namespace, data diff compares data by executing various SQL queries in the target database. It uses several JOIN-type queries and various aggregate queries to provide detailed insights into differences at the row, value, and column levels, and to calculate differences in metrics and distributions.
To compare datasets between two different databases, Datafold leverages a proprietary stochastic checksumming algorithm that allows it to identify discrepancies down to individual primary keys and column values while minimizing the amount of data sent over the network. As a result, the comparison is mostly performed in-place, leveraging the underlying databases without the need to export the entire dataset to compare elsewhere.
Stochastic checksumming is a technique used to verify the integrity of large datasets by generating checksums (hashes) for randomly selected subsets of the data rather than the entire dataset. This method provides a probabilistic assurance of data integrity, allowing for efficient detection of data corruption or changes with significantly reduced computational overhead compared to full data checksumming.
Stochastic checksumming is particularly useful in scenarios where processing the entire dataset is impractical due to size or resource constraints. We use stochastic checksumming to compare datasets between two different databases efficiently. This proprietary algorithm generates checksums for randomly selected subsets of the data, allowing it to identify discrepancies down to individual primary keys and column values with minimal data transfer over the network.
By performing most of the comparison in-place, within the underlying databases, Datafold avoids the need to export entire datasets for external comparison. This approach ensures accurate and scalable data verification while optimizing network and computational resources.
Yes, while the Datafold App UI provides advanced exploration of diff results, you can also materialize these results back to your database. This allows you to further investigate with SQL queries or maintain audit logs, providing flexibility in how you handle and review diff outcomes. Teams may additionally choose to download diff results as a CSV directly from the Datafold App to share with their team members.