The basic inputs required to run a diff are the data connections, names/paths of the datasets to be compared, and the primary key (one or more columns that uniquely identify rows in the datasets).

What types of data can data diffs compare?

Diffs can compare data in tables, views, SQL queries (in relational databases and data lakes), and even files (e.g. CSV, Excel, Parquet, etc.).

Datafold facilitates data diffing by supporting a wide range of basic data types across major database systems like Snowflake, Databricks, BigQuery, Redshift, PostgreSQL, and many more.

Creating data diffs

Diffs can be created in several ways:

  • Interactively through the Datafold app
  • Programmatically via our REST API
  • As part of a Continuous Integration (CI) workflow for Deployment Testing

How in-database diffing works

When diffing data within the same physical database or data lake namespace, diffs compare data by executing various SQL queries in the target database. It uses several JOIN-type queries and various aggregate queries to provide detailed insights into differences at the row, value, and column levels, and to calculate differences in metrics and distributions.

How cross-database diffing works

When comparing data across databases, diffs leverage checksumming and interval search to diff the data fast and at minimal cost. Diffs can quickly assess both the magnitude of differences and identify specific rows, columns, and values with differences without having to copy the entire datasets over the network. This efficiency makes it scalable for datasets as large as trillions of rows or terabytes in size.