Skip to main content

How Datafold diffs data

Data diff can compare data within and between databases (e.g., PostgreSQL <> Snowflake or between two MySQL instances), fast and without transfering the actual data over the network while providing full row/column/value-level detail into the discrepancies.

The basic inputs required to run a diff are the data connection, names of the datasets to be compared, and the primary key (a column or a combination of columns that uniquely identify a row in the dataset).

What types of data can data diff compare?

Data diff can compare data in tables, views, and SQL queries in relational databases and data lakes.

Datafold facilitates data diffing by supporting a wide range of basic data types across major database systems like BigQuery, PostgreSQL, Redshift, Databricks, and Snowflake.

Creating diffs

Diffs can be created in multiple ways, including:

  • Interactively through the Datafold App
  • Through the Datafold Cloud API
  • As part of a Continuous Integration (CI) workflow in Deployment Testing

How in-database diffing works

When diffing data within the same physical database or data lake namespace, data diff compares data by executing various SQL queries in the target database. It uses several JOIN-type queries and various aggregate queries to provide detailed insights into differences at the row, value, and column levels, and to calculate differences in metrics and distributions.

How cross-database diffing works

When comparing data across databases, data diff leverages checksumming and interval search to diff the data fast and at minimal cost. Data diff can quickly assess both the magnitude of differences and identify specific rows, columns, and values with differences without having to copy the entire dataset over the network. This efficiency makes it scalable for datasets as large as trillions of rows or terabytes in size.

Data ingestion and storage

  1. Metadata: Datafold requests and stores:

    • Table names
    • Column names
    • Queries executed in the data warehouse
  2. Data used for Data Diffs:

    • For in-database diffs, all the data that is visible in the app, including data samples, is fetched and stored.
    • For cross-database diffs, all the data that is visible in the app, including data samples, is fetched and stored. Larger amounts of data are fetched for comparison purposes, but only data samples are stored.
  3. Table profiling in Data Explorer: Datafold stores samples and distributions of data.