Skip to main content

Cross-database diffing

Data diff can compare data across databases (e.g., PostgreSQL <> Snowflake or between two MySQL instances), fast and without transfering the actual data over the network while providing full row/column/value-level detail into the discrepancies. To learn how this works, scroll to the end of this guide for an in-depth explanation.

Creating a new data diff

new data diff

Setting up a new data diff in Datafold Cloud is straightforward. You can configure your data diffs with the following parameters and options:

Datasets A and B

Data source

Pick your connected data sources.

Diff type

Choose how you want to compare your data:

  • Table: Select this to compare data directly from database tables
  • Query: Use this to compare results from specific SQL queries

Dataset

Choose the dataset you want to compare. This can be a table or a view in your relational database.

Filter

Insert your filter clause after the WHERE keyword to refine your dataset. For example: created_at > '2000-01-01' will only include data created after January 1, 2000.

Column remapping

Designate columns with the same data type and different column names to be compared. Data Diff will surface differences under the column name used in Dataset A.

note

Datafold Cloud automatically handles differences in data types to ensure accurate comparisons. See our best practices below for how this is handled.

General

Primary key

The primary key is one or more columns used to uniquely identify a row in the dataset during diffing. The primary key (or keys) does not need to be formally defined in the database or elsewhere as it is used for unique row identification during diffing.

Egress limit

The Egress Limit optimizes the diff process by terminating it once a predefined number of rows are downloaded. The limit is by default set to 1,000,000 by default.

Note that when the Egress Limit is reached, the diffing process does not produce the same results each time it is run as it is not deterministic (i.e., the order in which data is processed may vary).

The Egress Limit prevents redundant analysis in scenarios with minor, repetitive discrepancies, such as formatting differences (e.g., whitespace, rounding differences). For most use cases, it is impractical to continue diffing after it is known that datasets are substantially different.

Since the algorithm aims to detect and return every mismatched row/value, if the datasets have a large percentage of differing rows, the algorithm may be unable to take advantage of checksumming. This can cause a large amount of data to be pulled over the network, which slows down the diffing process, and increases the strain on the database.

Setting an Egress Limit prevents unwanted runtime and database load by stopping the operation early in cases of substantial dataset discrepancies. It is highly recommended to set an Egress Limit, taking into account these tradeoffs between cost/speed and rigor.

Columns

Columns to compare

Specify which columns to compare between datasets.

Note that this has performance implications when comparing a large number of columns, especially in wide tables with 30 or more columns. It is recommended to initially focus on comparisons using only the primary key or to select a limited subset of columns.

Advanced

Bisection factor

Sets the number of checksum segments that the dataset is divided into for each comparison iteration. Increasing this value may improve performance for extremely large data sets.

Bisection threshold

This parameter sets a threshold for the row count in each segment during a diff operation.

If the row count within a segment falls below this threshold, the diff tool will compare the data externally, outside of the warehouse environment.

Increasing this threshhold may lead to performance improvements, particularly in datasets where a large number of differences are anticipated.

Materialize diff results to table

Create a detailed table from your diff results, indicating each row where differences occur. This table will include corresponding values from both datasets and flags showing whether each row matches or mismatches.

Results

Overview

data-app-overview

Once your data diff is complete, Datafold Cloud provides a concise, high-level summary of the detected changes in the Overview tab. The top-level menu displays the diff status, job ID, creation and completed times, runtime, and data source.

Columns

data app column The Columns tab displays a table with detailed column and type mappings from the two datasets being diffed, with status indicators for each column comparison (e.g., identical, percentage of values different). This provides a quick way to identify data inconsistencies and prioritize updates.

Primary Keys

data app primary key

This tab highlights rows that are unique to Dataset B in a data diff ("Rows exclusive to B"). As this identifies rows that exist only in Dataset B and not in Dataset A based on the primary key, it flags potential data discrepancies.

The Clone diffs and materialize results button allows you to rerun existing data diffs with results materialized in the warehouse, as well as any other desired modifications.

Values

data app values

This tab displays rows where at least one column value differs between the datasets being compared. It is useful for quickly assessing the extent of discrepancies between the two datasets.

The Show filters button enables the following features:

  • Highlight characters: highlight value differences between tables
  • % of difference: filters and displays columns based on the specified percentage range of value differences

Best practices

While data diff functions very efficiently, often matching the performance of a COUNT(*) operation in many scenarios. However, when dealing with large datasets, it's crucial to approach diffing with specific optimization strategies in mind. We share best practices that will help you get the most accurate and efficient results from your data diffs.

Handling data type differences

Datafold Cloud automatically manages data type differences during cross-database diffing. For example, when comparing decimals with different precisions (e.g., DECIMAL(38,15) in SQL Server and DECIMAL(38,19) in Snowflake), Datafold automatically casts values to a common precision before comparison, flagging any differences appropriately. Similarly, for timestamps with different precisions (e.g., milliseconds in SQL Server and nanoseconds in Snowflake), Datafold adjusts the precision as needed for accurate comparisons, simplifying the diffing process.

Optimizing OLTP databases: indexing best practices

When working with row-oriented transactional databases like PostgreSQL or MySQL, optimizing the database structure is crucial for efficient data diffing, especially for large tables. Here are some best practices to consider:

  • Create indexes on key columns:

    • It's essential to create indexes on the columns that will be compared, particularly the primary key columns defined in the data diffs.
    • Example: If your data diff involves primary key columns colA and colB, ensure that indexes are created for these specific columns.
  • Use separate indexes for primary key columns:

    • Indexes for primary key columns should be distinct and start with these columns, not as subsets of other indexes. Having a dedicated primary key index is critical for efficient diffing.
    • Example: Consider a primary key consisting of colA and colB. Ensure that the index is structured in the same order, like (colA, colB), to align with the primary key. An index with an order of (colB, colA) is strongly discouraged due to the impact on performance.
    • Example: If the index is defined as (colA, colB, colC) and the primary key is a combination of colA and colB, then when setting up the diff operation, ensure that the primary key is specified as colA, colB. If the order is reversed as colB, colA, the diffing process won’t be able to fully utilize indexing, potentially leading to slower performance.
  • Leverage compound indexes:

    • Compound indexes, which involve multiple columns, can significantly improve query performance during data diffs as they efficiently handle complex queries and filtering.
    • Example: An index defined as (colA, colB, colC) can be beneficial for diffing operations involving these columns, as it aligns with the order of columns in the primary key.

Handling high percentage of differences

Data diff is optimized to perform best when the % of different rows/values is relatively low, to support common data validation scenarios like data replication and migration.

While the tool strives to maximize the database's computational power and minimize data transfer, in extreme cases with very high difference percentages (up to 100%), it may result in transferring every row over the network, which is considerably slower.

In order to avoid long-running diffs, we recommend following:

  1. Start with diffing primary keys only to identify row-level completeness first, before diffing all or more columns.

  2. Set an Egress Limit parameter to automatically stop the diffing process after a certain percentage of rows are identified as different. This is especially useful in scenarios where identifying a large number of discrepancies (e.g., large percentage of missing/different rows) early on indicates that a detailed row-by-row diff may not be required, thereby saving time and computational resources.

Execute queries in parallel

Increase the number of concurrent connections to the database in Datafold Cloud. This enables queries to be executed in parallel, significantly accelerating the diff process.

Navigate to the Settings option in the left sidebar menu of Datafold Cloud. Adjust the max connections setting to increase the number of concurrent connections Datafold can establish with your data source. Note that the maximum allowable value for concurrent connections is 64.

Optimize column selection

The number of columns included in the diff directly impacts its speed: selecting fewer columns typically results in faster execution. To optimize performance, refine your column selection based on your specific use case:

  • Comprehensive verification: For in-depth analysis, include all columns in the diff. This method is the most thorough, suitable for exhaustive data reviews, albeit time-intensive for wide tables.
  • Minimal verification: consider verifying only the primary key and updated_at columns. This is efficient and sufficient if you trust the data. It's ideal for quickly checking row presence and basic data integrity. You can also only verify id if you're interested in only presence, such as to detect missing hard deletes.
  • Presence verification: if your main concern is just the presence of data (whether data exists or has been removed), such as identifying missing hard deletes, verifying only the id column can be sufficient.
  • Hybrid verification: focus on key columns that are most critical to your operations or data integrity, such as monetary values in an amount column, while omitting large serialized or less critical columns like json_settings.

Managing primary key distribution

Significant gaps in the primary key column can decrease diff efficiency (e.g., 10s of millions of continuous rows missing). Datafold will execute queries for non-existent row ranges, which can slow down the data diff.

Handling very large tables

In cases of exceptionally large tables, consider increasing the bisection factor to avoid timeouts.

Similarly, if there are substantial changes within the table, raising the bisection threshold can be beneficial.

Handling different primary key types

As a general rule, primary keys should be of the same (or similar) type in both datasets for diffing to work properly. Comparing primary keys of different types (e.g., INT vs VARCHAR) will result in a type mismatch error. You can still diff such datasets by casting the primary key column to the same type in both datasets explicitly.

note

Indexes on the primary key typically cannot be utilized when the primary key is cast to a different type. This may result in slower diffing performance. Consider creating a separate index, such as expression index in PostgreSQL, to improve performance.

pk_type_cast

How cross-database diffing works

Cross-database diffing in Datafold uses checksumming and interval search techniques to rapidly and cost-effectively compare data across different databases. Data diff can quickly assess both the magnitude of differences and identify specific rows, columns, and values with differences without having to copy entire dataset over the network which makes it efficient and scalable to trillion-row/terabyte-size datasets.

High-level algorithm explained

Data diff splits the table into smaller segments and then performs checksums on each segment in both databases. When checksums for a segment differ, Datafold further subdivides that segment, repeating the checksum process until it isolates the differing row(s).

This approach is usually as fast as a COUNT(*) query when differences are few or nonexistent, and is able to output each differing row! It does so by leveraging the database's compute power, which is far more efficient than querying for and comparing every row.

Example

Imagine an orders table with 1 million rows, continuously replicated from DB-1 to DB-2.

┌─────────────┐                        ┌─────────────┐
│ DB1 │ │ DB2 │
├─────────────┤ ├─────────────┤
│ │ │ │
│ │ │ │
│ │ ┌─────────────┐ │ table with │
│ table with ├──┤ replication ├──────▶│ ?maybe? all │
│lots of rows!│ └─────────────┘ │ the same │
│ │ │ rows. │
│ │ │ │
│ │ │ │
│ │ │ │
└─────────────┘ └─────────────┘

To check if the tables match, data diff initially splits the table into 10 segments, a parameter known as the bisection factor.

Let’s say you choose to checksum the primary key id and the timestamp column updated_at, where updated_at is updated whenever the row changes. DB1 also has an index on updated_at.

Data diff first queries both databases for the MIN(id) and MAX(id) and splits the table into 10 segments of 100,000 keys each:

┌──────────────────────┐              ┌──────────────────────┐
│ DB1 │ │ DB2 │
├──────────────────────┤ ├──────────────────────┤
│ id=1..100k │ │ id=1..100k │
├──────────────────────┤ ├──────────────────────┤
│ id=100k..200k │ │ id=100k..200k │
├──────────────────────┤ ├──────────────────────┤
│ id=200k..300k ├─────────────▶│ id=200k..300k │
├──────────────────────┤ ├──────────────────────┤
│ id=300k..400k │ │ id=300k..400k │
├──────────────────────┤ ├──────────────────────┤
│ ... │ │ ... │
├──────────────────────┤ ├──────────────────────┤
│ 900k..100k │ │ 900k..100k │
└───────────────────▲──┘ └▲─────────────────────┘
┃ ┃
┃ ┃
┃ checksum queries ┃
┃ ┃
┌─┻──────────────────┻────┐
│ Data diff │
└─────────────────────────┘


After running the checksum queries on both sides, we see that all segments are the same except one (id=100k..200k):

┌──────────────────────┐              ┌──────────────────────┐
│ DB1 │ │ DB2 │
├──────────────────────┤ ├──────────────────────┤
│ checksum=0102 │ │ checksum=0102 │
├──────────────────────┤ mismatch! ├──────────────────────┤
│ checksum=ffff ◀──────────────▶ checksum=aaab │
├──────────────────────┤ ├──────────────────────┤
│ checksum=abab │ │ checksum=abab │
├──────────────────────┤ ├──────────────────────┤
│ checksum=f0f0 │ │ checksum=f0f0 │
├──────────────────────┤ ├──────────────────────┤
│ ... │ │ ... │
├──────────────────────┤ ├──────────────────────┤
│ checksum=9494 │ │ checksum=9494 │
└──────────────────────┘ └──────────────────────┘

Now data diff will do exactly as it just did for the whole table for only this mismatched segment: further split it into 10 smaller segments!

Except, this time, when each segment contains fewer than 10,000 entries (100k/10 = 10k), falling below the bisection threshold, Datafold compares every row in the segment in memory in Datafold Cloud:

┌──────────────────────┐              ┌──────────────────────┐
│ DB1 │ │ DB2 │
├──────────────────────┤ ├──────────────────────┤
│ id=100k..110k │ │ id=100k..110k │
├──────────────────────┤ ├──────────────────────┤
│ id=110k..120k │ │ id=110k..120k │
├──────────────────────┤ ├──────────────────────┤
│ id=120k..130k │ │ id=120k..130k │
├──────────────────────┤ ├──────────────────────┤
│ id=130k..140k │ │ id=130k..140k │
├──────────────────────┤ ├──────────────────────┤
│ ... │ │ ... │
├──────────────────────┤ ├──────────────────────┤
│ 190k..200k │ │ 190k..200k │
└──────────────────────┘ └──────────────────────┘

Finally, data diff will output the (id, updated_at) tuple for each row that differs, pinpointing the exact source of discrepancies between datasets (note: this is the output for data-diff open source only):

(122001, 1653672821)

Data diff thus enables the precise identification of rows that have been altered or are inconsistent.