Skip to main content

Cross-database diffing

Datafold Cloud's Data Diff can compare data across databases (e.g., PostgreSQL <> Snowflake, or between two SQL Server instances) efficiently and with minimal possible egress by leveraging stochastic in-database checksumming. This powerful algorithm provides full row-, column-, and value-level detail into discrepancies bewteen data tables.

Creating a new data diff

new data diff

Setting up a new data diff in Datafold Cloud is straightforward. You can configure your data diffs with the following parameters and options:

Datasets A and B

Data source

Pick your connected data sources.

Diff type

Choose how you want to compare your data:

  • Table: Select this to compare data directly from database tables
  • Query: Use this to compare results from specific SQL queries

Dataset

Choose the dataset you want to compare. This can be a table or a view in your relational database.

Filter

Insert your filter clause after the WHERE keyword to refine your dataset. For example: created_at > '2000-01-01' will only include data created after January 1, 2000.

Materialize Inputs

Select this option to improve diffing speed when query is heavy on compute, or if filters are applied to non-indexed columns, or if primary keys are transformed using concatenation, coalesce, or another function.

Column remapping

Designate columns with the same data type and different column names to be compared. Data Diff will surface differences under the column name used in Dataset A.

note

Datafold Cloud automatically handles differences in data types to ensure accurate comparisons. See our best practices below for how this is handled.

General

Primary key

The primary key is one or more columns used to uniquely identify a row in the dataset during diffing. The primary key (or keys) does not need to be formally defined in the database or elsewhere as it is used for unique row identification during diffing.

note

Textual primary keys do not support values outside the set of characters a-zA-Z0-9!"()*/^+-<>=. If these values exist, we recommend filtering them out before running the diff operation.

Egress limit

The egress limit optimizes the diff process by terminating it once a predefined number of rows are downloaded. The limit is by default set to 1,000,000 by default.

When the egress limit is reached, the diffing process does not produce the same results each time it is run, as it is not deterministic (i.e., the order in which data is processed may vary).

The egress limit prevents redundant analysis in scenarios with minor, repetitive discrepancies, such as formatting differences (e.g., whitespace, rounding differences). For most use cases, it is impractical to continue diffing after it is known that datasets are substantially different.

Since the algorithm aims to detect and return every mismatched row/value, if the datasets have a large percentage of differing rows, the algorithm may be unable to take advantage of checksumming. This can cause a large amount of data to be pulled over the network, which slows down the diffing process, and increases the strain on the database.

Setting an egress limit prevents unwanted runtime and database load by stopping the operation early in cases of substantial dataset discrepancies. It is highly recommended to set an egress limit, taking into account these tradeoffs between cost/speed and rigor.

Columns

Columns to compare

Specify which columns to compare between datasets.

Note that this has performance implications when comparing a large number of columns, especially in wide tables with 30 or more columns. It is recommended to initially focus on comparisons using only the primary key or to select a limited subset of columns.

Per-column diff limit

By setting a per-column diff limit, Data Diff will stop identifying differences for any column after a number of differences is found, based on the limit. Data Diff will also stop searching for exclusive and duplicate primary keys after the limit is reached.

Setting a per-column diff limit enables your team to find data quality issues that arise during data reconciliation while minimizing compute and time spent searching for differences. Learn more about data reconciliation best practices here.

Row sampling

Enable sampling

Use this to compare a subset of your data instead of the entire dataset. This is best for assessing large datasets.

Even when sampling is enabled, checksumming for unsampled primary keys still needs to be performed. As a result, if many columns beyond primary keys are involved, the time spent running cross-database differences with sampling may be significantly reduced.

Sampling tolerance

Sampling tolerance defines the allowable margin of error for our estimate. It sets the acceptable percentage of rows with primary key errors (e.g., nulls, duplicates, or primary keys exclusive to one dataset) before disabling sampling.

When sampling is enabled, not every row is examined, which introduces a probability of missing certain discrepancies. This threshold represents the level of difference we are willing to accept before considering the results unreliable and thereby disabling sampling. It essentially sets a limit on how much variance is tolerable in the sample compared to the complete dataset.

Default: 0.001%

Sampling confidence

Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset. It represents the minimum confidence level that the rate of primary key errors is below the threshold defined in sampling tolerance.

To put it simply, a 95% confidence level with a 5% tolerance means we are 95% certain that the true value falls within 5% of our estimate.

Default: 99%

Sampling threshold

Sampling is automatically disabled when the total row count of the largest table in the comparison falls below a specified threshold value. This approach is adopted because, for smaller datasets, a complete dataset comparison is not only more feasible but also quicker and more efficient than sampling. Disabling sampling in these scenarios ensures comprehensive data coverage and provides more accurate insights, as it becomes practical to examine every row in the dataset without significant time or resource constraints.

Sample size

This provides an estimated count of the total number of rows included in the combined sample from Datasets A and B, used for the diffing process. It's important to note that this number is an estimate and can vary from the actual sample size due to several factors:

  • The presence of duplicate primary keys in the datasets will likely increase this estimate, as it inflates the perceived uniqueness of rows.
  • Applying filters to the datasets tends to reduce the estimate, as it narrows down the data scope.

The number of rows we sample is not fixed; instead, we use a statistical approach called the Poisson distribution. This involves picking rows randomly from an infinite pool of rows with uniform random sampling. Importantly, we don't need to perform a full diff (compare every single row) to establish a baseline.

Example: Imagine there are two datasets we want to compare, dataset A and B. Since we prefer not to check every row, we use a statistical approach to determine the number of rows to sample from each dataset. To do so, we set the following parameters:

  • sampling tolerance: 5%
  • sampling confidence: 95%

Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset, while sampling tolerance defines the allowable margin of error for our estimate. Here, with a 95% sampling confidence and a 5% sampling tolerance, we are 95% confident that the true value falls within 5% of our estimate. Datafold will then estimate the sample size needed (e.g., 200 rows) to achieve these parameters.

Advanced

Materialize diff results to table

Create a detailed table from your diff results, indicating each row where differences occur. This table will include corresponding values from both datasets and flags showing whether each row matches or mismatches.

Results

Overview

data-app-overview

Once your data diff is complete, Datafold Cloud provides a concise, high-level summary of the detected changes in the Overview tab. The top-level menu displays the diff status, job ID, creation and completed times, runtime, and data source.

Columns

data app column The Columns tab displays a table with detailed column and type mappings from the two datasets being diffed, with status indicators for each column comparison (e.g., identical, percentage of values different). This provides a quick way to identify data inconsistencies and prioritize updates.

Primary Keys

data app primary key

This tab highlights rows that are unique to Dataset B in a data diff ("Rows exclusive to B"). As this identifies rows that exist only in Dataset B and not in Dataset A based on the primary key, it flags potential data discrepancies.

The Clone diffs and materialize results button allows you to rerun existing data diffs with results materialized in the warehouse, as well as any other desired modifications.

Values

data app values

This tab displays rows where at least one column value differs between the datasets being compared. It is useful for quickly assessing the extent of discrepancies between the two datasets.

The Show filters button enables the following features:

  • Highlight characters: highlight value differences between tables
  • % of difference: filters and displays columns based on the specified percentage range of value differences

Best practices

When dealing with large datasets, it's crucial to approach diffing with specific optimization strategies in mind. We share best practices that will help you get the most accurate and efficient results from your data diffs.

Handling data type differences

Datafold Cloud automatically manages data type differences during cross-database diffing. For example, when comparing decimals with different precisions (e.g., DECIMAL(38,15) in SQL Server and DECIMAL(38,19) in Snowflake), Datafold automatically casts values to a common precision before comparison, flagging any differences appropriately. Similarly, for timestamps with different precisions (e.g., milliseconds in SQL Server and nanoseconds in Snowflake), Datafold adjusts the precision as needed for accurate comparisons, simplifying the diffing process.

Optimizing OLTP databases: indexing best practices

When working with row-oriented transactional databases like PostgreSQL, optimizing the database structure is crucial for efficient data diffing, especially for large tables. Here are some best practices to consider:

  • Create indexes on key columns:

    • It's essential to create indexes on the columns that will be compared, particularly the primary key columns defined in the data diffs.
    • Example: If your data diff involves primary key columns colA and colB, ensure that indexes are created for these specific columns.
  • Use separate indexes for primary key columns:

    • Indexes for primary key columns should be distinct and start with these columns, not as subsets of other indexes. Having a dedicated primary key index is critical for efficient diffing.
    • Example: Consider a primary key consisting of colA and colB. Ensure that the index is structured in the same order, like (colA, colB), to align with the primary key. An index with an order of (colB, colA) is strongly discouraged due to the impact on performance.
    • Example: If the index is defined as (colA, colB, colC) and the primary key is a combination of colA and colB, then when setting up the diff operation, ensure that the primary key is specified as colA, colB. If the order is reversed as colB, colA, the diffing process won’t be able to fully utilize indexing, potentially leading to slower performance.
  • Leverage compound indexes:

    • Compound indexes, which involve multiple columns, can significantly improve query performance during data diffs as they efficiently handle complex queries and filtering.
    • Example: An index defined as (colA, colB, colC) can be beneficial for diffing operations involving these columns, as it aligns with the order of columns in the primary key.

Handling high percentage of differences

Data diff is optimized to perform best when the percent of different rows/values is relatively low, to support common data validation scenarios like data replication and migration.

While the tool strives to maximize the database's computational power and minimize data transfer, in extreme cases with very high difference percentages (up to 100%), it may result in transferring every row over the network, which is considerably slower.

In order to avoid long-running diffs, we recommend the following:

  1. Start with diffing primary keys only to identify row-level completeness first, before diffing all or more columns.

  2. Set an egress limit to automatically stop the diffing process after set number of rows are downloaded over the network.

  3. Set a per-column diff limit to stop finding differences for each column after a set number are found. This is especially useful in data reconciliation where identifying a large number of discrepancies (e.g., large percentage of missing/different rows) early on indicates that a detailed row-by-row diff may not be required, thereby saving time and computational resources.

In the screenshot below, we see that exactly 4 differences were found in user_id, but “at least 4,704 differences” were found in total_runtime_seconds. user_id has a number of differences below the per-column diff limit, and so we state the exact number. On the other hand, total_runtime_seconds has a number of differences greater than the per-column diff limit, so we state “at least.” Note that due to our algorithm’s approach, we often find significantly more differences than the limit before diffing is halted, and in that scenario, we report the value that was found, while stating that more differences may exist.

Execute queries in parallel

Increase the number of concurrent connections to the database in Datafold Cloud. This enables queries to be executed in parallel, significantly accelerating the diff process.

Navigate to the Settings option in the left sidebar menu of Datafold Cloud. Adjust the max connections setting to increase the number of concurrent connections Datafold can establish with your data source. Note that the maximum allowable value for concurrent connections is 64.

Optimize column selection

The number of columns included in the diff directly impacts its speed: selecting fewer columns typically results in faster execution. To optimize performance, refine your column selection based on your specific use case:

  • Comprehensive verification: For in-depth analysis, include all columns in the diff. This method is the most thorough, suitable for exhaustive data reviews, albeit time-intensive for wide tables.
  • Minimal verification: Consider verifying only the primary key and updated_at columns. This is efficient and sufficient if you trust the data. It's ideal for quickly checking row presence and basic data integrity.
  • Presence verification: If your main concern is just the presence of data (whether data exists or has been removed), such as identifying missing hard deletes, verifying only the primary key column can be sufficient.
  • Hybrid verification: Focus on key columns that are most critical to your operations or data integrity, such as monetary values in an amount column, while omitting large serialized or less critical columns like json_settings.

Managing primary key distribution

Significant gaps in the primary key column can decrease diff efficiency (e.g., 10s of millions of continuous rows missing). Datafold will execute queries for non-existent row ranges, which can slow down the data diff.

Handling different primary key types

As a general rule, primary keys should be of the same (or similar) type in both datasets for diffing to work properly. Comparing primary keys of different types (e.g., INT vs VARCHAR) will result in a type mismatch error. You can still diff such datasets by casting the primary key column to the same type in both datasets explicitly.

note

Indexes on the primary key typically cannot be utilized when the primary key is cast to a different type. This may result in slower diffing performance. Consider creating a separate index, such as expression index in PostgreSQL, to improve performance.

pk_type_cast