In-database diffing
Creating a new data diff
Setting up a new data diff in Datafold Cloud is straightforward. You can configure your data diffs with the following parameters and options:
Dataset
Data source
Pick your connected data source(s).
Diff type
Choose how you want to compare your data:
- Table: Select this to compare data directly from database tables
- Query: Use this to compare results from specific SQL queries
Datafold can also diff views, materialized views, and dynamic tables (Snowflake-only) across both options too.
Dataset
Choose the dataset you want to compare. This can be a table or a view in your relational database.
Time travel point
If your database supports time travel, like Snowflake, you can query data at a specified timestamp. This is useful for tracking changes over time, conducting audits, or correcting mistakes from accidental data modifications. You can adjust the database's session parameters as needed for your query.
Filter
Insert your filter clause after the WHERE
keyword to refine your dataset. For example: created_at > '2000-01-01'
will only include data created after January 1, 2000.
Column remapping
When columns are the same data type but are named differently, column remapping allows you to align and compare them. This is useful when datasets have semantically identical columns with different names, such as userID
and user_id
. Datafold will surface any differences under the column name used in Dataset A.
General parameters
Primary key
The primary key is one or more columns used to uniquely identify a row in the dataset during diffing. The primary key (or keys) does not need to be formally defined in the database or elsewhere as it is used for unique row identification during diffing. Multiple columns support compound primary key definitions.
Time-series dimension column
If a time-series dimension is selected, this produces a Timeline plot of diff results over time to identify any time-based patterns.
This is useful for identifying trends or anomalies when a given column does not match between tables in a certain date range. By selecting a time-based column, you can visualize differences and patterns across time, measured as column match rates.
Materialize diff results to table
Create a detailed table from your diff results, indicating each row where differences occur. This table will include corresponding values from both datasets and flags showing whether each row matches or mismatches.
Materialize full diff result
For in-depth analysis, you can opt to materialize the full diff result. This disables sampling, allowing for a complete row-by-row comparison across datasets. Otherwise, Datafold defaults to diffing only a sample of the data.
Row sampling
Enable sampling
Use this to compare a subset of your data instead of the entire dataset. This is best for assessing large datasets.
Sampling tolerance
Sampling tolerance defines the allowable margin of error for our estimate. It sets the acceptable percentage of rows with primary key errors (like nulls, duplicates, or primary keys exclusive to one dataset) before disabling sampling.
When sampling is enabled, not every row is examined, which introduces a probability of missing certain discrepancies. This threshold represents the level of difference we are willing to accept before considering the results unreliable and thereby disabling sampling. It essentially sets a limit on how much variance is tolerable in the sample compared to the complete dataset.
Default: 0.001%
Sampling confidence
Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset. It represents the minimum confidence level that the rate of primary key errors is below the threshold defined in sampling tolerance.
To put it simply, a 95% confidence level with a 5% tolerance means we are 95% certain that the true value falls within 5% of our estimate.
Default: 99%
Sampling threshold
Sampling is automatically disabled when the total row count of the largest table in the comparison falls below a specified threshold value. This approach is adopted because, for smaller datasets, a complete dataset comparison is not only more feasible but also quicker and more efficient than sampling. Disabling sampling in these scenarios ensures comprehensive data coverage and provides more accurate insights, as it becomes practical to examine every row in the dataset without significant time or resource constraints.
Sample size
This provides an estimated count of the total number of rows included in the combined sample from Datasets A and B, used for the diffing process. It's important to note that this number is an estimate and can vary from the actual sample size due to several factors:
- the presence of duplicate primary keys in the datasets will likely increase this estimate, as it inflates the perceived uniqueness of rows
- applying filters to the datasets tends to reduce the estimate, as it narrows down the data scope
The number of rows we sample is not fixed; instead, we use a statistical approach called the Poisson distribution. This involves picking rows randomly from an infinite pool of rows with uniform random sampling. Importantly, we don't need to perform a full diff (compare every single row) to establish a baseline.
Example: Imagine there are two datasets we want to compare, dataset A and B. Since we prefer not to check every row, we use a statistical approach to determine the number of rows to sample from each dataset. To do so, we set the following parameters:
- sampling tolerance: 5%
- sampling confidence: 95%
Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset, while sampling tolerance defines the allowable margin of error for our estimate. Here, with a 95% sampling confidence and a 5% sampling tolerance, we are 95% confident that the true value falls within 5% of our estimate. Datafold will then estimate the sample size needed (e.g., 200 rows) to achieve these parameters.
Tolerance for floats
An acceptable delta between numeric values is used to determine if they match. This is particularly useful for addressing rounding differences in long floating-point numbers.
Add tolerance by choosing a column name, mode, and value. For mode:
- Relative: Defines a percentage-based tolerance. For example, a 2% relative tolerance means no difference is noted if the absolute value of (A/B - 1) is less than or equal to 2%.
- Absolute: Sets a fixed numerical margin. For instance, an absolute tolerance of 0.5 means values are matched if the absolute difference between A and B is 0.5 or less.
Results
Overview
Once your data diff is complete, Datafold Cloud provides a concise, high-level summary of the detected changes in the Overview tab. The top-level menu displays the diff status, job ID, creation and completed times, runtime, and data source.
Columns
The Columns tab displays a table with detailed column and type mappings from the two datasets being diffed, with status indicators for each column comparison (e.g., identical, percentage of values different). This provides a quick way to identify data inconsistencies and prioritize updates.
Primary Keys
This tab highlights rows that are unique to Dataset B in a data diff ("Rows exclusive to B"). As this identifies rows that exist only in Dataset B and not in Dataset A based on the primary key, it flags potential data discrepancies.
The Show filters button allows you to filter these rows by selected column(s).
The Clone diffs and materialize results button allows you to rerun existing data diffs with results materialized in the warehouse, as well as any other desired modifications.
Column Profiles
Column Profiles displays aggregate statistics and distributions including averages, counts, ranges, and histogram charts representing column-level differences.
The Show filters button allows you to adjust chart values by relative (percentage) or absolute numbers.
Values
This tab displays rows where at least one column value differs between the datasets being compared. It is useful for quickly assessing the extent of discrepancies between the two datasets.
The Show filters button enables the following features:
- Highlight characters: highlight value differences between tables
- % of difference: filters and displays columns based on the specified percentage range of value differences
Timeline
The Timeline tab is a specialized feature that only appears if the time-series dimension column has been selected. It graphically represents data differences over time to highlight discrepancies. It only displays columns with data differences, and differences are presented as the share of mismatched data (percentage mismatched).
This feature offers enhanced clarity in pinpointing inconsistencies, supports informed decision-making through visual data representation, and increases efficiency in identifying and resolving data-related issues.
The Timeline feature is particularly useful in scenarios where an incremental model is mismanaged, leading to improper backfilling. It allows users to visually track the inconsistencies that arise over time due to the mismanagement. This graphical representation makes it easier to pinpoint the specific time frames where the errors occurred, facilitating a more targeted approach to rectify these issues.
It is also useful in correlating data differences with specific time intervals that coincide with changing data sources. When switching over or stitching together different data sources, there's often a shift in how data behaves over time. The Timeline graph helps flag the potential impact of the source change on data consistency and integrity.
Downstream Impact
This tab displays all associated BI and data app dependencies, such as dashboards and views, linked to the compared datasets. This helps visually illustrate the impact of data changes on downstream data assets.
Each listed dependency is shown with a link to its lineage diagram within Datafold Cloud's column-level lineage. You can you can filter by tables or columns within tables, or open this view in Lineage for further analysis.
Best practices
We share best practices that will help you get the most accurate and efficient results from your data diffs.
Comparing numeric columns: tolerance for floats
When comparing numerical columns or of FLOAT
type which is inherently noisy, it can be helpful to specify tolerance levels for differences below which the values are considered equal.
Set appropriate tolerance levels for floating-point comparisons to avoid flagging inconsequential differences.
Materialize diff results
While Datafold UI provides advanced exploration of diff results, sometimes it can be helpful to materialize diff results back to the database to investigate them further with SQL or for audit logging.
Optimizing diff performance at scale
Since data diff pushes down the compute to your database (which usually has sufficient capacity to store and compute the datasets in the first place), the diffing speed and scalability depends on the performance of the underlying SQL engine. In most cases, the diffing performance is comparable to typical transformation jobs and analytical queries running in the database and has scaled to trillions of rows.
When diffs run longer or consume more database resources than desired, consider the following measures:
- Enable Sampling to dramatically reduce the amount of data processed for in-database diffing.
Sampling can be helpful when diffing extremely large datasets. When sampling is enabled, Datafold compares a randomly chosen subset of the data. Sampling is the trade-off between the diff detail and time/cost of the diffing process. For most use cases, sampling does not reduce the information value of diff as it provides the magnitude and examples differences. Sampling is less ideal when you need to audit every changed value with 100% confidence which is rare in practice.
Although configuring sampling can seem overwhelming at first, a good rule of thumb is to stick to defaults and tweak them as needed. It is also helpful to tweak the parameters to see how they impact the sample size to make the trade off between performance and accuracy.
Add a SQL filter if you actually need to compare just a subset of data (e.g., for a particular city or last two weeks).
Optimize SQL queries to enhance the performance and efficiency of database operations, reduce execution time, minimize resource usage, and ensure faster retrieval of data diff results.
Leverage database performance by ensuring proper configuration to match the typical workload patterns of your diff operations. Many modern databases come with performance-enhancing features like query optimization, caching, and parallel processing.
Consider increasing resources available to Datafold in your data warehouse (e.g., for Snowflake, increase warehouse size).