Best Practices
We share best practices that will help you get the most accurate and efficient results from your data diffs.
Comparing numeric columns: tolerance for floats
When comparing numerical columns or of FLOAT
type which is inherently noisy, it can be helpful to specify tolerance levels for differences below which the values are considered equal.
Set appropriate tolerance levels for floating-point comparisons to avoid flagging inconsequential differences.
Materialize diff results
While Datafold UI provides advanced exploration of diff results, sometimes it can be helpful to materialize diff results back to the database to investigate them further with SQL or for audit logging.
Optimizing diff performance at scale
Since data diff pushes down the compute to your database (which usually has sufficient capacity to store and compute the datasets in the first place), the diffing speed and scalability depends on the performance of the underlying SQL engine. In most cases, the diffing performance is comparable to typical transformation jobs and analytical queries running in the database and has scaled to trillions of rows. When diffs run longer or consume more database resources than desired, consider the following measures:
- Enable Sampling to dramatically reduce the amount of data processed for in-database diffing. Sampling can be helpful when diffing extremely large datasets. When sampling is enabled, Datafold compares a randomly chosen subset of the data. Sampling is the tradeoff between the diff detail and time/cost of the diffing process. For most use cases, sampling does not reduce the informational value of data diffs as it still provides the magnitude and specific examples of differences (e.g., if 10% of sampled data show discrepancies, it suggests a similar proportion of differences across the entire dataset). Sampling is less ideal when you need to audit every changed value with 100% confidence, but this scenario is rare in practice.
Although configuring sampling can seem overwhelming at first, a good rule of thumb is to stick to defaults and tweak them as needed. It is also helpful to tweak the parameters to see how they impact the sample size to make the tradeoff between performance and accuracy.
-
Add a SQL filter if you actually need to compare just a subset of data (e.g., for a particular city or last two weeks).
-
Optimize SQL queries to enhance the performance and efficiency of database operations, reduce execution time, minimize resource usage, and ensure faster retrieval of data diff results.
-
Leverage database performance by ensuring proper configuration to match the typical workload patterns of your diff operations. Many modern databases come with performance-enhancing features like query optimization, caching, and parallel processing.
-
Consider increasing resources available to Datafold in your data warehouse (e.g., for Snowflake, increase warehouse size).