Setting up a new data diff in Datafold is straightforward.
You can configure your data diffs with the following parameters and options:
Pick your data connection(s).
Choose how you want to compare your data:
Datafold can also diff views, materialized views, and dynamic tables (Snowflake-only) across both options too.
Choose the dataset you want to compare, Main and Test. This can be a table or a view in your relational database.
If your database supports time travel, like Snowflake, you can query data at a specified timestamp. This is useful for tracking changes over time, conducting audits, or correcting mistakes from accidental data modifications. You can adjust the database’s session parameters as needed for your query. Supported time travel expressions:
Database | Timestamp | Negative Offset |
---|---|---|
BigQuery | ||
Snowflake |
Timestamp examples:
2024-01-01
2024-01-01 10:04:23
2024-01-01 10:04:23-09:00
2024-07-16T10:04:23+05:00
Negative offset examples (in seconds):
130
3600
Insert your filter clause after the WHERE
keyword to refine your dataset. For example: created_at >'2000-01-01
will only include data created after January 1, 2000.
When columns are the same data type but are named differently, column remapping allows you to align and compare them. This is useful when datasets have semantically identical columns with different names, such as userID
and user_id
. Datafold will surface any differences under the column name used in the Main dataset.
The primary key is one or more columns used to uniquely identify a row in the dataset during diffing. The primary key (or keys) does not need to be formally defined in the database or elsewhere as it is used for unique row identification during diffing. Multiple columns support compound primary key definitions.
If a time-series dimension is selected, this produces a Timeline plot of diff results over time to identify any time-based patterns. This is useful for identifying trends or anomalies when a given column does not match between tables in a certain date range. By selecting a time-based column, you can visualize differences and patterns across time, measured as column match rates.
Create a detailed table from your diff results, indicating each row where differences occur. This table will include corresponding values from both datasets and flags showing whether each row matches or mismatches.
For in-depth analysis, you can opt to materialize the full diff result. This disables sampling, allowing for a complete row-by-row comparison across datasets. Otherwise, Datafold defaults to diffing only a sample of the data.
Use this to compare a subset of your data instead of the entire dataset. This is best for assessing large datasets.
Sampling tolerance defines the allowable margin of error for our estimate. It sets the acceptable percentage of rows with primary key errors (like nulls, duplicates, or primary keys exclusive to one dataset) before disabling sampling. When sampling is enabled, not every row is examined, which introduces a probability of missing certain discrepancies. This threshold represents the level of difference we are willing to accept before considering the results unreliable and thereby disabling sampling. It essentially sets a limit on how much variance is tolerable in the sample compared to the complete dataset.
Default: 0.001%
Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset. It represents the minimum confidence level that the rate of primary key errors is below the threshold defined in sampling tolerance. To put it simply, a 95% confidence level with a 5% tolerance means we are 95% certain that the true value falls within 5% of our estimate.
Default: 99%
Sampling is automatically disabled when the total row count of the largest table in the comparison falls below a specified threshold value. This approach is adopted because, for smaller datasets, a complete dataset comparison is not only more feasible but also quicker and more efficient than sampling. Disabling sampling in these scenarios ensures comprehensive data coverage and provides more accurate insights, as it becomes practical to examine every row in the dataset without significant time or resource constraints.
This provides an estimated count of the total number of rows included in the combined sample from Datasets A and B, used for the diffing process. It’s important to note that this number is an estimate and can vary from the actual sample size due to several factors:
Example: Imagine there are two datasets we want to compare, Main and Test. Since we prefer not to check every row, we use a statistical approach to determine the number of rows to sample from each dataset. To do so, we set the following parameters:
Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset, while sampling tolerance defines the allowable margin of error for our estimate. Here, with a 95% sampling confidence and a 5% sampling tolerance, we are 95% confident that the true value falls within 5% of our estimate. Datafold will then estimate the sample size needed (e.g., 200 rows) to achieve these parameters.
An acceptable delta between numeric values is used to determine if they match. This is particularly useful for addressing rounding differences in long floating-point numbers. Add tolerance by choosing a column name, mode, and value. For mode:
Setting up a new data diff in Datafold is straightforward.
You can configure your data diffs with the following parameters and options:
Pick your data connection(s).
Choose how you want to compare your data:
Datafold can also diff views, materialized views, and dynamic tables (Snowflake-only) across both options too.
Choose the dataset you want to compare, Main and Test. This can be a table or a view in your relational database.
If your database supports time travel, like Snowflake, you can query data at a specified timestamp. This is useful for tracking changes over time, conducting audits, or correcting mistakes from accidental data modifications. You can adjust the database’s session parameters as needed for your query. Supported time travel expressions:
Database | Timestamp | Negative Offset |
---|---|---|
BigQuery | ||
Snowflake |
Timestamp examples:
2024-01-01
2024-01-01 10:04:23
2024-01-01 10:04:23-09:00
2024-07-16T10:04:23+05:00
Negative offset examples (in seconds):
130
3600
Insert your filter clause after the WHERE
keyword to refine your dataset. For example: created_at >'2000-01-01
will only include data created after January 1, 2000.
When columns are the same data type but are named differently, column remapping allows you to align and compare them. This is useful when datasets have semantically identical columns with different names, such as userID
and user_id
. Datafold will surface any differences under the column name used in the Main dataset.
The primary key is one or more columns used to uniquely identify a row in the dataset during diffing. The primary key (or keys) does not need to be formally defined in the database or elsewhere as it is used for unique row identification during diffing. Multiple columns support compound primary key definitions.
If a time-series dimension is selected, this produces a Timeline plot of diff results over time to identify any time-based patterns. This is useful for identifying trends or anomalies when a given column does not match between tables in a certain date range. By selecting a time-based column, you can visualize differences and patterns across time, measured as column match rates.
Create a detailed table from your diff results, indicating each row where differences occur. This table will include corresponding values from both datasets and flags showing whether each row matches or mismatches.
For in-depth analysis, you can opt to materialize the full diff result. This disables sampling, allowing for a complete row-by-row comparison across datasets. Otherwise, Datafold defaults to diffing only a sample of the data.
Use this to compare a subset of your data instead of the entire dataset. This is best for assessing large datasets.
Sampling tolerance defines the allowable margin of error for our estimate. It sets the acceptable percentage of rows with primary key errors (like nulls, duplicates, or primary keys exclusive to one dataset) before disabling sampling. When sampling is enabled, not every row is examined, which introduces a probability of missing certain discrepancies. This threshold represents the level of difference we are willing to accept before considering the results unreliable and thereby disabling sampling. It essentially sets a limit on how much variance is tolerable in the sample compared to the complete dataset.
Default: 0.001%
Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset. It represents the minimum confidence level that the rate of primary key errors is below the threshold defined in sampling tolerance. To put it simply, a 95% confidence level with a 5% tolerance means we are 95% certain that the true value falls within 5% of our estimate.
Default: 99%
Sampling is automatically disabled when the total row count of the largest table in the comparison falls below a specified threshold value. This approach is adopted because, for smaller datasets, a complete dataset comparison is not only more feasible but also quicker and more efficient than sampling. Disabling sampling in these scenarios ensures comprehensive data coverage and provides more accurate insights, as it becomes practical to examine every row in the dataset without significant time or resource constraints.
This provides an estimated count of the total number of rows included in the combined sample from Datasets A and B, used for the diffing process. It’s important to note that this number is an estimate and can vary from the actual sample size due to several factors:
Example: Imagine there are two datasets we want to compare, Main and Test. Since we prefer not to check every row, we use a statistical approach to determine the number of rows to sample from each dataset. To do so, we set the following parameters:
Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset, while sampling tolerance defines the allowable margin of error for our estimate. Here, with a 95% sampling confidence and a 5% sampling tolerance, we are 95% confident that the true value falls within 5% of our estimate. Datafold will then estimate the sample size needed (e.g., 200 rows) to achieve these parameters.
An acceptable delta between numeric values is used to determine if they match. This is particularly useful for addressing rounding differences in long floating-point numbers. Add tolerance by choosing a column name, mode, and value. For mode: