When dealing with large datasets, it’s crucial to approach diffing with specific optimization strategies in mind. We share best practices that will help you get the most accurate and efficient results from your data diffs.
Databases | vCPU | RAM, GB | Rows | Columns | Time full | Time sampled | Speedup | RDS type | Diff full | Diff sampled | Per-col noise |
---|---|---|---|---|---|---|---|---|---|---|---|
Oracle vs Snowflake | 2 | 2 | 1,000,000 | 1 | 0:00:33 | 0:00:27 | 1.22 | db.t3.small | 5399 | 5400 | 0 |
Oracle vs Snowflake | 8 | 32 | 1,000,000 | 1 | 0:07:23 | 0:00:18 | 24.61 | db.m5.2xlarge | 5422 | 5423 | 0.005 |
MySQL vs Snowflake | 2 | 8 | 1,000,000 | 1 | 0:00:57 | 0:00:24 | 2.38 | db.m5.large | 5409 | 5413 | 0 |
MySQL vs Snowflake | 2 | 8 | 1,000,000 | 29 | 0:40:00 | 0:02:14 | 17.91 | db.m5.large | 5412 | 5411 | 0 |
DECIMAL(38,15)
in SQL Server and DECIMAL(38,19)
in Snowflake), Datafold automatically casts values to a common precision before comparison, flagging any differences appropriately. Similarly, for timestamps with different precisions (e.g., milliseconds in SQL Server and nanoseconds in Snowflake), Datafold adjusts the precision as needed for accurate comparisons, simplifying the diffing process.
colA
and colB
, ensure that indexes are created for these specific columns.
colA
and colB
. Ensure that the index is structured in the same order, like (colA
, colB
), to align with the primary key. An index with an order of (colB
, colA
) is strongly discouraged due to the impact on performance.
colA
, colB
, colC
) and the primary key is a combination of colA
and colB
, then when setting up the diff operation, ensure that the primary key is specified as colA
, colB.
If the order is reversed as colB
, colA
, the diffing process won’t be able to fully utilize indexing, potentially leading to slower performance.
colA
, colB
, colC
) can be beneficial for diffing operations involving these columns, as it aligns with the order of columns in the primary key.
user_id
, but “at least 4,704 differences” were found in total_runtime_seconds
. user_id
has a number of differences below the per-column diff limit, and so we state the exact number. On the other hand, total_runtime_seconds
has a number of differences greater than the per-column diff limit, so we state “at least.” Note that due to our algorithm’s approach, we often find significantly more differences than the limit before diffing is halted, and in that scenario, we report the value that was found, while stating that more differences may exist.
updated_at
columns. This is efficient and sufficient if you need to validate rows have not been added or removed, and that updates are current between databases, but do not need to check for value-level differences between rows with common primary keys.amount
column, while omitting large serialized or less critical columns like json_settings
.INT
vs VARCHAR
) will result in a type mismatch error. You can still diff such datasets by casting the primary key column to the same type in both datasets explicitly.