Performance and Scalability
Datafold is highly scalable, supporting data teams working with billion-row datasets and thousands of data transformation/dbt models. It offers powerful performance optimization features such as SQL filtering, sampling, and Slim Diff, which allow you to focus on testing the datasets that are most critical to your business, ensuring efficient and targeted data quality validation.
Datafold pushes down compute to your database, and the performance of data diffs largely depends on the underlying SQL engine. Here are some in-app strategies to optimize performance:
-
Enable sampling: Sampling reduces the amount of data processed by comparing a randomly chosen subset. This approach balances diff detail with processing time and cost, suitable for most use cases.
-
Use SQL Filters: If you only need to compare a specific subset of data (e.g., for a particular city or a recent time period), adding a SQL filter can streamline the diff process.
-
Exclude columns/tables: When certain columns or tables are unnecessary for critical comparisons—such as temporary tables with dynamic values, metadata fields, or timestamp columns that always differ—you can exclude these to increase diff efficiency and speed.
You can exclude columns when you create a new Data Diff or when you clone an existing one:
To exclude them in your CI/CD pipeline, follow this guide to specify them in the Advanced settings of your CI/CD configuration in Datafold.
- Optimize SQL queries: Refactor your SQL queries to improve the efficiency of database operations, reducing execution time and resource usage.
- Leverage database performance features: Ensure your database is configured to match typical diff workload patterns. Utilize features like query optimization, caching, and parallel processing to boost performance.
- Increase data warehouse resources: If using a platform like Snowflake, consider increasing the size of your warehouse to allocate more resources to Datafold operations.