FAQ
Performance and Scalability
FAQ
Performance and Scalability
Datafold is highly scalable, supporting data teams working with billion-row datasets and thousands of data transformation/dbt models. It offers powerful performance optimization features such as SQL filtering, sampling, and Slim Diff, which allow you to focus on testing the datasets that are most critical to your business, ensuring efficient and targeted data quality validation.
Datafold pushes down compute to your database, and the performance of data diffs largely depends on the underlying SQL engine. Here are some strategies to optimize performance:
- Enable Sampling: Sampling reduces the amount of data processed by comparing a randomly chosen subset. This approach balances diff detail with processing time and cost, suitable for most use cases.
- Use SQL Filters: If you only need to compare a specific subset of data (e.g., for a particular city or a recent time period), adding a SQL filter can streamline the diff process.
- Optimize SQL Queries: Refactor your SQL queries to improve the efficiency of database operations, reducing execution time and resource usage.
- Leverage Database Performance Features: Ensure your database is configured to match typical diff workload patterns. Utilize features like query optimization, caching, and parallel processing to boost performance.
- Increase Data Warehouse Resources: If using a platform like Snowflake, consider increasing the size of your warehouse to allocate more resources to Datafold operations.