# Get Audit Logs get /api/v1/audit_logs # Create a DBT BI integration post /api/v1/lineage/bi/dbt/ # Create a Hightouch integration post /api/v1/lineage/bi/hightouch/ # Create a Looker integration post /api/v1/lineage/bi/looker/ # Create a Mode Analytics integration post /api/v1/lineage/bi/mode/ # Create a Power BI integration post /api/v1/lineage/bi/powerbi/ # Create a Tableau integration post /api/v1/lineage/bi/tableau/ # Get an integration get /api/v1/lineage/bi/{bi_datasource_id}/ Returns the integration for Mode/Tableau/Looker/HighTouch by its id. # List all integrations get /api/v1/lineage/bi/ Return all integrations for Mode/Tableau/Looker # Remove an integration delete /api/v1/lineage/bi/{bi_datasource_id}/ # Rename a Power BI integration put /api/v1/lineage/bi/powerbi/{bi_datasource_id}/ It can only update the name. Returns the integration with changed fields. # Sync a BI integration get /api/v1/lineage/bi/{bi_datasource_id}/sync/ Start an unscheduled synchronization of the integration. # Update a DBT BI integration put /api/v1/lineage/bi/dbt/{bi_datasource_id}/ Returns the integration with changed fields. # Update a Hightouch integration put /api/v1/lineage/bi/hightouch/{bi_datasource_id}/ It can only update the schedule. Returns the integration with changed fields. # Update a Looker integration put /api/v1/lineage/bi/looker/{bi_datasource_id}/ It can only update the schedule. Returns the integration with changed fields. # Update a Mode Analytics integration put /api/v1/lineage/bi/mode/{bi_datasource_id}/ It can only update the schedule. Returns the integration with changed fields. # Update a Tableau integration put /api/v1/lineage/bi/tableau/{bi_datasource_id}/ It can only update the schedule. Returns the integration with changed fields. # List CI runs get /api/v1/ci/{ci_config_id}/runs # Trigger a PR/MR run post /api/v1/ci/{ci_config_id}/trigger # Upload PR/MR changes post /api/v1/ci/{ci_config_id}/{pr_num} # Create a data diff post /api/v1/datadiffs # Get a data diff get /api/v1/datadiffs/{datadiff_id} # Get a data diff summary get /api/v1/datadiffs/{datadiff_id}/summary_results # List data diffs get /api/v1/datadiffs All fields support multiple items, using just comma delimiter Date fields also support ranges using the following syntax: - ``DATETIME`` = after DATETIME - ``DATETIME`` = between DATETIME and DATETIME + 1 MINUTE - ``DATE`` = start of that DATE until DATE + 1 DAY - ``DATETIME1< ``` ## Custom CI Integrations Please follow [our CI orchestration docs](../integrations/orchestrators/custom-integrations) to set up a custom CI integration levering the Datafold SDK. ## dbt Core CI Integrations When you set up Datafold CI diffing for a dbt Core project, we rely on the submission of `manifest.json` files to represent the production and staging versions of your dbt project. Please see our detailed docs on how to [set up Datafold in CI for dbt Core](../integrations/orchestrators/dbt-core), and reach out to our team if you have questions. #### CLI ```bash datafold dbt upload \ --ci-config-id \ --run-type \ --target-folder \ --commit-sha ``` #### Python ```python import os from datafold.sdk.dbt import submit_artifacts api_key = os.environ.get('DATAFOLD_API_KEY') # only needed if your Datafold app url is not app.datafold.com host = os.environ.get("DATAFOLD_HOST") submit_artifacts(host=host, api_key=api_key, ci_config_id=, run_type='', target_folder='', commit_sha='') ``` ## Diffing dbt models in development It can be beneficial to diff between two dbt environments before opening a pull request. This can be done using the Datafold SDK from the command line: ```bash datafold diff dbt ``` That command will compare data between your development and production environments. By default, all models that were built in the previous `dbt run` or `dbt build` command will be compared. ### Running Data Diffs before opening a pull request It can be helpful to view Data Diff results in your ticket before creating a pull request. This enables faster code reviews by letting developers QA changes earlier. To do this, you can create a draft PR and run the following command: ``` dbt run && datafold diff dbt ``` This executes dbt locally and triggers a Data Diff to preview data changes without committing to Git. To automate this workflow, see our guide [here](/faq/datafold-with-dbt#can-i-run-data-diffs-before-opening-a-pr). ### Update your dbt\_project.yml with configurations #### Option 1: Add variables to the `dbt_project.yml` ```yaml # dbt_project.yml vars: data_diff: prod_database: my_default_database # default database for the prod target prod_schema: my_default_schema # default schema for the prod target prod_custom_schema: PROD_ # Optional: see dropdown below ``` **Additional schema variable details** The value for `prod_custom_schema:` will vary based on how you have setup dbt. This variable is used when a model has a custom schema and becomes ***dynamic*** when the string literal `` is present. The `` substring is replaced with the custom schema for the model in order to support the various ways schema name generation can be overridden here -- also referred to as "advanced custom schemas". **Examples (not exhaustive)** **Single production schema** *If your prod environment looks like this ...* ```bash PROD.ANALYTICS ``` *... your data-diff configuration should look like this:* ```yaml vars: data_diff: prod_database: PROD prod_schema: ANALYTICS ``` **Some custom schemas in production with a prefix like "prod\_"** *If your prod environment looks like this ...* ```bash PROD.ANALYTICS PROD.PROD_MARKETING PROD.PROD_SALES ``` *... your data-diff configuration should look like this:* ```yaml vars: data_diff: prod_database: PROD prod_schema: ANALYTICS prod_custom_schema: PROD_ ``` **Some custom schemas in production with no prefix** *If your prod environment looks like this ...* ```yaml PROD.ANALYTICS PROD.MARKETING PROD.SALES ``` *... your data-diff configuration should look like this:* ```yaml vars: data_diff: prod_database: PROD prod_scheam: ANALYTICS prod_custom_schema: ``` #### Option 2: Specify a production `manifest.json` using `--state` **Using the `--state` option is highly recommended for dbt projects with multiple target database and schema configurations. For example, if you customized the [`generate_schema_name`](https://docs.getdbt.com/docs/build/custom-schemas#understanding-custom-schemas) macro, this is the best option for you.** > Note: `dbt ls` is preferred over `dbt compile` as it runs faster and data diffing does not require fully compiled models to work. ```bash dbt ls -t prod # compile a manifest.json using the "prod" target mv target/manifest.json prod_manifest.json # move the file up a directory and rename it to prod_manifest.json dbt run # run your entire dbt project or only a subset of models with `dbt run --select ` data-diff --dbt --state prod_manifest.json # run data-diff to compare your development results to the production database/schema results in the prod manifest ``` #### Add your Datafold data connection integration ID to your dbt\_project.yml To connect to your database, navigate to **Settings** → **Integrations** → **Data connections** and click **Add new integration** and follow the prompts. After you **Test and Save**, add the ID (which can be found on Integrations > Data connections) to your **dbt\_project.yml**. ```yaml # dbt_project.yml vars: data_diff: ... datasource_id: ``` The following optional arguments are available: | Options | Description | | ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `--version` | Print version info and exit. | | `-w, --where EXPR` | An additional 'where' expression to restrict the search space. Beware of SQL Injection! | | `--dbt-profiles-dir PATH` | Which directory to look in for the `profiles.yml` file. If not set, we follow the default `profiles.yml` location for the dbt version being used. Can also be set via the `DBT_PROFILES_DIR` environment variable. | | `--dbt-project-dir PATH` | Which directory to look in for the `dbt_project.yml` file. Default is the current working directory and its parents. | | `--select SELECTION or MODEL_NAME` | Select dbt resources to compare using dbt selection syntax in dbt versions >= 1.5. In versions \< 1.5, it will naively search for a model with `MODEL_NAME` as the name. | | `--state PATH` | Specify manifest to utilize for 'prod' comparison paths instead of using configuration. | | `-pd, --prod-database TEXT` | Override the dbt production database configuration within `dbt_project.yml`. | | `-ps, --prod-schema TEXT` | Override the dbt production schema configuration within `dbt_project.yml`. | | `--help` | Show this message and exit. | # Introduction Our REST API allows you to interact with Datafold programmatically. To use it, you'll need an API key. Follow the instructions below to get started. ## Create an API Key Open the Datafold app, visit Settings > Account, and select **Create API Key**. Store your API key somewhere safe. If you lose it, you'll need to generate a new one. ![Create an API key](https://mintlify.s3-us-west-1.amazonaws.com/datafold/images/create-api-key.png) ## Use your API Key When making requests to the Datafold API, you'll need to include the API key as a header in your HTTP request for authentication. The header should be named `Authorization`, and the value should be in the format: ``` Authorization: Key {API_KEY} ``` For example, if you're using cURL: ```bash curl https://api.datafold.com/api/v1/... -H "Authorization: Key {API_KEY}" ``` ## Datafold SDK Rather than hit our REST API endpoints directly, we offer a convenient Python SDK for common development and deployment testing workflows. You can find more information about our SDK [here](/api-reference/datafold-sdk). ## Need help? If you have any questions about how to use our REST API, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com). # Create a Diff Monitor post /api/v1/monitors/create/diff # Create a Metric Monitor post /api/v1/monitors/create/metric # Create a Schema Monitor post /api/v1/monitors/create/schema # Create a Test Monitor post /api/v1/monitors/create/test # Delete a Monitor delete /api/v1/monitors/{id} # Get Monitor get /api/v1/monitors/{id} # Get Monitor Run get /api/v1/monitors/{id}/runs/{run_id} # List Monitor Runs get /api/v1/monitors/{id}/runs # List Monitors get /api/v1/monitors # Toggle a Monitor put /api/v1/monitors/{id}/toggle # Trigger a run post /api/v1/monitors/{id}/run # Update a Monitor patch /api/v1/monitors/{id}/update # Best Practices When dealing with large datasets, it's crucial to approach diffing with specific optimization strategies in mind. We share best practices that will help you get the most accurate and efficient results from your data diffs. ## Enable sampling [Sampling](/data-diff/cross-database-diffing/creating-a-new-data-diff#row-sampling) can be helpful when diffing between extremely large datasets as it can result in a speedup of 2x to 20x or more. The extent of the speedup depends on various factors, including the scale of the data, instance sizes, and the number of data columns. The following table illustrates the speedup achieved with sampling in different databases, varying instance sizes, and different numbers of data columns: | Databases | vCPU | RAM, GB | Rows | Columns | Time full | Time sampled | Speedup | RDS type | Diff full | Diff sampled | Per-col noise | | :-----------------: | :--: | :-----: | :-------: | :-----: | :-------: | :----------: | :-----: | :-----------: | :-------: | :----------: | :-----------: | | Oracle vs Snowflake | 2 | 2 | 1,000,000 | 1 | 0:00:33 | 0:00:27 | 1.22 | db.t3.small | 5399 | 5400 | 0 | | Oracle vs Snowflake | 8 | 32 | 1,000,000 | 1 | 0:07:23 | 0:00:18 | 24.61 | db.m5.2xlarge | 5422 | 5423 | 0.005 | | MySQL vs Snowflake | 2 | 8 | 1,000,000 | 1 | 0:00:57 | 0:00:24 | 2.38 | db.m5.large | 5409 | 5413 | 0 | | MySQL vs Snowflake | 2 | 8 | 1,000,000 | 29 | 0:40:00 | 0:02:14 | 17.91 | db.m5.large | 5412 | 5411 | 0 | When sampling is enabled, Datafold compares a randomly chosen subset of the data. Sampling is the tradeoff between the diff detail and time/cost of the diffing process. For most use cases, sampling does not reduce the informational value of data diffs as it still provides the magnitude and specific examples of differences (e.g., if 10% of sampled data show discrepancies, it suggests a similar proportion of differences across the entire dataset). Although configuring sampling can seem overwhelming at first, a good rule of thumb is to select an initial value of 95% for the sampling confidence and adjust it as needed. Tweaking the parameters can be helpful to see how they impact the sample size and the tradeoff between performance and accuracy. ## Handling data type differences Datafold automatically manages data type differences during cross-database diffing. For example, when comparing decimals with different precisions (e.g., `DECIMAL(38,15)` in SQL Server and `DECIMAL(38,19)` in Snowflake), Datafold automatically casts values to a common precision before comparison, flagging any differences appropriately. Similarly, for timestamps with different precisions (e.g., milliseconds in SQL Server and nanoseconds in Snowflake), Datafold adjusts the precision as needed for accurate comparisons, simplifying the diffing process. ## Optimizing OLTP databases: indexing best practices When working with row-oriented transactional databases like PostgreSQL, optimizing the database structure is crucial for efficient data diffing, especially for large tables. Here are some best practices to consider: * **Create indexes on key columns**: * It's essential to create indexes on the columns that will be compared, particularly the primary key columns defined in the data diffs. * **Example**: If your data diff involves primary key columns `colA` and `colB`, ensure that indexes are created for these specific columns. * **Use separate indexes for primary key columns:** * Indexes for primary key columns should be distinct and start with these columns, not as subsets of other indexes. Having a dedicated primary key index is critical for efficient diffing. * **Example**: Consider a primary key consisting of `colA` and `colB`. Ensure that the index is structured in the same order, like (`colA`, `colB`), to align with the primary key. An index with an order of (`colB`, `colA`) is strongly discouraged due to the impact on performance. * **Example**: If the index is defined as (`colA`, `colB`, `colC`) and the primary key is a combination of `colA` and `colB`, then when setting up the diff operation, ensure that the primary key is specified as `colA`, `colB.` If the order is reversed as `colB`, `colA`, the diffing process won’t be able to fully utilize indexing, potentially leading to slower performance. * **Leverage compound indexes**: * Compound indexes, which involve multiple columns, can significantly improve query performance during data diffs as they efficiently handle complex queries and filtering. * **Example**: An index defined as (`colA`, `colB`, `colC`) can be beneficial for diffing operations involving these columns, as it aligns with the order of columns in the primary key. ## Handling high percentage of differences Data diff is optimized to perform best when the percent of different rows/values is relatively low, to support common data validation scenarios like data replication and migration. While the tool strives to maximize the database's computational power and minimize data transfer, in extreme cases with very high difference percentages (up to 100%), it may result in transferring every row over the network, which is considerably slower. In order to avoid long-running diffs, we recommend the following: * **Start with diffing [primary keys](/data-diff/cross-database-diffing/creating-a-new-data-diff#primary-key)** only to identify row-level completeness first, before diffing all or more columns. * **Set an [egress](/data-diff/cross-database-diffing/creating-a-new-data-diff#primary-key) limit** to automatically stop the diffing process after set number of rows are downloaded over the network. * **Set a [per-column diff](/data-diff/cross-database-diffing/creating-a-new-data-diff#primary-key) limit** to stop finding differences for each column after a set number are found. This is especially useful in data reconciliation where identifying a large number of discrepancies (e.g., large percentage of missing/different rows) early on indicates that a detailed row-by-row diff may not be required, thereby saving time and computational resources. In the screenshot below, we see that exactly 4 differences were found in `user_id`, but “at least 4,704 differences” were found in `total_runtime_seconds`. `user_id` has a number of differences below the per-column diff limit, and so we state the exact number. On the other hand, `total_runtime_seconds` has a number of differences greater than the per-column diff limit, so we state “at least.” Note that due to our algorithm’s approach, we often find significantly more differences than the limit before diffing is halted, and in that scenario, we report the value that was found, while stating that more differences may exist. ## Executing queries in parallel Increase the number of concurrent connections to the database in Datafold. This enables queries to be executed in parallel, significantly accelerating the diff process. Navigate to the **Settings** option in the left sidebar menu of Datafold. Adjust the **max connections** setting to increase the number of concurrent connections Datafold can establish with your data. Note that the maximum allowable value for concurrent connections is 64. ## Optimize column selection The number of columns included in the diff directly impacts its speed: selecting fewer columns typically results in faster execution. To optimize performance, refine your column selection based on your specific use case: * **Comprehensive verification**: For in-depth analysis, include all columns in the diff. This method is the most thorough, suitable for exhaustive data reviews, albeit time-intensive for wide tables. * **Minimal verification**: Consider verifying only the primary key and `updated_at` columns. This is efficient and sufficient if you need to validate rows have not been added or removed, and that updates are current between databases, but do not need to check for value-level differences between rows with common primary keys. * **Presence verification**: If your main concern is just the presence of data (whether data exists or has been removed), such as identifying missing hard deletes, verifying only the primary key column can be sufficient. * **Hybrid verification**: Focus on key columns that are most critical to your operations or data integrity, such as monetary values in an `amount` column, while omitting large serialized or less critical columns like `json_settings`. ## Managing primary key distribution Significant gaps in the primary key column can decrease diff efficiency (e.g., 10s of millions of continuous rows missing). Datafold will execute queries for non-existent row ranges, which can slow down the data diff. ## Handling different primary key types As a general rule, primary keys should be of the same (or similar) type in both datasets for diffing to work properly. Comparing primary keys of different types (e.g., `INT` vs `VARCHAR`) will result in a type mismatch error. You can still diff such datasets by casting the primary key column to the same type in both datasets explicitly. Indexes on the primary key typically cannot be utilized when the primary key is cast to a different type. This may result in slower diffing performance. Consider creating a separate index, such as [expression index in PostgreSQL](https://www.postgresql.org/docs/current/indexes-expressional.html), to improve performance. # Creating a New Data Diff Datafold's Data Diff can compare data across databases (e.g., PostgreSQL <> Snowflake, or between two SQL Server instances) efficiently and with minimal possible egress by leveraging stochastic in-database checksumming. This powerful algorithm provides full row-, column-, and value-level detail into discrepancies between data tables. ## Creating a new data diff Setting up a new data diff in Datafold is straightforward. You can configure your data diffs with the following parameters and options: ### Source and Target datasets #### Data connection Pick your data connection(s). #### Diff type Choose how you want to compare your data: * Table: Select this to compare data directly from database tables * Query: Use this to compare results from specific SQL queries #### Dataset Choose the dataset you want to compare. This can be a table or a view in your relational database. #### Filter Insert your filter clause after the WHERE keyword to refine your dataset. For example: `created_at > '2000-01-01'` will only include data created after January 1, 2000. ### Materialize inputs Select this option to improve diffing speed when query is heavy on compute, or if filters are applied to non-indexed columns, or if primary keys are transformed using concatenation, coalesce, or another function. ## Column remapping Designate columns with the same data type and different column names to be compared. Data Diff will surface differences under the column name used in the Source dataset. Datafold automatically handles differences in data types to ensure accurate comparisons. See our best practices below for how this is handled. ## General ### Primary key The primary key is one or more columns used to uniquely identify a row in the dataset during diffing. The primary key (or keys) does not need to be formally defined in the database or elsewhere as it is used for unique row identification during diffing. Textual primary keys do not support values outside the set of characters `a-zA-Z0-9!"()*/^+-<>=`. If these values exist, we recommend filtering them out before running the diff operation. #### Egress limit The egress limit optimizes the diff process by terminating it once a predefined number of rows are downloaded. The limit is set to 1,000,000 by default. When the egress limit is reached, the diffing process does not produce the same results each time it is run, as it is not deterministic (i.e., the order in which data is processed may vary). The egress limit prevents redundant analysis in scenarios with minor, repetitive discrepancies, such as formatting differences (e.g., whitespace, rounding differences). For most use cases, it is impractical to continue diffing after it is known that datasets are substantially different. Since the algorithm aims to detect and return every mismatched row/value, if the datasets have a large percentage of differing rows, the algorithm may be unable to take advantage of checksumming. This can cause a large amount of data to be pulled over the network, which slows down the diffing process, and increases the strain on the database. Setting an egress limit prevents unwanted runtime and database load by stopping the operation early in cases of substantial dataset discrepancies. It is highly recommended to set an egress limit, taking into account these tradeoffs between cost/speed and rigor. ### Columns #### Columns to compare Specify which columns to compare between datasets. Note that this has performance implications when comparing a large number of columns, especially in wide tables with 30 or more columns. It is recommended to initially focus on comparisons using only the primary key or to select a limited subset of columns. #### Per-column diff limit By setting a per-column diff limit, Data Diff will stop identifying differences for any column after a number of differences is found, based on the limit. Data Diff will also stop searching for exclusive and duplicate primary keys after the limit is reached. Setting a per-column diff limit enables your team to find data quality issues that arise during data reconciliation while minimizing compute and time spent searching for differences. Learn more about data reconciliation best practices here. ### Row sampling #### Enable sampling Use this to compare a subset of your data instead of the entire dataset. This is best for assessing large datasets. Even when sampling is enabled, checksumming for unsampled primary keys still needs to be performed. As a result, if many columns beyond primary keys are involved, the time spent running cross-database differences with sampling may be significantly reduced. #### Sampling tolerance Sampling tolerance defines the allowable margin of error for our estimate. It sets the acceptable percentage of rows with primary key errors (e.g., nulls, duplicates, or primary keys exclusive to one dataset) before disabling sampling. When sampling is enabled, not every row is examined, which introduces a probability of missing certain discrepancies. This threshold represents the level of difference we are willing to accept before considering the results unreliable and thereby disabling sampling. It essentially sets a limit on how much variance is tolerable in the sample compared to the complete dataset. Default: 0.001% #### Sampling confidence Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset. It represents the minimum confidence level that the rate of primary key errors is below the threshold defined in sampling tolerance. To put it simply, a 95% confidence level with a 5% tolerance means we are 95% certain that the true value falls within 5% of our estimate. Default: 99% #### Sampling threshold Sampling is automatically disabled when the total row count of the largest table in the comparison falls below a specified threshold value. This approach is adopted because, for smaller datasets, a complete dataset comparison is not only more feasible but also quicker and more efficient than sampling. Disabling sampling in these scenarios ensures comprehensive data coverage and provides more accurate insights, as it becomes practical to examine every row in the dataset without significant time or resource constraints. #### Sample size This provides an estimated count of the total number of rows included in the combined sample from Datasets A and B, used for the diffing process. It's important to note that this number is an estimate and can vary from the actual sample size due to several factors: The presence of duplicate primary keys in the datasets will likely increase this estimate, as it inflates the perceived uniqueness of rows. * Applying filters to the datasets tends to reduce the estimate, as it narrows down the data scope. * The number of rows we sample is not fixed; instead, we use a statistical approach called the Poisson distribution. This involves picking rows randomly from an infinite pool of rows with uniform random sampling. Importantly, we don't need to perform a full diff (compare every single row) to establish a baseline. Example: Imagine there are two datasets we want to compare, Source and Target. Since we prefer not to check every row, we use a statistical approach to determine the number of rows to sample from each dataset. To do so, we set the following parameters: * Sampling tolerance: 5% * Sampling confidence: 95% Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset, while sampling tolerance defines the allowable margin of error for our estimate. Here, with a 95% sampling confidence and a 5% sampling tolerance, we are 95% confident that the true value falls within 5% of our estimate. Datafold will then estimate the sample size needed (e.g., 200 rows) to achieve these parameters. ### Advanced #### Materialize diff results to table Create a detailed table from your diff results, indicating each row where differences occur. This table will include corresponding values from both datasets and flags showing whether each row matches or mismatches. # How Cross-Database Diffing Works Datafold's cross-database diffing algorithm efficiently compares datasets between different databases. To compare datasets between two different databases, Datafold leverages a proprietary [stochastic checksumming algorithm](../../faq/data-diffing#what-is-stochastic-checksumming) that allows it to identify discrepancies down to individual primary keys and column values while minimizing the amount of data sent over the network. As a result, the comparison is mostly performed in-place, leveraging the underlying databases without the need to export the entire dataset to compare elsewhere. # Results Once your data diff is complete, Datafold provides a concise, high-level summary of the detected changes in the Overview tab. ## Overview The top-level menu displays the diff status, job ID, creation and completed times, runtime, and data connection. ## Columns The Columns tab displays a table with detailed column and type mappings from the two datasets being diffed, with status indicators for each column comparison (e.g., identical, percentage of values different). This provides a quick way to identify data inconsistencies and prioritize updates. ## Primary keys This tab highlights rows that are unique to the Target dataset in a data diff ("Rows exclusive to Target"). As this identifies rows that exist only in the Target dataset and not in the Source dataset based on the primary key, it flags potential data discrepancies. The Clone **diffs and materialize results** button allows you to rerun existing data diffs with results materialized in the warehouse, as well as any other desired modifications. ## Values This tab displays rows where at least one column value differs between the datasets being compared. It is useful for quickly assessing the extent of discrepancies between the two datasets. The **Show filters** button enables the following features: * Highlight characters: highlight value differences between tables * % of difference: filters and displays columns based on the specified percentage range of value differences # How Datafold Diffs Data Data Diff can compare data within and between databases (e.g., PostgreSQL <> Snowflake or between two MySQL instances), fast and without transfering the actual data over the network while providing full row, column, and value-level detail into the discrepancies. The basic inputs required to run a diff are the data connection, names of the datasets to be compared, and the primary key (a column or a combination of columns that uniquely identify a row in the dataset). ## What types of data can Data Diff compare? Data Diff can compare data in tables, views, and SQL queries in relational databases and data lakes. Datafold facilitates data diffing by [supporting a wide range of basic data types](https://docs.datafold.com/api-reference/data-types) across major database systems like BigQuery, PostgreSQL, Redshift, Databricks, and Snowflake. ## Creating diffs Diffs can be created in multiple ways, including: * Interactively through the Datafold App * Through the [Datafold API](https://docs.datafold.com/reference/cloud/rest-api) * As part of a Continuous Integration (CI) workflow in Deployment Testing ## How in-database diffing works When diffing data within the same physical database or data lake namespace, Data Diff compares data by executing various SQL queries in the target database. It uses several JOIN-type queries and various aggregate queries to provide detailed insights into differences at the row, value, and column levels, and to calculate differences in metrics and distributions. ## How cross-database diffing works When comparing data across databases, Data Diff leverages checksumming and interval search to diff the data fast and at minimal cost. Data Diff can quickly assess both the magnitude of differences and identify specific rows, columns, and values with differences **without having to copy the entire dataset over the network**. This efficiency makes it scalable for datasets as large as trillions of rows or terabytes in size. # Best Practices We share best practices that will help you get the most accurate and efficient results from your data diffs. ## Comparing numeric columns: tolerance for floats When comparing numerical columns or of `FLOAT` type which is inherently noisy, it can be helpful to specify tolerance levels for differences below which the values are considered equal. Set appropriate tolerance levels for floating-point comparisons to avoid flagging inconsequential differences. ## Materialize diff results While Datafold UI provides advanced exploration of diff results, sometimes it can be helpful to materialize diff results back to the database to investigate them further with SQL or for audit logging. ## Optimizing diff performance at scale Since data diff pushes down the compute to your database (which usually has sufficient capacity to store and compute the datasets in the first place), the diffing speed and scalability depends on the performance of the underlying SQL engine. In most cases, the diffing performance is comparable to typical transformation jobs and analytical queries running in the database and has scaled to trillions of rows. When diffs run longer or consume more database resources than desired, consider the following measures: 1. **Enable Sampling** to dramatically reduce the amount of data processed for in-database diffing. Sampling can be helpful when diffing extremely large datasets. When sampling is enabled, Datafold compares a randomly chosen subset of the data. Sampling is the tradeoff between the diff detail and time/cost of the diffing process. For most use cases, sampling does not reduce the informational value of data diffs as it still provides the magnitude and specific examples of differences (e.g., if 10% of sampled data show discrepancies, it suggests a similar proportion of differences across the entire dataset). Sampling is less ideal when you need to audit every changed value with 100% confidence, but this scenario is rare in practice. Although configuring sampling can seem overwhelming at first, a good rule of thumb is to select an initial value of 95% for the sampling confidence and adjust it as needed. Tweaking the parameters can be helpful to see how they impact the sample size and the tradeoff between performance and accuracy. 2. **Add a SQL filter** if you actually need to compare just a subset of data (e.g., for a particular city or last two weeks). 3. **Optimize SQL queries** to enhance the performance and efficiency of database operations, reduce execution time, minimize resource usage, and ensure faster retrieval of data diff results. 4. **Leverage database performance** by ensuring proper configuration to match the typical workload patterns of your diff operations. Many modern databases come with performance-enhancing features like query optimization, caching, and parallel processing. 5. Consider **increasing resources** available to Datafold in your data warehouse (e.g., for Snowflake, increase warehouse size). # Creating a New Data Diff Setting up a new data diff in Datafold is straightforward. You can configure your data diffs with the following parameters and options: ## Dataset ### Data connection Pick your data connection(s). ### Diff type Choose how you want to compare your data: * Table: Select this to compare data directly from database tables * Query: Use this to compare results from specific SQL queries Datafold can also diff views, materialized views, and dynamic tables (Snowflake-only) across both options too. ### Dataset Choose the dataset you want to compare, Main and Test. This can be a table or a view in your relational database. ### Time travel point If your database supports time travel, like [Snowflake](https://docs.snowflake.com/en/user-guide/data-time-travel#querying-historical-data), you can query data at a specified timestamp. This is useful for tracking changes over time, conducting audits, or correcting mistakes from accidental data modifications. You can adjust the database's session parameters as needed for your query. Supported time travel expressions: | Database | Timestamp | Negative Offset | | :-------: | :--------------------------: | :--------------------------: | | BigQuery | | | | Snowflake | | | Timestamp examples: * `2024-01-01` * `2024-01-01 10:04:23` * `2024-01-01 10:04:23-09:00` * `2024-07-16T10:04:23+05:00` Negative offset examples (in seconds): * `130` * `3600` ### Filter Insert your filter clause after the `WHERE` keyword to refine your dataset. For example: `created_at >'2000-01-01` will only include data created after January 1, 2000. ## Column remapping When columns are the same data type but are named differently, column remapping allows you to align and compare them. This is useful when datasets have semantically identical columns with different names, such as `userID` and `user_id`. Datafold will surface any differences under the column name used in the Main dataset. ## General parameters ### Primary key The primary key is one or more columns used to uniquely identify a row in the dataset during diffing. The primary key (or keys) does not need to be formally defined in the database or elsewhere as it is used for unique row identification during diffing. Multiple columns support compound primary key definitions. ### Time-series dimension column If a time-series dimension is selected, this produces a Timeline plot of diff results over time to identify any time-based patterns. This is useful for identifying trends or anomalies when a given column does not match between tables in a certain date range. By selecting a time-based column, you can visualize differences and patterns across time, measured as column match rates. ### Materialize diff results to table Create a detailed table from your diff results, indicating each row where differences occur. This table will include corresponding values from both datasets and flags showing whether each row matches or mismatches. ### Materialize full diff result For in-depth analysis, you can opt to materialize the full diff result. This disables sampling, allowing for a complete row-by-row comparison across datasets. Otherwise, Datafold defaults to diffing only a sample of the data. ## Row sampling ### Enable sampling Use this to compare a subset of your data instead of the entire dataset. This is best for assessing large datasets. ### Sampling tolerance Sampling tolerance defines the allowable margin of error for our estimate. It sets the acceptable percentage of rows with primary key errors (like nulls, duplicates, or primary keys exclusive to one dataset) before disabling sampling. When sampling is enabled, not every row is examined, which introduces a probability of missing certain discrepancies. This threshold represents the level of difference we are willing to accept before considering the results unreliable and thereby disabling sampling. It essentially sets a limit on how much variance is tolerable in the sample compared to the complete dataset. Default: 0.001% ### Sampling confidence Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset. It represents the minimum confidence level that the rate of primary key errors is below the threshold defined in sampling tolerance. To put it simply, a 95% confidence level with a 5% tolerance means we are 95% certain that the true value falls within 5% of our estimate. Default: 99% ### Sampling threshold Sampling is automatically disabled when the total row count of the largest table in the comparison falls below a specified threshold value. This approach is adopted because, for smaller datasets, a complete dataset comparison is not only more feasible but also quicker and more efficient than sampling. Disabling sampling in these scenarios ensures comprehensive data coverage and provides more accurate insights, as it becomes practical to examine every row in the dataset without significant time or resource constraints. ### Sample size This provides an estimated count of the total number of rows included in the combined sample from Datasets A and B, used for the diffing process. It's important to note that this number is an estimate and can vary from the actual sample size due to several factors: * the presence of duplicate primary keys in the datasets will likely increase this estimate, as it inflates the perceived uniqueness of rows * applying filters to the datasets tends to reduce the estimate, as it narrows down the data scope * The number of rows we sample is not fixed; instead, we use a statistical approach called the Poisson distribution. This involves picking rows randomly from an infinite pool of rows with uniform random sampling. Importantly, we don't need to perform a full diff (compare every single row) to establish a baseline. Example: Imagine there are two datasets we want to compare, Main and Test. Since we prefer not to check every row, we use a statistical approach to determine the number of rows to sample from each dataset. To do so, we set the following parameters: * sampling tolerance: 5% * sampling confidence: 95% Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset, while sampling tolerance defines the allowable margin of error for our estimate. Here, with a 95% sampling confidence and a 5% sampling tolerance, we are 95% confident that the true value falls within 5% of our estimate. Datafold will then estimate the sample size needed (e.g., 200 rows) to achieve these parameters. ## Tolerance for floats An acceptable delta between numeric values is used to determine if they match. This is particularly useful for addressing rounding differences in long floating-point numbers. Add tolerance by choosing a column name, mode, and value. For mode: * *Relative*: Defines a percentage-based tolerance. For example, a 2% relative tolerance means no difference is noted if the absolute value of (A/B - 1) is less than or equal to 2%. * *Absolute*: Sets a fixed numerical margin. For instance, an absolute tolerance of 0.5 means values are matched if the absolute difference between A and B is 0.5 or less. # Results Once your data diff is complete, Datafold provides a concise, high-level summary of the detected changes in the Overview tab ## Overview The top-level menu displays the diff status, job ID, creation and completed times, runtime, and data connection. ## Columns The Columns tab displays a table with detailed column and type mappings from the two datasets being diffed, with status indicators for each column comparison (e.g., identical, percentage of values different). This provides a quick way to identify data inconsistencies and prioritize updates. ## Primary keys This tab highlights rows that are unique to the Test dataset in a data diff ("Rows exclusive to Test"). As this identifies rows that exist only in the Test dataset and not in the Main dataset based on the primary key, it flags potential data discrepancies. The **Show filters** button allows you to filter these rows by selected column(s). The **Clone diffs and materialize** results button allows you to rerun existing data diffs with results materialized in the warehouse, as well as any other desired modifications. ## Column Profiles Column Profiles displays aggregate statistics and distributions including averages, counts, ranges, and histogram charts representing column-level differences. The **Show filters** button allows you to adjust chart values by relative (percentage) or absolute numbers. ## Values This tab displays rows where at least one column value differs between the datasets being compared. It is useful for quickly assessing the extent of discrepancies between the two datasets. The **Show** filters button enables the following features: * Highlight characters: highlight value differences between tables * % of difference: filters and displays columns based on the specified percentage range of value differences ## Timeline The Timeline tab is a specialized feature that only appears if the time-series dimension column has been selected. It graphically represents data differences over time to highlight discrepancies. It only displays columns with data differences, and differences are presented as the share of mismatched data (percentage mismatched). This feature offers enhanced clarity in pinpointing inconsistencies, supports informed decision-making through visual data representation, and increases efficiency in identifying and resolving data-related issues. The Timeline feature is particularly useful in scenarios where an incremental model is mismanaged, leading to improper backfilling. It allows users to visually track the inconsistencies that arise over time due to the mismanagement. This graphical representation makes it easier to pinpoint the specific time frames where the errors occurred, facilitating a more targeted approach to rectify these issues. It is also useful in correlating data differences with specific time intervals that coincide with changing data connections. When switching over or stitching together different data connections, there's often a shift in how data behaves over time. The Timeline graph helps flag the potential impact of the source change on data consistency and integrity. ## Downstream Impact This tab displays all associated BI and data app dependencies, such as dashboards and views, linked to the compared datasets. This helps visually illustrate the impact of data changes on downstream data assets. Each listed dependency is shown with a link to its lineage diagram within Datafold's [column-level lineage](https://docs.datafold.com/data-explorer/how-it-works). You can you can filter by tables or columns within tables, or [open this view](https://docs.datafold.com/data-explorer/how-it-works) in Data Explorer for further analysis. # What's a Data Diff? A data diff is the value-level comparison between two tables, used to identify critical changes to your data and guarantee data quality. When you **git diff** your code, you’re comparing two versions of your code files to see what has changed, such as lines added, removed, or modified. Similarly, a **data diff** compares two versions of a dataset or two databases to identify differences in individual cells in the data. ![what's a data diff](https://mintlify.s3-us-west-1.amazonaws.com/datafold/images/data_diff/what_is_data_diff.png) ## Why do I need to diff data? Just as diffing code and text is fundamental to software engineering and working with text documents, diffing data is essential to the data engineering workflow. Why? In data engineering, both data and the code that processes it are constantly evolving. Without the ability to easily diff data, understanding and tracking data changes becomes challenging. This slows down the development process and makes it harder to ensure data quality. There is a lot you can do with data diff: * Test SQL code by comparing development or staging environment data to production * Compare tables in source and target systems to identify discrepancies when migrating data between databases * Detect value-level outliers, or unexpected changes, in data flowing through your ETL/ELT pipelines * Verify that reports generated for regulatory compliance accurately reflect the underlying data by comparing report outputs with source data ## Why Datafold? Data diffing is a fundamental capability in data engineering that every engineer should have access to. Datafold's [Data Diff](https://www.datafold.com/data-diff) is a tool that compares datasets fast, within or across databases. Datafold offers an enterprise-ready solution for comparing datasets within or across databases at scale. It includes comprehensive, optimized, and automated diffing solutions, API access, and secure deployment options. Here's how you can identify row-level discrepancies in Datafold: Datafold provides end-to-end solutions for automating testing, including column-level lineage, ML-based anomaly detection, and enterprise-scale infrastructure support. It caters to complex and production-ready scenarios, including: * Automated and collaborative diffing and testing for data transformations in CI * Data diffing informed by column-level lineage, and validation of code changes with visibility into BI applications * Validating large data migrations or continuous replications with automated cross-database diffing capabilities Here's a high-level overview of what Datafold offers: | Feature Category | Datafold | | :---------------------------------------------------------------------------------------------------------------------------: | :----------------------------------------------: | | **Database Support**
*Databases that are supported for source-destination diff* | Any SQL database, inquire about specific support | | **Scale**
*Size of datasets supported for diffing* | Unlimited with advanced performance optimization | | **Primary Key Data Type Support**
*Data types of primary keys that are supported for diffing* | Numerical, string, datetime, boolean, composite | | **Data Types Diffing Support**
*Data types that are supported for per-column diffing* | All data types | | **Export Diff Results to Database**
*Materialize diffing results in your database of choice* | | | **Value-level diffs**
*Investigate row-by-row column value differences between source and destination databases* | (JSON & GUI) | | **Diff UI**
*Explore diffs visually and easily share them with your team and stakeholders* | | | **API Access**
*Automatically create diffs and receive results at scale using the Datafold REST API* | | | **Persisting Diff History**
*Persist the result history of diffs to know how your data and diffs have changed over time* | | | **Scheduled Checks**
*Run scheduled diffs for a defined list of tables* | | | **Alerting**
*Receive automatic alerts about detected discrepancies between tables within or across databases* | | | **Security and Compliance**
*Run diffs in secure and compliant environments* | HIPAA, SOC2 Type II, GDPR compliant | | **Deployment Options**
*Deploy your diffs in secure environments that meet your security standards* | Multi-tenant SaaS or Single-tenant in VPC | | **Support**
*Choose which channels offer the greatest support to your use cases and users* | Enterprise support from Datafold team members | | **SLA**
*The types of SLAs that exist to guarantee your team can diff and interact with diffs as expected* | (Coming soon) | ## Three ways to learn more If you're new to Datafold or data diffing, here are three easy ways to get started: 1. **Explore our CI integration guides**: See how Datafold fits into your continuous integration (CI) pipeline by checking out our guides for [No-Code](../deployment-testing/getting-started/universal/no-code), [API](../deployment-testing/getting-started/universal/api), or [dbt](../integrations/orchestrators) integrations. 2. **Try it yourself**: Use your own data with our [14-day free trial](https://app.datafold.com/) and experience Datafold in action. 3. **Book a demo**: Get a deeper technical understanding of how Datafold integrates with your company’s data infrastructure by [booking a demo](https://www.datafold.com/booktime) with our team. # dbt Metadata Sync Datafold can automatically ingest dbt metadata from your production environment and display it in Data Explorer. **INFO** You can enable the metadata sync in your Orchestration settings. Please note that when this feature is enabled, user editing of table metadata is disabled. ### Model-level The following model-level information can be synced: * `description` is synchronized into the description field of the table into Lineage. * The `owner` of the table is set to the user identified by the `user@company.com` field. This user must exist in Datafold with that email. * The `foo` meta-information is added to the description field with the value `bar`. * The tags `pii` and `bar` are applied to the table as tags. Here's an example configuration in YAML format: ```Bash models: - name: users description: "Description of the table" meta: owner: user@company.com foo: bar tags: - pii - abc ``` ### Column-level The following column-level information can be synced: * The column `user_id` has two tags applied: `pk` and `id`. * The metadata for `user_id` is ignored because it reflects the primary key tag. * The `email` column has the description applied. * The `email` column has the tag `pii` applied. * The `email` column has extra metadata information in the description field: `type` with the value `email`. Here's an example configuration for columns in YAML format: ```Bash models: - name: users ... columns: - name: user_id tags: - pk - id meta: pk: true - name: email description: "The user's email" tags: - pii meta: type: email ``` # How It Works The UI visually maps workflows and tracks column-level or tabular lineages, helping users understand the impact of upstream changes. Our **Data Explorer** offers a comprehensive overview of your data assets, including [Lineage](/data-explorer/lineage) and [Profiles](/data-explorer/profile). You can filter data assets by Data Connections, Tags, Data Owners, and Asset Types (e.g., tables, columns, and BI-created assets such as views, reports, and syncs). You can also search directly to find specific data assets for lineage analysis. After selecting a table or data asset, the UI will display a **graph of table-level lineage** by default. You can toggle between **Upstream** and **Downstream** perspectives and customize the lineage view by adjusting the **Max Depth** parameter to your preference. # Lineage Datafold offers a column-level and tabular lineage view. ## Column-level lineage Datafold's column-level lineage helps users trace and document the history, transformations, dependencies, and both downstream and upstream processes of a specific data column within an organization's data assets. This feature allows you to pinpoint the origins of data validation issues and comprehensively identify downstream data processes and applications. To view column-level lineage, click on the **Columns** dropdown menu of the selected asset. ### Highlight path between assets To highlight the column path between assets, click the specific column. Reset the view by clicking the **Exit the selected path** button. ## Tabular lineage Datafold also offers a tabular lineage view. You can sort lineage information by depth, asset type, identifier, and owner. Click on the **Actions** button for further options: ### Focus lineage on current node Drill down onto the data node or column of interest. ### Show SQL query Access the SQL query associated with the selected column to understand how the data was queried from the source: ### Show usage details Access detailed information about the column's read, write, and cumulative read (the sum of read count including read count of downstream columns) for the previous 7 days: ## Search and filters Datafold offers powerful search and filtering capabilities to help users quickly locate specific data assets and isolate data connections of interest. In both the graphical and tabular lineage views, you can filter by tables or columns within tables, allowing you to go as granular as needed. ### Table filtering Simply enter the table's name in the search bar to filter and display all relevant information associated with that table. ### Column filtering To focus specifically on columns, you can search using a combination of keywords. For instance, searching "column table" will display columns associated with a table, while a query like "column dim customer" narrows the search to columns within the "dim customer" table. ## Settings You can configure the settings for Lineage under Settings > Data Connections > Advanced Settings: ### Schema indexing schedule Customize the frequency and timing of when to update the indexes on database schemas. The schedule is defined through a cron tab expression. ### Table inclusion/exclusion You can filter to include and/or exclude specific tables to be shown in Lineage. When the inclusion list is set, only the tables specified in this list will be visible in the lineage and search results. When the inclusion list is not set, all tables will be visible by default, except for those explicitly specified in the exclusion list. ### Lineage update schedule Customize the frequency and timing of when to scan the query history of your data warehouse to build and update the data lineage. The schedule is defined through a cron tab expression. ## FAQ Datafold computes column-level lineage by: 1. Ingesting, parsing and analyzing SQL logs from your databases and data warehouses. This allows Datafold to infer dependencies between SQL statements, including those that create, modify, and read data. 2. Augmenting the metadata graph with data from various sources. This includes metadata from orchestration tools (e.g., dbt), BI tools, and user-provided documentation. Currently, the schema of the Datafold GraphQL API, which we use to expose lineage information, is not yet stable and is considered to be in beta. Therefore, we do not include this API in our public documentation. If you would like to programmatically access lineage information, you can explore our GitHub repository with a few examples: [datafold/datafold-api-examples](https://github.com/datafold/datafold-api-examples). Simply clone the repository and follow the instructions provided in the `README.md` file. # Profile View a data profile that summarizes key table and column-level statistics, and any upstream dependencies. # Cross-Database Diffing for Migrations Validate migration parity with Datafold's cross-database diffing solution. When migrating data from one system to another, ensuring that the data is accurately transferred and remains consistent is critical. Datafold’s cross-database diffing provides a robust method to validate parity between the source and target databases. It compares data across databases, identifying discrepancies at the dataset, column, and row levels, ensuring full confidence in your migration process. ## How cross-database diffing works * Datafold connects to any SQL source and target databases, similar to how BI tools do. * Datafold does not need to extract the entirety of the datasets for comparisons. Instead, it relies on stochastic checksumming to identify discrepancies and only extract those for analysis. ### What kind of information does Datafold output? Datafold’s cross-database diffing will produce the following results: * **High-Level Summary:** * Total number of different rows * Total number of rows (primary keys) that are present in one database but not the other * Aggregate schema differences * **Schema Differences:** Per-column mapping of data types, column order, etc. * **Primary Key Differences:** Sample of specific rows that are present in one database but not the other. * **Value-Level Differences:** Sample of differing column values for each column with identified discrepancies. The full dataset of differences can be downloaded or materialized to the warehouse. ### How does a user run a data diff? Users can run data diffs through the following methods: * Via Datafold’s interactive UI * Via the Datafold API * On a schedule (as a monitor) with optional alerting via Slack, email, PagerDuty, etc. ### Can I run multiple data diffs at the same time? Yes, users can run as many diffs as they would like, with concurrency limited by the underlying database. ### What if my data is changing and replicated live, how can I ensure proper comparison? In such cases, we recommend using watermarking—diffing data within a specified time window of row creation or update (e.g., `updated_at timestamp`). ### What if the data types do not match between source and target? Datafold performs best-effort type matching for cases where deterministic type casting is possible, e.g., comparing `VARCHAR` type with `STRING` type. When automatic type casting without information loss is not possible, the user can define type casting manually using diffing in Query mode. ### Can data diff help if the dataset in the source and target databases has a different shape/schema/column naming? Yes, users can reshape input datasets by writing a SQL query and diffing in Query mode to bring the dataset to a comparable shape. Datafold also supports column remapping for datasets with different column names between tables. ## Learn more To learn more, check out our guide on [how cross-database diffing works](../data-diff/cross-database-diffing/creating-a-new-data-diff) in Datafold, or explore our extensive [FAQ section](../faq/data-migration-automation) covering cross-database diffing and data migration. # Datafold Migration Agent Automatically migrate data environments of any scale and complexity with Datafold's Migration Agent. Datafold provides a full-cycle migration automation solution for data teams, which includes code translation and cross-database reconciliation. ## How does DMA work? Datafold performs complete SQL codebase translation and validation using an AI-powered architecture. This approach leverages a large language model (LLM) with a feedback loop optimized for achieving full parity between the migration source and target. DMA analyzes metadata, including schema, data types, and relationships, to ensure accuracy in translation. ![datafold migration agent architecture](https://mintlify.s3-us-west-1.amazonaws.com/datafold/images/data-migration/datafold_migration_agent.png) Datafold provides a comprehensive report at the end of the migration. This report includes links to data diffs validating parity and highlighting any discrepancies at the dataset, column, and row levels between the source and target databases. ## Why migrate with DMA? Unlike traditional deterministic transpilers, DMA offers several distinct benefits: * **Full parity between source and target:** DMA ensures not just code that compiles, but code that delivers the same results in your new database, complete with explicit validation. * **Flexible dialect handling:** DMA can adapt to any arbitrary input/output dialect without requiring a full grammar definition, which is especially valuable for legacy systems. * **Self-correction capabilities:** The AI-driven DMA can account for and correct mistakes based on both compilation errors and data discrepancies. * **Modernizing code structure:** DMA can convert complex stored procedures into clean, modern formats such as dbt projects, following best practices. ## Getting started with DMA **Want to learn more?** If you're interested in diving deeper, please take a moment to [fill out our intake form](https://nw1wdkq3rlx.typeform.com/to/VC2TbEbz) to connect with the Datafold team. 1. Connect your source and target data sources to Datafold. 2. Provide Datafold access to your codebase, typically by installing the Datafold GitHub/GitLab/ADO app or via system catalog access for stored procedures. Once you connect your source and target systems and Datafold ingests the codebase, DMA's translation process is supervised by the Datafold team. In most cases, no additional input is required from the customer. The migration process timeline depends on the technologies, scale, and complexity of the migration. After setup, migrations typically take several days to several weeks. ## Security Datafold is SOC 2 Type II, GDPR, and HIPAA-compliant. We offer flexible deployment options, including in-VPC setups in AWS, GCP, or Azure. The LLM infrastructure is local, ensuring no data is exposed to external subprocessors beyond the cloud provider. For VPC deployments, data stays entirely within the customer’s private network. ## FAQ For more information, please see our extensive [FAQ section](../faq/data-migration-automation). # Datafold for Migration Automation Datafold provides full-cycle migration automation with SQL code translation and cross-database validation for data warehouse, transformation framework, and hybrid migrations. Datafold offers flexible migration validation options to fit your data migration workflow. Data teams can choose to leverage the full power of the [Datafold Migration Agent (DMA)](../data-migration-automation/datafold-migration-agent) alongside [cross-database diffing](../data-diff/how-datafold-diffs-data#how-cross-database-diffing-works), or use ad-hoc diffing exclusively for validation. ## Supported migrations Datafold supports a wide range of migrations to meet the needs of modern data teams. The platform enables smooth transitions between different databases and transformation frameworks, ensuring both code translation and data validation throughout the migration process. Datafold can handle: * **Data Warehouse Migrations:** Seamlessly migrate between data warehouses, for example, from PostgreSQL to Databricks. * **Data Transformation Framework Migrations:** Transition your transformation framework from legacy stored procedures to modern tools like dbt. * **Hybrid Migrations:** Migrate across a combination of data platforms and transformation frameworks. For example, moving from MySQL + stored procedures to Databricks + dbt. ## Migration options The AI-powered Datafold Migration Agent (DMA) provides automated SQL code translation and validation to simplify and automate data migrations. Teams can pair DMA with ad-hoc cross-database diffing to enhance the validation process with additional manual checks when necessary. **How it works:** * **Step 1:** Connect your legacy and new databases to Datafold, along with your codebase. * **Step 2:** DMA translates and validates SQL code automatically. * **Step 3:** Pair the DMA output with ad-hoc cross-database diffing to reconcile data between legacy and new databases. This combination streamlines the migration process, offering automatic validation with the flexibility of manual diffing for fine-tuned control. For teams that prefer to handle code translation manually or are working with third-party migrations, Datafold's ad-hoc cross-database diffing is available as a stand-alone validation tool. **How it works:** * Validate data across databases manually without using DMA for code translation. * Run ad-hoc diffing as needed, via the [Datafold REST API](../api-reference/introduction), or schedule it with [Monitors](../data-monitoring) for continuous validation. This option gives you full control over the migration validation process, making it suitable for in-house or outsourced migrations. # Monitor Types Monitoring your data for unexpected changes is one of the cornerstones of data observability. Datafold supports all your monitoring needs through a variety of different monitor types: 1. [**Data Diff**](/data-monitoring/monitors/data-diff-monitors) → Detect differences between any two datasets, within or across databases 2. [**Metric**](/data-monitoring/monitors/metric-monitors) → Identify anomalies in standard metrics like row count, freshness, and cardinality, or in any custom metric 3. [**Data Test**](/data-monitoring/monitors/data-test-monitors) → Validate your data with business rules and see specific records that fail your tests 4. [**Schema Change**](/data-monitoring/monitors/schema-change-monitors) → Receive alerts when a table schema changes If you need help creating your first few monitors, deciding which type of monitor to use in a particular situation, or developing an overall monitoring strategy, please reach out via email ([support@datafold.com](mailto:support@datafold.com)) and our team of experts will be happy to assist. # Monitors as Code Manage Datafold monitors via version-controlled YAML for greater scalability, governance, and flexibility in code-based workflows. **INFO** Please contact [support@datafold.com](mailto:support@datafold.com) if you'd like to enable this feature for your organization. This is particularly useful if any of the following are true: * You have (or plan to have) 100s or 1000s of monitors * Your team is accustomed to managing things in code * Strict governance and change management are important to you ## Getting started **INFO** This section describes how to get started with GitHub Actions, but the same concepts apply to other hosted version control platforms like GitLab and Bitbucket. Contact us if you need help getting started. ### Set up version control integration To start using monitors as code, you'll need to decide which repository will contain your YAML configuration. If you've already connected a repository to Datafold, you could use that. Or, follow the instructions [here](/integrations/code-repositories) to connect a new repository. ### Generate a Datafold API key If you've already got a Datafold API key, use it. Otherwise, you can create a new one in the app by visiting **Settings > Account** and selecting **Create API Key**. ### Create monitors config In your chosen repository, create a new YAML file where you'll define your monitors config. For this example, we'll name the file `monitors.yaml` and place it in the root directory, but neither of these choices are hard requirements. Leave the file blank for now—we'll come back to it in a moment. ### Add CI workflow If you're using GitHub Actions, create a new YAML file under `.github/workflows/` using the following template. Be sure to tailor it to your particular setup: ```yaml name: Apply monitors as code config to Datafold on: push: branches: - main # or master jobs: apply: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - name: Set up Python uses: actions/setup-python@v2 with: python-version: 3.12 - name: Install dependencies run: | python -m pip install --upgrade pip pip install datafold-sdk - name: Update monitors run: datafold monitors provision monitors.yaml # use the correct file name/path env: DATAFOLD_HOST: https://app.datafold.com # different for dedicated deployments DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} # remember to add to secrets ``` ### Create a monitor Now return to your YAML configuration file to add your first monitor. Reference the list of examples below and select one that makes sense for your organization. ## Examples **INFO** These examples are intended to serve as inspiration and don't demonstrate every possible configuration. Contact us if you have any questions. ### Data Diff [Data Diff monitors](/data-monitoring/monitors/data-diff-monitors) detect differences between any two datasets, within or across databases. ```yaml monitors: replication_test_example: name: 'Example of a custom name' description: 'Example of a custom description' type: diff enabled: true datadiff: dataset_a: connection_id: 734 table: db.schema.table time_travel_point: '2020-01-01' dataset_b: connection_id: 736 table: db.schema.table1 time_travel_point: '2020-01-01' primary_key: - pk_column columns_to_compare: - col1 materialize_results: true column_remapping: col1: col2 sampling: rate: 0.1 ignore_string_case: true schedule: interval: every: hour replication_test_example_with_thresholds: type: diff enabled: true datadiff: dataset_a: connection_id: 734 table: db.schema.table dataset_b: connection_id: 736 table: db.schema.table2 materialize: false session_parameters: k: v primary_key: - pk_column egress_limit: 100 per_column_diff_limit: 10 schedule: interval: every: hour alert: different_rows_count: 100 different_rows_percent: 10 replication_test_example_with_thresholds_and_notifications: type: diff enabled: true datadiff: dataset_a: connection_id: 734 table: db.schema.table dataset_b: connection_id: 736 table: db.schema.table3 primary_key: - pk_column schedule: interval: every: hour notifications: - type: email recipients: - valentin@datafold.com - type: slack integration: 123 channel: datafold-alerts - type: pagerduty integration: 123 - type: webhook integration: 123 alert: different_rows_count: 100 different_rows_percent: 10 ``` ### Metric [Metric monitors](/data-monitoring/monitors/metric-monitors) identify anomalies in standard metrics like row count, freshness, and cardinality, or in any custom metric. ```yaml monitors: table_metric_example: type: metric enabled: true connection_id: 736 metric: type: table table: db.schema.table filter: deleted is false metric: freshness # see full list of options below alert: type: automatic sensitivity: 10 schedule: interval: every: hour column_metric_example: type: metric enabled: true connection_id: 736 metric: type: column table: db.schema.table column: some_col filter: deleted is false metric: sum # see full list of options below alert: type: absolute max: 100 min: 0 tags: - oncall - action-required schedule: interval: every: hour ``` #### Supported metrics For more details on supported metrics, see the docs for [Metric monitors](/data-monitoring/monitors/metric-monitors#metric-types). **Table metrics:** * Freshness: `freshness` * Row Count: `row_count` **Column metrics:** * Cardinality: `cardinality` * Uniqueness: `uniqueness` * Minimum: `minimum` * Maximum: `maximum` * Average: `average` * Median: `median` * Sum: `sum` * Standard Deviation: `std_dev` * Fill Rate: `fill_rate` ### Data Test [Data Test monitors](/data-monitoring/monitors/data-test-monitors) validate your data with business rules and surface specific records that fail your tests. ```yaml monitors: data_test_example: type: test enabled: true connection_id: 736 query: select 1 from db.schema.table schedule: interval: every: hour tags: - team_1 ``` ### Schema Change [Schema Change monitors](/data-monitoring/monitors/schema-change-monitors) detect when changes occur to a table's schema. ```yaml monitors: schema_change_example: type: schema enabled: true connection_id: 736 table: db.schema.table schedule: interval: every: hour tags: - team_2 ``` ## FAQ Yes, it's not all or nothing. You can still create/manage monitors in the app even if you're defining others in code. By default, nothing—it remains in the app. However, you can add the `--dangling-monitors-strategy [delete|pause]` flag to your `run` command to either delete or pause notifications if they're removed from your code. For example: ```bash datafold monitors provision monitors.yaml --dangling-monitors-strategy delete ``` Note: this only applies to monitors that were created from code, not those created in the UI. No, any monitors created from code will be read-only in the app (though they can still be cloned). Yes, please contact us and we'll be happy to assist. ## Need help? If you have any questions about how to use monitors as code, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com). # Data Diff Monitors Data Diff monitors compare datasets across or within databases, identifying row and column discrepancies with customizable scheduling and notifications. ## Ways to create a data diff monitor There are 3 ways to create a data diff monitor: 1. From the **Monitors** page by clicking **Create new monitor** and then selecting **Data diff** as a type of monitor. 2. Clone an existing monitor by clicking **Actions** and then **Clone** in the header menu. This will pre-fill the form with the existing monitor configuration. 3. Create a monitor directly from the data diff results by clicking **Actions** and **Create monitor**. This will pre-fill the configuration with the parent data diff settings, requiring updates only for the **Schedule** and **Notifications** sections. Once a monitor is created and initial metrics collected, you can set up [thresholds](/data-monitoring/monitors/data-diff-monitors#monitoring) for the two metrics. ## Create a new data diff monitor Setting up a new diff monitor in Datafold is straightforward. You can configure it with the following parameters and options: ### General Choose how you want to compare your data and whether the diff type is in-database or cross-database. Pick your data connections. Then, choose the two datasets you want to compare. This can be a table or a view in your relational database. If you need to compare just a subset of data (e.g., for a particular city or last two weeks), add a SQL filter. Select **Materialize inputs** to improve diffing speed when query is heavy on compute, or if filters are applied to non-indexed columns, or if primary keys are transformed using concatenation, coalesce, or another function. ### Column remapping When columns are the same data type but are named differently, column remapping allows you to align and compare them. This is useful when datasets have semantically identical columns with different names, such as `userID` and `user_id`. Datafold will surface any differences under the column name used in Dataset A. ### Diff settings #### Primary key The primary key is one or more columns used to uniquely identify a row in the dataset during diffing. The primary key (or keys) does not need to be formally defined in the database or elsewhere as it is used for unique row identification during diffing. Multiple columns support compound primary key definitions. #### Egress limit The egress limit optimizes the diff process by terminating it once a predefined number of rows are downloaded. The limit is by default set to 1,000,000 by default. When the egress limit is reached, the diffing process does not produce the same results each time it is run, as it is not deterministic (i.e., the order in which data is processed may vary). The egress limit prevents redundant analysis in scenarios with minor, repetitive discrepancies, such as formatting differences (e.g., whitespace, rounding differences). For most use cases, it is impractical to continue diffing after it is known that datasets are substantially different. Since the algorithm aims to detect and return **every** mismatched row/value, if the datasets have a large percentage of differing rows, the algorithm may be unable to take advantage of checksumming. This can cause a large amount of data to be pulled over the network, which slows down the diffing process, and increases the strain on the database. Setting an egress limit prevents unwanted runtime and database load by stopping the operation early in cases of substantial dataset discrepancies. It is highly recommended to set an egress limit, taking into account these tradeoffs between cost/speed and rigor. #### Per-column diff limit By setting a per-column diff limit, Data Diff will stop identifying differences for any column after a number of differences is found, based on the limit. Data Diff will also stop searching for exclusive and duplicate primary keys after the limit is reached. Setting a per-column diff limit enables your team to find data quality issues that arise during data reconciliation while minimizing compute and time spent searching for differences. #### Columns to compare Determine whether to compare all columns or select specific one(s). To optimize performance on large tables, it's recommended to exclude columns known to have unique values for every row, such as timestamp columns like "updated\_at," or apply filters to limit the comparison scope. #### Materialize diff results Choose whether to store diff results in a table. #### Sampling Use this to compare a subset of your data instead of the entire dataset. This is best for assessing large datasets. There are two ways to enable sampling in Monitors: [Tolerance](#tolerance) and [% of Rows](#-of-rows). **TIP** When should I use sampling tolerance instead of percent of rows? Each has its specific use cases and benefits, please [see the FAQ section](#sampling-tolerance-vs--of-rows) for a more detailed breakdown. ##### Tolerance Tolerance defines the allowable margin of error for our estimate. It sets the acceptable percentage of rows with primary key errors (like nulls, duplicates, or primary keys exclusive to one dataset) before disabling sampling. When sampling tolerance is enabled, not every row is examined, which introduces a probability of missing certain discrepancies. This threshold represents the level of difference we are willing to accept before considering the results unreliable and thereby disabling sampling. It essentially sets a limit on how much variance is tolerable in the sample compared to the complete dataset. Default: 0.001% ###### Sampling confidence Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset. It represents the minimum confidence level that the rate of primary key errors is below the threshold defined in sampling tolerance. To put it simply, a 95% confidence level with a 5% tolerance means we are 95% certain that the true value falls within 5% of our estimate. Default: 99% ###### Sampling threshold Sampling will be disabled if total row count of the largest table is less that the threshold value. ###### Sample size This provides an estimated count of the total number of rows included in the combined sample from Datasets A and B, used for the diffing process. It's important to note that this number is an estimate and can vary from the actual sample size due to several factors: * The presence of duplicate primary keys in the datasets will likely increase this estimate, as it inflates the perceived uniqueness of rows * Applying filters to the datasets tends to reduce the estimate, as it narrows down the data scope The number of rows we sample is not fixed; instead, we use a statistical approach called the Poisson distribution. This involves picking rows randomly from an infinite pool of rows with uniform random sampling. Importantly, we don't need to perform a full diff (compare every single row) to establish a baseline. Example: Imagine there are two datasets we want to compare, Main and Test. Since we prefer not to check every row, we use a statistical approach to determine the number of rows to sample from each dataset. To do so, we set the following parameters: * Sampling tolerance: 5% * Sampling confidence: 95% Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset, while sampling tolerance defines the allowable margin of error for our estimate. Here, with a 95% sampling confidence and a 5% sampling tolerance, we are 95% confident that the true value falls within 5% of our estimate. Datafold will then estimate the sample size needed (e.g., 200 rows) to achieve these parameters. ##### % of rows Percent of rows sampling defines the proportion of the dataset to be included in the sample by specifying a percentage of the total number of rows. For example, setting the sampling percentage to 0.1% means that only 0.1% of the total rows will be sampled for analysis or comparison. When percent of rows sampling is enabled, a fixed percentage of rows is selected randomly from the dataset. This method simplifies the sampling process, making it easy to understand and configure without needing to adjust complex statistical parameters. However, it lacks the statistical assurances provided by methods like sampling tolerance. It doesn't dynamically adjust based on data characteristics or discrepancies but rather adheres strictly to the specified percentage, regardless of the dataset's variability. This straightforward approach is ideal for scenarios where simplicity and quick setup are more important than precision and statistical confidence. It provides a basic yet effective way to estimate the dataset's characteristics or differences, suitable for less critical data validation tasks. ###### Sampling rate This refers to the percentage of the total number of rows in the largest table that will be used to determine the sample size. This ensures that the sample size is proportionate to the size of the dataset, providing a representative subset for comparison. For instance, if the largest table contains 1,000,000 rows and the sampling rate is set to 1%, the sample size will be 10,000 rows. ###### Sampling threshold Sampling is automatically disabled when the total row count of the largest table in the comparison falls below a specified threshold value. This approach is adopted because, for smaller datasets, a complete dataset comparison is not only more feasible but also quicker and more efficient than sampling. Disabling sampling in these scenarios ensures comprehensive data coverage and provides more accurate insights, as it becomes practical to examine every row in the dataset without significant time or resource constraints. ###### Sampling size This parameter is the [same one used in sampling tolerance](#sample-size). ### Add scheduling Customize the frequency and timing of monitor executions. You can choose a specific hourly or daily time in UTC, or input a cron tab expression for more complex scheduling: ### Add notifications You can add notifications, sent through Slack or emails, which indicate whether a monitor has been executed. Notifications are sent when either or both predefined thresholds are reached during a Diff Monitor. You can set a maximum threshold for the: * Number of different rows * Percentage of different rows ## Results The diff monitor run history shows the results from each run. Each run includes basic stats, along with metrics such as: * The total rows different: number of different rows according to data diff results. * Rows with different values: percentage of different rows relative to the total number of rows in dataset A according to data diff results. Note that the status `Different` doesn't automatically map into a notification/alert. Click the **Open Diff** link for more granular information about a specific Data Diff. ## FAQ Use sampling tolerance when you need statistical confidence in your results, as it is more efficient and stops sampling once a difference is confidently detected. This method is ideal for critical data validation tasks that require precise accuracy. On the other hand, use the percent of rows method for its simplicity and ease of use, especially in less critical scenarios where you just need a straightforward, quick sampling approach without worrying about statistical parameters. This method is perfect for general, easy-to-understand sampling needs. If you have any questions about how to use Data Diff monitors, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com). # Data Test Monitors Data Test monitors validate your data against business rules written as SQL queries, materializing failed records in a temporary table for review in Datafold. Data Test monitors allow you to validate your data with business rules written as custom SQL queries. Failed records are materialized to a temporary table in your warehouse, and you can view a sample of the records in Datafold. **INFO** Please contact [support@datafold.com](mailto:support@datafold.com) if you'd like to enable this feature for your organization. Think of these monitors as pass/fail—either the query returns no records (pass) or it returns at least one record (fail). Data Tests are extremely flexible, but common use cases are validating custom business rules, referential integrity between tables, and data formatting (see examples below). ## Create a Data Test monitor There are two ways to create a Data Test monitor: 1. Open the **Monitors** page, select **Create new monitor**, and then choose **Data Test**. 2. Clone an existing Data Test monitor by clicking **Actions** and then **Clone**. This will pre-fill the form with the existing monitor configuration. ## Set up your monitor Give your monitor a descriptive name. For example, if your test confirms that records in a table of transactions have an amount that's greater than zero, you might call it `Transaction amount > 0`. Then, select your data connection and write your test. Keep in mind your query should return records that *fail* the test. Continuing with the example above, if transaction amounts should be *greater than zero*, your query should return records with amounts *less than or equal to zero*: ```sql SELECT * FROM transactions WHERE amount <= 0 ``` ## Add a schedule Customize the frequency and timing of monitor runs. You can choose a specific hourly or daily time in UTC, or input a cron tab expression for more complex scheduling: ## Add notifications Receive notifications via Slack or email when at least one record fails your test: ## Example queries ### Custom business rule Say your company defines active users as individuals who have signed into your application at least 3 times in the past week. You could write a test that validates this logic by checking for users marked as active who haven't reached this threshold: ```sql SELECT * FROM users WHERE status = 'active' AND signins_last_7d < 3; ``` ### Referential integrity Now let’s assume you have contacts and accounts in your warehouse. Every contact belongs to exactly one account, and you want to confirm that every account referenced in the contacts table exists in the accounts table. In this case, you’d look for references to accounts that are missing from the accounts table: ```sql SELECT * FROM contacts LEFT JOIN accounts ON contacts.account_id = accounts.id WHERE contacts.account_id IS NOT NULL AND accounts.id IS NULL; ``` ### Data formatting Finally, if you wanted to validate that all phone numbers in your contacts table are 10 digits and only contain numbers, you'd return records that are not 10 digits or use non-numeric characters: ```sql SELECT * FROM contacts WHERE LENGTH(phone_number) != 10 OR phone_number REGEXP '[^0-9]'; ``` ## Attach CSVs to notifications Datafold allows attaching a CSV of failed records to notifications in Slack, email, etc. This is useful if, for example, you have business users who don't have a Datafold license but need to know about records that fail your tests. This option is configured separately per notification destination as shown here: ![Attach CSVs to Data Tests notifications](https://mintlify.s3-us-west-1.amazonaws.com/datafold/images/data-test-csv-1.png) CSV attachments are limited to the lesser of 1,000 rows or 1 MB in file size. ### Attaching CSVs in Slack In order to attach CSVs to Slack notifications, you need to complete 1-2 additional steps: 1. If you installed the Datafold Slack app prior to October 2024, you'll need to reinstall the app by visiting Settings > Integrations > Notifications, selecting your Slack integration, then **Reinstall Slack integration**. 2. Invite the Datafold app to the channel you wish to send notifications to using the `/invite` command shown below: ![Invite Datafold app to Slack channel](https://mintlify.s3-us-west-1.amazonaws.com/datafold/images/data-test-csv-2.png) ## Run Tests in CI Standard Data Tests run on a schedule against your production data. But often it's useful to test data before it gets to production as part of your deployment workflow. For this reason, Datafold supports running tests in CI. Data Tests in CI work very similarly to our [Monitors as Code](/data-monitoring/monitors-as-code) feature, in the sense that you define your tests in a version-controled YAML file. You then use the Datafold SDK to execute those tests as part of your CI workflow. ### Write your tests First, create a new file (e.g. `tests.yaml`) in the root of your repository. Then write your tests use the same format described in our [Monitors as Code](/data-monitoring/monitors-as-code) docs with two exceptions: 1. Add a `run_in_ci` flag to each test and set it to `true` (assuming you'd like to run the test) 2. (Optional) Add placeholders for variables that you'd like to populate dynamically when executing your tests Here's an example: ```yaml monitors: null_pk_test: type: test name: No NULL pk in the users table run_in_ci: true connection_id: 8 query: select * from {{ schema }}.USERS where id is null duplicate_pk_test: type: test name: No duplicate pk in the users table run_in_ci: true connection_id: 8 query: | select * from {{ schema }}.USERS where id in ( select id from {{ schema }}.USERS group by id having count(*) > 1 ); ``` ### Execute your tests **INFO** This section describes how to get started with GitHub Actions, but the same concepts apply to other hosted version control platforms like GitLab and Bitbucket. Contact us if you need help getting started. If you're using GitHub Actions, create a new YAML file under `.github/workflows/` using the following template. Be sure to tailor it to your particular setup: ```yaml on: push: branches: - main pull_request: jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - uses: actions/checkout@v2 with: token: ${{ secrets.GH_TOKEN }} repository: datafold/datafold-sdk path: datafold-sdk ref: data-tests-in-ci-demo - uses: actions/setup-python@v2 with: python-version: '3.12' - name: Install dependencies run: | python -m pip install --upgrade pip pip install -r requirements.txt - name: Set schema env var in PR run: | echo "SCHEMA=ANALYTICS.PR" >> $GITHUB_ENV if: github.event_name == 'pull_request' - name: Set schema env var in main run: | echo "SCHEMA=ANALYTICS.CORE" >> $GITHUB_ENV if: github.event_name == 'push' - name: Run tests run: | datafold tests run --var schema:$SCHEMA --ci-config-id 1 tests.yaml # use the correct file name/path env: DATAFOLD_HOST: https://app.datafold.com # different for dedicated deployments DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} # remember to add to secrets ``` ### View the results When your CI workflow is triggered (e.g. by a pull request), you can view the terminal output for your test results: ## Need help? If you have any questions about how to use Data Test monitors, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com). # Metric Monitors Metric monitors detect anomalies in your data using ML-based algorithms or manual thresholds, supporting standard and custom metrics for tables or columns. **INFO** Please contact [support@datafold.com](mailto:support@datafold.com) if you'd like to enable this feature for your organization. Metric monitors allow you to perform anomaly detection—either automatically using our ML-based algorithm or by setting manual thresholds—on the following metric types: 1. Standard metrics (e.g. row count, freshness, and cardinality) 2. Custom metrics (e.g. sales volume per region) ## Create a Metric monitor There are two ways to create a Metric Monitor: 1. Open the **Monitors** page, select **Create new monitor**, and then choose **Metric**. 2. Clone an existing Metric monitor by clicking **Actions** and then **Clone**. This will pre-fill the form with the existing monitor configuration. ## Set up your monitor Select your data connection, then choose the type of metric you'd like: **Table**, **Column**, or **Custom**. If you select table or column, you have the option to add a SQL filter to refine your dataset. For example, you could implement a 7-day rolling time window with the following: `timestamp >= dateadd(day, -7, current_timestamp)`. Please ensure the SQL is compatible with your selected data connection. ## Metric types ### Table metrics | Metric | Definition | Additional Notes | | --------- | --------------------------------- | -------------------------------------------------------------------------------------------------------------- | | Freshness | Time since table was last updated | Measured in minutes. Derived from INFORMATION\_SCHEMA. Only supported for Snowflake, BigQuery, and Databricks. | | Row Count | Total number of rows | | ### Column metrics | Metric | Definition | Supported Column Types | Additional Notes | | ------------------ | ------------------------------ | ---------------------- | -------------------------- | | Cardinality | Number of distinct values | All types | | | Uniqueness | Proportion of distinct values | All types | Proportion between 0 and 1 | | Minimum | Lowest numeric value | Numeric columns | | | Maximum | Highest numeric value | Numeric columns | | | Average | Mean value | Numeric columns | | | Median | Median value (50th percentile) | Numeric columns | | | Sum | Sum of all values | Numeric columns | | | Standard Deviation | Measure of data spread | Numeric columns | | | Fill Rate | Proportion of non-null values | All types | Proportion between 0 and 1 | ### Custom metrics Our custom metric framework is extremely flexible and supports several approaches to defining metrics. Depending on the approach you choose, your query should return some combination of the following columns: * **Metric value (required)**: a numeric column containing your *metric values* * **Timestamp (optional)**: a date/time column containing *timestamps* corresponding to your metric values * **Group (optional)**: a string column containing *groups/dimensions* for your metric **INFO** The names and order of your columns don't matter. Datafold will automatically infer their meaning based on data type. The following questions will help you decide which approach is best for you: 1. **Do you want to group your metric by the value of a column in your query?** For example, if your metric is *sales volume per day*, rather than looking at a single metric that encompasses all sales globally, it might be more informative to group by country. In this case, Datafold will automatically compute sales volume separately for each country to assist with root cause analysis when there’s an unexpected change. 2. **Will your query return a single metric value (per group, if relevant) on every monitor run, or an entire time series?** We generally recommend starting with the simpler approach of providing a single metric value (per group) per monitor run. However, if you’ve already defined a time series elsewhere (e.g. in your BI tool) and simply want to copy/paste that query into Datafold, then you may prefer the latter approach. **INFO** Datafold will only log a single data point per timestamp per group, which means you should only send data for a particular time period once that period is complete. 1. **If your metric returns a single value per monitor run, will you provide your own timestamps or use the timestamps of monitor runs?** If your query returns a single value per run, we generally recommend letting Datafold provide timestamps based on monitor runs unless you have a compelling reason to provide your own. For example, if your metric always lags by one day, you could explicitly associate yesterday's date with each observation. As you're writing your query, Datafold will let you know if the result set doesn't match one of the accepted patterns. If you have questions, please contact us and we'll be happy to help. ## Configure anomaly detection Enable anomaly detection to get the most out of metric monitors. You have several options: * **Automatic**: our automated anomaly detection uses machine learning to flag metric values that are out of the ordinary. Dial the sensitivity up or down depending on how many alerts you'd like to receive. * **Manual**: specific thresholds beyond which you'd like the monitor to trigger an alert. **Fixed Values** are specific minimum and/or maximum values, while **Percent Change** measure the magnitude of change from one observation to the next. ## Add a schedule Customize the frequency and timing of monitor runs. You can choose a specific hourly or daily time in UTC, or input a cron tab expression for more complex scheduling: ## Add notifications Send notifications via Slack or email when your monitor exceeds a threshold (automatic or manual): ## Need help? If you have any questions about how to use Metric monitors, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com). # Schema Change Monitors Schema Change monitors notify you when a table’s schema changes, such as when columns are added, removed, or data types are modified. **INFO** Please contact [support@datafold.com](mailto:support@datafold.com) if you'd like to enable this feature for your organization. Schema change monitors alert you when a table’s schema changes in any of the following ways: * Column added * Column removed * Data type changed ## Create a Schema Change monitor There are two ways to create a Schema Change monitor: 1. Open the **Monitors** page, select **Create new monitor**, and then choose **Schema Change**. 2. Clone an existing Schema Change monitor by clicking **Actions** and then **Clone**. This will pre-fill the form with the existing monitor configuration. ## Set up your monitor To set up a Schema Change monitor, simply select your data connection and the table you wish to monitor for changes. ## Add a schedule Customize the frequency and timing of monitor runs. You can choose a specific hourly or daily time in UTC, or input a cron tab expression for more complex scheduling: ## Add notifications Receive notifications via Slack or email when at least one record fails your test: ## FAQ Yes, but in a different context. While data diffs report on schema differences *between two tables at the same time* (unless you’re using the time travel feature), data diff monitors alert you to schema changes for the *same table over time*. ## Need help? If you have any questions about how to use Schema Change monitors, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com). # Deployment Options Datafold is a web-based application with multiple deployment options, including multi-tenant SaaS and dedicated cloud (either customer- or Datafold-hosted). ## SaaS / Multi-Tenant Our standard multi-tenant deployment is a cost-efficient option for most teams and is available in two AWS regions: | Region Name | Region | Sign-Up Page | | :--------------- | :---------- | :------------------------------------------------------------------------- | | US West (Oregon) | `us-west-2` | [https://app.datafold.com/org-signup](https://app.datafold.com/org-signup) | | Europe (Ireland) | `eu-west-1` | [https://eu.datafold.com/org-signup](https://eu.datafold.com/org-signup) | For additional security, we provide the following options: 1. [IP Whitelisting](/security/securing-connections#ip-whitelisting): only allow access to specific IP addresses 2. [AWS PrivateLink](/security/securing-connections#private-link): set up a limited network point to access your RDS in the same region 3. [VPC Peering](/security/securing-connections#vpc-peering-saas): securely join two networks together 4. [SSH Tunnel](/security/securing-connections#ssh-tunnel): set up a secure tunnel between your network and Datafold with the SSH server on your side 5. [IPSec Tunnel](/security/securing-connections#ipsec-tunnel): an IPSec tunnel setup ## Dedicated Cloud We also offer a single-tenant deployment of the Datafold application in a dedicated Virtual Private Cloud (VPC). The options are (from least to most complex): 1. **Datafold-hosted, Datafold-managed**: the Cloud account belongs to Datafold and we manage the Datafold application for you. 2. **Customer-hosted, Datafold-managed**: the Cloud account belongs to you, but we manage the Datafold application for you. 3. **Customer-hosted, Customer-managed**: the Cloud account belongs to you and you manage the Datafold application. Datafold does not have access. Dedicated Cloud can be deployed to all major cloud providers: * [AWS](/datafold-deployment/dedicated-cloud/aws) * [GCP](/datafold-deployment/dedicated-cloud/gcp) * [Azure](/datafold-deployment/dedicated-cloud/azure) **VPC vs. VNet** We use the term VPC across all major cloud providers. However, Azure refers to this concept as a Virtual Network (VNet). ### Datafold Dedicated Cloud FAQ Dedicated Cloud deployment may be the preferred deployment method by customers with special privacy and security concerns and in highly regulated domains. In a Dedicated Cloud deployment, the entire Datafold stack runs on dedicated cloud infrastructure and network, which usually means it is: 1. Not accessible to public Internet (sits behind customer's VPN) 2. Uses internal network to communicate with customer's databases and other resources – none of the data is sent using public networks Datafold is deployed to customer's cloud infrastructure but is fully managed by Datafold. The only DevOps involvement needed from the customer's side is to set up a cloud project and role (steps #1 and #2 below). 1. Customer creates a Datafold-specific namespace in their cloud account (subaccount in AWS / project in GCP / subscription or resource group in Azure) 2. Customer creates a Datafold-specific IAM resource with permissions to deploy the Datafold-specific namespace 3. Datafold Infrastructure team provisions the Datafold stack on the customer's infrastructure using fully-automated procedure with Terraform 4. Customer and Datafold Infrastructure teams collaborate to implement the security and networking requirements, see [all available options](#additional-security-dedicated-cloud) See cloud-specific instructions here: * [AWS](/datafold-deployment/dedicated-cloud/aws) * [GCP](/datafold-deployment/dedicated-cloud/gcp) * [Azure](/datafold-deployment/dedicated-cloud/azure) After the initial deployment, the Datafold team uses the same procedure to roll out software updates and perform maintenance to keep the uptime SLA. Datafold is deployed in the customer's region of choice on AWS, GCP, or Azure that is owned and managed by Datafold. We collaborate to implement the security and networking requirements ensuring that traffic either does not cross the public internet or, if it does, does so securely. All available options are listed below. This deployment method follows the same process as the standard customer-hosted deployment (see above), but with a key difference: the customer is responsible for managing both the infrastructure and the application. Datafold engineers do not have any access to the deployment in this case. We offer open-source projects that facilitate this deployment, with examples for every major cloud provider. You can find these projects on GitHub: * [AWS](https://github.com/datafold/terraform-aws-datafold) * [GCP](https://github.com/datafold/terraform-google-datafold) * [Azure](https://github.com/datafold/terraform-azure-datafold) Each of these projects uses a Helm chart for deploying the application. The Helm chart is also available on GitHub: * [Helm Chart](https://github.com/datafold/helm-charts) By providing these open-source projects, Datafold enables you to integrate the deployment into your own infrastructure, including existing clusters. This allows your infrastructure team to manage the deployment effectively. **Deployment Secrets:** Datafold provides the necessary secrets for downloading images as part of the license agreement. Without this agreement, the deployment will not complete successfully. Because the Datafold application is deployed in a dedicated VPC, your databases/integrations are not directly accessible when they are not exposed to the public Internet. The following solutions enable secure connections to your databases/integrations without exposing them to the public Internet: 1. [PrivateLink](/security/securing-connections?current-cloud=aws#private-link "PrivateLink") 2. [VPC Peering](/security/securing-connections#vpc-peering-dedicated-cloud "VPC Peering") 3. [SSH Tunnel](/security/securing-connections#ssh-tunnel "SSH Tunnel") 4. [IPSec Tunnel](/security/securing-connections#ipsec-tunnel "IPSec Tunnel") 1. [Private Service Connect](/security/securing-connections?current-cloud=gcp#private-link "Private Service Connect") 2. [VPC Peering](/security/securing-connections#vpc-peering-dedicated-cloud "VPC Peering") 3. [SSH Tunnel](/security/securing-connections#ssh-tunnel "SSH Tunnel") 1. [Private Link](/security/securing-connections?current-cloud=azure#private-link "Private Link") 2. [VNet Peering](/security/securing-connections#vpc-peering-dedicated-cloud "VNet Peering") 3. [SSH Tunnel](/security/securing-connections#ssh-tunnel "SSH Tunnel") Please inquire with [sales@datafold.com](mailto:sales@datafold.com) about customer-managed deployment options. # Datafold VPC Deployment on AWS Learn how to deploy Datafold in a Virtual Private Cloud (VPC) on AWS. **INFO** VPC deployments are an Enterprise feature. Please email [sales@datafold.com](mailto:sales@datafold.com) to enable your account. ## Create a Domain Name (optional) You can either choose to use your domain (for example, `datafold.domain.tld`) or to use a Datafold managed domain (for example, `yourcompany.dedicated.datafold.com`). ### Customer Managed Domain Name Create a DNS A-record for the domain where Datafold will be hosted. For the DNS record, there are two options: * **Public-facing:** When the domain is publicly available, we will provide an SSL certificate for the endpoint. * **Internal:** It is also possible to have Datafold disconnected from the internet. This would require an internal DNS (for example, AWS Route 53) record that points to the Datafold instance. It is possible to provide your own certificate for setting up the SSL connection. Once the deployment is complete, you will point that A-record to the IP address of the Datafold service. ## Give Datafold Access to AWS For setting up Datafold, it is required to set up a separate account within your organization where we can deploy Datafold. We're following the [best practices of AWS to allow third-party access](https://docs.aws.amazon.com/IAM/latest/UserGuide/id%5Froles%5Fcommon-scenarios%5Fthird-party.html). ### Create a separate AWS account for Datafold First, create a new account for Datafold. Go to **My Organization** to add an account: Click **Add an AWS Account**: You can name this account anything that helps identify it clearly. In our examples, we name it **Datafold**. Make sure that the email address of the owner isn't used by another account. When you click the **Create AWS Account** button, you'll be returned back the organization screen, and see the notification that the new account is being created. After refresh a few minutes later, the account should appear in the organizations list. ### Grant Third-Party access to Datafold To make sure that deployment runs as expected, your Datafold Support Engineer may need access to the Datafold-specific AWS account that you created. The access can be revoked after the deployment if needed. To grant access, log into the account created in the previous step. You can switch to the newly created account using the [Switch Role page](https://signin.aws.amazon.com/switchrole): By default, the role name is **OrganizationAccountAccessRole**. Click **Switch Role** to log in to the Datafold account. ## Grant Access to Datafold Next, we need to allow Datafold to access the account. We do this by allowing the Datafold AWS account to access your AWS workspace. Go to the [IAM page](https://console.aws.amazon.com/iam/home) or type **IAM** in the search bar: Go to the Roles page, and click the **Create Role** button: Select **Another AWS Account**, and use account ID `710753145501`, which is Datafold's account ID. Select **Require MFA** and click **Next: Permissions**. On the Permissions page, attach the **AdministratorAccess** permissions for Datafold to have control over the resources within the account, or see [Minimal IAM Permissions](#minimal-iam-permissions). Next, you can set **Tags**; however, they are not a requirement. Finally, give the role a name of your choice. Be careful not to duplicate the account name. If you named the account in an earlier step `Datafold`, you may want to name the role `Datafold-role`. Click **Create Role** to complete this step. Now that the role is created, you should be routed back to a list of roles in your organization. Click on your newly created role to get a sharable link for the account and store this in your password manager. When setting up your deployment with a support engineer, Datafold will use this link to gain access to the account. After validating the deployment with your support engineer, and making sure that everything works as it should, we will let you know when it's clear to revoke the credentials. ### Minimal IAM Permissions Because we work in a Account dedicated to Datafold, there is no direct access to your resources unless explicitly configured (e.g., VPC Peering). The following IAM policy are required to update and maintain the infrastructure. ```JSON { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "acm:AddTagsToCertificate", "acm:DeleteCertificate", "acm:DescribeCertificate", "acm:GetCertificate", "acm:ListCertificates", "acm:ListTagsForCertificate", "acm:RemoveTagsFromCertificate", "acm:RequestCertificate", "acm:UpdateCertificateOptions", "autoscaling:*", "ec2:*", "eks:*", "elasticloadbalancing:*", "iam:GetPolicy", "iam:GetPolicyVersion", "iam:GetOpenIDConnectProvider", "iam:GetRole", "iam:GetRolePolicy", "iam:GetUserPolicy", "iam:GetUser", "iam:ListAccessKeys", "iam:ListAttachedRolePolicies", "iam:ListGroupsForUser", "iam:ListInstanceProfilesForRole", "iam:ListPolicies", "iam:ListPolicyVersions", "iam:ListRolePolicies", "iam:PassRole", "iam:TagOpenIDConnectProvider", "iam:TagPolicy", "iam:TagRole", "iam:TagUser", "kms:CreateAlias", "kms:CreateGrant", "kms:CreateKey", "kms:Decrypt", "kms:DeleteAlias", "kms:DescribeKey", "kms:DisableKey", "kms:GenerateDataKey", "kms:GetKeyPolicy", "kms:GetKeyRotationStatus", "kms:ListAliases", "kms:ListResourceTags", "kms:PutKeyPolicy", "kms:RevokeGrant", "kms:ScheduleKeyDeletion", "kms:TagResource", "logs:CreateLogGroup", "logs:DeleteLogGroup", "logs:DescribeLogGroups", "logs:ListTagsLogGroup", "logs:PutRetentionPolicy", "logs:TagResource", "rds:*", "s3:*" ], "Resource": "*" } ] } ``` Some policies we need from time to time. For example, when we do the first deployment. Since those are IAM-related, we will ask for temporary permissions when required. ```JSON { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "iam:AttachRolePolicy", "iam:CreateAccessKey", "iam:CreateOpenIDConnectProvider", "iam:CreatePolicy", "iam:CreateRole", "iam:CreateUser", "iam:DeleteAccessKey", "iam:DeleteOpenIDConnectProvider", "iam:DeletePolicy", "iam:DeleteRole", "iam:DeleteRolePolicy", "iam:DeleteUser", "iam:DeleteUserPolicy", "iam:DetachRolePolicy", "iam:PutRolePolicy", "iam:PutUserPolicy" ], "Resource": "*" } ] } ``` # Datafold VPC Deployment on Azure Learn how to deploy Datafold in a Virtual Private Cloud (VPC) on Azure. **INFO** VPC deployments are an Enterprise feature. Please email [sales@datafold.com](mailto:sales@datafold.com) to enable your account. ## Create a Domain Name (optional) You can either choose to use your domain (for example, `datafold.domain.tld`) or to use a Datafold managed domain (for example, `yourcompany.dedicated.datafold.com`). ### Customer Managed Domain Name Create a DNS A-record for the domain where Datafold will be hosted. For the DNS record, there are two options: * **Public-facing:** When the domain is publicly available, we will provide an SSL certificate for the endpoint. * **Internal:** It is also possible to have Datafold disconnected from the internet. This would require an internal DNS (for example, AWS Route 53) record that points to the Datafold instance. It is possible to provide your own certificate for setting up the SSL connection. Once the deployment is complete, you will point that A-record to the IP address of the Datafold service. ## Create a New Subscription For isolation reasons, it is best practice to [create a new subscription](https://learn.microsoft.com/en-us/azure/cost-management-billing/manage/create-subscription) within your Microsoft Entra directory/tenant. Please call it something like `yourcompany-datafold` to make it easy to identify. ## Set IAM Permissions Go to **Microsoft Entra ID** and navigate to **Users**. Click **Add**, **User**, **Invite external user** and add the Datafold engineers. Navigate to the subscription you just created and go to **Access control (IAM)** tab in the side bar. * Navigate to the subscription you just created. Go to **Access control (IAM)**. Under **Add** select **Add role assignment**. * Under **Role**, navigate to **Priviledged administrator roles** and select **Owner**. * Under **Members**, click **Select members** and add the Datafold engineers. * When you are done, select **Review + assign**. # Datafold VPC Deployment on GCP Learn how to deploy Datafold in a Virtual Private Cloud (VPC) on GCP. **INFO** VPC deployments are an Enterprise feature. Please email [sales@datafold.com](mailto:sales@datafold.com) to enable your account. ## Create a Domain Name (optional) You can either choose to use your domain (for example, `datafold.domain.tld`) or to use a Datafold managed domain (for example, `yourcompany.dedicated.datafold.com`). ### Customer Managed Domain Name Create a DNS A-record for the domain where Datafold will be hosted. For the DNS record, there are two options: * **Public-facing:** When the domain is publicly available, we will provide an SSL certificate for the endpoint. * **Internal:** It is also possible to have Datafold disconnected from the internet. This would require an internal DNS (for example, AWS Route 53) record that points to the Datafold instance. It is possible to provide your own certificate for setting up the SSL connection. Once the deployment is complete, you will point that A-record to the IP address of the Datafold service. ## Create a New Project For isolation reasons, it is best practice to [create a new project](https://console.cloud.google.com/projectcreate) within your GCP organization. Please call it something like `yourcompany-datafold` to make it easy to identify: After a minute or so, you should receive confirmation that the project has been created. Afterward, you should be able to see the new project. ## Set IAM Permissions Navigate to the **IAM** tab in the sidebar and click **Grant Access** to invite Datafold to the project. Add your Datafold solutions engineer as a **principal**. You have two options for assigning IAM permissions to the Datafold Engineers. 1. Assign them as an **owner** of your project. 2. Assign the extended set of [Minimal IAM Permissions](#minimal-iam-permissions). The owner role is only required temporarily while we configure and test the initial Datafold deployment. We'll inform you when it is ok to revoke this permission and provide us with only the [Minimal IAM Permissions](#minimal-iam-permissions). ### Required APIs The following GCP APIs need to be additionally enabled to run Datafold: 1. [Compute Engine API](https://console.cloud.google.com/apis/library/compute.googleapis.com) 2. [Secret Manager API](https://console.cloud.google.com/apis/api/secretmanager.googleapis.com) The following GCP APIs we use are already turned on by default when you created the project: 1. [Cloud Logging API](https://console.cloud.google.com/apis/api/logging.googleapis.com) 2. [Cloud Monitoring API](https://console.cloud.google.com/apis/api/monitoring.googleapis.com) 3. [Cloud Storage](https://console.cloud.google.com/apis/api/storage-component.googleapis.com) 4. [Service Networking API](https://console.cloud.google.com/apis/api/servicenetworking.googleapis.com) Once the access has been granted, make sure to notify Datafold so we can initiate the deployment. ### Minimal IAM Permissions Because we work in a Project dedicated to Datafold, there is no direct access to your resources unless explicitly configured (e.g., VPC Peering). The following IAM roles are required to update and maintain the infrastructure. ```Bash cloudsql.admin compute.loadBalancerAdmin compute.networkAdmin compute.securityAdmin compute.storageAdmin container.admin container.clusterAdmin iam.roleViewer iam.serviceAccountUser iap.tunnelResourceAccessor storage.admin viewer ``` Some roles we need from time to time. For example, when we do the first deployment. Since those are IAM-related, we will ask for temporary permissions when required. ```Bash iam.roleAdmin iam.securityAdmin iam.serviceAccountKeyAdmin iam.serviceAccountAdmin serviceusage.serviceUsageAdmin ``` # Best Practices Explore best practices for CI/CD testing in Datafold. Optimize time and cost by choosing which downstream tables to diff. Learn how to prevent and manage data drift in CI pipelines. # Handling Data Drift Ensuring Datafold in CI executes apples-to-apples comparison between staging and production environments. **Note** This section of the docs is only relevant if the data used as inputs during the PR build are inconsistent with the data used as inputs during the last production build. Please contact [support@datafold.com](mailto:support@datafold.com) if you'd like to learn more. ## What is data drift in CI? Datafold is used in CI to illuminate the impact of a pull request's proposed code change by comparing two versions of the data and identifying differences. **Data drift in CI** happens when those data differences occur due to *changes in upstream data sources*—not because of proposed code changes. Data drift in CI adds "noise" to your CI testing analysis, making it tricky to tell if data differences are due to new code, or changes in the source data. Unless both versions rely on the same snapshot of upstream data, data drift can compromise your ability to see the true effect of the code changes. **Tip** dbt users should implement Slim CI in [dbt Core](https://www.datafold.com/blog/taking-your-dbt-ci-pipeline-to-the-next-level) or [dbt Cloud](https://www.datafold.com/blog/slim-ci-the-cost-effective-solution-for-successful-deployments-in-dbt-cloud) to prevent most instances of data drift. Slim CI reduces build time and eliminates most instances of data drift because the CI build depends on upstreams in production due to state deferral. However, Slim CI will not *completely* eliminate data drift in CI, specifically in cases where the model being modified in the PR depends on a source. In those cases, we recommend [**building twice in CI**](/deployment-testing/best-practices/handling-data-drift#build-twice-in-ci). ## Why prevent data drift in CI? By eliminating data drift entirely, you can be confident that any differences detected in CI are driven only by your code, not unexpected data changes. You can think of this as similar to a scientific experiment, where the control versus treatment groups ideally exist in identical baseline conditions, with the treatment as the only variable which would cause differential outcomes. In practice, many organizations do not completely eliminate data drift, and still derive value from automatic data diffing and analysis conducted by Datafold in CI, in spite of minor noise that does exist. ## Handling data drift ### Build twice in CI The most rigorous way to prevent data drift in CI is to **set up two builds in CI**: one build representing PR data and another representing production data, both based on an identical snapshot of upstream data. 1. Create a fixed snapshot of the upstream data that both builds will use. 2. The CI pipeline executes two dbt builds: one using the PR branch of code, and another using the base branch of code. This creates two data environments which Datafold can compare. Since both builds transform the same snapshot of upstream data, any detected differences will be due to the code changes alone, ensuring an accurate comparison with no false positives. In this architecture, production data is not directly used in CI, but rather, a snapshot or clone. This eliminates any potential noise introduced if the most recent production job used outdated upstream data. By building two versions of the data in CI, you can ensure an "apples-to-apples" comparison that depends on the same version of upstream data. If performance is a concern, you can use a reduced or filtered upstream data set to speed up the CI process while still providing rich insight into the data. This method assumes the production build doesn’t involve multiple jobs that process different sets of models at different times. # Slim Diff Choose which downstream tables to diff to optimize time, cost, and performance. By default, Datafold diffs all modified models and downstream models. However, it won't make sense for all organizations to diff every downstream table every time you make a code update. Tradeoffs of time, cost, and risk must be considered. That's why we created Slim Diff. With Slim Diff enabled, Datafold will only diff models with dbt code changes in your Pull Request (PR). ## Setting up Slim Diff In Datafold, Slim Diff can be enabled by adjusting your diff settings by navigating to Settings → Integrations → CI → Select your CI tool → Advanced Settings and check the Slim Diff box: ## Diffing only modified models With this setting turned on, only the modified models will be diffed by default. ## Diff individual downstream models Once Datafold has diffed only the modified models, you still have the option of diffing individual downstream models right within your PR. ## Diff all downstream models You can also add the `datafold:diff-all-downstream` label within your PR, which will automatically diff *all* downstream models. ## Explicitly define which models to always diff Finally, with Slim Diff turned on, there might be certain models or subdirectories that you want to *always* diff when downstream. You can think of this as an exclusion to the Slim Diff behavior. Apply the `slim_diff: diff_when_downstream` meta tag to individual models or entire folders in your `dbt_project.yml` file: ```Bash models: : : +materialized: view : +meta: datafold: datadiff: slim_diff: diff_when_downstream : +meta: datafold: datadiff: slim_diff: diff_when_downstream ``` These meta tags can also be added in individual yaml files or in config blocks. More details about using meta tags are available in [the dbt docs](https://docs.getdbt.com/reference/resource-configs/meta). With this configuration in place, Slim Diff will prevent downstream models from being run *unless* they have been designated as exceptions with the `slim_diff: diff_when_downstream` dbt meta tag. As usual, once the PR has been opened, you'll still have the option of diffing individual downstream models that weren't diffed, or diffing all downstream models using the `datafold:diff-all-downstream` label. # Configuration Explore configuration options for CI/CD testing in Datafold. Learn how Datafold infers primary keys for accurate Data Diffs. Map renamed columns in PRs to their production counterparts. Configure when Datafold runs in CI, including on-demand triggers. Set model-specific filters and configurations for CI runs. # Column Remapping Specify column renaming in your git commit message so Datafold can map renamed columns to their original counterparts in production for accurate comparison. When your PR includes updates to column names, it's important to specify these updates in your git commit message using the following syntax. This allows Datafold to understand how renamed columns should be compared to the column in the production data with the original name. ## Example By specifying column remapping in the commit message, instead of interpreting the change as a removing one column and adding another: Datafold will recognize that the column has been renamed: ## Syntax for column remapping You can use any of the following syntax styles as a single line to a commit message to instruct Datafold in CI to remap a column from `oldcol` to `newcol`. ```Bash # All models/tables in the PR: datafold remap oldcol newcol X-Datafold: rename oldcol newcol /datafold renamed oldcol newcol datafold: remapped oldcol newcol # Filtered models/tables by shell-like glob: datafold remap oldcol newcol model_NAME X-Datafold: rename oldcol newcol TABLE /datafold renamed oldcol newcol VIEW_* ``` ## Chaining together column name updates Commit messages can be chained together to reflect sequential changes. This means that a commit message does not lock you in to renaming a column. For example, if your commit history looks like this: Datafold will understand that the production column `name` has been renamed to `first_name` in the PR branch. ## Handling column renaming in git commits and PR comments ### Git commits Git commits track changes on a change-by-change basis and linearize history assuming merged branches introduce new changes on top of the base/current branch (1st parent). ### PR comments PR comments apply changes to the entire changeset. ### When to use git commits or PR comments? When handling chained renames: * **Git commits:** Sequential renames (`col1 > col2 > col3`) result in the final rename (`col1 > col3`). * **PR comments:** It's best to specify the final result directly (`col1 > col3`). Sequential renames (`col1 > col2 > col3`) can also work, but specifying the final state simplifies understanding during review. | Aspect | Git Commits | PR Comments | | ------------------------- | ----------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **Tracking Changes** | Tracks changes on a change-by-change basis. | Applies changes to the entire changeset. | | **History Linearization** | Linearizes history assuming merged branches introduce new changes on top of the base/current branch (1st parent). | N/A | | **Chained Renames** | Sequential renames (col1 > col2 > col3) result in the final rename (col1 > col3). | It's best to specify the final result directly (col1 > col3). Sequential renames (col1 > col2 > col3) can also work, but specifying the final state simplifies understanding during review. | | **Precedence** | Renames specified in git commits are applied in sequence unless overridden by subsequent commits. | PR comments take precedence over renames specified in git commits if applied during the review process. | These guidelines ensure consistency and clarity when managing column renaming in collaborative development environments, leveraging Datafold's capabilities effectively. # Running Data Diff for Specific PRs/MRs By default, Datafold CI runs on every new pull/merge request and commits to existing ones. To **only** run Datafold CI when the user explicitly requests it, you can set **Run only when tagged** option in the Datafold app [CI settings](https://app.datafold.com/settings/integrations/ci) which will only allow Datafold CI to run if a `datafold` tag/label is assigned to the pull/merge request. ## Running data diff on specific file changes By default, Datafold CI will run on any file change in the repo. To skip Datafold CI runs for certain modified files (e.g., if the dbt code is placed in the same repo with non-dbt code), you can specify files to ignore. The pattern uses the syntax of .gitignore. Excluded files can be re-included by using the negation. ### Example Let's say the dbt project is a folder in a repo that contains other code (e.g., Airflow). We want to run Datafold CI for changes to dbt models but skip it for other files. For that, we exclude all files in the repo except those the /dbt folder. We also want to filter out `.md` files in the /dbt folder: ```Bash *!dbt/*dbt/*.md ``` **SKIPPING SPECIFIC DBT MODELS** To skip diffing individual dbt models in CI, use the [never\_diff](/deployment-testing/configuration/model-specific-ci/excluding-models) option in the Datafold dbt yaml config. # Running Data Diff on Specific Branches By default, Datafold CI runs on every new pull/merge request and commits to existing ones. You can set **Custom base branch** option in the Datafold app [CI settings](https://app.datafold.com/settings/integrations/ci), to only run Datafold CI on pull requests that have a specific base branch. This might be useful if you have multiple environments built from different branches. For example, `staging` and `production` environments built from `staging` and `main` branches respectively. Using the option, you can have 2 different CI configurations in Datafold, one for each environment, and only run the CI for the corresponding branch. # Diff Timeline Specify a `time_column` to visualize match rates between tables for each column over time. ```Bash models: - name: users meta: datafold: datadiff: time_column: created_at ``` # Excluding Models Use `never_diff` to exclude a model or subdirectory of models from data diffs. ```Bash models: - name: users meta: datafold: datadiff: never_diff: true ``` # Including/Excluding Columns Specify columns to include or exclude from the data diff using `include_columns` and `exclude_columns`. ```Bash models: - name: users meta: datafold: datadiff: include_columns: - user_id - created_at - name exclude_columns: - full_name ``` # SQL Filters Use dbt YAML configuration to set model-specific filters for Datafold CI. SQL filters can be helpful in two scenarios: 1. When **Production** and **Staging** environments are not built using the same data. For example, if **Staging** is built using a subset of production data, filters can be applied to ensure that both environments are on par and can be diffed. 2. To improve Datafold CI performance by reducing the volume of data compared, e.g., only comparing the last 3 months of data. SQL filters are an effective technique to speed up diffs by narrowing the data diffed. A SQL filter adds a `WHERE` clause to allow you to filter data on both sides using standard SQL filter expressions. They can be added to dbt YAML under the `meta.datafold.datadiff.filter` tag: ``` models: - name: users meta: datafold: datadiff: filter: "user_id > 2350 AND source_timestamp >= current_date() - 7" ``` # Time Travel Use `prod_time_travel` and `pr_time_travel` to diff tables from specific points in time. If your database supports time travel, you can diff tables from a particular point in time by specifying `prod_time_travel` for a production model and `pr_time_travel` for a PR model. ```Bash models: - name: users meta: datafold: datadiff: prod_time_travel: - 2022-02-07T00:00:00 pr_time_travel: - 2022-02-07T00:00:00 ``` # Primary Key Inference Datafold requires a primary key to perform data diffs. Using dbt metadata, Datafold identifies the column to use as the primary key for accurate data diffs. Datafold supports composite primary keys, meaning that you can assign multiple columns that make up the primary key together. ## Metadata The first option is setting the `primary-key` key in the dbt metadata. There are [several ways to configure this](https://docs.getdbt.com/reference/resource-configs/meta) in your dbt project using either the `meta` key in a yaml file or a model-specific config block. ```Bash models: - name: users columns: - name: user_id meta: primary-key: true ## for compound primary keys, set all parts of the key as a primary-key ## # - name: company_id # meta: # primary-key: true ``` ## Tags If the primary key is not found in the metadata, it will go through the [tags](https://docs.getdbt.com/reference/resource-properties/tags). ```Bash models: - name: users columns: - name: user_id tags: - primary-key ## for compound primary keys, tag all parts of the key ## # - name: company_id # tags: # - primary-key ``` ## Inferred If the primary key isn't provided explicitly, Datafold will try to infer a primary key from dbt's uniqueness tests. If you have a single column uniqueness test defined, it will use this column as the PK. ```Bash models: - name: users columns: - name: user_id tests: - unique ``` Also, model-level uniqueness tests can be used for inferring the PK. ```Bash models: - name: sales columns: - name: col1 - name: col2 ... tests: - unique: column_name: "col1 || col2" # or column_name: "CONCAT(col1, col2)" # we also support dbt_utils unique_combination_of_columns test - dbt_utils.unique_combination_of_columns: combination_of_columns: - order_no - order_line ``` Keep in mind that this is a failover mechanism. If you change the uniqueness test, this will also impact the way Datafold performs the diff. # Getting Started with CI/CD Testing Learn how to set up CI/CD testing with Datafold by integrating your data connections, code repositories, and CI pipeline for automated testing. **TEAM CLOUD** Interested in adding Datafold Team Cloud to your CI pipeline? [Let's talk](https://calendly.com/d/zkz-63b-23q/see-a-demo?email=clay%20analytics%40datafold.com\&first_name=Clay\&last_name=Moeller\&a1=\&month=2024-07)! ## Getting Started with Deployment Testing To get started, first set up your [data connection](https://docs.datafold.com/integrations/databases) to ensure that Datafold can access and monitor your data sources. Next, integrate Datafold with your version control system by following the instructions for [code repositories](https://docs.datafold.com/integrations/code-repositories). This allows Datafold to track and test changes in your data pipelines. Add Datafold to your continuous integration (CI) pipeline to enable automated deployment testing. You can do this through our universal [Fully-Automated](../deployment-testing/getting-started/universal/fully-automated), [No-Code](../deployment-testing/getting-started/universal/no-code), [API](../deployment-testing/getting-started/universal/api), or [dbt](../integrations/orchestrators) integrations. Optionally, you can [connect data apps](https://docs.datafold.com/integrations/bi_data_apps) to extend your testing and monitoring to data applications like BI tools. # API Learn how to set up and configure Datafold's API for CI/CD testing. ## 1. Create a repository integration Integrate your code repository using the appropriate [integration](/integrations/code-repositories). ## 2. Create an API integration In the Datafold app, create an API integration. ## 3. Set up the API integration Complete the configuration by specifying the following fields: ### Basic settings | Field Name | Description | | ------------------ | --------------------------------------------------------- | | Configuration name | Choose a name for your for your Datafold dbt integration. | | Repository | Select the repository you configured in step 1. | | Data Source | Select the data source your repository writes to. | ### Advanced settings: Configuration | Field Name | Description | | ------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Diff Hightouch Models | Run data diffs for Hightouch models affected by your PR. | | CI fails on primary key issues | If null or duplicate primary keys exist, CI will fail. | | Pull Request Label | When this is selected, the Datafold CI process will only run when the 'datafold' label has been applied. | | CI Diff Threshold | Data Diffs will only be run automatically for given CI Run if the number of diffs doesn't exceed this threshold. | | Custom base branch | If defined, the Datafold CI process will only run on pull requests with the specified base branch. | | Files to ignore | Datafold CI diffs all changed models in the PR if at least one modified file doesn’t match the ignore pattern. Datafold CI doesn’t run in the PR if all modified files should be ignored. ([Additional details.](/deployment-testing/configuration/datafold-ci/on-demand)) | ### Advanced settings: Sampling | Field Name | Description | | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Enable sampling | Enable sampling for data diffs to optimize analyzing large datasets. | | Sampling tolerance | The tolerance to apply in sampling for all data diffs. | | Sampling confidence | The confidence to apply when sampling. | | Sampling threshold | Sampling will be disabled automatically if tables are smaller than specified threshold. If unspecified, default values will be used depending on the Data Source type. | ## 4. Obtain a Datafold API Key and CI config ID Generate a new Datafold API Key and obtain the CI config ID from the CI API integration settings page: You will need these values later on when setting up the CI Jobs. ## 5. Install Datafold SDK into your Python environment ```Bash pip install datafold-sdk ``` ## 6. Configure your CI script(s) with the Datafold SDK Using the Datafold SDK, configure your CI script(s) to use the Datafold SDK `ci submit` command. The example below should be adapted to match your specific use-case. ```Bash datafold ci submit --ci-config-id --pr-num --diffs ./diffs.json ``` Since Datafold cannot infer which tables have changed, you'll need to manually provide this information in a specific `json` file format. Datafold can then determine which models to diff in a CI run based on the `diffs.json` you pass in to the Datafold SDK `ci submit` command. ```Bash [ { "prod": "MY.PROD.TABLE", // Production table to compare PR changes against "pr": "MY.PR.TABLE", // Changed table containing data modifications in the PR "pk": ["MY", "PK", "LIST"], // Primary key; can be an empty array // These fields are not required and can be omitted from the JSON file: "include_columns": ["COLUMNS", "TO", "INCLUDE"], "exclude_columns": ["COLUMNS", "TO", "EXCLUDE"] } ] ``` Note: The `JSON` file is optional and you can also achieve the same effect by using standard input (stdin) as shown here. However, for brevity, we'll use the `JSON` file approach in this example: ```Bash datafold ci submit \ --ci-config-id \ --pr-num <<- EOF [{ "prod": "MY.PROD.TABLE", "pr": "MY.PR.TABLE", "pk": ["MY", "PK", "LIST"] }] ``` Implementation details will vary depending on [which CI tool](#ci-implementation-tools) you use. Please review the following instructions and examples for your organization's CI tool. **NOTE** Populating the `diffs.json` file is specific to your use case and therefore out of scope for this guide. The only requirement is to adhere to the `JSON` schema structure explained above. ## CI Implementation Tools We've created guides and templates for three popular CI tools. **HAVING TROUBLE SETTING UP DATAFOLD IN CI?** We're here to help! Please [reach out and chat with a Datafold Solutions Engineer](https://www.datafold.com/booktime). To add Datafold to your CI tool, add `datafold ci submit` step in your PR CI job. ```Bash name: Datafold PR Job # Run this job when a commit is pushed to any branch except main on: pull_request: push: branches: - '!main' jobs: run: runs-on: ubuntu-20.04 # your image will vary steps: - name: Install Datafold SDK run: pip install -q datafold-sdk # ... - name: Upload what to diff to Datafold run: datafold ci submit --ci-config-id --pr-num ${PR_NUM} --diffs env: # env variables used by Datafold SDK internally DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} DATAFOLD_HOST: ${DATAFOLD_HOST} # For Dedicated Cloud/private deployments of Datafold, # Set the "https://custom.url.datafold.com" variable as the base URL as an environment variable, either as a string or a project variable # There are multiple ways to get the PR_NUM, this is just a simple example PR_NUM: ${{ github.event.number }} ``` Be sure to replace `` with the [CI config ID](#4-obtain-a-datafold-api-key-and-ci-config-id) value. **NOTE** It is beyond the scope of this guide to provide guidance on generating the ``, as it heavily depends on your specific use case. However, ensure that the generated file adheres to the required schema outlined above. Finally, store [your Datafold API Key](#4-obtain-a-datafold-api-key-and-ci-config-id) as a secret named `DATAFOLD_API_KEY` [in your GitHub repository settings](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository). Once you've completed these steps, Datafold will run data diffs between production and development data on the next GitHub Actions CI run. ```Bash version: 2.1 jobs: artifacts-job: filters: branches: only: main # or master, or the name of your default branch docker: - image: cimg/python:3.9 # your image will vary env: # env variables used by Datafold SDK internally DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} DATAFOLD_HOST: ${DATAFOLD_HOST} # For Dedicated Cloud/private deployments of Datafold, # Set the "https://custom.url.datafold.com" variable as the base URL as an environment variable, either as a string or a project variable, per https://circleci.com/docs/set-environment-variable/ # There are multiple ways to get the PR_NUM, this is just a simple example PR_NUM: ${{ github.event.number }} steps: - checkout - run: name: "Install Datafold SDK" command: pip install -q datafold-sdk - run: name: "Upload what to diff to Datafold" command: datafold ci submit --ci-config-id --pr-num ${CIRCLE_PULL_REQUEST} --diffs ``` Be sure to replace `` with the [CI config ID](#4-obtain-a-datafold-api-key-and-ci-config-id) value. **NOTE** It is beyond the scope of this guide to provide guidance on generating the ``, as it heavily depends on your specific use case. However, ensure that the generated file adheres to the required schema outlined above. Then, enable [**Only build pull requests**](https://circleci.com/docs/oss#only-build-pull-requests) in CircleCI. This ensures that CI runs on pull requests and production, but not on pushes to other branches. Finally, store [your Datafold API Key](#4-obtain-a-datafold-api-key-and-ci-config-id) as a secret named `DATAFOLD_API_KEY` [your CircleCI project settings.](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository). Once you've completed these steps, Datafold will run data diffs between production and development data on the next CircleCI run. ```Bash image: name: ghcr.io/dbt-labs/dbt-core:1.x # your name will vary entrypoint: [ "" ] variables: # env variables used by Datafold SDK internally DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} DATAFOLD_HOST: ${DATAFOLD_HOST} # For Dedicated Cloud/private deployments of Datafold, # Set the "https://custom.url.datafold.com" variable as the base URL as an environment variable, either as a string or a project variable # There are multiple ways to get the PR_NUM, this is just a simple example PR_NUM: ${{ github.event.number }} run_pipeline: stage: test before_script: - pip install -q datafold-sdk script: # Upload what to diff to Datafold - datafold ci submit --ci-config-id --pr-num $CI_MERGE_REQUEST_ID --diffs rules: - if: '$CI_PIPELINE_SOURCE == "merge_request_event"' ``` Be sure to replace `` with the [CI config ID](#4-obtain-a-datafold-api-key-and-ci-config-id) value. **NOTE** It is beyond the scope of this guide to provide guidance on generating the ``, as it heavily depends on your specific use case. However, ensure that the generated file adheres to the required schema outlined above. Finally, store [your Datafold API Key](#4-obtain-a-datafold-api-key-and-ci-config-id) as a secret named `DATAFOLD_API_KEY` [in your GitLab project's settings](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository). Once you've completed these steps, Datafold will run data diffs between production and development data on the next GitLab CI run. ## Optional CI Configurations and Strategies ### Skip Datafold in CI To skip the Datafold step in CI, include the string `datafold-skip-ci` in the last commit message. # Fully-Automated Automatically diff tables modified in a pull request with Datafold's Fully-Automated CI integration. Our Fully-Automated CI integration enables you to automatically diff tables modified in a pull request so you know exactly how your data will change before going to production. We do this by analyzing the SQL in any changed files, extracting the relevant table names, and diffing those tables between your staging and production environments. We then post the results of those diffs—including any downstream impact—to your pull request for all to see. All without manual intervention. ## Prerequisites * Your code must be hosted in one of our supported version control integrations * Your tables/views must be defined in SQL * Your schema names must be parameterized ([see below](#4-parameterize-schema-names)) * You must be automatically generating staging data ([more info](/deployment-testing/how-it-works)) ## Get Started Get started in just a few easy steps. ### 1. Generate a Datafold API key If you haven't already generated an API key (you only need one), visit Settings > Account and select **Create API Key**. Save the key somewhere safe like a password manager, as you won't be able to view it later. ### 2. Set up a version control integration Open the Datafold app and navigate to Settings > Integrations > Repositories to connect the repository that contains the code you'd like to automatically diff. ### 3. Add a step to your CI workflow This example assumes you're using GitHub actions, but the approach generalizes to any version control tool we support including GitLab, Bitbucket, etc. Either [create a new GitHub Action](https://docs.github.com/en/actions/writing-workflows/quickstart) or add the following steps to an existing one: ```yaml - name: Install datafold-sdk run: pip install -q datafold-sdk - name: Trigger Datafold CI run: | datafold ci auto trigger --ci-config-id $CI_CONF_ID --pr-num $PR_NUM --base-sha $BASE_SHA --pr-sha $PR_SHA --reference-params "$REFERENCE_PARAMS" --pr-params "$PR_PARAMS" env: DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} CI_CONF_ID: 436 PR_NUM: "${{ steps.findPr.outputs.pr }}" PR_SHA: "${{ github.event.pull_request.head.sha }}" BASE_SHA: ${{ github.event.pull_request.base.sha }} REFERENCE_PARAMS: '{ "target_schema": "nc_default" }' PR_PARAMS: "{ \"target_schema\": \"${{ env.TARGET_SCHEMA }}\" }" ``` ### 4. Parameterize schema names If it's not already the case, you'll need to parameterize the schema for any table paths you'd like Datafold to diff. For example, let's say you have a file called `dim_orgs.sql` that defines a table called `DIM_ORGS` in your warehouse. Your SQL should look something like this: ```sql -- datafold: pk=org_id CREATE OR REPLACE TABLE analytics.${target_schema}.dim_orgs AS ( SELECT org_id, org_name, employee_count, created_at FROM analytics.${target_schema}.org_created ); ``` ### 5. Provide primary keys (optional) While this step is technically optional, we strongly recommend providing primary keys for any tables you'd like Datafold to diff. In order for Datafold to perform full value-level comparisons between staging and production tables, Datafold needs to know the primary keys. To provide this information, place a comment above each query using the `-- datafold: pk=` syntax shown below: ```sql -- datafold: pk=org_id CREATE OR REPLACE TABLE analytics.${target_schema}.dim_orgs AS ( SELECT org_id, ... ``` ### 6. Create a pull request When you create a pull request, Datafold will automatically detect it, attempt to diff any tables modified in the code, and post a summary as a comment in the PR. You can click through on the comment to view a more complete analysis of the changes in the Datafold app. Happy diffing! ## Need help? If you have any questions about Fully-Automated CI, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com). # No-Code Set up Datafold's No-Code CI integration to create and manage Data Diffs without writing code. Monitors are easy to create and manage in the Datafold app. But for teams (or individual users) who prefer a more code-based approach, our monitors as code feature allows managing monitors via version-controlled YAML. ## Getting Started Get up and running with our No-Code CI integration in just a few steps. ### 1. Create a repository integration Connect your code repository using the appropriate [integration](/integrations/code-repositories). ### 2. Create a No-Code integration From the integrations page, create a new No-Code CI integration. ### 3. Set up the No-Code integration Complete the configuration by specifying the following fields: #### Basic settings | Field Name | Description | | ------------------ | ----------------------------------------------------- | | Configuration name | Choose a name for your Datafold integration. | | Repository | Select the repository you configured in step 1. | | Data Connection | Select the data connection your repository writes to. | #### Advanced settings | Field Name | Description | | ------------------ | ----------------------------------------------------------------------------------------------------------------------------- | | Pull request label | When this is selected, the Datafold CI process will only run when the `datafold` label has been applied to your pull request. | | Custom base branch | If provided, the Datafold CI process will only run on pull requests against the specified base branch. | ### 4. Create a pull request and add diffs Datafold will automatically post a comment on your pull request with a link to generate a CI run that corresponds to the latest set of changes. ### 5. Add diffs to your CI run Once in Datafold, add as many pull requests as you'd like to the CI run. If you need a refresher on how to configure data diffs, check out [our docs](/data-diff/in-database-diffing/creating-a-new-data-diff). ### 6. Add a summary to your pull request Click on **Save and Add Preview to PR** to post a summary to your pull request. ### 7. View the summary in your pull request ## Cloning diffs from the last CI run If you make additional changes to your pull request, clicking the **Add data diff** button generates a new CI run in Datafold. From there, you can: * Create a new Data Diff from scratch * Clone diffs from the last CI run You can also diff downstream tables by clicking on the **Add Data Diff** button in the Downstream Impact table. This creates additional Data Diffs: You can then post another summary to your pull request by clicking **Save and Add Preview to PR**. # How Datafold in CI Works Learn how Datafold integrates with your _Continuous Integration (CI)_ process to create Data Diffs for all SQL code changes, catching issues before they make it into production. ## What is CI? Continuous Integration (or CI) is a process for building and testing changes to your code before deploying to production. This ensures early detection of potential issues and improves the quality of code deployment. | Without CI | With CI | | -------------------------------------------------------------------------------- | ------------------------------------------------------------------------ | | Updates are manually coordinated and become a complex synchronization chore. | Smoothly manage code changes, and scale as your team and code base grow. | | Testing is done manually, if at all. | Automate high-confidence test coverage. | | Code changes are released at a slower cadence, and with higher rates of failure. | Boost the quantity and quality of developer output. | ### Datafold in CI For Datafold to work in CI, you need to add a step that builds staging data in your CI process (e.g., GitHub). **Prerequisite: Building staging data in CI** If you're using dbt, you'll need to add a dbt build step to your CI pipeline first. This can be done through either [dbt Cloud](https://www.datafold.com/blog/slim-ci-the-cost-effective-solution-for-successful-deployments-in-dbt-cloud) or [dbt Core](https://www.datafold.com/blog/accelerating-dbt-core-ci-cd-with-github-actions-a-step-by-step-guide). For other orchestrators like Airflow, follow [this guide](https://www.datafold.com/blog/datafold-in-ci-is-for-everyone) to build staging data in CI, or contact us for custom recommendations based on your infrastructure. In this short clip, see how the Datafold bot automatically comments on your PR, highlighting data differences between the production and development versions of your code: