# Get Audit Logs
Source: https://docs.datafold.com/api-reference/audit-logs/get-audit-logs

get /api/v1/audit_logs


# Create a DBT BI integration
Source: https://docs.datafold.com/api-reference/bi/create-a-dbt-bi-integration

post /api/v1/lineage/bi/dbt/


# Create a Hightouch integration
Source: https://docs.datafold.com/api-reference/bi/create-a-hightouch-integration

post /api/v1/lineage/bi/hightouch/


# Create a Looker integration
Source: https://docs.datafold.com/api-reference/bi/create-a-looker-integration

post /api/v1/lineage/bi/looker/


# Create a Mode Analytics integration
Source: https://docs.datafold.com/api-reference/bi/create-a-mode-analytics-integration

post /api/v1/lineage/bi/mode/


# Create a Power BI integration
Source: https://docs.datafold.com/api-reference/bi/create-a-power-bi-integration

openapi-public.json post /api/v1/lineage/bi/powerbi/


# Create a Tableau integration
Source: https://docs.datafold.com/api-reference/bi/create-a-tableau-integration

post /api/v1/lineage/bi/tableau/


# Get an integration
Source: https://docs.datafold.com/api-reference/bi/get-an-integration

get /api/v1/lineage/bi/{bi_datasource_id}/
Returns the integration for Mode/Tableau/Looker/HighTouch by its id.


# List all integrations
Source: https://docs.datafold.com/api-reference/bi/list-all-integrations

get /api/v1/lineage/bi/
Return all integrations for Mode/Tableau/Looker


# Remove an integration
Source: https://docs.datafold.com/api-reference/bi/remove-an-integration

delete /api/v1/lineage/bi/{bi_datasource_id}/


# Rename a Power BI integration
Source: https://docs.datafold.com/api-reference/bi/rename-a-power-bi-integration

openapi-public.json put /api/v1/lineage/bi/powerbi/{bi_datasource_id}/
It can only update the name. Returns the integration with changed fields.


# Sync a BI integration
Source: https://docs.datafold.com/api-reference/bi/sync-a-bi-integration

get /api/v1/lineage/bi/{bi_datasource_id}/sync/
Start an unscheduled synchronization of the integration.


# Update a DBT BI integration
Source: https://docs.datafold.com/api-reference/bi/update-a-dbt-bi-integration

put /api/v1/lineage/bi/dbt/{bi_datasource_id}/
Returns the integration with changed fields.


# Update a Hightouch integration
Source: https://docs.datafold.com/api-reference/bi/update-a-hightouch-integration

put /api/v1/lineage/bi/hightouch/{bi_datasource_id}/
It can only update the schedule. Returns the integration with changed fields.


# Update a Looker integration
Source: https://docs.datafold.com/api-reference/bi/update-a-looker-integration

put /api/v1/lineage/bi/looker/{bi_datasource_id}/
It can only update the schedule. Returns the integration with changed fields.


# Update a Mode Analytics integration
Source: https://docs.datafold.com/api-reference/bi/update-a-mode-analytics-integration

put /api/v1/lineage/bi/mode/{bi_datasource_id}/
It can only update the schedule. Returns the integration with changed fields.


# Update a Tableau integration
Source: https://docs.datafold.com/api-reference/bi/update-a-tableau-integration

put /api/v1/lineage/bi/tableau/{bi_datasource_id}/
It can only update the schedule. Returns the integration with changed fields.


# List CI runs
Source: https://docs.datafold.com/api-reference/ci/list-ci-runs

get /api/v1/ci/{ci_config_id}/runs


# Trigger a PR/MR run
Source: https://docs.datafold.com/api-reference/ci/trigger-a-prmr-run

post /api/v1/ci/{ci_config_id}/trigger


# Upload PR/MR changes
Source: https://docs.datafold.com/api-reference/ci/upload-prmr-changes

post /api/v1/ci/{ci_config_id}/{pr_num}


# Create a data diff
Source: https://docs.datafold.com/api-reference/data-diffs/create-a-data-diff

post /api/v1/datadiffs


# Get a data diff
Source: https://docs.datafold.com/api-reference/data-diffs/get-a-data-diff

get /api/v1/datadiffs/{datadiff_id}


# Get a data diff summary
Source: https://docs.datafold.com/api-reference/data-diffs/get-a-data-diff-summary

get /api/v1/datadiffs/{datadiff_id}/summary_results


# List data diffs
Source: https://docs.datafold.com/api-reference/data-diffs/list-data-diffs

get /api/v1/datadiffs
All fields support multiple items, using just comma delimiter
Date fields also support ranges using the following syntax:

- ``<DATETIME`` = before DATETIME
- ``>DATETIME`` = after DATETIME
- ``DATETIME`` = between DATETIME and DATETIME + 1 MINUTE
- ``DATE`` = start of that DATE until DATE + 1 DAY
- ``DATETIME1<<DATETIME2`` = between DATETIME1 and DATETIME2
- ``DATE1<<DATE2`` = between DATE1 and DATE2


# Update a data diff
Source: https://docs.datafold.com/api-reference/data-diffs/update-a-data-diff

patch /api/v1/datadiffs/{datadiff_id}


# Create a data source
Source: https://docs.datafold.com/api-reference/data-sources/create-a-data-source

post /api/v1/data_sources


# Get a data source
Source: https://docs.datafold.com/api-reference/data-sources/get-a-data-source

get /api/v1/data_sources/{data_source_id}


# Get a data source summary
Source: https://docs.datafold.com/api-reference/data-sources/get-a-data-source-summary

get /api/v1/data_sources/{data_source_id}/summary


# Get data source testing results
Source: https://docs.datafold.com/api-reference/data-sources/get-data-source-testing-results

get /api/v1/data_sources/test/{job_id}


# List data source types
Source: https://docs.datafold.com/api-reference/data-sources/list-data-source-types

get /api/v1/data_sources/types


# List data sources
Source: https://docs.datafold.com/api-reference/data-sources/list-data-sources

get /api/v1/data_sources


# Test a data source connection
Source: https://docs.datafold.com/api-reference/data-sources/test-a-data-source-connection

post /api/v1/data_sources/{data_source_id}/test


# Data Types
Source: https://docs.datafold.com/api-reference/data-types

Datafold facilitates data diffing by supporting a wide range of basic data types across major database systems like BigQuery, PostgreSQL, Redshift, Databricks, and Snowflake.

The tables below provide an overview of these types, categorized as numeric, string, boolean, date/time, etc., tailored to each system's specifics, to help pinpoint compatible types and grasp their mappings across different databases. Note that data types not included in these tables are not currently supported by Datafold.

For **cross-database diffing**, it's important to note that while the tables list the primary data types supported by Datafold, you can still diff data with types outside these lists. Datafold samples values and refines them further, often implicitly supporting a broad range of types. In scenarios where a type is not explicitly supported, it may be interpreted as `Text()`. While this is usually sufficient, be aware that it can lead to false positives in some cases.

## Supported types for in-database diffing

### BigQuery

| Category   | Type                                                            |
| ---------- | --------------------------------------------------------------- |
| Numeric    | INTEGER, INT64, FLOAT, FLOAT64, NUMERIC, BIGNUMERIC, BIGDECIMAL |
| Boolean    | BOOLEAN, BOOL                                                   |
| String     | STRING                                                          |
| Date/Time  | DATE, DATETIME, TIMESTAMP, TIME                                 |
| Additional | JSON                                                            |

### PostgreSQL/Redshift

| Category  | Type                                                                                         |
| --------- | -------------------------------------------------------------------------------------------- |
| Numeric   | BIGINT, INT8, INTEGER, INT4, SMALLINT, INT2, REAL, FLOAT4, FLOAT8, NUMERIC, DOUBLE PRECISION |
| String    | CHARACTER, CHAR, BCHAR, CHARACTER VARYING, VARCHAR, NAME, TEXT, CHARACTER VARYING            |
| Boolean   | BOOLEAN, BOOL                                                                                |
| Date/Time | DATE, TIMESTAMPTZ, TIMESTAMP, TIME, TIMESTAMP WITH/OUT TIME ZONE, TIME WITH/OUT TIME ZONE    |
| UUID      | UUID                                                                                         |

### Databricks

| Category  | Type                                                         |
| --------- | ------------------------------------------------------------ |
| Numeric   | BIGINT, TINYINT, SMALLINT, INT, DOUBLE, DECIMAL, FLOAT, REAL |
| String    | CHAR, STRING, VARCHAR                                        |
| Boolean   | BOOLEAN                                                      |
| Date/Time | TIMESTAMP, DATE                                              |

### Snowflake

| Category   | Type                                                      |
| ---------- | --------------------------------------------------------- |
| Numeric    | NUMBER, FLOAT, REAL                                       |
| String     | TEXT, VARCHAR                                             |
| Boolean    | BOOLEAN                                                   |
| Date/Time  | DATE, TIMESTAMP\_LTZ, TIMESTAMP\_NTZ, TIMESTAMP\_TZ, TIME |
| Additional | BINARY, VARIANT, ARRAY, OBJECT                            |

## Supported types for cross-database diffing

**Note for cross-database diffing only:** while the tables list the primary data types supported by Datafold, you can still diff data with types outside these lists. Datafold samples values and refines them further, often implicitly supporting a broad range of types. In scenarios where a type is not explicitly supported, it may be interpreted as `Text()`. While this is usually sufficient, be aware that it can lead to false positives in some cases.

### BigQuery

| Category | Type                                                |
| -------- | --------------------------------------------------- |
| Dates    | TIMESTAMP, DATETIME                                 |
| Numbers  | INT64, INT32, NUMERIC, BIGNUMERIC, FLOAT64, FLOAT32 |
| Text     | STRING                                              |
| Boolean  | BOOL                                                |
| JSON     | JSON                                                |

### Databricks

| Category   | Type                                                   |
| ---------- | ------------------------------------------------------ |
| Numbers    | INT, SMALLINT, TINYINT, BIGINT, FLOAT, DOUBLE, DECIMAL |
| Timestamps | TIMESTAMP                                              |
| Text       | STRING                                                 |
| Boolean    | BOOLEAN                                                |

### MS SQL

| Category   | Type                                                                |
| ---------- | ------------------------------------------------------------------- |
| Timestamps | DATETIMEOFFSET, DATETIME, DATETIME2, SMALLDATETIME, DATE            |
| Numbers    | FLOAT, REAL, DECIMAL, MONEY, SMALLMONEY                             |
| Integers   | INT, BIGINT, TINYINT, SMALLINT                                      |
| Text       | VARCHAR, CHAR, TEXT, NTEXT, NVARCHAR, NCHAR, BINARY, VARBINARY, XML |
| UUID       | UNIQUEIDENTIFIER                                                    |
| Boolean    | BIT                                                                 |
| JSON       | JSON                                                                |

### MYSQL

| Category | Type                                                                   |
| -------- | ---------------------------------------------------------------------- |
| Dates    | DATETIME, TIMESTAMP, DATE                                              |
| Numbers  | DOUBLE, FLOAT, DECIMAL, INT, BIGINT, MEDIUMINT, SMALLINT, TINYINT      |
| Text     | VARCHAR, CHAR, VARBINARY, BINARY, TEXT, MEDIUMTEXT, LONGTEXT, TINYTEXT |
| Boolean  | BOOLEAN                                                                |

### Oracle

| Category   | Type                             |
| ---------- | -------------------------------- |
| Numbers    | NUMBER, FLOAT                    |
| Text       | CHAR, NCHAR, NVARCHAR2, VARCHAR2 |
| Timestamps | DATE                             |

### PostgreSQL and Redshift

| Category   | Type                                                                |
| ---------- | ------------------------------------------------------------------- |
| Timestamps | TIMESTAMPTZ, TIMESTAMP, TIMESTAMP WITH/OUT TIME ZONE, DATE          |
| Numbers    | DOUBLE PRECISION, REAL, DECIMAL, SMALLINT, INTEGER, NUMERIC, BIGINT |
| Text       | CHARACTER, CHARACTER VARYING, VARCHAR, TEXT, JSON, JSONB            |
| Boolean    | BOOLEAN                                                             |
| UUID       | UUID                                                                |

### Snowflake

| Category   | Type                                                |
| ---------- | --------------------------------------------------- |
| Timestamps | TIMESTAMP\_NTZ, TIMESTAMP\_LTZ, TIMESTAMP\_TZ, DATE |
| Numbers    | NUMBER, FLOAT                                       |
| Text       | TEXT                                                |
| Boolean    | BOOLEAN                                             |


# Datafold SDK
Source: https://docs.datafold.com/api-reference/datafold-sdk


The Datafold SDK allows you to accomplish certain actions using a thin programmatic wrapper around the Datafold REST API, in particular:

* **Custom CI Integrations**: Submitting information to Datafold about what tables to diff in CI
* **dbt CI Integrations**: Submitting dbt artifacts via CI runner
* **dbt development**: Kick off data diffs from the command line while developing in your dbt project

## Install

First, create and activate your virtual environment for Python:

```
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip setuptools wheel
```

Now, you're ready to install the Datafold SDK:

```
pip install datafold-sdk
```

#### CLI environment variables

To use the Datafold CLI, you need to set up some environment variables:

```bash
export DATAFOLD_API_KEY=XXXXXXXXX
```

If your Datafold app URL is different from the default `app.datafold.com`, set the custom domain as the variable:

```bash
export DATAFOLD_HOST=<CUSTOM_DATAFOLD_APP_DOMAIN>
```

## Custom CI Integrations

Please follow [our CI orchestration docs](../integrations/orchestrators/custom-integrations) to set up a custom CI integration levering the Datafold SDK.

## dbt Core CI Integrations

When you set up Datafold CI diffing for a dbt Core project, we rely on the submission of `manifest.json` files to represent the production and staging versions of your dbt project.

Please see our detailed docs on how to [set up Datafold in CI for dbt Core](../integrations/orchestrators/dbt-core), and reach out to our team if you have questions.

#### CLI

```bash
    datafold dbt upload \
    --ci-config-id <ci_config_id> \
    --run-type <run-type> \
    --target-folder <artifacts_path> \
    --commit-sha <git_sha>
```

#### Python

```python
import os

from datafold.sdk.dbt import submit_artifacts

api_key = os.environ.get('DATAFOLD_API_KEY')

# only needed if your Datafold app url is not app.datafold.com
host = os.environ.get("DATAFOLD_HOST")

submit_artifacts(host=host,
                 api_key=api_key,
                 ci_config_id=<ci_config_id>,
                 run_type='<run-type>',
                 target_folder='<artifacts_path>',
                 commit_sha='<git_sha>')
```

## Diffing dbt models in development

It can be beneficial to diff between two dbt environments before opening a pull request. This can be done using the Datafold SDK from the command line:

```bash
datafold diff dbt
```

That command will compare data between your development and production environments. By default, all models that were built in the previous `dbt run` or `dbt build` command will be compared.

### Running Data Diffs before opening a pull request

It can be helpful to view Data Diff results in your ticket before creating a pull request. This enables faster code reviews by letting developers QA changes earlier.

To do this, you can create a draft PR and run the following command:

```
dbt run && datafold diff dbt
```

This executes dbt locally and triggers a Data Diff to preview data changes without committing to Git. To automate this workflow, see our guide [here](/faq/datafold-with-dbt#can-i-run-data-diffs-before-opening-a-pr).

### Update your dbt\_project.yml with configurations

#### Option 1: Add variables to the `dbt_project.yml`

```yaml
# dbt_project.yml
vars:
  data_diff:
    prod_database: my_default_database # default database for the prod target
    prod_schema: my_default_schema # default schema for the prod target
    prod_custom_schema: PROD_<custom_schema> # Optional: see dropdown below
```

**Additional schema variable details**
The value for `prod_custom_schema:` will vary based on how you have setup dbt.

This variable is used when a model has a custom schema and becomes ***dynamic*** when the string literal `<custom_schema>` is present. The `<custom_schema>` substring is replaced with the custom schema for the model in order to support the various ways schema name generation can be overridden <a href="https://docs.getdbt.com/docs/build/custom-schemas">here</a> -- also referred to as "advanced custom schemas".

**Examples (not exhaustive)**

**Single production schema**

*If your prod environment looks like this ...*

```bash
PROD.ANALYTICS
```

*... your data-diff configuration should look like this:*

```yaml
  vars:
      data_diff:
          prod_database: PROD
          prod_schema: ANALYTICS
```

**Some custom schemas in production with a prefix like "prod\_"**

*If your prod environment looks like this ...*

```bash
PROD.ANALYTICS
PROD.PROD_MARKETING
PROD.PROD_SALES
```

*... your data-diff configuration should look like this:*

```yaml
  vars:
      data_diff:
          prod_database: PROD
          prod_schema: ANALYTICS
          prod_custom_schema: PROD_<custom_schema>
```

**Some custom schemas in production with no prefix**

*If your prod environment looks like this ...*

```yaml
PROD.ANALYTICS
PROD.MARKETING
PROD.SALES
```

*... your data-diff configuration should look like this:*

```yaml
vars:
  data_diff:
    prod_database: PROD
    prod_scheam: ANALYTICS
    prod_custom_schema: <custom_schema>
```

#### Option 2: Specify a production `manifest.json` using `--state`

**Using the `--state` option is highly recommended for dbt projects with multiple target database and schema configurations. For example, if you customized the [`generate_schema_name`](https://docs.getdbt.com/docs/build/custom-schemas#understanding-custom-schemas) macro, this is the best option for you.**

> Note: `dbt ls` is preferred over `dbt compile` as it runs faster and data diffing does not require fully compiled models to work.

```bash
dbt ls -t prod # compile a manifest.json using the "prod" target
mv target/manifest.json prod_manifest.json # move the file up a directory and rename it to prod_manifest.json
dbt run # run your entire dbt project or only a subset of models with `dbt run --select <model_name>`
data-diff --dbt --state prod_manifest.json # run data-diff to compare your development results to the production database/schema results in the prod manifest
```

#### Add your Datafold data connection integration ID to your dbt\_project.yml

To connect to your database, navigate to **Settings** → **Integrations** → **Data connections** and click **Add new integration** and follow the prompts.

After you **Test and Save**, add the ID (which can be found on Integrations > Data connections) to your **dbt\_project.yml**.

```yaml
# dbt_project.yml
vars:
  data_diff:
      ...
      datasource_id: <DATA_SOURCE_ID>
```

The following optional arguments are available:

| Options                            | Description                                                                                                                                                                                                        |
| ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `--version`                        | Print version info and exit.                                                                                                                                                                                       |
| `-w, --where EXPR`                 | An additional 'where' expression to restrict the search space. Beware of SQL Injection!                                                                                                                            |
| `--dbt-profiles-dir PATH`          | Which directory to look in for the `profiles.yml` file. If not set, we follow the default `profiles.yml` location for the dbt version being used. Can also be set via the `DBT_PROFILES_DIR` environment variable. |
| `--dbt-project-dir PATH`           | Which directory to look in for the `dbt_project.yml` file. Default is the current working directory and its parents.                                                                                               |
| `--select SELECTION or MODEL_NAME` | Select dbt resources to compare using dbt selection syntax in dbt versions >= 1.5. In versions \< 1.5, it will naively search for a model with `MODEL_NAME` as the name.                                           |
| `--state PATH`                     | Specify manifest to utilize for 'prod' comparison paths instead of using configuration.                                                                                                                            |
| `-pd, --prod-database TEXT`        | Override the dbt production database configuration within `dbt_project.yml`.                                                                                                                                       |
| `-ps, --prod-schema TEXT`          | Override the dbt production schema configuration within `dbt_project.yml`.                                                                                                                                         |
| `--help`                           | Show this message and exit.                                                                                                                                                                                        |


# Get column downstreams
Source: https://docs.datafold.com/api-reference/explore/get-column-downstreams

openapi-public.json get /api/v1/explore/db/{data_connection_id}/columns/{column_path}/downstreams
Retrieve a list of columns or tables which depend on the given column.


# Get column upstreams
Source: https://docs.datafold.com/api-reference/explore/get-column-upstreams

openapi-public.json get /api/v1/explore/db/{data_connection_id}/columns/{column_path}/upstreams
Retrieve a list of columns or tables which the given column depends on.


# Get table downstreams
Source: https://docs.datafold.com/api-reference/explore/get-table-downstreams

openapi-public.json get /api/v1/explore/db/{data_connection_id}/tables/{table_path}/downstreams
Retrieve a list of tables which depend on the given table.


# Get table upstreams
Source: https://docs.datafold.com/api-reference/explore/get-table-upstreams

openapi-public.json get /api/v1/explore/db/{data_connection_id}/tables/{table_path}/upstreams
Retrieve a list of tables which the given table depends on.


# Introduction
Source: https://docs.datafold.com/api-reference/introduction


Our REST API allows you to interact with Datafold programmatically. To use it, you'll need an API key. Follow the instructions below to get started.

## Create an API Key

Open the Datafold app, visit Settings > Account, and select **Create API Key**.

<Note>
  Store your API key somewhere safe. If you lose it, you'll need to generate a new one.
</Note>

![Create an API key](https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/create-api-key.png)

## Use your API Key

When making requests to the Datafold API, you'll need to include the API key as a header in your HTTP request for authentication. The header should be named `Authorization`, and the value should be in the format:

```
Authorization: Key {API_KEY}
```

For example, if you're using cURL:

```bash
curl https://api.datafold.com/api/v1/... -H "Authorization: Key {API_KEY}"
```

## Datafold SDK

Rather than hit our REST API endpoints directly, we offer a convenient Python SDK for common development and deployment testing workflows. You can find more information about our SDK [here](/api-reference/datafold-sdk).

## Need help?

If you have any questions about how to use our REST API, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com).


# Create a Data Diff Monitor
Source: https://docs.datafold.com/api-reference/monitors/create-a-data-diff-monitor

openapi-public.json post /api/v1/monitors/create/diff


# Create a Data Test Monitor
Source: https://docs.datafold.com/api-reference/monitors/create-a-data-test-monitor

openapi-public.json post /api/v1/monitors/create/test


# Create a Metric Monitor
Source: https://docs.datafold.com/api-reference/monitors/create-a-metric-monitor

openapi-public.json post /api/v1/monitors/create/metric


# Create a Schema Change Monitor
Source: https://docs.datafold.com/api-reference/monitors/create-a-schema-change-monitor

openapi-public.json post /api/v1/monitors/create/schema


# Delete a Monitor
Source: https://docs.datafold.com/api-reference/monitors/delete-a-monitor

openapi-public.json delete /api/v1/monitors/{id}


# Get Monitor
Source: https://docs.datafold.com/api-reference/monitors/get-monitor

openapi-public.json get /api/v1/monitors/{id}


# Get Monitor Run
Source: https://docs.datafold.com/api-reference/monitors/get-monitor-run

openapi-public.json get /api/v1/monitors/{id}/runs/{run_id}


# List Monitor Runs
Source: https://docs.datafold.com/api-reference/monitors/list-monitor-runs

openapi-public.json get /api/v1/monitors/{id}/runs


# List Monitors
Source: https://docs.datafold.com/api-reference/monitors/list-monitors

openapi-public.json get /api/v1/monitors


# Toggle a Monitor
Source: https://docs.datafold.com/api-reference/monitors/toggle-a-monitor

openapi-public.json put /api/v1/monitors/{id}/toggle


# Trigger a run
Source: https://docs.datafold.com/api-reference/monitors/trigger-a-run

openapi-public.json post /api/v1/monitors/{id}/run


# Update a Monitor
Source: https://docs.datafold.com/api-reference/monitors/update-a-monitor

openapi-public.json patch /api/v1/monitors/{id}/update


# Best Practices
Source: https://docs.datafold.com/data-diff/cross-database-diffing/best-practices

When dealing with large datasets, it's crucial to approach diffing with specific optimization strategies in mind. We share best practices that will help you get the most accurate and efficient results from your data diffs.

## Enable sampling

[Sampling](/data-diff/cross-database-diffing/creating-a-new-data-diff#row-sampling) can be helpful when diffing between extremely large datasets as it can result in a speedup of 2x to 20x or more. The extent of the speedup depends on various factors, including the scale of the data, instance sizes, and the number of data columns.

The following table illustrates the speedup achieved with sampling in different databases, varying instance sizes, and different numbers of data columns:

|      Databases      | vCPU | RAM, GB |    Rows   | Columns | Time full | Time sampled | Speedup |    RDS type   | Diff full | Diff sampled | Per-col noise |
| :-----------------: | :--: | :-----: | :-------: | :-----: | :-------: | :----------: | :-----: | :-----------: | :-------: | :----------: | :-----------: |
| Oracle vs Snowflake |   2  |    2    | 1,000,000 |    1    |  0:00:33  |    0:00:27   |   1.22  |  db.t3.small  |    5399   |     5400     |       0       |
| Oracle vs Snowflake |   8  |    32   | 1,000,000 |    1    |  0:07:23  |    0:00:18   |  24.61  | db.m5.2xlarge |    5422   |     5423     |     0.005     |
|  MySQL vs Snowflake |   2  |    8    | 1,000,000 |    1    |  0:00:57  |    0:00:24   |   2.38  |  db.m5.large  |    5409   |     5413     |       0       |
|  MySQL vs Snowflake |   2  |    8    | 1,000,000 |    29   |  0:40:00  |    0:02:14   |  17.91  |  db.m5.large  |    5412   |     5411     |       0       |

When sampling is enabled, Datafold compares a randomly chosen subset of the data. Sampling is the tradeoff between the diff detail and time/cost of the diffing process. For most use cases, sampling does not reduce the informational value of data diffs as it still provides the magnitude and specific examples of differences (e.g., if 10% of sampled data show discrepancies, it suggests a similar proportion of differences across the entire dataset).

<Tip>
  Although configuring sampling can seem overwhelming at first, a good rule of thumb is to select an initial value of 95% for the sampling confidence and adjust it as needed. Tweaking the parameters can be helpful to see how they impact the sample size and the tradeoff between performance and accuracy.
</Tip>

## Handling data type differences

Datafold automatically manages data type differences during cross-database diffing. For example, when comparing decimals with different precisions (e.g., `DECIMAL(38,15)` in SQL Server and `DECIMAL(38,19)` in Snowflake), Datafold automatically casts values to a common precision before comparison, flagging any differences appropriately. Similarly, for timestamps with different precisions (e.g., milliseconds in SQL Server and nanoseconds in Snowflake), Datafold adjusts the precision as needed for accurate comparisons, simplifying the diffing process.

## Optimizing OLTP databases: indexing best practices

When working with row-oriented transactional databases like PostgreSQL, optimizing the database structure is crucial for efficient data diffing, especially for large tables. Here are some best practices to consider:

* **Create indexes on key columns**:

* It's essential to create indexes on the columns that will be compared, particularly the primary key columns defined in the data diffs.

* **Example**: If your data diff involves primary key columns `colA` and `colB`, ensure that indexes are created for these specific columns.

* **Use separate indexes for primary key columns:**

* Indexes for primary key columns should be distinct and start with these columns, not as subsets of other indexes. Having a dedicated primary key index is critical for efficient diffing.

* **Example**: Consider a primary key consisting of `colA` and `colB`. Ensure that the index is structured in the same order, like (`colA`, `colB`), to align with the primary key. An index with an order of (`colB`, `colA`) is strongly discouraged due to the impact on performance.

* **Example**: If the index is defined as (`colA`, `colB`, `colC`) and the primary key is a combination of `colA` and `colB`, then when setting up the diff operation, ensure that the primary key is specified as `colA`, `colB.` If the order is reversed as `colB`, `colA`, the diffing process won’t be able to fully utilize indexing, potentially leading to slower performance.

* **Leverage compound indexes**:

* Compound indexes, which involve multiple columns, can significantly improve query performance during data diffs as they efficiently handle complex queries and filtering.

* **Example**: An index defined as (`colA`, `colB`, `colC`) can be beneficial for diffing operations involving these columns, as it aligns with the order of columns in the primary key.

## Handling high percentage of differences

Data diff is optimized to perform best when the percent of different rows/values is relatively low, to support common data validation scenarios like data replication and migration.

While the tool strives to maximize the database's computational power and minimize data transfer, in extreme cases with very high difference percentages (up to 100%), it may result in transferring every row over the network, which is considerably slower.

In order to avoid long-running diffs, we recommend the following:

* **Start with diffing [primary keys](/data-diff/cross-database-diffing/creating-a-new-data-diff#primary-key)** only to identify row-level completeness first, before diffing all or more columns.
* **Set an [egress](/data-diff/cross-database-diffing/creating-a-new-data-diff#primary-key) limit** to automatically stop the diffing process after set number of rows are downloaded over the network.
* **Set a [per-column diff](/data-diff/cross-database-diffing/creating-a-new-data-diff#primary-key) limit** to stop finding differences for each column after a set number are found. This is especially useful in data reconciliation where identifying a large number of discrepancies (e.g., large percentage of missing/different rows) early on indicates that a detailed row-by-row diff may not be required, thereby saving time and computational resources.

In the screenshot below, we see that exactly 4 differences were found in `user_id`, but “at least 4,704 differences” were found in `total_runtime_seconds`. `user_id` has a number of differences below the per-column diff limit, and so we state the exact number. On the other hand, `total_runtime_seconds` has a number of differences greater than the per-column diff limit, so we state “at least.” Note that due to our algorithm’s approach, we often find significantly more differences than the limit before diffing is halted, and in that scenario, we report the value that was found, while stating that more differences may exist.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/screenshot.png" />
</Frame>

## Executing queries in parallel

Increase the number of concurrent connections to the database in Datafold. This enables queries to be executed in parallel, significantly accelerating the diff process.

Navigate to the **Settings** option in the left sidebar menu of Datafold. Adjust the **max connections** setting to increase the number of concurrent connections Datafold can establish with your data. Note that the maximum allowable value for concurrent connections is 64.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/connection.png" />
</Frame>

## Optimize column selection

The number of columns included in the diff directly impacts its speed: selecting fewer columns typically results in faster execution. To optimize performance, refine your column selection based on your specific use case:

* **Comprehensive verification**: For in-depth analysis, include all columns in the diff. This method is the most thorough, suitable for exhaustive data reviews, albeit time-intensive for wide tables.
* **Minimal verification**: Consider verifying only the primary key and `updated_at` columns. This is efficient and sufficient if you need to validate rows have not been added or removed, and that updates are current between databases, but do not need to check for value-level differences between rows with common primary keys.
* **Presence verification**: If your main concern is just the presence of data (whether data exists or has been removed), such as identifying missing hard deletes, verifying only the primary key column can be sufficient.
* **Hybrid verification**: Focus on key columns that are most critical to your operations or data integrity, such as monetary values in an `amount` column, while omitting large serialized or less critical columns like `json_settings`.

## Managing primary key distribution

Significant gaps in the primary key column can decrease diff efficiency (e.g., 10s of millions of continuous rows missing). Datafold will execute queries for non-existent row ranges, which can slow down the data diff.

## Handling different primary key types

As a general rule, primary keys should be of the same (or similar) type in both datasets for diffing to work properly. Comparing primary keys of different types (e.g., `INT` vs `VARCHAR`) will result in a type mismatch error. You can still diff such datasets by casting the primary key column to the same type in both datasets explicitly.

<Note>
  Indexes on the primary key typically cannot be utilized when the primary key is cast to a different type. This may result in slower diffing performance. Consider creating a separate index, such as [expression index in PostgreSQL](https://www.postgresql.org/docs/current/indexes-expressional.html), to improve performance.
</Note>

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data1.png" />
</Frame>


# Creating a New Data Diff
Source: https://docs.datafold.com/data-diff/cross-database-diffing/creating-a-new-data-diff

Datafold's Data Diff can compare data across databases (e.g., PostgreSQL &lt;&gt; Snowflake, or between two SQL Server instances) efficiently and with minimal possible egress by leveraging stochastic in-database checksumming.

This powerful algorithm provides full row-, column-, and value-level detail into discrepancies between data tables.

## Creating a new data diff

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/creating.png" />
</Frame>

Setting up a new data diff in Datafold is straightforward. You can configure your data diffs with the following parameters and options:

### Source and Target datasets

#### Data connection

Pick your data connection(s).

#### Diff type

Choose how you want to compare your data:

* Table: Select this to compare data directly from database tables
* Query: Use this to compare results from specific SQL queries

#### Dataset

Choose the dataset you want to compare. This can be a table or a view in your relational database.

#### Filter

Insert your filter clause after the WHERE keyword to refine your dataset. For example: `created_at > '2000-01-01'` will only include data created after January 1, 2000.

### Materialize inputs

Select this option to improve diffing speed when query is heavy on compute, or if filters are applied to non-indexed columns, or if primary keys are transformed using concatenation, coalesce, or another function.

## Column remapping

Designate columns with the same data type and different column names to be compared. Data Diff will surface differences under the column name used in the Source dataset.
<Note>Datafold automatically handles differences in data types to ensure accurate comparisons. See our best practices below for how this is handled.</Note>

## General

### Primary key

The primary key is one or more columns used to uniquely identify a row in the dataset during diffing. The primary key (or keys) does not need to be formally defined in the database or elsewhere as it is used for unique row identification during diffing.

<Note>
  Textual primary keys do not support values outside the set of characters `a-zA-Z0-9!"()*/^+-<>=`. If these values exist, we recommend filtering them out before running the diff operation.
</Note>

#### Egress limit

The egress limit optimizes the diff process by terminating it once a predefined number of rows are downloaded. The limit is set to 1,000,000 by default.

When the egress limit is reached, the diffing process does not produce the same results each time it is run, as it is not deterministic (i.e., the order in which data is processed may vary).

The egress limit prevents redundant analysis in scenarios with minor, repetitive discrepancies, such as formatting differences (e.g., whitespace, rounding differences). For most use cases, it is impractical to continue diffing after it is known that datasets are substantially different.

Since the algorithm aims to detect and return every mismatched row/value, if the datasets have a large percentage of differing rows, the algorithm may be unable to take advantage of checksumming. This can cause a large amount of data to be pulled over the network, which slows down the diffing process, and increases the strain on the database.

Setting an egress limit prevents unwanted runtime and database load by stopping the operation early in cases of substantial dataset discrepancies. It is highly recommended to set an egress limit, taking into account these tradeoffs between cost/speed and rigor.

### Columns

#### Columns to compare

Specify which columns to compare between datasets.
Note that this has performance implications when comparing a large number of columns, especially in wide tables with 30 or more columns. It is recommended to initially focus on comparisons using only the primary key or to select a limited subset of columns.

#### Per-column diff limit

By setting a per-column diff limit, Data Diff will stop identifying differences for any column after a number of differences is found, based on the limit. Data Diff will also stop searching for exclusive and duplicate primary keys after the limit is reached.
Setting a per-column diff limit enables your team to find data quality issues that arise during data reconciliation while minimizing compute and time spent searching for differences. Learn more about data reconciliation best practices here.

### Row sampling

#### Enable sampling

Use this to compare a subset of your data instead of the entire dataset. This is best for assessing large datasets.
Even when sampling is enabled, checksumming for unsampled primary keys still needs to be performed. As a result, if many columns beyond primary keys are involved, the time spent running cross-database differences with sampling may be significantly reduced.

#### Sampling tolerance

Sampling tolerance defines the allowable margin of error for our estimate. It sets the acceptable percentage of rows with primary key errors (e.g., nulls, duplicates, or primary keys exclusive to one dataset) before disabling sampling.
When sampling is enabled, not every row is examined, which introduces a probability of missing certain discrepancies. This threshold represents the level of difference we are willing to accept before considering the results unreliable and thereby disabling sampling. It essentially sets a limit on how much variance is tolerable in the sample compared to the complete dataset.
Default: 0.001%

#### Sampling confidence

Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset. It represents the minimum confidence level that the rate of primary key errors is below the threshold defined in sampling tolerance.
To put it simply, a 95% confidence level with a 5% tolerance means we are 95% certain that the true value falls within 5% of our estimate.
Default: 99%

#### Sampling threshold

Sampling is automatically disabled when the total row count of the largest table in the comparison falls below a specified threshold value. This approach is adopted because, for smaller datasets, a complete dataset comparison is not only more feasible but also quicker and more efficient than sampling. Disabling sampling in these scenarios ensures comprehensive data coverage and provides more accurate insights, as it becomes practical to examine every row in the dataset without significant time or resource constraints.

#### Sample size

This provides an estimated count of the total number of rows included in the combined sample from Datasets A and B, used for the diffing process. It's important to note that this number is an estimate and can vary from the actual sample size due to several factors:
The presence of duplicate primary keys in the datasets will likely increase this estimate, as it inflates the perceived uniqueness of rows.

* Applying filters to the datasets tends to reduce the estimate, as it narrows down the data scope.
* The number of rows we sample is not fixed; instead, we use a statistical approach called the Poisson distribution. This involves picking rows randomly from an infinite pool of rows with uniform random sampling. Importantly, we don't need to perform a full diff (compare every single row) to establish a baseline.

Example: Imagine there are two datasets we want to compare, Source and Target. Since we prefer not to check every row, we use a statistical approach to determine the number of rows to sample from each dataset. To do so, we set the following parameters:

* Sampling tolerance: 5%
* Sampling confidence: 95%

Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset, while sampling tolerance defines the allowable margin of error for our estimate. Here, with a 95% sampling confidence and a 5% sampling tolerance, we are 95% confident that the true value falls within 5% of our estimate. Datafold will then estimate the sample size needed (e.g., 200 rows) to achieve these parameters.

### Advanced

#### Materialize diff results to table

Create a detailed table from your diff results, indicating each row where differences occur. This table will include corresponding values from both datasets and flags showing whether each row matches or mismatches.


# How Cross-Database Diffing Works
Source: https://docs.datafold.com/data-diff/cross-database-diffing/cross-database

Datafold's cross-database diffing algorithm efficiently compares datasets between different databases.

To compare datasets between two different databases, Datafold leverages a proprietary [stochastic checksumming algorithm](../../faq/data-diffing#what-is-stochastic-checksumming) that allows it to identify discrepancies down to individual primary keys and column values while minimizing the amount of data sent over the network.

As a result, the comparison is mostly performed in-place, leveraging the underlying databases without the need to export the entire dataset to compare elsewhere.


# Results
Source: https://docs.datafold.com/data-diff/cross-database-diffing/results

Once your data diff is complete, Datafold provides a concise, high-level summary of the detected changes in the Overview tab.

## Overview

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/overview.png" />
</Frame>

The top-level menu displays the diff status, job ID, creation and completed times, runtime, and data connection.

## Columns

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/columns.png" />
</Frame>

The Columns tab displays a table with detailed column and type mappings from the two datasets being diffed, with status indicators for each column comparison (e.g., identical, percentage of values different). This provides a quick way to identify data inconsistencies and prioritize updates.

## Primary keys

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/primarykeys.png" />
</Frame>

This tab highlights rows that are unique to the Target dataset in a data diff ("Rows exclusive to Target"). As this identifies rows that exist only in the Target dataset and not in the Source dataset based on the primary key, it flags potential data discrepancies.

The Clone **diffs and materialize results** button allows you to rerun existing data diffs with results materialized in the warehouse, as well as any other desired modifications.

## Values

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/values.png" />
</Frame>

This tab displays rows where at least one column value differs between the datasets being compared. It is useful for quickly assessing the extent of discrepancies between the two datasets.

The **Show filters** button enables the following features:

* Highlight characters: highlight value differences between tables
* % of difference: filters and displays columns based on the specified percentage range of value differences


# File Diffing
Source: https://docs.datafold.com/data-diff/file-diffing

Datafold allows you to diff files (e.g. CSV, Excel, Parquet, etc.) in a similar way to how you diff tables.

<Note>
  If you'd like to enable file diffing for your organization, please contact [support@datafold.com](mailto:support@datafold.com).
</Note>

In addition to diffing data in tables, views, and SQL queries, Datafold allows you to diff data in files hosted in cloud storage. For example, you can diff between an Excel file and a Snowflake table, or between a CSV file and an Excel file.

## Supported cloud storage providers

Datafold supports diffing files in the following cloud storage providers, with plans to support more in the future:

* Azure Blob Storage
* Azure Data Lake Storage (ADLS)
* More coming soon...

## Supported file types

Datafold supports diffing the following file types:

* Tabular text files (e.g. `.csv`, `.tsv`, `.txt`, `.dat`)
* Excel (`.xlsx`, `.xls`)
* Parquet (`.parquet`)

## Type-specific options

Depending on the type of file you're diffing, you'll have a few options to specify how you'd like to parse the file.

For example, when diffing a tabular text file, you can specify the delimiter and skip header/footer rows.

![File diffing options](https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_diff/file-diffing/adls-file-diff-options.png)


# How Datafold Diffs Data
Source: https://docs.datafold.com/data-diff/how-datafold-diffs-data

Data diffs allow you to perform value-level comparisons between any two datasets within the same database, across different databases, or even between files.

The basic inputs required to run a diff are the data connections, names/paths of the datasets to be compared, and the primary key (one or more columns that uniquely identify rows in the datasets).

## What types of data can data diffs compare?

Diffs can compare data in tables, views, SQL queries (in relational databases and data lakes), and even files (e.g. CSV, Excel, Parquet, etc.).

Datafold facilitates data diffing by [supporting a wide range of basic data types](https://docs.datafold.com/api-reference/data-types) across major database systems like Snowflake, Databricks, BigQuery, Redshift, PostgreSQL, and many more.

## Creating data diffs

Diffs can be created in several ways:

* Interactively through the Datafold app
* Programmatically via our [REST API](/api-reference/data-diffs/create-a-data-diff)
* As part of a Continuous Integration (CI) workflow for [Deployment Testing](/deployment-testing/how-it-works)

## How in-database diffing works

When diffing data within the same physical database or data lake namespace, diffs compare data by executing various SQL queries in the target database. It uses several `JOIN`-type queries and various aggregate queries to provide detailed insights into differences at the row, value, and column levels, and to calculate differences in metrics and distributions.

## How cross-database diffing works

When comparing data across databases, diffs leverage checksumming and interval search to diff the data fast and at minimal cost. Diffs can quickly assess both the magnitude of differences and identify specific rows, columns, and values with differences **without having to copy the entire datasets over the network**. This efficiency makes it scalable for datasets as large as trillions of rows or terabytes in size.


# Best Practices
Source: https://docs.datafold.com/data-diff/in-database-diffing/best-practices

We share best practices that will help you get the most accurate and efficient results from your data diffs.

## Comparing numeric columns: tolerance for floats

When comparing numerical columns or of `FLOAT` type which is inherently noisy, it can be helpful to specify tolerance levels for differences below which the values are considered equal.
Set appropriate tolerance levels for floating-point comparisons to avoid flagging inconsequential differences.

## Materialize diff results

While Datafold UI provides advanced exploration of diff results, sometimes it can be helpful to materialize diff results back to the database to investigate them further with SQL or for audit logging.

## Optimizing diff performance at scale

Since data diff pushes down the compute to your database (which usually has sufficient capacity to store and compute the datasets in the first place), the diffing speed and scalability depends on the performance of the underlying SQL engine. In most cases, the diffing performance is comparable to typical transformation jobs and analytical queries running in the database and has scaled to trillions of rows.
When diffs run longer or consume more database resources than desired, consider the following measures:

1. **Enable Sampling** to dramatically reduce the amount of data processed for in-database diffing.
   Sampling can be helpful when diffing extremely large datasets. When sampling is enabled, Datafold compares a randomly chosen subset of the data. Sampling is the tradeoff between the diff detail and time/cost of the diffing process. For most use cases, sampling does not reduce the informational value of data diffs as it still provides the magnitude and specific examples of differences (e.g., if 10% of sampled data show discrepancies, it suggests a similar proportion of differences across the entire dataset).
   Sampling is less ideal when you need to audit every changed value with 100% confidence, but this scenario is rare in practice.

<Tip>
  Although configuring sampling can seem overwhelming at first, a good rule of thumb is to select an initial value of 95% for the sampling confidence and adjust it as needed. Tweaking the parameters can be helpful to see how they impact the sample size and the tradeoff between performance and accuracy.
</Tip>

2. **Add a SQL filter** if you actually need to compare just a subset of data (e.g., for a particular city or last two weeks).

3. **Optimize SQL queries** to enhance the performance and efficiency of database operations, reduce execution time, minimize resource usage, and ensure faster retrieval of data diff results.

4. **Leverage database performance** by ensuring proper configuration to match the typical workload patterns of your diff operations. Many modern databases come with performance-enhancing features like query optimization, caching, and parallel processing.

5. Consider **increasing resources** available to Datafold in your data warehouse (e.g., for Snowflake, increase warehouse size).


# Creating a New Data Diff
Source: https://docs.datafold.com/data-diff/in-database-diffing/creating-a-new-data-diff

Setting up a new data diff in Datafold is straightforward.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/in-diff.png" />
</Frame>

You can configure your data diffs with the following parameters and options:

## Dataset

### Data connection

Pick your data connection(s).

### Diff type

Choose how you want to compare your data:

* Table: Select this to compare data directly from database tables
* Query: Use this to compare results from specific SQL queries

Datafold can also diff views, materialized views, and dynamic tables (Snowflake-only) across both options too.

### Dataset

Choose the dataset you want to compare, Main and Test. This can be a table or a view in your relational database.

### Time travel point

If your database supports time travel, like [Snowflake](https://docs.snowflake.com/en/user-guide/data-time-travel#querying-historical-data), you can query data at a specified timestamp. This is useful for tracking changes over time, conducting audits, or correcting mistakes from accidental data modifications. You can adjust the database's session parameters as needed for your query.
Supported time travel expressions:

|  Database |           Timestamp          |        Negative Offset       |
| :-------: | :--------------------------: | :--------------------------: |
|  BigQuery | <Icon icon="square-check" /> |                              |
| Snowflake | <Icon icon="square-check" /> | <Icon icon="square-check" /> |

Timestamp examples:

* `2024-01-01`
* `2024-01-01 10:04:23`
* `2024-01-01 10:04:23-09:00`
* `2024-07-16T10:04:23+05:00`

Negative offset examples (in seconds):

* `130`
* `3600`

### Filter

Insert your filter clause after the `WHERE` keyword to refine your dataset. For example: `created_at >'2000-01-01` will only include data created after January 1, 2000.

## Column remapping

When columns are the same data type but are named differently, column remapping allows you to align and compare them. This is useful when datasets have semantically identical columns with different names, such as `userID` and `user_id`. Datafold will surface any differences under the column name used in the Main dataset.

## General parameters

### Primary key

The primary key is one or more columns used to uniquely identify a row in the dataset during diffing. The primary key (or keys) does not need to be formally defined in the database or elsewhere as it is used for unique row identification during diffing. Multiple columns support compound primary key definitions.

### Time-series dimension column

If a time-series dimension is selected, this produces a Timeline plot of diff results over time to identify any time-based patterns.
This is useful for identifying trends or anomalies when a given column does not match between tables in a certain date range. By selecting a time-based column, you can visualize differences and patterns across time, measured as column match rates.

<Frame type="glass">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/timeline.png" />
</Frame>

### Materialize diff results to table

Create a detailed table from your diff results, indicating each row where differences occur. This table will include corresponding values from both datasets and flags showing whether each row matches or mismatches.

### Materialize full diff result

For in-depth analysis, you can opt to materialize the full diff result. This disables sampling, allowing for a complete row-by-row comparison across datasets. Otherwise, Datafold defaults to diffing only a sample of the data.

## Row sampling

### Enable sampling

Use this to compare a subset of your data instead of the entire dataset. This is best for assessing large datasets.

### Sampling tolerance

Sampling tolerance defines the allowable margin of error for our estimate. It sets the acceptable percentage of rows with primary key errors (like nulls, duplicates, or primary keys exclusive to one dataset) before disabling sampling.
When sampling is enabled, not every row is examined, which introduces a probability of missing certain discrepancies. This threshold represents the level of difference we are willing to accept before considering the results unreliable and thereby disabling sampling. It essentially sets a limit on how much variance is tolerable in the sample compared to the complete dataset.

Default: 0.001%

### Sampling confidence

Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset. It represents the minimum confidence level that the rate of primary key errors is below the threshold defined in sampling tolerance.
To put it simply, a 95% confidence level with a 5% tolerance means we are 95% certain that the true value falls within 5% of our estimate.

Default: 99%

### Sampling threshold

Sampling is automatically disabled when the total row count of the largest table in the comparison falls below a specified threshold value. This approach is adopted because, for smaller datasets, a complete dataset comparison is not only more feasible but also quicker and more efficient than sampling. Disabling sampling in these scenarios ensures comprehensive data coverage and provides more accurate insights, as it becomes practical to examine every row in the dataset without significant time or resource constraints.

### Sample size

This provides an estimated count of the total number of rows included in the combined sample from Datasets A and B, used for the diffing process. It's important to note that this number is an estimate and can vary from the actual sample size due to several factors:

* the presence of duplicate primary keys in the datasets will likely increase this estimate, as it inflates the perceived uniqueness of rows
* applying filters to the datasets tends to reduce the estimate, as it narrows down the data scope
* The number of rows we sample is not fixed; instead, we use a statistical approach called the Poisson distribution. This involves picking rows randomly from an infinite pool of rows with uniform random sampling. Importantly, we don't need to perform a full diff (compare every single row) to establish a baseline.

Example: Imagine there are two datasets we want to compare, Main and Test. Since we prefer not to check every row, we use a statistical approach to determine the number of rows to sample from each dataset. To do so, we set the following parameters:

* sampling tolerance: 5%
* sampling confidence: 95%

Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset, while sampling tolerance defines the allowable margin of error for our estimate. Here, with a 95% sampling confidence and a 5% sampling tolerance, we are 95% confident that the true value falls within 5% of our estimate. Datafold will then estimate the sample size needed (e.g., 200 rows) to achieve these parameters.

## Tolerance for floats

An acceptable delta between numeric values is used to determine if they match. This is particularly useful for addressing rounding differences in long floating-point numbers.
Add tolerance by choosing a column name, mode, and value. For mode:

* *Relative*: Defines a percentage-based tolerance. For example, a 2% relative tolerance means no difference is noted if the absolute value of (A/B - 1) is less than or equal to 2%.
* *Absolute*: Sets a fixed numerical margin. For instance, an absolute tolerance of 0.5 means values are matched if the absolute difference between A and B is 0.5 or less.


# Results
Source: https://docs.datafold.com/data-diff/in-database-diffing/results

Once your data diff is complete, Datafold provides a concise, high-level summary of the detected changes in the Overview tab

## Overview

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/in-overview.png" />
</Frame>

The top-level menu displays the diff status, job ID, creation and completed times, runtime, and data connection.

## Columns

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/in-columns.png" />
</Frame>

The Columns tab displays a table with detailed column and type mappings from the two datasets being diffed, with status indicators for each column comparison (e.g., identical, percentage of values different). This provides a quick way to identify data inconsistencies and prioritize updates.

## Primary keys

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/in-primary.png" />
</Frame>

This tab highlights rows that are unique to the Test dataset in a data diff ("Rows exclusive to Test"). As this identifies rows that exist only in the Test dataset and not in the Main dataset based on the primary key, it flags potential data discrepancies.

The **Show filters** button allows you to filter these rows by selected column(s).

The **Clone diffs and materialize** results button allows you to rerun existing data diffs with results materialized in the warehouse, as well as any other desired modifications.

## Column Profiles

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/in-profiles.png" />
</Frame>

Column Profiles displays aggregate statistics and distributions including averages, counts, ranges, and histogram charts representing column-level differences.

The **Show filters** button allows you to adjust chart values by relative (percentage) or absolute numbers.

## Values

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/in-values.png" />
</Frame>

This tab displays rows where at least one column value differs between the datasets being compared. It is useful for quickly assessing the extent of discrepancies between the two datasets.

The **Show** filters button enables the following features:

* Highlight characters: highlight value differences between tables
* % of difference: filters and displays columns based on the specified percentage range of value differences

## Timeline

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/timeline.png" />
</Frame>

The Timeline tab is a specialized feature that only appears if the time-series dimension column has been selected. It graphically represents data differences over time to highlight discrepancies. It only displays columns with data differences, and differences are presented as the share of mismatched data (percentage mismatched).

This feature offers enhanced clarity in pinpointing inconsistencies, supports informed decision-making through visual data representation, and increases efficiency in identifying and resolving data-related issues.

The Timeline feature is particularly useful in scenarios where an incremental model is mismanaged, leading to improper backfilling. It allows users to visually track the inconsistencies that arise over time due to the mismanagement. This graphical representation makes it easier to pinpoint the specific time frames where the errors occurred, facilitating a more targeted approach to rectify these issues.

It is also useful in correlating data differences with specific time intervals that coincide with changing data connections. When switching over or stitching together different data connections, there's often a shift in how data behaves over time. The Timeline graph helps flag the potential impact of the source change on data consistency and integrity.

## Downstream Impact

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/downstream.png" />
</Frame>

This tab displays all associated BI and data app dependencies, such as dashboards and views, linked to the compared datasets. This helps visually illustrate the impact of data changes on downstream data assets.

Each listed dependency is shown with a link to its lineage diagram within Datafold's [column-level lineage](https://docs.datafold.com/data-explorer/how-it-works). You can you can filter by tables or columns within tables, or [open this view](https://docs.datafold.com/data-explorer/how-it-works) in Data Explorer for further analysis.


# What's a Data Diff?
Source: https://docs.datafold.com/data-diff/what-is-data-diff

A data diff is the value-level comparison between two tables, used to identify critical changes to your data and guarantee data quality.

When you **git diff** your code, you’re comparing two versions of your code files to see what has changed, such as lines added, removed, or modified. Similarly, a **data diff** compares two versions of a dataset or two databases to identify differences in individual cells in the data.

![what's a data diff](https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_diff/what_is_data_diff.png)

## Why do I need to diff data?

Just as diffing code and text is fundamental to software engineering and working with text documents, diffing data is essential to the data engineering workflow.

Why? In data engineering, both data and the code that processes it are constantly evolving. Without the ability to easily diff data, understanding and tracking data changes becomes challenging. This slows down the development process and makes it harder to ensure data quality.

There is a lot you can do with data diff:

* Test SQL code by comparing development or staging environment data to production
* Compare tables in source and target systems to identify discrepancies when migrating data between databases
* Detect value-level outliers, or unexpected changes, in data flowing through your ETL/ELT pipelines
* Verify that reports generated for regulatory compliance accurately reflect the underlying data by comparing report outputs with source data

## Why Datafold?

Data diffing is a fundamental capability in data engineering that every engineer should have access to.

Datafold's [Data Diff](https://www.datafold.com/data-diff) is a tool that compares datasets fast, within or across databases. Datafold offers an enterprise-ready solution for comparing datasets within or across databases at scale. It includes comprehensive, optimized, and automated diffing solutions, API access, and secure deployment options.

Here's how you can identify row-level discrepancies in Datafold:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data-diff.png" />
</Frame>

Datafold provides end-to-end solutions for automating testing, including column-level lineage, ML-based anomaly detection, and enterprise-scale infrastructure support. It caters to complex and production-ready scenarios, including:

* Automated and collaborative diffing and testing for data transformations in CI
* Data diffing informed by column-level lineage, and validation of code changes with visibility into BI applications
* Validating large data migrations or continuous replications with automated cross-database diffing capabilities

Here's a high-level overview of what Datafold offers:

|                                                        Feature Category                                                       |                     Datafold                     |
| :---------------------------------------------------------------------------------------------------------------------------: | :----------------------------------------------: |
|                      **Database Support**<br />*Databases that are supported for source-destination diff*                     | Any SQL database, inquire about specific support |
|                                    **Scale**<br />*Size of datasets supported for diffing*                                    | Unlimited with advanced performance optimization |
|               **Primary Key Data Type Support**<br />*Data types of primary keys that are supported for diffing*              |  Numerical, string, datetime, boolean, composite |
|                   **Data Types Diffing Support**<br />*Data types that are supported for per-column diffing*                  |                  All data types                  |
|               **Export Diff Results to Database**<br />*Materialize diffing results in your database of choice*               |           <Icon icon="square-check" />           |
|     **Value-level diffs**<br />*Investigate row-by-row column value differences between source and destination databases*     |     <Icon icon="square-check" /> (JSON & GUI)    |
|                **Diff UI**<br />*Explore diffs visually and easily share them with your team and stakeholders*                |           <Icon icon="square-check" />           |
|           **API Access**<br />*Automatically create diffs and receive results at scale using the Datafold REST API*           |           <Icon icon="square-check" />           |
| **Persisting Diff History**<br />*Persist the result history of diffs to know how your data and diffs have changed over time* |           <Icon icon="square-check" />           |
|                          **Scheduled Checks**<br />*Run scheduled diffs for a defined list of tables*                         |           <Icon icon="square-check" />           |
|      **Alerting**<br />*Receive automatic alerts about detected discrepancies between tables within or across databases*      |           <Icon icon="square-check" />           |
|                       **Security and Compliance**<br />*Run diffs in secure and compliant environments*                       |        HIPAA, SOC2 Type II, GDPR compliant       |
|            **Deployment Options**<br />*Deploy your diffs in secure environments that meet your security standards*           |     Multi-tenant SaaS or Single-tenant in VPC    |
|                **Support**<br />*Choose which channels offer the greatest support to your use cases and users*                |   Enterprise support from Datafold team members  |
|        **SLA**<br />*The types of SLAs that exist to guarantee your team can diff and interact with diffs as expected*        |    <Icon icon="square-check" /> (Coming soon)    |

## Three ways to learn more

If you're new to Datafold or data diffing, here are three easy ways to get started:

1. **Explore our CI integration guides**: See how Datafold fits into your continuous integration (CI) pipeline by checking out our guides for [No-Code](../deployment-testing/getting-started/universal/no-code), [API](../deployment-testing/getting-started/universal/api), or [dbt](../integrations/orchestrators) integrations.
2. **Try it yourself**: Use your own data with our [14-day free trial](https://app.datafold.com/) and experience Datafold in action.
3. **Book a demo**: Get a deeper technical understanding of how Datafold integrates with your company’s data infrastructure by [booking a demo](https://www.datafold.com/booktime) with our team.


# dbt Metadata Sync
Source: https://docs.datafold.com/data-explorer/best-practices/dbt-metadata-sync

Datafold can automatically ingest dbt metadata from your production environment and display it in Data Explorer.

<Note>
  **INFO**

  You can enable the metadata sync in your Orchestration settings.
</Note>

Please note that when this feature is enabled, user editing of table metadata is disabled.

### Model-level

The following model-level information can be synced:

* `description` is synchronized into the description field of the table into Lineage.
* The `owner` of the table is set to the user identified by the `user@company.com` field. This user must exist in Datafold with that email.
* The `foo` meta-information is added to the description field with the value `bar`.
* The tags `pii` and `bar` are applied to the table as tags.

Here's an example configuration in YAML format:

```Bash
models:
  - name: users
    description: "Description of the table"
    meta:
      owner: user@company.com
      foo: bar
    tags:
      - pii
      - abc

```

### Column-level

The following column-level information can be synced:

* The column `user_id` has two tags applied: `pk` and `id`.
* The metadata for `user_id` is ignored because it reflects the primary key tag.
* The `email` column has the description applied.
* The `email` column has the tag `pii` applied.
* The `email` column has extra metadata information in the description field: `type` with the value `email`.

Here's an example configuration for columns in YAML format:

```Bash
models:
  - name: users
    ...
    columns:
      - name: user_id
        tags:
          - pk
          - id
        meta:
          pk: true
      - name: email
        description: "The user's email"
        tags:
          - pii
        meta:
          type: email
```


# How It Works
Source: https://docs.datafold.com/data-explorer/how-it-works

The UI visually maps workflows and tracks column-level or tabular lineages, helping users understand the impact of upstream changes.

Our **Data Explorer** offers a comprehensive overview of your data assets, including [Lineage](/data-explorer/lineage) and [Profiles](/data-explorer/profile).

You can filter data assets by Data Connections, Tags, Data Owners, and Asset Types (e.g., tables, columns, and BI-created assets such as views, reports, and syncs). You can also search directly to find specific data assets for lineage analysis.

<Frame caption="Data App Lineage Overview">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_app_lineage_overview-3ffe6ae5a34c4e7d8918fae232eb1ed1.png" />
</Frame>

After selecting a table or data asset, the UI will display a **graph of table-level lineage** by default. You can toggle between **Upstream** and **Downstream** perspectives and customize the lineage view by adjusting the **Max Depth** parameter to your preference.

<Frame caption="Lineage Graph">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_app_lineage_graph-c33d286a7a8c4787f229e5590d32d2ff.png" />
</Frame>


# Lineage
Source: https://docs.datafold.com/data-explorer/lineage

Datafold offers a column-level and tabular lineage view.

## Column-level lineage

Datafold's column-level lineage helps users trace and document the history, transformations, dependencies, and both downstream and upstream processes of a specific data column within an organization's data assets. This feature allows you to pinpoint the origins of data validation issues and comprehensively identify downstream data processes and applications.

To view column-level lineage, click on the **Columns** dropdown menu of the selected asset.

<Frame caption="Lineage Graph Columns Dropdown">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_app_lineage_graph_columns-97a1aa84140dc7bd0b242eb70d6e5d81.png" />
</Frame>

### Highlight path between assets

To highlight the column path between assets, click the specific column. Reset the view by clicking the **Exit the selected path** button.

<Frame caption="Selected Path in Lineage Graph">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_app_lineage_selected_path-ae2fb8bab8b62f9479a6e148a399bc5a.png" />
</Frame>

## Tabular lineage

Datafold also offers a tabular lineage view.

You can sort lineage information by depth, asset type, identifier, and owner. Click on the **Actions** button for further options:

<Frame caption="Tabular Lineage Actions Dropdown">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/tabular_lineage_actions_dropdown-b583e8f59c747058d8a5c95ed6d12843.png" />
</Frame>

### Focus lineage on current node

Drill down onto the data node or column of interest.

### Show SQL query

Access the SQL query associated with the selected column to understand how the data was queried from the source:

<Frame caption="Show SQL Query in Tabular Lineage">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/tabular_lineage_actions_show_sql_code-da905cea8201954e3b0eb17f3fe108de.png" />
</Frame>

### Show usage details

Access detailed information about the column's read, write, and cumulative read (the sum of read count including read count of downstream columns) for the previous 7 days:

<Frame caption="Usage Details in Tabular Lineage">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/tabular_lineage_actions_usage_details-c0462ebd7bee2bc769169ff2b7640d56.png" />
</Frame>

## Search and filters

Datafold offers powerful search and filtering capabilities to help users quickly locate specific data assets and isolate data connections of interest.

In both the graphical and tabular lineage views, you can filter by tables or columns within tables, allowing you to go as granular as needed.

<Frame caption="Search and Filter in Tabular Lineage">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/tabular_lineage_actions_search-f5cef28a24e5a1603e5c04d9696a4ff8.png" />
</Frame>

### Table filtering

Simply enter the table's name in the search bar to filter and display all relevant information associated with that table.

### Column filtering

To focus specifically on columns, you can search using a combination of keywords. For instance, searching "column table" will display columns associated with a table, while a query like "column dim customer" narrows the search to columns within the "dim customer" table.

## Settings

You can configure the settings for Lineage under Settings > Data Connections > Advanced Settings:

<Frame caption="Lineage Advanced Settings">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/lineage_advanced_settings-8787d8f5b01d1a6572b2d73f486ad49a.png" />
</Frame>

### Schema indexing schedule

Customize the frequency and timing of when to update the indexes on database schemas. The schedule is defined through a cron tab expression.

### Table inclusion/exclusion

You can filter to include and/or exclude specific tables to be shown in Lineage.

When the inclusion list is set, only the tables specified in this list will be visible in the lineage and search results.

When the inclusion list is not set, all tables will be visible by default, except for those explicitly specified in the exclusion list.

### Lineage update schedule

Customize the frequency and timing of when to scan the query history of your data warehouse to build and update the data lineage. The schedule is defined through a cron tab expression.

## FAQ

<AccordionGroup>
  <Accordion title="How is lineage computed?">
    Datafold computes column-level lineage by:

    1. Ingesting, parsing and analyzing SQL logs from your databases and data warehouses. This allows Datafold to infer dependencies between SQL statements, including those that create, modify, and read data.
    2. Augmenting the metadata graph with data from various sources. This includes metadata from orchestration tools (e.g., dbt), BI tools, and user-provided documentation.
  </Accordion>

  <Accordion title="Is there a programmatic way to retrieve lineage?">
    Currently, the schema of the Datafold GraphQL API, which we use to expose lineage information, is not yet stable and is considered to be in beta. Therefore, we do not include this API in our public documentation.

    If you would like to programmatically access lineage information, you can explore our GitHub repository with a few examples: [datafold/datafold-api-examples](https://github.com/datafold/datafold-api-examples). Simply clone the repository and follow the instructions provided in the `README.md` file.
  </Accordion>
</AccordionGroup>


# Profile
Source: https://docs.datafold.com/data-explorer/profile

View a data profile that summarizes key table and column-level statistics, and any upstream dependencies.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_app_data_explorer_profile-347ca38a3ae0a32084fd9d02f3a0d667.png" />
</Frame>


# Cross-Database Diffing for Migrations
Source: https://docs.datafold.com/data-migration-automation/cross-database-diffing-migrations

Validate migration parity with Datafold's cross-database diffing solution.

When migrating data from one system to another, ensuring that the data is accurately transferred and remains consistent is critical. Datafold’s cross-database diffing provides a robust method to validate parity between the source and target databases. It compares data across databases, identifying discrepancies at the dataset, column, and row levels, ensuring full confidence in your migration process.

## How cross-database diffing works

* Datafold connects to any SQL source and target databases, similar to how BI tools do.
* Datafold does not need to extract the entirety of the datasets for comparisons. Instead, it relies on stochastic checksumming to identify discrepancies and only extract those for analysis.

### What kind of information does Datafold output?

Datafold’s cross-database diffing will produce the following results:

* **High-Level Summary:**
  * Total number of different rows
  * Total number of rows (primary keys) that are present in one database but not the other
  * Aggregate schema differences
* **Schema Differences:** Per-column mapping of data types, column order, etc.
* **Primary Key Differences:** Sample of specific rows that are present in one database but not the other.
* **Value-Level Differences:** Sample of differing column values for each column with identified discrepancies. The full dataset of differences can be downloaded or materialized to the warehouse.

### How does a user run a data diff?

Users can run data diffs through the following methods:

* Via Datafold’s interactive UI
* Via the Datafold API
* On a schedule (as a monitor) with optional alerting via Slack, email, PagerDuty, etc.

### Can I run multiple data diffs at the same time?

Yes, users can run as many diffs as they would like, with concurrency limited by the underlying database.

### What if my data is changing and replicated live, how can I ensure proper comparison?

In such cases, we recommend using watermarking—diffing data within a specified time window of row creation or update (e.g., `updated_at timestamp`).

### What if the data types do not match between source and target?

Datafold performs best-effort type matching for cases where deterministic type casting is possible, e.g., comparing `VARCHAR` type with `STRING` type. When automatic type casting without information loss is not possible, the user can define type casting manually using diffing in Query mode.

### Can data diff help if the dataset in the source and target databases has a different shape/schema/column naming?

Yes, users can reshape input datasets by writing a SQL query and diffing in Query mode to bring the dataset to a comparable shape. Datafold also supports column remapping for datasets with different column names between tables.

## Learn more

To learn more, check out our guide on [how cross-database diffing works](../data-diff/cross-database-diffing/creating-a-new-data-diff) in Datafold, or explore our extensive [FAQ section](../faq/data-migration-automation) covering cross-database diffing and data migration.


# Datafold Migration Agent
Source: https://docs.datafold.com/data-migration-automation/datafold-migration-agent

Automatically migrate data environments of any scale and complexity with Datafold's Migration Agent.

Datafold provides a full-cycle migration automation solution for data teams, which includes code translation and cross-database reconciliation.

## How does DMA work?

Datafold performs complete SQL codebase translation and validation using an AI-powered architecture. This approach leverages a large language model (LLM) with a feedback loop optimized for achieving full parity between the migration source and target. DMA analyzes metadata, including schema, data types, and relationships, to ensure accuracy in translation.

![datafold migration agent architecture](https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data-migration/datafold_migration_agent.png)

Datafold provides a comprehensive report at the end of the migration. This report includes links to data diffs validating parity and highlighting any discrepancies at the dataset, column, and row levels between the source and target databases.

## Why migrate with DMA?

Unlike traditional deterministic transpilers, DMA offers several distinct benefits:

* **Full parity between source and target:** DMA ensures not just code that compiles, but code that delivers the same results in your new database, complete with explicit validation.
* **Flexible dialect handling:** DMA can adapt to any arbitrary input/output dialect without requiring a full grammar definition, which is especially valuable for legacy systems.
* **Self-correction capabilities:** The AI-driven DMA can account for and correct mistakes based on both compilation errors and data discrepancies.
* **Modernizing code structure:** DMA can convert complex stored procedures into clean, modern formats such as dbt projects, following best practices.

## Getting started with DMA

<Note>
  **Want to learn more?**

  If you're interested in diving deeper, please take a moment to [fill out our intake form](https://nw1wdkq3rlx.typeform.com/to/VC2TbEbz) to connect with the Datafold team.
</Note>

1. Connect your source and target data sources to Datafold.
2. Provide Datafold access to your codebase, typically by installing the Datafold GitHub/GitLab/ADO app or via system catalog access for stored procedures.

Once you connect your source and target systems and Datafold ingests the codebase, DMA's translation process is supervised by the Datafold team. In most cases, no additional input is required from the customer.

The migration process timeline depends on the technologies, scale, and complexity of the migration. After setup, migrations typically take several days to several weeks.

## Security

Datafold is SOC 2 Type II, GDPR, and HIPAA-compliant. We offer flexible deployment options, including in-VPC setups in AWS, GCP, or Azure. The LLM infrastructure is local, ensuring no data is exposed to external subprocessors beyond the cloud provider. For VPC deployments, data stays entirely within the customer’s private network.

## FAQ

For more information, please see our extensive [FAQ section](../faq/data-migration-automation).


# Datafold for Migration Automation
Source: https://docs.datafold.com/data-migration-automation/datafold-migration-automation

Datafold provides full-cycle migration automation with SQL code translation and cross-database validation for data warehouse, transformation framework, and hybrid migrations.

Datafold offers flexible migration validation options to fit your data migration workflow. Data teams can choose to leverage the full power of the [Datafold Migration Agent (DMA)](../data-migration-automation/datafold-migration-agent) alongside [cross-database diffing](../data-diff/how-datafold-diffs-data#how-cross-database-diffing-works), or use ad-hoc diffing exclusively for validation.

## Supported migrations

Datafold supports a wide range of migrations to meet the needs of modern data teams. The platform enables smooth transitions between different databases and transformation frameworks, ensuring both code translation and data validation throughout the migration process. Datafold can handle:

* **Data Warehouse Migrations:** Seamlessly migrate between data warehouses, for example, from PostgreSQL to Databricks.

* **Data Transformation Framework Migrations:** Transition your transformation framework from legacy stored procedures to modern tools like dbt.

* **Hybrid Migrations:** Migrate across a combination of data platforms and transformation frameworks. For example, moving from MySQL + stored procedures to Databricks + dbt.

## Migration options

<AccordionGroup>
  <Accordion title="Option 1: DMA + Ad-Hoc Diffing">
    The AI-powered Datafold Migration Agent (DMA) provides automated SQL code translation and validation to simplify and automate data migrations. Teams can pair DMA with ad-hoc cross-database diffing to enhance the validation process with additional manual checks when necessary.

    **How it works:**

    * **Step 1:** Connect your legacy and new databases to Datafold, along with your codebase.
    * **Step 2:** DMA translates and validates SQL code automatically.
    * **Step 3:** Pair the DMA output with ad-hoc cross-database diffing to reconcile data between legacy and new databases.

    This combination streamlines the migration process, offering automatic validation with the flexibility of manual diffing for fine-tuned control.
  </Accordion>

  <Accordion title="Option 2: Ad-Hoc Diffing Only">
    For teams that prefer to handle code translation manually or are working with third-party migrations, Datafold's ad-hoc cross-database diffing is available as a stand-alone validation tool.

    **How it works:**

    * Validate data across databases manually without using DMA for code translation.
    * Run ad-hoc diffing as needed, via the [Datafold REST API](../api-reference/introduction), or schedule it with [Monitors](../data-monitoring) for continuous validation.

    This option gives you full control over the migration validation process, making it suitable for in-house or outsourced migrations.
  </Accordion>
</AccordionGroup>


# Monitor Types
Source: https://docs.datafold.com/data-monitoring/monitor-types

Monitoring your data for unexpected changes is one of the cornerstones of data observability.

Datafold supports all your monitoring needs through a variety of different monitor types:

1. [**Data Diff**](/data-monitoring/monitors/data-diff-monitors) → Detect differences between any two datasets, within or across databases
2. [**Metric**](/data-monitoring/monitors/metric-monitors) → Identify anomalies in standard metrics like row count, freshness, and cardinality, or in any custom metric
3. [**Data Test**](/data-monitoring/monitors/data-test-monitors) → Validate your data with business rules and see specific records that fail your tests
4. [**Schema Change**](/data-monitoring/monitors/schema-change-monitors) → Receive alerts when a table schema changes

If you need help creating your first few monitors, deciding which type of monitor to use in a particular situation, or developing an overall monitoring strategy, please reach out via email ([support@datafold.com](mailto:support@datafold.com)) and our team of experts will be happy to assist.


# Monitors as Code
Source: https://docs.datafold.com/data-monitoring/monitors-as-code

Manage Datafold monitors via version-controlled YAML for greater scalability, governance, and flexibility in code-based workflows.

<Note>
  **INFO**

  Please contact [support@datafold.com](mailto:support@datafold.com) if you'd like to enable this feature for your organization.
</Note>

This is particularly useful if any of the following are true:

* You have (or plan to have) 100s or 1000s of monitors
* Your team is accustomed to managing things in code
* Strict governance and change management are important to you

## Getting started

<Note>
  **INFO**

  This section describes how to get started with GitHub Actions, but the same concepts apply to other hosted version control platforms like GitLab and Bitbucket. Contact us if you need help getting started.
</Note>

### Set up version control integration

To start using monitors as code, you'll need to decide which repository will contain your YAML configuration.

If you've already connected a repository to Datafold, you could use that. Or, follow the instructions [here](/integrations/code-repositories) to connect a new repository.

### Generate a Datafold API key

If you've already got a Datafold API key, use it. Otherwise, you can create a new one in the app by visiting **Settings > Account** and selecting **Create API Key**.

### Create monitors config

In your chosen repository, create a new YAML file where you'll define your monitors config.

For this example, we'll name the file `monitors.yaml` and place it in the root directory, but neither of these choices are hard requirements.

Leave the file blank for now—we'll come back to it in a moment.

### Add CI workflow

If you're using GitHub Actions, create a new YAML file under `.github/workflows/` using the following template. Be sure to tailor it to your particular setup:

```yaml
name: Apply monitors as code config to Datafold

on:
  push:
    branches:
      - main # or master

jobs:
  apply:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: 3.12
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install datafold-sdk
      - name: Update monitors
        run: datafold monitors provision monitors.yaml # use the correct file name/path
        env:
          DATAFOLD_HOST: https://app.datafold.com # different for dedicated deployments
          DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} # remember to add to secrets
```

### Create a monitor

Now return to your YAML configuration file to add your first monitor. Reference the list of examples below and select one that makes sense for your organization.

## Examples

<Note>
  **INFO**

  These examples are intended to serve as inspiration and don't demonstrate every possible configuration. Contact us if you have any questions.
</Note>

### Data Diff

[Data Diff monitors](/data-monitoring/monitors/data-diff-monitors) detect differences between any two datasets, within or across databases.

```yaml
monitors:
  replication_test_example:
    name: 'Example of a custom name'
    description: 'Example of a custom description'
    type: diff
    enabled: true
    datadiff:
      dataset_a:
        connection_id: 734
        table: db.schema.table
        time_travel_point: '2020-01-01'
      dataset_b:
        connection_id: 736
        table: db.schema.table1
        time_travel_point: '2020-01-01'
      primary_key:
        - pk_column
      columns_to_compare:
        - col1
      materialize_results: true
      column_remapping:
        col1: col2
      sampling:
        rate: 0.1
      ignore_string_case: true
    schedule:
      interval:
        every: hour

  replication_test_example_with_thresholds:
    type: diff
    enabled: true
    datadiff:
      dataset_a:
        connection_id: 734
        table: db.schema.table
      dataset_b:
        connection_id: 736
        table: db.schema.table2
        materialize: false
        session_parameters:
          k: v
      primary_key:
        - pk_column
      egress_limit: 100
      per_column_diff_limit: 10
    schedule:
      interval:
        every: hour
    alert:
      different_rows_count: 100
      different_rows_percent: 10

  replication_test_example_with_thresholds_and_notifications:
    type: diff
    enabled: true
    datadiff:
      dataset_a:
        connection_id: 734
        table: db.schema.table
      dataset_b:
        connection_id: 736
        table: db.schema.table3
      primary_key:
        - pk_column
    schedule:
      interval:
        every: hour
    notifications:
      - type: email
        recipients:
          - valentin@datafold.com
      - type: slack
        integration: 123
        channel: datafold-alerts
      - type: pagerduty
        integration: 123
      - type: webhook
        integration: 123
    alert:
      different_rows_count: 100
      different_rows_percent: 10
```

### Metric

[Metric monitors](/data-monitoring/monitors/metric-monitors) identify anomalies in standard metrics like row count, freshness, and cardinality, or in any custom metric.

```yaml
monitors:
  table_metric_example:
    type: metric
    enabled: true
    connection_id: 736
    metric:
      type: table
      table: db.schema.table
      filter: deleted is false
      metric: freshness # see full list of options below
    alert:
      type: automatic
      sensitivity: 10
    schedule:
      interval:
        every: day
        hour: 8 # 0-23 UTC

  column_metric_example:
    type: metric
    enabled: true
    connection_id: 736
    metric:
      type: column
      table: db.schema.table
      column: some_col
      filter: deleted is false
      metric: sum # see full list of options below
    alert:
      type: absolute
      max: 100
      min: 0
    tags:
      - oncall
      - action-required
    schedule:
      cron: 0 0 * * * # every day at midnight UTC
```

#### Supported metrics

For more details on supported metrics, see the docs for [Metric monitors](/data-monitoring/monitors/metric-monitors#metric-types).

**Table metrics:**

* Freshness: `freshness`
* Row Count: `row_count`

**Column metrics:**

* Cardinality: `cardinality`
* Uniqueness: `uniqueness`
* Minimum: `minimum`
* Maximum: `maximum`
* Average: `average`
* Median: `median`
* Sum: `sum`
* Standard Deviation: `std_dev`
* Fill Rate: `fill_rate`

### Data Test

[Data Test monitors](/data-monitoring/monitors/data-test-monitors) validate your data with business rules and surface specific records that fail your tests.

```yaml
monitors:
  custom_data_test_example:
    type: test
    enabled: true
    connection_id: 736
    query: select 1 from db.schema.table
    schedule:
      interval:
        every: hour
    tags:
      - team_1

  accepted_values_test_example:
    type: test
    enabled: true
    connection_id: 736
    test:
      type: accepted_values
      tables:
        - path: db.schema.table
          columns:
            - column_name
      variables:
          accepted_values:
            value:
              - 12
              - 15
            quote: false
    schedule:
      interval:
        every: hour

  numeric_range_test_example:
    type: test
    enabled: true
    connection_id: 736
    test:
      type: numeric_range
      tables:
        - path: db.schema.table
          columns:
            - column_name
      variables:
        maximum:
          value: 15
          quote: false
    schedule:
      interval:
        every: hour
```

**Supported variables by Standard Data Test (SDT) type**

| SDT Type              | Monitor-as-Code Type    | Supported Variables | Variable Type          |
| --------------------- | ----------------------- | ------------------- | ---------------------- |
| Unique                | `unique`                | -                   | -                      |
| Not Null              | `not_null`              | -                   | -                      |
| Accepted Values       | `accepted_values`       | `accepted_values`   | Collection with values |
| Referential Integrity | `referential_integrity` | -                   | -                      |
| Numeric Range         | `numeric_range`         | `minimum`           | Single value           |
|                       |                         | `maximum`           | Single value           |

### Schema Change

[Schema Change monitors](/data-monitoring/monitors/schema-change-monitors) detect when changes occur to a table's schema.

```yaml
monitors:
  schema_change_example:
    type: schema
    enabled: true
    connection_id: 736
    table: db.schema.table
    schedule:
      interval:
        every: day
        hour: 22 # 0-23 UTC
    tags:
      - team_2
```

## Bulk Manage with Wildcards

For certain monitor types—[Freshness](/data-monitoring/monitors/metric-monitors), [Row Count](/data-monitoring/monitors/metric-monitors), and [Schema Change](/data-monitoring/monitors/schema-change-monitors)—it's possible to create/manage many monitors at once using the following wildcard syntax:

```yaml
row_count_monitors:
  type: metric
  connection_id: 123
  metric:
    type: table
    metric: row_count
    # include all tables in the WAREHOUSE database
    include_tables: WAREHOUSE.*
    # exclude all tables in the INFORMATION_SCHEMA schema
    exclude_tables: WAREHOUSE.INFORMATION_SCHEMA.*
  schedule:
    interval:
      every: day
      hour: 10 # 0-23 UTC
```

This is particularly useful if you want to create the same monitor type for many tables in a particular database or schema. Note in the example above that you can specify both `include_tables` and `exclude_tables` to fine-tune your selection.

## FAQ

<AccordionGroup>
  <Accordion title="Can I still create/manage monitors in the app if I'm using monitors as code?">
    Yes, it's not all or nothing. You can still create/manage monitors in the app even if you're defining others in code.
  </Accordion>

  <Accordion title="What happens to a monitor in the app if it's removed from the code?">
    By default, nothing—it remains in the app. However, you can add the `--dangling-monitors-strategy [delete|pause]` flag to your `run` command to either delete or pause notifications if they're removed from your code. For example:

    ```bash
    datafold monitors provision monitors.yaml --dangling-monitors-strategy delete
    ```

    Note: this only applies to monitors that were created from code, not those created in the UI.
  </Accordion>

  <Accordion title="How do I delete or pause all of my monitors?">
    Add the `--dangling-monitors-strategy [delete|pause]` flag to your `run` command and replace the contents of your YAML file with the following:

    ```yaml
    monitors: {}
    ```

    Note that providing an empty YAML file will likely produce an error and not have the same effect.
  </Accordion>

  <Accordion title="Can I use the app to modify monitors managed in code?">
    No, any monitors created from code will be read-only in the app (though they can still be cloned).
  </Accordion>

  <Accordion title="Can I export monitors I've created in the app so I can manage them in code?">
    Yes, please contact us and we'll be happy to assist.
  </Accordion>
</AccordionGroup>

## Need help?

If you have any questions about how to use monitors as code, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com).


# Data Diff Monitors
Source: https://docs.datafold.com/data-monitoring/monitors/data-diff-monitors

Data Diff monitors compare datasets across or within databases, identifying row and column discrepancies with customizable scheduling and notifications.

## Ways to create a data diff monitor

There are 3 ways to create a data diff monitor:

1. From the **Monitors** page by clicking **Create new monitor** and then selecting **Data diff** as a type of monitor.
2. Clone an existing monitor by clicking **Actions** and then **Clone** in the header menu. This will pre-fill the form with the existing monitor configuration.
3. Create a monitor directly from the data diff results by clicking **Actions** and **Create monitor**. This will pre-fill the configuration with the parent data diff settings, requiring updates only for the **Schedule** and **Notifications** sections.

Once a monitor is created and initial metrics collected, you can set up [thresholds](/data-monitoring/monitors/data-diff-monitors#monitoring) for the two metrics.

## Create a new data diff monitor

Setting up a new diff monitor in Datafold is straightforward. You can configure it with the following parameters and options:

### General

Choose how you want to compare your data and whether the diff type is in-database or cross-database.

Pick your data connections. Then, choose the two datasets you want to compare. This can be a table or a view in your relational database.

If you need to compare just a subset of data (e.g., for a particular city or last two weeks), add a SQL filter.

Select **Materialize inputs** to improve diffing speed when query is heavy on compute, or if filters are applied to non-indexed columns, or if primary keys are transformed using concatenation, coalesce, or another function.

<Frame caption="Data Diff General Settings">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_diff_settings_general-bdcc461b033ca6c91e5831339673e522.png" />
</Frame>

### Column remapping

When columns are the same data type but are named differently, column remapping allows you to align and compare them. This is useful when datasets have semantically identical columns with different names, such as `userID` and `user_id`. Datafold will surface any differences under the column name used in Dataset A.

<Frame caption="Column Remapping Settings">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_diff_settings_column_remapping-59552ddfda90200e3eba4ab4461757f8.png" />
</Frame>

### Diff settings

<Frame caption="Diff Settings">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_diff_settings_diff_settings-cd905714f5f522817bcd503f38a1fefc.png" />
</Frame>

#### Primary key

The primary key is one or more columns used to uniquely identify a row in the dataset during diffing. The primary key (or keys) does not need to be formally defined in the database or elsewhere as it is used for unique row identification during diffing. Multiple columns support compound primary key definitions.

#### Egress limit

The egress limit optimizes the diff process by terminating it once a predefined number of rows are downloaded. The limit is by default set to 1,000,000 by default.

When the egress limit is reached, the diffing process does not produce the same results each time it is run, as it is not deterministic (i.e., the order in which data is processed may vary).

The egress limit prevents redundant analysis in scenarios with minor, repetitive discrepancies, such as formatting differences (e.g., whitespace, rounding differences). For most use cases, it is impractical to continue diffing after it is known that datasets are substantially different.

Since the algorithm aims to detect and return **every** mismatched row/value, if the datasets have a large percentage of differing rows, the algorithm may be unable to take advantage of checksumming. This can cause a large amount of data to be pulled over the network, which slows down the diffing process, and increases the strain on the database.

Setting an egress limit prevents unwanted runtime and database load by stopping the operation early in cases of substantial dataset discrepancies. It is highly recommended to set an egress limit, taking into account these tradeoffs between cost/speed and rigor.

#### Per-column diff limit

By setting a per-column diff limit, Data Diff will stop identifying differences for any column after a number of differences is found, based on the limit. Data Diff will also stop searching for exclusive and duplicate primary keys after the limit is reached.

Setting a per-column diff limit enables your team to find data quality issues that arise during data reconciliation while minimizing compute and time spent searching for differences.

#### Columns to compare

Determine whether to compare all columns or select specific one(s). To optimize performance on large tables, it's recommended to exclude columns known to have unique values for every row, such as timestamp columns like "updated\_at," or apply filters to limit the comparison scope.

#### Materialize diff results

Choose whether to store diff results in a table.

#### Sampling

Use this to compare a subset of your data instead of the entire dataset. This is best for assessing large datasets.

There are two ways to enable sampling in Monitors: [Tolerance](#tolerance) and [% of Rows](#-of-rows).

<Tip>
  **TIP**

  When should I use sampling tolerance instead of percent of rows?

  Each has its specific use cases and benefits, please [see the FAQ section](#sampling-tolerance-vs--of-rows) for a more detailed breakdown.
</Tip>

##### Tolerance

Tolerance defines the allowable margin of error for our estimate. It sets the acceptable percentage of rows with primary key errors (like nulls, duplicates, or primary keys exclusive to one dataset) before disabling sampling.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_diff_settings_sampling_tolerance-c48e5b3c707d1e6cec2a5cde995c6d01.png" />
</Frame>

When sampling tolerance is enabled, not every row is examined, which introduces a probability of missing certain discrepancies. This threshold represents the level of difference we are willing to accept before considering the results unreliable and thereby disabling sampling. It essentially sets a limit on how much variance is tolerable in the sample compared to the complete dataset.

Default: 0.001%

###### Sampling confidence

Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset. It represents the minimum confidence level that the rate of primary key errors is below the threshold defined in sampling tolerance.

To put it simply, a 95% confidence level with a 5% tolerance means we are 95% certain that the true value falls within 5% of our estimate.

Default: 99%

###### Sampling threshold

Sampling will be disabled if total row count of the largest table is less that the threshold value.

###### Sample size

This provides an estimated count of the total number of rows included in the combined sample from Datasets A and B, used for the diffing process. It's important to note that this number is an estimate and can vary from the actual sample size due to several factors:

* The presence of duplicate primary keys in the datasets will likely increase this estimate, as it inflates the perceived uniqueness of rows
* Applying filters to the datasets tends to reduce the estimate, as it narrows down the data scope

The number of rows we sample is not fixed; instead, we use a statistical approach called the Poisson distribution. This involves picking rows randomly from an infinite pool of rows with uniform random sampling. Importantly, we don't need to perform a full diff (compare every single row) to establish a baseline.

Example: Imagine there are two datasets we want to compare, Main and Test. Since we prefer not to check every row, we use a statistical approach to determine the number of rows to sample from each dataset. To do so, we set the following parameters:

* Sampling tolerance: 5%
* Sampling confidence: 95%

Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset, while sampling tolerance defines the allowable margin of error for our estimate. Here, with a 95% sampling confidence and a 5% sampling tolerance, we are 95% confident that the true value falls within 5% of our estimate. Datafold will then estimate the sample size needed (e.g., 200 rows) to achieve these parameters.

##### % of rows

Percent of rows sampling defines the proportion of the dataset to be included in the sample by specifying a percentage of the total number of rows. For example, setting the sampling percentage to 0.1% means that only 0.1% of the total rows will be sampled for analysis or comparison.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_diff_settings_sampling_percent-48c1e9b953a45b49213d5a9799842875.png" />
</Frame>

When percent of rows sampling is enabled, a fixed percentage of rows is selected randomly from the dataset. This method simplifies the sampling process, making it easy to understand and configure without needing to adjust complex statistical parameters. However, it lacks the statistical assurances provided by methods like sampling tolerance.

It doesn't dynamically adjust based on data characteristics or discrepancies but rather adheres strictly to the specified percentage, regardless of the dataset's variability. This straightforward approach is ideal for scenarios where simplicity and quick setup are more important than precision and statistical confidence. It provides a basic yet effective way to estimate the dataset's characteristics or differences, suitable for less critical data validation tasks.

###### Sampling rate

This refers to the percentage of the total number of rows in the largest table that will be used to determine the sample size. This ensures that the sample size is proportionate to the size of the dataset, providing a representative subset for comparison. For instance, if the largest table contains 1,000,000 rows and the sampling rate is set to 1%, the sample size will be 10,000 rows.

###### Sampling threshold

Sampling is automatically disabled when the total row count of the largest table in the comparison falls below a specified threshold value. This approach is adopted because, for smaller datasets, a complete dataset comparison is not only more feasible but also quicker and more efficient than sampling. Disabling sampling in these scenarios ensures comprehensive data coverage and provides more accurate insights, as it becomes practical to examine every row in the dataset without significant time or resource constraints.

###### Sampling size

This parameter is the [same one used in sampling tolerance](#sample-size).

### Add a schedule

You can choose to run your monitor daily, hourly, or even input a cron expression for more complex scheduling:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/monitors/schedule.png" />
</Frame>

### Add notifications

You can add notifications, sent through Slack or emails, which indicate whether a monitor has been executed.

Notifications are sent when either or both predefined thresholds are reached during a Diff Monitor. You can set a maximum threshold for the:

* Number of different rows
* Percentage of different rows

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/monitor_settings_notifications-c4bd20b39b0ec478ae4a5e46a0dce0e8.png" />
</Frame>

## Results

The diff monitor run history shows the results from each run.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_diff_results-0ed255b70929f8b7a501b59d042d7958.png" />
</Frame>

Each run includes basic stats, along with metrics such as:

* The total rows different: number of different rows according to data diff results.
* Rows with different values: percentage of different rows relative to the total number of rows in dataset A according to data diff results. Note that the status `Different` doesn't automatically map into a notification/alert.

Click the **Open Diff** link for more granular information about a specific Data Diff.

## FAQ

<AccordionGroup>
  <Accordion title="Sampling tolerance vs. % of rows">
    Use sampling tolerance when you need statistical confidence in your results, as it is more efficient and stops sampling once a difference is confidently detected. This method is ideal for critical data validation tasks that require precise accuracy.

    On the other hand, use the percent of rows method for its simplicity and ease of use, especially in less critical scenarios where you just need a straightforward, quick sampling approach without worrying about statistical parameters. This method is perfect for general, easy-to-understand sampling needs.
  </Accordion>

  <Accordion title="Need help?">
    If you have any questions about how to use Data Diff monitors, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com).
  </Accordion>
</AccordionGroup>


# Data Test Monitors
Source: https://docs.datafold.com/data-monitoring/monitors/data-test-monitors

Data Tests validate your data against off-the-shelf checks or custom business rules.

Data Test monitors allow you to validate your data using off-the-shelf checks for non-null or unique values, numeric ranges, accepted values, referential integrity, and more. Custom tests let you write custom SQL queries to validate your own business rules.

Think of Data Tests as pass/fail—either a test returns no records (pass) or it returns at least one record (fail). Failed records are viewable in the app, materialized to a temporary table in your warehouse, and can even be [attached to notifications as a CSV](/data-monitoring/monitors/data-test-monitors#attach-csvs-to-notifications).

## Create a Data Test monitor

There are two ways to create a Data Test monitor:

1. Open the **Monitors** page, select **Create new monitor**, and then choose **Data Test**.
2. Clone an existing Data Test monitor by clicking **Actions** and then **Clone**. This will pre-fill the form with the existing monitor configuration.

## Set up your monitor

Select your data connection, then choose whether you'd like to use a [Standard](/data-monitoring/monitors/data-test-monitors#standard-data-tests) or [Custom](/data-monitoring/monitors/data-test-monitors#custom-data-tests) test.

### Standard Data Tests

Standard tests allow you to validate your data against off-the-shelf checks for non-null or unique values, numeric ranges, accepted values, referential integrity, and more.

After choosing your data connection, select **Standard** and the specific test that you'd like to run. If you don't see the test you're looking for, you can always write a [Custom test](/data-monitoring/monitors/data-test-monitors#custom-data-tests).

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/monitors/standard_data_test_types.png" />
</Frame>

#### Quoting variables

Some test types (e.g. accepted values) require you to provide one or more values, which you may want to have quoted in the final SQL. The **Quote** flag, which is enabled by default, allows you to control this behavior. Here's an example.

Quoting **enabled** for `EXAMPLE_VALUE` (default):

```sql
SELECT *
FROM DB.SCHEMA.TABLE1
WHERE "COLUMN1" < 'EXAMPLE_VALUE';
```

Quoting **disabled** for `EXAMPLE_VALUE`:

```sql
SELECT *
FROM DB.SCHEMA.TABLE1
WHERE "COLUMN1" < EXAMPLE_VALUE;
```

### Custom Data Tests

When you need to test something that's not available in our [Standard tests](/data-monitoring/monitors/data-test-monitors#standard-data-tests), you can write a Custom test. Select your data connection, choose **Custom**, then write your SQL query.

Importantly, keep in mind that your query should return records that *fail* the test. Here are some examples to illustrate this.

**Custom business rule**

Say your company defines active users as individuals who have signed into your application at least 3 times in the past week. You could write a test that validates this logic by checking for users marked as active who *haven't* reached this threshold:

```sql
SELECT *
FROM users
WHERE status = 'active'
    AND signins_last_7d < 3;
```

**Data formatting**

If you wanted to validate that all phone numbers in your contacts table are 10 digits and only contain numbers, you'd return records that are not 10 digits or use non-numeric characters:

```sql
SELECT *
FROM contacts
WHERE LENGTH(phone_number) != 10
    OR phone_number REGEXP '[^0-9]';
```

## Add a schedule

You can choose to run your monitor daily, hourly, or even input a cron expression for more complex scheduling:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/monitors/schedule.png" />
</Frame>

## Add notifications

Receive notifications via Slack or email when at least one record fails your test:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/monitor_settings_notifications-c4bd20b39b0ec478ae4a5e46a0dce0e8.png" />
</Frame>

## Attach CSVs to notifications

Datafold allows attaching a CSV of failed records to Slack and email notifications. This is useful if, for example, you have business users who don't have a Datafold license but need to know about records that fail your tests.

This option is configured separately per notification destination as shown here:

![Attach CSVs to Data Tests notifications](https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data-test-csv-1.png)

<Note>
  CSV attachments are limited to the lesser of 1,000 rows or 1 MB in file size.
</Note>

### Attaching CSVs in Slack

In order to attach CSVs to Slack notifications, you need to complete 1-2 additional steps:

1. If you installed the Datafold Slack app prior to October 2024, you'll need to reinstall the app by visiting Settings > Integrations > Notifications, selecting your Slack integration, then **Reinstall Slack integration**.
2. Invite the Datafold app to the channel you wish to send notifications to using the `/invite` command shown below:

![Invite Datafold app to Slack channel](https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data-test-csv-2.png)

## Run Tests in CI

Standard Data Tests run on a schedule against your production data. But often it's useful to test data before it gets to production as part of your deployment workflow. For this reason, Datafold supports running tests in CI.

Data Tests in CI work very similarly to our [Monitors as Code](/data-monitoring/monitors-as-code) feature, in the sense that you define your tests in a version-controled YAML file. You then use the Datafold SDK to execute those tests as part of your CI workflow.

### Write your tests

First, create a new file (e.g. `tests.yaml`) in the root of your repository. Then write your tests use the same format described in our [Monitors as Code](/data-monitoring/monitors-as-code) docs with two exceptions:

1. Add a `run_in_ci` flag to each test and set it to `true` (assuming you'd like to run the test)
2. (Optional) Add placeholders for variables that you'd like to populate dynamically when executing your tests

Here's an example:

```yaml
monitors:
  null_pk_test:
    type: test
    name: No NULL pk in the users table
    run_in_ci: true
    connection_id: 8
    query: select * from {{ schema }}.USERS where id is null

  duplicate_pk_test:
    type: test
    name: No duplicate pk in the users table
    run_in_ci: true
    connection_id: 8
    query: |
        select *
        from {{ schema }}.USERS
        where id in (
            select id
            from {{ schema }}.USERS
            group by id
            having count(*) > 1
        );
```

### Execute your tests

<Note>
  **INFO**

  This section describes how to get started with GitHub Actions, but the same concepts apply to other hosted version control platforms like GitLab and Bitbucket. Contact us if you need help getting started.
</Note>

If you're using GitHub Actions, create a new YAML file under `.github/workflows/` using the following template. Be sure to tailor it to your particular setup:

```yaml
  on:
    push:
      branches:
        - main
    pull_request:
  jobs:
    test:
      runs-on: ubuntu-latest
      steps:
        - uses: actions/checkout@v2
        - uses: actions/checkout@v2
          with:
            token: ${{ secrets.GH_TOKEN }}
            repository: datafold/datafold-sdk
            path: datafold-sdk
            ref: data-tests-in-ci-demo
        - uses: actions/setup-python@v2
          with:
            python-version: '3.12'
        - name: Install dependencies
          run: |
            python -m pip install --upgrade pip
            pip install -r requirements.txt
        - name: Set schema env var in PR
          run: |
            echo "SCHEMA=ANALYTICS.PR" >> $GITHUB_ENV
          if: github.event_name == 'pull_request'
        - name: Set schema env var in main
          run: |
            echo "SCHEMA=ANALYTICS.CORE" >> $GITHUB_ENV
          if: github.event_name == 'push'
        - name: Run tests
          run: |
            datafold tests run --var schema:$SCHEMA --ci-config-id 1 tests.yaml # use the correct file name/path
          env:
            DATAFOLD_HOST: https://app.datafold.com # different for dedicated deployments
            DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} # remember to add to secrets
```

### View the results

When your CI workflow is triggered (e.g. by a pull request), you can view the terminal output for your test results:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_tests_in_ci_output-9be8c97e4d32734e71edee4f201e0ffc.png" />
</Frame>

## Need help?

If you have any questions about how to use Data Test monitors, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com).


# Metric Monitors
Source: https://docs.datafold.com/data-monitoring/monitors/metric-monitors

Metric monitors detect anomalies in your data using ML-based algorithms or manual thresholds, supporting standard and custom metrics for tables or columns.

<Note>
  **INFO**

  Please contact [support@datafold.com](mailto:support@datafold.com) if you'd like to enable this feature for your organization.
</Note>

Metric monitors allow you to perform anomaly detection—either automatically using our ML-based algorithm or by setting manual thresholds—on the following metric types:

1. Standard metrics (e.g. row count, freshness, and cardinality)
2. Custom metrics (e.g. sales volume per region)

## Create a Metric monitor

There are two ways to create a Metric Monitor:

1. Open the **Monitors** page, select **Create new monitor**, and then choose **Metric**.
2. Clone an existing Metric monitor by clicking **Actions** and then **Clone**. This will pre-fill the form with the existing monitor configuration.

## Set up your monitor

Select your data connection, then choose the type of metric you'd like: **Table**, **Column**, or **Custom**.

If you select table or column, you have the option to add a SQL filter to refine your dataset. For example, you could implement a 7-day rolling time window with the following: `timestamp >= dateadd(day, -7, current_timestamp)`. Please ensure the SQL is compatible with your selected data connection.

## Metric types

### Table metrics

| Metric    | Definition                        | Additional Notes                                                                                               |
| --------- | --------------------------------- | -------------------------------------------------------------------------------------------------------------- |
| Freshness | Time since table was last updated | Measured in minutes. Derived from INFORMATION\_SCHEMA. Only supported for Snowflake, BigQuery, and Databricks. |
| Row Count | Total number of rows              |                                                                                                                |

### Column metrics

| Metric             | Definition                     | Supported Column Types | Additional Notes           |
| ------------------ | ------------------------------ | ---------------------- | -------------------------- |
| Cardinality        | Number of distinct values      | All types              |                            |
| Uniqueness         | Proportion of distinct values  | All types              | Proportion between 0 and 1 |
| Minimum            | Lowest numeric value           | Numeric columns        |                            |
| Maximum            | Highest numeric value          | Numeric columns        |                            |
| Average            | Mean value                     | Numeric columns        |                            |
| Median             | Median value (50th percentile) | Numeric columns        |                            |
| Sum                | Sum of all values              | Numeric columns        |                            |
| Standard Deviation | Measure of data spread         | Numeric columns        |                            |
| Fill Rate          | Proportion of non-null values  | All types              | Proportion between 0 and 1 |

### Custom metrics

Our custom metric framework is extremely flexible and supports several approaches to defining metrics. Depending on the approach you choose, your query should return some combination of the following columns:

* **Metric value (required)**: a numeric column containing your *metric values*
* **Timestamp (optional)**: a date/time column containing *timestamps* corresponding to your metric values
* **Group (optional)**: a string column containing *groups/dimensions* for your metric

<Note>
  **INFO**

  The names and order of your columns don't matter. Datafold will automatically infer their meaning based on data type.
</Note>

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/custom_metric_matrix-7f38681722ab77f7d52e0b9350af9ab9.png" />
</Frame>

The following questions will help you decide which approach is best for you:

1. **Do you want to group your metric by the value of a column in your query?** For example, if your metric is *sales volume per day*, rather than looking at a single metric that encompasses all sales globally, it might be more informative to group by country. In this case, Datafold will automatically compute sales volume separately for each country to assist with root cause analysis when there’s an unexpected change.
2. **Will your query return a single metric value (per group, if relevant) on every monitor run, or an entire time series?** We generally recommend starting with the simpler approach of providing a single metric value (per group) per monitor run. However, if you’ve already defined a time series elsewhere (e.g. in your BI tool) and simply want to copy/paste that query into Datafold, then you may prefer the latter approach.

<Note>
  **INFO**

  Datafold will only log a single data point per timestamp per group, which means you should only send data for a particular time period once that period is complete.
</Note>

1. **If your metric returns a single value per monitor run, will you provide your own timestamps or use the timestamps of monitor runs?** If your query returns a single value per run, we generally recommend letting Datafold provide timestamps based on monitor runs unless you have a compelling reason to provide your own. For example, if your metric always lags by one day, you could explicitly associate yesterday's date with each observation.

As you're writing your query, Datafold will let you know if the result set doesn't match one of the accepted patterns. If you have questions, please contact us and we'll be happy to help.

## Configure anomaly detection

Enable anomaly detection to get the most out of metric monitors. You have several options:

* **Automatic**: our automated anomaly detection uses machine learning to flag metric values that are out of the ordinary. Dial the sensitivity up or down depending on how many alerts you'd like to receive.
* **Manual**: specific thresholds beyond which you'd like the monitor to trigger an alert. **Fixed Values** are specific minimum and/or maximum values, while **Percent Change** measure the magnitude of change from one observation to the next.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/anomaly_detection_menu-bec86b18752d4a0a3081de8ce1983485.png" />
</Frame>

## Add a schedule

You can choose to run your monitor daily, hourly, or even input a cron expression for more complex scheduling:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/monitors/schedule.png" />
</Frame>

## Add notifications

Send notifications via Slack or email when your monitor exceeds a threshold (automatic or manual):

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/monitor_settings_notifications-c4bd20b39b0ec478ae4a5e46a0dce0e8.png" />
</Frame>

## Need help?

If you have any questions about how to use Metric monitors, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com).


# Schema Change Monitors
Source: https://docs.datafold.com/data-monitoring/monitors/schema-change-monitors

Schema Change monitors notify you when a table’s schema changes, such as when columns are added, removed, or data types are modified.

<Note>
  **INFO**

  Please contact [support@datafold.com](mailto:support@datafold.com) if you'd like to enable this feature for your organization.
</Note>

Schema change monitors alert you when a table’s schema changes in any of the following ways:

* Column added
* Column removed
* Data type changed

## Create a Schema Change monitor

There are two ways to create a Schema Change monitor:

1. Open the **Monitors** page, select **Create new monitor**, and then choose **Schema Change**.
2. Clone an existing Schema Change monitor by clicking **Actions** and then **Clone**. This will pre-fill the form with the existing monitor configuration.

## Set up your monitor

To set up a Schema Change monitor, simply select your data connection and the table you wish to monitor for changes.

## Add a schedule

You can choose to run your monitor daily, hourly, or even input a cron expression for more complex scheduling:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/monitors/schedule.png" />
</Frame>

## Add notifications

Receive notifications via Slack or email when at least one record fails your test:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/monitor_settings_notifications-c4bd20b39b0ec478ae4a5e46a0dce0e8.png" />
</Frame>

## FAQ

<Accordion title="Don't data diffs detect schema changes too?">
  Yes, but in a different context. While data diffs report on schema differences *between two tables at the same time* (unless you’re using the time travel feature), data diff monitors alert you to schema changes for the *same table over time*.
</Accordion>

## Need help?

If you have any questions about how to use Schema Change monitors, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com).


# Deployment Options
Source: https://docs.datafold.com/datafold-deployment/datafold-deployment-options

Datafold is a web-based application with multiple deployment options, including multi-tenant SaaS and dedicated cloud (either customer- or Datafold-hosted).

## SaaS / Multi-Tenant

Our standard multi-tenant deployment is a cost-efficient option for most teams and is available in two AWS regions:

| Region Name      | Region      | Sign-Up Page                                                               |
| :--------------- | :---------- | :------------------------------------------------------------------------- |
| US West (Oregon) | `us-west-2` | [https://app.datafold.com/org-signup](https://app.datafold.com/org-signup) |
| Europe (Ireland) | `eu-west-1` | [https://eu.datafold.com/org-signup](https://eu.datafold.com/org-signup)   |

For additional security, we provide the following options:

1. [IP Whitelisting](/security/securing-connections#ip-whitelisting): only allow access to specific IP addresses
2. [AWS PrivateLink](/security/securing-connections#private-link): set up a limited network point to access your RDS in the same region
3. [VPC Peering](/security/securing-connections#vpc-peering-saas): securely join two networks together
4. [SSH Tunnel](/security/securing-connections#ssh-tunnel): set up a secure tunnel between your network and Datafold with the SSH server on your side
5. [IPSec Tunnel](/security/securing-connections#ipsec-tunnel): an IPSec tunnel setup

## Dedicated Cloud

We also offer a single-tenant deployment of the Datafold application in a dedicated Virtual Private Cloud (VPC). The options are (from least to most complex):

1. **Datafold-hosted, Datafold-managed**: the Cloud account belongs to Datafold and we manage the Datafold application for you.
2. **Customer-hosted, Datafold-managed**: the Cloud account belongs to you, but we manage the Datafold application for you.
3. **Customer-hosted, Customer-managed**: the Cloud account belongs to you and you manage the Datafold application. Datafold does not have access.

Dedicated Cloud can be deployed to all major cloud providers:

* [AWS](/datafold-deployment/dedicated-cloud/aws)
* [GCP](/datafold-deployment/dedicated-cloud/gcp)
* [Azure](/datafold-deployment/dedicated-cloud/azure)

<Tip>
  **VPC vs. VNet**

  We use the term VPC across all major cloud providers. However, Azure refers to this concept as a Virtual Network (VNet).
</Tip>

### Datafold Dedicated Cloud FAQ

<AccordionGroup>
  <Accordion title="What is the benefit of a Dedicated Cloud deployment?">
    Dedicated Cloud deployment may be the preferred deployment method by customers with special privacy and security concerns and in highly regulated domains. In a Dedicated Cloud deployment, the entire Datafold stack runs on dedicated cloud infrastructure and network, which usually means it is:

    1. Not accessible to public Internet (sits behind customer's VPN)
    2. Uses internal network to communicate with customer's databases and other resources – none of the data is sent using public networks
  </Accordion>

  <Accordion title="How does a Customer-hosted Dedicated Cloud deployment work?">
    Datafold is deployed to customer's cloud infrastructure but is fully managed by Datafold. The only DevOps involvement needed from the customer's side is to set up a cloud project and role (steps #1 and #2 below).

    1. Customer creates a Datafold-specific namespace in their cloud account (subaccount in AWS / project in GCP / subscription or resource group in Azure)
    2. Customer creates a Datafold-specific IAM resource with permissions to deploy the Datafold-specific namespace
    3. Datafold Infrastructure team provisions the Datafold stack on the customer's infrastructure using fully-automated procedure with Terraform
    4. Customer and Datafold Infrastructure teams collaborate to implement the security and networking requirements, see [all available options](#additional-security-dedicated-cloud)

    See cloud-specific instructions here:

    * [AWS](/datafold-deployment/dedicated-cloud/aws)
    * [GCP](/datafold-deployment/dedicated-cloud/gcp)
    * [Azure](/datafold-deployment/dedicated-cloud/azure)

    After the initial deployment, the Datafold team uses the same procedure to roll out software updates and perform maintenance to keep the uptime SLA.
  </Accordion>

  <Accordion title="How does a Datafold-hosted Dedicated Cloud deployment work?">
    Datafold is deployed in the customer's region of choice on AWS, GCP, or Azure that is owned and managed by Datafold. We collaborate to implement the security and networking requirements ensuring that traffic either does not cross the public internet or, if it does, does so securely. All available options are listed below.
  </Accordion>

  <Accordion title="How does a Customer-hosted, Customer-managed deployment work?">
    This deployment method follows the same process as the standard customer-hosted deployment (see above), but with a key difference: the customer is responsible for managing both the infrastructure and the application. Datafold engineers do not have any access to the deployment in this case.

    We offer open-source projects that facilitate this deployment, with examples for every major cloud provider. You can find these projects on GitHub:

    * [AWS](https://github.com/datafold/terraform-aws-datafold)
    * [GCP](https://github.com/datafold/terraform-google-datafold)
    * [Azure](https://github.com/datafold/terraform-azure-datafold)

    Each of these projects uses a Helm chart for deploying the application. The Helm chart is also available on GitHub:

    * [Helm Chart](https://github.com/datafold/helm-charts)

    By providing these open-source projects, Datafold enables you to integrate the deployment into your own infrastructure, including existing clusters. This allows your infrastructure team to manage the deployment effectively.

    <Tip>
      **Deployment Secrets:** Datafold provides the necessary secrets for downloading images as part of the license agreement. Without this agreement, the deployment will not complete successfully.
    </Tip>
  </Accordion>

  <Accordion title="What additional security and networking options are available?">
    Because the Datafold application is deployed in a dedicated VPC, your databases/integrations are not directly accessible when they are not exposed to the public Internet. The following solutions enable secure connections to your databases/integrations without exposing them to the public Internet:

    <Tabs>
      <Tab title="AWS">
        1. [PrivateLink](/security/securing-connections?current-cloud=aws#private-link "PrivateLink")
        2. [VPC Peering](/security/securing-connections#vpc-peering-dedicated-cloud "VPC Peering")
        3. [SSH Tunnel](/security/securing-connections#ssh-tunnel "SSH Tunnel")
        4. [IPSec Tunnel](/security/securing-connections#ipsec-tunnel "IPSec Tunnel")
      </Tab>

      <Tab title="GCP">
        1. [Private Service Connect](/security/securing-connections?current-cloud=gcp#private-link "Private Service Connect")
        2. [VPC Peering](/security/securing-connections#vpc-peering-dedicated-cloud "VPC Peering")
        3. [SSH Tunnel](/security/securing-connections#ssh-tunnel "SSH Tunnel")
      </Tab>

      <Tab title="Azure">
        1. [Private Link](/security/securing-connections?current-cloud=azure#private-link "Private Link")
        2. [VNet Peering](/security/securing-connections#vpc-peering-dedicated-cloud "VNet Peering")
        3. [SSH Tunnel](/security/securing-connections#ssh-tunnel "SSH Tunnel")
      </Tab>
    </Tabs>
  </Accordion>

  <Accordion title="Can Datafold be deployed and managed by the customer's internal team?">
    Please inquire with [sales@datafold.com](mailto:sales@datafold.com) about customer-managed deployment options.
  </Accordion>
</AccordionGroup>


# Datafold VPC Deployment on AWS
Source: https://docs.datafold.com/datafold-deployment/dedicated-cloud/aws

Learn how to deploy Datafold in a Virtual Private Cloud (VPC) on AWS.

<Note>
  **INFO**

  VPC deployments are an Enterprise feature. Please email [sales@datafold.com](mailto:sales@datafold.com) to enable your account.
</Note>

## Create a Domain Name (optional)

You can either choose to use your domain (for example, `datafold.domain.tld`) or to use a Datafold managed domain (for example, `yourcompany.dedicated.datafold.com`).

### Customer Managed Domain Name

Create a DNS A-record for the domain where Datafold will be hosted. For the DNS record, there are two options:

* **Public-facing:** When the domain is publicly available, we will provide an SSL certificate for the endpoint.
* **Internal:** It is also possible to have Datafold disconnected from the internet. This would require an internal DNS (for example, AWS Route 53) record that points to the Datafold instance. It is possible to provide your own certificate for setting up the SSL connection.

Once the deployment is complete, you will point that A-record to the IP address of the Datafold service.

## Give Datafold Access to AWS

For setting up Datafold, it is required to set up a separate account within your organization where we can deploy Datafold. We're following the [best practices of AWS to allow third-party access](https://docs.aws.amazon.com/IAM/latest/UserGuide/id%5Froles%5Fcommon-scenarios%5Fthird-party.html).

### Create a separate AWS account for Datafold

First, create a new account for Datafold. Go to **My Organization** to add an account:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/onprem_aws_landing-329bb3e7015c52b1b3a9d4872ff71d66.png" />
</Frame>

Click **Add an AWS Account**:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/onprem_aws_add_account-e8f6d7c449b5763c1962b0df7322ecf0.png" />
</Frame>

You can name this account anything that helps identify it clearly. In our examples, we name it **Datafold**. Make sure that the email address of the owner isn't used by another account.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/onprem_aws_account-41993b39bb1c092bd085a1727f5537e9.png" />
</Frame>

When you click the **Create AWS Account** button, you'll be returned back the organization screen, and see the notification that the new account is being created. After refresh a few minutes later, the account should appear in the organizations list.

### Grant Third-Party access to Datafold

To make sure that deployment runs as expected, your Datafold Support Engineer may need access to the Datafold-specific AWS account that you created. The access can be revoked after the deployment if needed.

To grant access, log into the account created in the previous step. You can switch to the newly created account using the [Switch Role page](https://signin.aws.amazon.com/switchrole):

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/onprem_aws_switch_role-f8ff2e8a925e444830b7db4afd41a14d.png" />
</Frame>

By default, the role name is **OrganizationAccountAccessRole**.

Click **Switch Role** to log in to the Datafold account.

## Grant Access to Datafold

Next, we need to allow Datafold to access the account. We do this by allowing the Datafold AWS account to access your AWS workspace. Go to the [IAM page](https://console.aws.amazon.com/iam/home) or type **IAM** in the search bar:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/onprem_aws_iam-dc7c1aa1e6e33ef4c37d46f0092c5268.png" />
</Frame>

Go to the Roles page, and click the **Create Role** button:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/onprem_aws_create_role-82ea0f25999413532214cae7b4cf1c89.png" />
</Frame>

Select **Another AWS Account**, and use account ID `710753145501`, which is Datafold's account ID. Select **Require MFA** and click **Next: Permissions**.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/onprem_aws_role_config-11a94bf4eb4fb1921544b0824d65d223.png" />
</Frame>

On the Permissions page, attach the **AdministratorAccess** permissions for Datafold to have control over the resources within the account, or see [Minimal IAM Permissions](#minimal-iam-permissions).

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/onprem_aws_role_permissions-72630b3366b32c6e1a4986c52f98f439.png" />
</Frame>

Next, you can set **Tags**; however, they are not a requirement.

Finally, give the role a name of your choice. Be careful not to duplicate the account name. If you named the account in an earlier step `Datafold`, you may want to name the role `Datafold-role`.

Click **Create Role** to complete this step.

Now that the role is created, you should be routed back to a list of roles in your organization.

Click on your newly created role to get a sharable link for the account and store this in your password manager. When setting up your deployment with a support engineer, Datafold will use this link to gain access to the account.

After validating the deployment with your support engineer, and making sure that everything works as it should, we will let you know when it's clear to revoke the credentials.

### Minimal IAM Permissions

Because we work in a Account dedicated to Datafold, there is no direct access to your resources unless explicitly configured (e.g., VPC Peering). The following IAM policy are required to update and maintain the infrastructure.

```JSON
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "acm:AddTagsToCertificate",
                "acm:DeleteCertificate",
                "acm:DescribeCertificate",
                "acm:GetCertificate",
                "acm:ListCertificates",
                "acm:ListTagsForCertificate",
                "acm:RemoveTagsFromCertificate",
                "acm:RequestCertificate",
                "acm:UpdateCertificateOptions",
                "autoscaling:*",
                "ec2:*",
                "eks:*",
                "elasticloadbalancing:*",
                "iam:GetPolicy",
                "iam:GetPolicyVersion",
                "iam:GetOpenIDConnectProvider",
                "iam:GetRole",
                "iam:GetRolePolicy",
                "iam:GetUserPolicy",
                "iam:GetUser",
                "iam:ListAccessKeys",
                "iam:ListAttachedRolePolicies",
                "iam:ListGroupsForUser",
                "iam:ListInstanceProfilesForRole",
                "iam:ListPolicies",
                "iam:ListPolicyVersions",
                "iam:ListRolePolicies",
                "iam:PassRole",
                "iam:TagOpenIDConnectProvider",
                "iam:TagPolicy",
                "iam:TagRole",
                "iam:TagUser",
                "kms:CreateAlias",
                "kms:CreateGrant",
                "kms:CreateKey",
                "kms:Decrypt",
                "kms:DeleteAlias",
                "kms:DescribeKey",
                "kms:DisableKey",
                "kms:GenerateDataKey",
                "kms:GetKeyPolicy",
                "kms:GetKeyRotationStatus",
                "kms:ListAliases",
                "kms:ListResourceTags",
                "kms:PutKeyPolicy",
                "kms:RevokeGrant",
                "kms:ScheduleKeyDeletion",
                "kms:TagResource",
                "logs:CreateLogGroup",
                "logs:DeleteLogGroup",
                "logs:DescribeLogGroups",
                "logs:ListTagsLogGroup",
                "logs:PutRetentionPolicy",
                "logs:TagResource",
                "rds:*",
                "s3:*"
            ],
            "Resource": "*"
        }
    ]
}
```

Some policies we need from time to time. For example, when we do the first deployment. Since those are IAM-related, we will ask for temporary permissions when required.

```JSON
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "iam:AttachRolePolicy",
                "iam:CreateAccessKey",
                "iam:CreateOpenIDConnectProvider",
                "iam:CreatePolicy",
                "iam:CreateRole",
                "iam:CreateUser",
                "iam:DeleteAccessKey",
                "iam:DeleteOpenIDConnectProvider",
                "iam:DeletePolicy",
                "iam:DeleteRole",
                "iam:DeleteRolePolicy",
                "iam:DeleteUser",
                "iam:DeleteUserPolicy",
                "iam:DetachRolePolicy",
                "iam:PutRolePolicy",
                "iam:PutUserPolicy"
            ],
            "Resource": "*"
        }
    ]
}
```


# Datafold VPC Deployment on Azure
Source: https://docs.datafold.com/datafold-deployment/dedicated-cloud/azure

Learn how to deploy Datafold in a Virtual Private Cloud (VPC) on Azure.

<Note>
  **INFO**

  VPC deployments are an Enterprise feature. Please email [sales@datafold.com](mailto:sales@datafold.com) to enable your account.
</Note>

## Create a Domain Name (optional)

You can either choose to use your domain (for example, `datafold.domain.tld`) or to use a Datafold managed domain (for example, `yourcompany.dedicated.datafold.com`).

### Customer Managed Domain Name

Create a DNS A-record for the domain where Datafold will be hosted. For the DNS record, there are two options:

* **Public-facing:** When the domain is publicly available, we will provide an SSL certificate for the endpoint.
* **Internal:** It is also possible to have Datafold disconnected from the internet. This would require an internal DNS (for example, AWS Route 53) record that points to the Datafold instance. It is possible to provide your own certificate for setting up the SSL connection.

Once the deployment is complete, you will point that A-record to the IP address of the Datafold service.

## Create a New Subscription

For isolation reasons, it is best practice to [create a new subscription](https://learn.microsoft.com/en-us/azure/cost-management-billing/manage/create-subscription) within your Microsoft Entra directory/tenant. Please call it something like `yourcompany-datafold` to make it easy to identify.

## Set IAM Permissions

Go to **Microsoft Entra ID** and navigate to **Users**. Click **Add**, **User**, **Invite external user** and add the Datafold engineers.

Navigate to the subscription you just created and go to **Access control (IAM)** tab in the side bar.

* Navigate to the subscription you just created. Go to **Access control (IAM)**. Under **Add** select **Add role assignment**.
* Under **Role**, navigate to **Priviledged administrator roles** and select **Owner**.
* Under **Members**, click **Select members** and add the Datafold engineers.
* When you are done, select **Review + assign**.


# Datafold VPC Deployment on GCP
Source: https://docs.datafold.com/datafold-deployment/dedicated-cloud/gcp

Learn how to deploy Datafold in a Virtual Private Cloud (VPC) on GCP.

<Note>
  **INFO**

  VPC deployments are an Enterprise feature. Please email [sales@datafold.com](mailto:sales@datafold.com) to enable your account.
</Note>

## Create a Domain Name (optional)

You can either choose to use your domain (for example, `datafold.domain.tld`) or to use a Datafold managed domain (for example, `yourcompany.dedicated.datafold.com`).

### Customer Managed Domain Name

Create a DNS A-record for the domain where Datafold will be hosted. For the DNS record, there are two options:

* **Public-facing:** When the domain is publicly available, we will provide an SSL certificate for the endpoint.
* **Internal:** It is also possible to have Datafold disconnected from the internet. This would require an internal DNS (for example, AWS Route 53) record that points to the Datafold instance. It is possible to provide your own certificate for setting up the SSL connection.

Once the deployment is complete, you will point that A-record to the IP address of the Datafold service.

## Create a New Project

For isolation reasons, it is best practice to [create a new project](https://console.cloud.google.com/projectcreate) within your GCP organization. Please call it something like `yourcompany-datafold` to make it easy to identify:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/onprem_gcp_create-2b10d24df91f7f09ff3bd8c216edb511.png" />
</Frame>

After a minute or so, you should receive confirmation that the project has been created. Afterward, you should be able to see the new project.

## Set IAM Permissions

Navigate to the **IAM** tab in the sidebar and click **Grant Access** to invite Datafold to the project.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/onprem_gcp_iam-7c29989550ec1f3636e6270d866fe740.png" />
</Frame>

Add your Datafold solutions engineer as a **principal**. You have two options for assigning IAM permissions to the Datafold Engineers.

1. Assign them as an **owner** of your project.
2. Assign the extended set of [Minimal IAM Permissions](#minimal-iam-permissions).

The owner role is only required temporarily while we configure and test the initial Datafold deployment. We'll inform you when it is ok to revoke this permission and provide us with only the [Minimal IAM Permissions](#minimal-iam-permissions).

### Required APIs

The following GCP APIs need to be additionally enabled to run Datafold:

1. [Compute Engine API](https://console.cloud.google.com/apis/library/compute.googleapis.com)
2. [Secret Manager API](https://console.cloud.google.com/apis/api/secretmanager.googleapis.com)

The following GCP APIs we use are already turned on by default when you created the project:

1. [Cloud Logging API](https://console.cloud.google.com/apis/api/logging.googleapis.com)
2. [Cloud Monitoring API](https://console.cloud.google.com/apis/api/monitoring.googleapis.com)
3. [Cloud Storage](https://console.cloud.google.com/apis/api/storage-component.googleapis.com)
4. [Service Networking API](https://console.cloud.google.com/apis/api/servicenetworking.googleapis.com)

Once the access has been granted, make sure to notify Datafold so we can initiate the deployment.

### Minimal IAM Permissions

Because we work in a Project dedicated to Datafold, there is no direct access to your resources unless explicitly configured (e.g., VPC Peering). The following IAM roles are required to update and maintain the infrastructure.

```Bash
cloudsql.admin
compute.loadBalancerAdmin
compute.networkAdmin
compute.securityAdmin
compute.storageAdmin
container.admin
container.clusterAdmin
iam.roleViewer
iam.serviceAccountUser
iap.tunnelResourceAccessor
storage.admin
viewer
```

Some roles we need from time to time. For example, when we do the first deployment. Since those are IAM-related, we will ask for temporary permissions when required.

```Bash
iam.roleAdmin
iam.securityAdmin
iam.serviceAccountKeyAdmin
iam.serviceAccountAdmin
serviceusage.serviceUsageAdmin
```


# Best Practices
Source: https://docs.datafold.com/deployment-testing/best-practices

Explore best practices for CI/CD testing in Datafold.

<CardGroup>
  <Card title="Slim Diff" href="/deployment-testing/best-practices/slim-diff" icon="file" horizontal>
    Optimize time and cost by choosing which downstream tables to diff.
  </Card>

  <Card title="Handling Data Drift" href="/deployment-testing/best-practices/handling-data-drift" icon="file" horizontal>
    Learn how to prevent and manage data drift in CI pipelines.
  </Card>
</CardGroup>


# Handling Data Drift
Source: https://docs.datafold.com/deployment-testing/best-practices/handling-data-drift

Ensuring Datafold in CI executes apples-to-apples comparison between staging and production environments.

<Note>
  **Note**

  This section of the docs is only relevant if the data used as inputs during the PR build are inconsistent with the data used as inputs during the last production build. Please contact [support@datafold.com](mailto:support@datafold.com) if you'd like to learn more.
</Note>

## What is data drift in CI?

Datafold is used in CI to illuminate the impact of a pull request's proposed code change by comparing two versions of the data and identifying differences.

**Data drift in CI** happens when those data differences occur due to *changes in upstream data sources*—not because of proposed code changes.

Data drift in CI adds "noise" to your CI testing analysis, making it tricky to tell if data differences are due to new code, or changes in the source data. Unless both versions rely on the same snapshot of upstream data, data drift can compromise your ability to see the true effect of the code changes.

<Tip>
  **Tip**

  dbt users should implement Slim CI in [dbt Core](https://www.datafold.com/blog/taking-your-dbt-ci-pipeline-to-the-next-level) or [dbt Cloud](https://www.datafold.com/blog/slim-ci-the-cost-effective-solution-for-successful-deployments-in-dbt-cloud) to prevent most instances of data drift. Slim CI reduces build time and eliminates most instances of data drift because the CI build depends on upstreams in production due to state deferral. However, Slim CI will not *completely* eliminate data drift in CI, specifically in cases where the model being modified in the PR depends on a source. In those cases, we recommend [**building twice in CI**](/deployment-testing/best-practices/handling-data-drift#build-twice-in-ci).
</Tip>

## Why prevent data drift in CI?

By eliminating data drift entirely, you can be confident that any differences detected in CI are driven only by your code, not unexpected data changes.

You can think of this as similar to a scientific experiment, where the control versus treatment groups ideally exist in identical baseline conditions, with the treatment as the only variable which would cause differential outcomes.

In practice, many organizations do not completely eliminate data drift, and still derive value from automatic data diffing and analysis conducted by Datafold in CI, in spite of minor noise that does exist.

## Handling data drift

We recommend two options for removing data drift to the greatest extent possible:

* [Build twice in CI](#build-twice-in-ci)
* [Build CI data from clone of prod sources](#build-ci-data-from-clone-of-prod-sources)

In both of these approaches, Datafold compares transformations of identical upstream data, so that any detected differences will be due to the code changes alone, ensuring an accurate comparison with no false positives.

By building two versions of the data in CI, you can ensure an "apples-to-apples" comparison that depends on the same version of upstream data.

When deciding between the two, choose the one that best matches your workflow:

| Workflow                                              | Approach                      | Why                                                                                           |
| ----------------------------------------------------- | ----------------------------- | --------------------------------------------------------------------------------------------- |
| Data changes frequently in production                 | Build twice in CI             | Isolates PR impact without waiting on recent production updates, using a consistent snapshot. |
| Production has complex orchestration or multiple jobs | Build CI data from prod clone | Allows a stable comparison by freezing upstream data from a fixed production state.           |
| Performance and speed are critical                    | Build CI data from prod clone | Limits CI build to a single snapshot, reducing the processing load on the pipeline.           |
| Simplified orchestration with minimal dependencies    | Build twice in CI             | Reduces the need to manage production snapshots by running both environments in CI.           |

### Build twice in CI

This method involves two CI builds: one representing PR data, and another representing production data, both based on an identical snapshot of upstream data.

1. Create a fixed snapshot of the upstream data that both builds will use.
2. The CI pipeline executes two builds: one using the PR branch of code, and another using the base branch of code.
3. Datafold compares these two data environments, both created in CI, and detects differences.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/deployment_testing/data-drift-architecture-diagram.png" />
</Frame>

<Note>
  If performance is a concern, you can use a reduced or filtered upstream data set to speed up the CI process while still providing rich insight into the data.
</Note>

<Note>
  This method assumes the production build doesn’t involve multiple jobs that process different sets of models at different times.
</Note>

### Build CI data from clone of prod sources

This method involves comparing a CI build based on a snapshot of the upstream source data *from the time of the last production build* to the production version of transformed data.

1. Update orchestration to create and store a snapshot of the upstream source data at the time of the production transformation job.
2. The CI pipeline executes a data transformation build using the PR branch of code, with the snapshotted upstream data as the upstream source.
3. Datafold compares the CI data environment with production data and detects differences.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/deployment_testing/data-drift-solution-clone-of-prod.png" />
</Frame>


# Slim Diff
Source: https://docs.datafold.com/deployment-testing/best-practices/slim-diff

Choose which downstream tables to diff to optimize time, cost, and performance.

By default, Datafold diffs all modified models and downstream models. However, it won't make sense for all organizations to diff every downstream table every time you make a code update. Tradeoffs of time, cost, and risk must be considered.

That's why we created Slim Diff.

With Slim Diff enabled, Datafold will only diff models with dbt code changes in your Pull Request (PR).

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/deployment_testing/slim-diff-diagram.png" />
</Frame>

## Setting up Slim Diff

In Datafold, Slim Diff can be enabled by adjusting your diff settings by navigating to Settings → Integrations → CI → Select your CI tool → Advanced Settings and check the Slim Diff box:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/deployment_testing/slim-diff.png" />
</Frame>

## Diffing only modified models

With this setting turned on, only the modified models will be diffed by default.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/208832523-c3552417-8975-4460-91ed-fd7b0df7d7b7.png" />
</Frame>

## Diff individual downstream models

Once Datafold has diffed only the modified models, you still have the option of diffing individual downstream models right within your PR.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/208832659-e3cdb9d9-c468-459f-85ff-990b2a68b57c.png" />
</Frame>

## Diff all downstream models

You can also add the `datafold:diff-all-downstream` label within your PR, which will automatically diff *all* downstream models.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/208833093-f853bde7-d12a-4b9f-b5ac-a4d8d9666076.png" />
</Frame>

## Explicitly define which models to always diff

Finally, with Slim Diff turned on, there might be certain models or subdirectories that you want to *always* diff when downstream. You can think of this as an exclusion to the Slim Diff behavior.

Apply the `slim_diff: diff_when_downstream` meta tag to individual models or entire folders in your `dbt_project.yml` file:

```Bash
models:
  <project_name>:
    <directory_name>:
      +materialized: view
      <model_name>:
        +meta:
          datafold:
            datadiff:
              slim_diff: diff_when_downstream

    <directory_name>:
      +meta:
        datafold:
          datadiff:
            slim_diff: diff_when_downstream
```

These meta tags can also be added in individual yaml files or in config blocks. More details about using meta tags are available in [the dbt docs](https://docs.getdbt.com/reference/resource-configs/meta).

With this configuration in place, Slim Diff will prevent downstream models from being run *unless* they have been designated as exceptions with the `slim_diff: diff_when_downstream` dbt meta tag.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/208833985-031a04fe-864a-4487-8a64-ec80e4c232e1.png" />
</Frame>

As usual, once the PR has been opened, you'll still have the option of diffing individual downstream models that weren't diffed, or diffing all downstream models using the `datafold:diff-all-downstream` label.


# Configuration
Source: https://docs.datafold.com/deployment-testing/configuration

Explore configuration options for CI/CD testing in Datafold.

<CardGroup>
  <Card title="Primary Key Inference" href="/deployment-testing/configuration/primary-key" icon="file" horizontal>
    Learn how Datafold infers primary keys for accurate Data Diffs.
  </Card>

  <Card title="Column Remapping" href="/deployment-testing/configuration/column-remapping" icon="file" horizontal>
    Map renamed columns in PRs to their production counterparts.
  </Card>

  <Card title="Datafold CI Triggers" href="/deployment-testing/configuration/datafold-ci/on-demand" icon="folder-open" horizontal>
    Configure when Datafold runs in CI, including on-demand triggers.
  </Card>

  <Card title="Model-specific CI Configs" href="/deployment-testing/configuration/model-specific-ci/sql-filters" icon="folder-open" horizontal>
    Set model-specific filters and configurations for CI runs.
  </Card>
</CardGroup>


# Column Remapping
Source: https://docs.datafold.com/deployment-testing/configuration/column-remapping

Specify column renaming in your git commit message so Datafold can map renamed columns to their original counterparts in production for accurate comparison.

When your PR includes updates to column names, it's important to specify these updates in your git commit message using the following syntax. This allows Datafold to understand how renamed columns should be compared to the column in the production data with the original name.

## Example

By specifying column remapping in the commit message, instead of interpreting the change as a removing one column and adding another:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/column_remapping_schema_difference_collapsed-f4fcb478c3a3e43b57f5b79b3f0bf15f.png" />
</Frame>

Datafold will recognize that the column has been renamed:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/column_remapping_no_schema_diff-d727e739b814160b72cde19f667ee7da.png" />
</Frame>

## Syntax for column remapping

You can use any of the following syntax styles as a single line to a commit message to instruct Datafold in CI to remap a column from `oldcol` to `newcol`.

```Bash
# All models/tables in the PR:
datafold remap oldcol newcol
X-Datafold: rename oldcol newcol
/datafold renamed oldcol newcol
datafold: remapped oldcol newcol

# Filtered models/tables by shell-like glob:
datafold remap oldcol newcol model_NAME
X-Datafold: rename oldcol newcol TABLE
/datafold renamed oldcol newcol VIEW_*

```

## Chaining together column name updates

Commit messages can be chained together to reflect sequential changes. This means that a commit message does not lock you in to renaming a column.

For example, if your commit history looks like this:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/column_remapping_commit_messages-8ef04d36b80ee9f509fdb976d4dcb16e.png" />
</Frame>

Datafold will understand that the production column `name` has been renamed to `first_name` in the PR branch.

## Handling column renaming in git commits and PR comments

### Git commits

Git commits track changes on a change-by-change basis and linearize history assuming merged branches introduce new changes on top of the base/current branch (1st parent).

### PR comments

PR comments apply changes to the entire changeset.

### When to use git commits or PR comments?

When handling chained renames:

* **Git commits:** Sequential renames (`col1 > col2 > col3`) result in the final rename (`col1 > col3`).
* **PR comments:** It's best to specify the final result directly (`col1 > col3`). Sequential renames (`col1 > col2 > col3`) can also work, but specifying the final state simplifies understanding during review.

| Aspect                    | Git Commits                                                                                                       | PR Comments                                                                                                                                                                                 |
| ------------------------- | ----------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Tracking Changes**      | Tracks changes on a change-by-change basis.                                                                       | Applies changes to the entire changeset.                                                                                                                                                    |
| **History Linearization** | Linearizes history assuming merged branches introduce new changes on top of the base/current branch (1st parent). | N/A                                                                                                                                                                                         |
| **Chained Renames**       | Sequential renames (col1 > col2 > col3) result in the final rename (col1 > col3).                                 | It's best to specify the final result directly (col1 > col3). Sequential renames (col1 > col2 > col3) can also work, but specifying the final state simplifies understanding during review. |
| **Precedence**            | Renames specified in git commits are applied in sequence unless overridden by subsequent commits.                 | PR comments take precedence over renames specified in git commits if applied during the review process.                                                                                     |

These guidelines ensure consistency and clarity when managing column renaming in collaborative development environments, leveraging Datafold's capabilities effectively.


# Running Data Diff for Specific PRs/MRs
Source: https://docs.datafold.com/deployment-testing/configuration/datafold-ci/on-demand

By default, Datafold CI runs on every new pull/merge request and commits to existing ones.

To **only** run Datafold CI when the user explicitly requests it, you can set **Run only when tagged** option in the Datafold app [CI settings](https://app.datafold.com/settings/integrations/ci) which will only allow Datafold CI to run if a `datafold` tag/label is assigned to the pull/merge request.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/datafold_label_in_pr-81dfcfe40ff4c9c43bb14fde407894d5.png" />
</Frame>

## Running data diff on specific file changes

By default, Datafold CI will run on any file change in the repo. To skip Datafold CI runs for certain modified files (e.g., if the dbt code is placed in the same repo with non-dbt code), you can specify files to ignore. The pattern uses the syntax of .gitignore. Excluded files can be re-included by using the negation.

### Example

Let's say the dbt project is a folder in a repo that contains other code (e.g., Airflow). We want to run Datafold CI for changes to dbt models but skip it for other files. For that, we exclude all files in the repo except those the /dbt folder. We also want to filter out `.md` files in the /dbt folder:

```Bash
*!dbt/*dbt/*.md
```

<Tip>
  **SKIPPING SPECIFIC DBT MODELS**

  To skip diffing individual dbt models in CI, use the [never\_diff](/deployment-testing/configuration/model-specific-ci/excluding-models) option in the Datafold dbt yaml config.
</Tip>


# Running Data Diff on Specific Branches
Source: https://docs.datafold.com/deployment-testing/configuration/datafold-ci/specifc

By default, Datafold CI runs on every new pull/merge request and commits to existing ones.

You can set **Custom base branch** option in the Datafold app [CI settings](https://app.datafold.com/settings/integrations/ci), to only run Datafold CI on pull requests that have a specific base branch. This might be useful if you have multiple environments built from different branches. For example, `staging` and `production` environments built from `staging` and `main` branches respectively. Using the option, you can have 2 different CI configurations in Datafold, one for each environment, and only run the CI for the corresponding branch.


# Diff Timeline
Source: https://docs.datafold.com/deployment-testing/configuration/model-specific-ci/diff-timeline

Specify a `time_column` to visualize match rates between tables for each column over time.

```Bash
models:
  - name: users
    meta:
      datafold:
        datadiff:
          time_column: created_at
```


# Excluding Models
Source: https://docs.datafold.com/deployment-testing/configuration/model-specific-ci/excluding-models

Use `never_diff` to exclude a model or subdirectory of models from data diffs.

```Bash
models:
  - name: users
    meta:
      datafold:
        datadiff:
          never_diff: true
```


# Including/Excluding Columns
Source: https://docs.datafold.com/deployment-testing/configuration/model-specific-ci/including-excluding-columns

Specify columns to include or exclude from the data diff using `include_columns` and `exclude_columns`.

```Bash
models:
  - name: users
    meta:
      datafold:
        datadiff:
          include_columns:
            - user_id
            - created_at
            - name
          exclude_columns:
            - full_name
```


# SQL Filters
Source: https://docs.datafold.com/deployment-testing/configuration/model-specific-ci/sql-filters

Use dbt YAML configuration to set model-specific filters for Datafold CI.

SQL filters can be helpful in two scenarios:

1. When **Production** and **Staging** environments are not built using the same data. For example, if **Staging** is built using a subset of production data, filters can be applied to ensure that both environments are on par and can be diffed.
2. To improve Datafold CI performance by reducing the volume of data compared, e.g., only comparing the last 3 months of data.

SQL filters are an effective technique to speed up diffs by narrowing the data diffed. A SQL filter adds a `WHERE` clause to allow you to filter data on both sides using standard SQL filter expressions. They can be added to dbt YAML under the `meta.datafold.datadiff.filter` tag:

```
models:
  - name: users
    meta:
      datafold:
        datadiff:
          filter: "user_id > 2350 AND source_timestamp >= current_date() - 7"
```


# Time Travel
Source: https://docs.datafold.com/deployment-testing/configuration/model-specific-ci/time-travel

Use `prod_time_travel` and `pr_time_travel` to diff tables from specific points in time.

If your database supports <Tooltip tip="The ability to query or compare data from a specific point in the past, often using a timestamp or version history. Commonly used in databases like Snowflake and Big Query, which store historical snapshots of data.">time travel</Tooltip>, you can diff tables from a particular point in time by specifying `prod_time_travel` for a production model and `pr_time_travel` for a PR model.

```Bash
models:
  - name: users
    meta:
      datafold:
        datadiff:
          prod_time_travel:
            - 2022-02-07T00:00:00
          pr_time_travel:
            - 2022-02-07T00:00:00
```


# Primary Key Inference
Source: https://docs.datafold.com/deployment-testing/configuration/primary-key

Datafold requires a primary key to perform data diffs. Using dbt metadata, Datafold identifies the column to use as the primary key for accurate data diffs.

Datafold supports composite primary keys, meaning that you can assign multiple columns that make up the primary key together.

## Metadata

The first option is setting the `primary-key` key in the dbt metadata. There are [several ways to configure this](https://docs.getdbt.com/reference/resource-configs/meta) in your dbt project using either the `meta` key in a yaml file or a model-specific config block.

```Bash
models:
  - name: users
    columns:
      - name: user_id
        meta:
          primary-key: true
    ## for compound primary keys, set all parts of the key as a primary-key ##
    # - name: company_id
    #   meta:
    #     primary-key: true
```

## Tags

If the primary key is not found in the metadata, it will go through the [tags](https://docs.getdbt.com/reference/resource-properties/tags).

```Bash
models:
  - name: users
    columns:
      - name: user_id
        tags:
          - primary-key
    ## for compound primary keys, tag all parts of the key ##
    # - name: company_id
    #   tags:
    #       - primary-key

```

## Inferred

If the primary key isn't provided explicitly, Datafold will try to infer a primary key from dbt's uniqueness tests. If you have a single column uniqueness test defined, it will use this column as the PK.

```Bash
models:
  - name: users
    columns:
      - name: user_id
        tests:
          - unique
```

Also, model-level uniqueness tests can be used for inferring the PK.

```Bash
models:
  - name: sales
    columns:
      - name: col1
      - name: col2
      ...
    tests:
      - unique:
          column_name: "col1 || col2"
          # or
          column_name: "CONCAT(col1, col2)"
      # we also support dbt_utils unique_combination_of_columns test
      - dbt_utils.unique_combination_of_columns:
          combination_of_columns:
            - order_no
            - order_line
```

Keep in mind that this is a failover mechanism. If you change the uniqueness test, this will also impact the way Datafold performs the diff.


# Getting Started with CI/CD Testing
Source: https://docs.datafold.com/deployment-testing/getting-started

Learn how to set up CI/CD testing with Datafold by integrating your data connections, code repositories, and CI pipeline for automated testing.

<Tip>
  **TEAM CLOUD**

  <Icon icon="wrench" /> Interested in adding Datafold Team Cloud to your CI pipeline? [Let's talk](https://calendly.com/d/zkz-63b-23q/see-a-demo?email=clay%20analytics%40datafold.com\&first_name=Clay\&last_name=Moeller\&a1=\&month=2024-07)! <Icon icon="phone-rotary" />
</Tip>

## Getting Started with Deployment Testing

<Steps>
  <Step title="Set up your data connection">
    To get started, first set up your [data connection](https://docs.datafold.com/integrations/databases) to ensure that Datafold can access and monitor your data sources.
  </Step>

  <Step title="Integrate with code repositories">
    Next, integrate Datafold with your version control system by following the instructions for [code repositories](https://docs.datafold.com/integrations/code-repositories). This allows Datafold to track and test changes in your data pipelines.
  </Step>

  <Step title="Add Datafold to your CI pipeline">
    Add Datafold to your continuous integration (CI) pipeline to enable automated deployment testing. You can do this through our universal [Fully-Automated](../deployment-testing/getting-started/universal/fully-automated), [No-Code](../deployment-testing/getting-started/universal/no-code), [API](../deployment-testing/getting-started/universal/api), or [dbt](../integrations/orchestrators) integrations.
  </Step>

  <Step title="Optional: Connect data apps">
    Optionally, you can [connect data apps](https://docs.datafold.com/integrations/bi_data_apps) to extend your testing and monitoring to data applications like BI tools.
  </Step>
</Steps>


# API
Source: https://docs.datafold.com/deployment-testing/getting-started/universal/api

Learn how to set up and configure Datafold's API for CI/CD testing.

## 1. Create a repository integration

Integrate your code repository using the appropriate [integration](/integrations/code-repositories).

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_app_repo_integration-d436bfd0149ef5b49b3cd2baff167737.png" />
</Frame>

## 2. Create an API integration

In the Datafold app, create an API integration.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_app_ci_integration-63a004100ab880d71821d7f41a5aeebb.png" />
</Frame>

## 3. Set up the API integration

Complete the configuration by specifying the following fields:

### Basic settings

| Field Name         | Description                                               |
| ------------------ | --------------------------------------------------------- |
| Configuration name | Choose a name for your for your Datafold dbt integration. |
| Repository         | Select the repository you configured in step 1.           |
| Data Source        | Select the data source your repository writes to.         |

### Advanced settings: Configuration

| Field Name                     | Description                                                                                                                                                                                                                                                                |
| ------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Diff Hightouch Models          | Run data diffs for Hightouch models affected by your PR.                                                                                                                                                                                                                   |
| CI fails on primary key issues | If null or duplicate primary keys exist, CI will fail.                                                                                                                                                                                                                     |
| Pull Request Label             | When this is selected, the Datafold CI process will only run when the 'datafold' label has been applied.                                                                                                                                                                   |
| CI Diff Threshold              | Data Diffs will only be run automatically for given CI Run if the number of diffs doesn't exceed this threshold.                                                                                                                                                           |
| Custom base branch             | If defined, the Datafold CI process will only run on pull requests with the specified base branch.                                                                                                                                                                         |
| Files to ignore                | Datafold CI diffs all changed models in the PR if at least one modified file doesn’t match the ignore pattern. Datafold CI doesn’t run in the PR if all modified files should be ignored. ([Additional details.](/deployment-testing/configuration/datafold-ci/on-demand)) |

### Advanced settings: Sampling

| Field Name          | Description                                                                                                                                                            |
| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Enable sampling     | Enable sampling for data diffs to optimize analyzing large datasets.                                                                                                   |
| Sampling tolerance  | The tolerance to apply in sampling for all data diffs.                                                                                                                 |
| Sampling confidence | The confidence to apply when sampling.                                                                                                                                 |
| Sampling threshold  | Sampling will be disabled automatically if tables are smaller than specified threshold. If unspecified, default values will be used depending on the Data Source type. |

## 4. Obtain a Datafold API Key and CI config ID

Generate a new Datafold API Key and obtain the CI config ID from the CI API integration settings page:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/api_key_and_config_id-634de74aeb5f3904e366c412b4c61ba1.png" />
</Frame>

You will need these values later on when setting up the CI Jobs.

## 5. Install Datafold SDK into your Python environment

```Bash
pip install datafold-sdk
```

## 6. Configure your CI script(s) with the Datafold SDK

Using the Datafold SDK, configure your CI script(s) to use the Datafold SDK `ci submit` command. The example below should be adapted to match your specific use-case.

```Bash
datafold ci submit --ci-config-id <datafold_ci_config_id> --pr-num <pr_num> --diffs ./diffs.json
```

Since Datafold cannot infer which tables have changed, you'll need to manually provide this information in a specific `json` file format. Datafold can then determine which models to diff in a CI run based on the `diffs.json` you pass in to the Datafold SDK `ci submit` command.

```Bash
[
  {
    "prod": "MY.PROD.TABLE", // Production table to compare PR changes against
    "pr": "MY.PR.TABLE", // Changed table containing data modifications in the PR
    "pk": ["MY", "PK", "LIST"], // Primary key; can be an empty array
    // These fields are not required and can be omitted from the JSON file:
    "include_columns": ["COLUMNS", "TO", "INCLUDE"],
    "exclude_columns": ["COLUMNS", "TO", "EXCLUDE"]
  }
]
```

Note: The `JSON` file is optional and you can also achieve the same effect by using standard input (stdin) as shown here. However, for brevity, we'll use the `JSON` file approach in this example:

```Bash
datafold ci submit \
    --ci-config-id <datafold_ci_config_id> \
    --pr-num <pr_num> <<- EOF
[{
        "prod": "MY.PROD.TABLE",
        "pr": "MY.PR.TABLE",
        "pk": ["MY", "PK", "LIST"]
}]
```

Implementation details will vary depending on [which CI tool](#ci-implementation-tools) you use. Please review the following instructions and examples for your organization's CI tool.

<Info>
  **NOTE**

  Populating the `diffs.json` file is specific to your use case and therefore out of scope for this guide. The only requirement is to adhere to the `JSON` schema structure explained above.
</Info>

## CI Implementation Tools

We've created guides and templates for three popular CI tools.

<Tip>
  **HAVING TROUBLE SETTING UP DATAFOLD IN CI?**

  <Icon icon="hand-wave" /> We're here to help! Please [reach out and chat with a Datafold Solutions Engineer](https://www.datafold.com/booktime). <Icon icon="phone-rotary" />
</Tip>

To add Datafold to your CI tool, add `datafold ci submit` step in your PR CI job.

<Tabs>
  <Tab title="GitHub Actions">
    ```Bash
    name: Datafold PR Job

    # Run this job when a commit is pushed to any branch except main
    on:
      pull_request:
      push:
        branches:
          - '!main'

    jobs:
      run:
        runs-on: ubuntu-20.04 # your image will vary

        steps:

          - name: Install Datafold SDK
            run: pip install -q datafold-sdk
        # ...
          - name: Upload what to diff to Datafold
            run: datafold ci submit --ci-config-id <datafold_ci_config_id> --pr-num ${PR_NUM} --diffs <path_to_diffs_json_file>
            env:
              # env variables used by Datafold SDK internally
              DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
              DATAFOLD_HOST: ${DATAFOLD_HOST}
              # For Dedicated Cloud/private deployments of Datafold,
              # Set the "https://custom.url.datafold.com" variable as the base URL as an environment variable, either as a string or a project variable
              # There are multiple ways to get the PR_NUM, this is just a simple example
              PR_NUM: ${{ github.event.number }}
    ```

    Be sure to replace `<datafold_ci_config_id>` with the [CI config ID](#4-obtain-a-datafold-api-key-and-ci-config-id) value.

    <Info>
      **NOTE**

      It is beyond the scope of this guide to provide guidance on generating the `<path_to_diffs_json_file>`, as it heavily depends on your specific use case. However, ensure that the generated file adheres to the required schema outlined above.
    </Info>

    Finally, store [your Datafold API Key](#4-obtain-a-datafold-api-key-and-ci-config-id) as a secret named `DATAFOLD_API_KEY` [in your GitHub repository settings](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository).

    Once you've completed these steps, Datafold will run data diffs between production and development data on the next GitHub Actions CI run.
  </Tab>

  <Tab title="CircleCI">
    ```Bash
    version: 2.1

    jobs:
      artifacts-job:
        filters:
          branches:
            only: main # or master, or the name of your default branch
        docker:
          - image: cimg/python:3.9 # your image will vary
              env:
                # env variables used by Datafold SDK internally
                DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
                DATAFOLD_HOST: ${DATAFOLD_HOST}
                # For Dedicated Cloud/private deployments of Datafold,
                # Set the "https://custom.url.datafold.com" variable as the base URL as an environment variable, either as a string or a project variable, per https://circleci.com/docs/set-environment-variable/
                # There are multiple ways to get the PR_NUM, this is just a simple example
                PR_NUM: ${{ github.event.number }}
        steps:
          - checkout
          - run:
              name: "Install Datafold SDK"
              command: pip install -q datafold-sdk

          - run:
              name: "Upload what to diff to Datafold"
              command: datafold ci submit --ci-config-id <datafold_ci_config_id> --pr-num ${CIRCLE_PULL_REQUEST} --diffs <path_to_diffs_json_file>
    ```

    Be sure to replace `<datafold_ci_config_id>` with the [CI config ID](#4-obtain-a-datafold-api-key-and-ci-config-id) value.

    <Info>
      **NOTE**

      It is beyond the scope of this guide to provide guidance on generating the `<path_to_diffs_json_file>`, as it heavily depends on your specific use case. However, ensure that the generated file adheres to the required schema outlined above.
    </Info>

    Then, enable [**Only build pull requests**](https://circleci.com/docs/oss#only-build-pull-requests) in CircleCI. This ensures that CI runs on pull requests and production, but not on pushes to other branches.

    Finally, store [your Datafold API Key](#4-obtain-a-datafold-api-key-and-ci-config-id) as a secret named `DATAFOLD_API_KEY` [your CircleCI project settings.](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository).

    Once you've completed these steps, Datafold will run data diffs between production and development data on the next CircleCI run.
  </Tab>

  <Tab title="GitLab CI">
    ```Bash

    image:
    name: ghcr.io/dbt-labs/dbt-core:1.x # your name will vary
    entrypoint: [ "" ]
    variables:
      # env variables used by Datafold SDK internally
      DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
      DATAFOLD_HOST: ${DATAFOLD_HOST}
      # For Dedicated Cloud/private deployments of Datafold,
      # Set the "https://custom.url.datafold.com" variable as the base URL as an environment variable, either as a string or a project variable
      # There are multiple ways to get the PR_NUM, this is just a simple example
      PR_NUM: ${{ github.event.number }}

    run_pipeline:
      stage: test
      before_script:
        - pip install -q datafold-sdk

      script:
        # Upload what to diff to Datafold
        - datafold ci submit --ci-config-id <datafold_ci_config_id> --pr-num $CI_MERGE_REQUEST_ID --diffs <path_to_diffs_json_file>
     rules:
        - if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
    ```

    Be sure to replace `<datafold_ci_config_id>` with the [CI config ID](#4-obtain-a-datafold-api-key-and-ci-config-id) value.

    <Info>
      **NOTE**

      It is beyond the scope of this guide to provide guidance on generating the `<path_to_diffs_json_file>`, as it heavily depends on your specific use case. However, ensure that the generated file adheres to the required schema outlined above.
    </Info>

    Finally, store [your Datafold API Key](#4-obtain-a-datafold-api-key-and-ci-config-id) as a secret named `DATAFOLD_API_KEY` [in your GitLab project's settings](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository).

    Once you've completed these steps, Datafold will run data diffs between production and development data on the next GitLab CI run.
  </Tab>
</Tabs>

## Optional CI Configurations and Strategies

### Skip Datafold in CI

To skip the Datafold step in CI, include the string `datafold-skip-ci` in the last commit message.


# Fully-Automated
Source: https://docs.datafold.com/deployment-testing/getting-started/universal/fully-automated

Automatically diff tables modified in a pull request with Datafold's Fully-Automated CI integration.

Our Fully-Automated CI integration enables you to automatically diff tables modified in a pull request so you know exactly how your data will change before going to production.

We do this by analyzing the SQL in any changed files, extracting the relevant table names, and diffing those tables between your staging and production environments. We then post the results of those diffs—including any downstream impact—to your pull request for all to see. All without manual intervention.

## Prerequisites

* Your code must be hosted in one of our supported version control integrations
* Your tables/views must be defined in SQL
* Your schema names must be parameterized ([see below](#4-parameterize-schema-names))
* You must be automatically generating staging data ([more info](/deployment-testing/how-it-works))

## Get Started

Get started in just a few easy steps.

### 1. Generate a Datafold API key

If you haven't already generated an API key (you only need one), visit Settings > Account and select **Create API Key**. Save the key somewhere safe like a password manager, as you won't be able to view it later.

### 2. Set up a version control integration

Open the Datafold app and navigate to Settings > Integrations > Repositories to connect the repository that contains the code you'd like to automatically diff.

### 3. Add a step to your CI workflow

<Note>This example assumes you're using GitHub actions, but the approach generalizes to any version control tool we support including GitLab, Bitbucket, etc.</Note>

Either [create a new GitHub Action](https://docs.github.com/en/actions/writing-workflows/quickstart) or add the following steps to an existing one:

```yaml
- name: Install datafold-sdk
  run: pip install -q datafold-sdk

- name: Trigger Datafold CI
  run: |
    datafold ci auto trigger --ci-config-id $CI_CONF_ID --pr-num $PR_NUM 
    --base-sha $BASE_SHA --pr-sha $PR_SHA --reference-params "$REFERENCE_PARAMS" 
    --pr-params "$PR_PARAMS"
  env:
    DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
    CI_CONF_ID: 436
    PR_NUM:  "${{ steps.findPr.outputs.pr }}"
    PR_SHA: "${{ github.event.pull_request.head.sha }}"
    BASE_SHA: ${{ github.event.pull_request.base.sha }}
    REFERENCE_PARAMS: '{ "target_schema": "nc_default" }'
    PR_PARAMS: "{ \"target_schema\": \"${{ env.TARGET_SCHEMA }}\" }"
```

### 4. Parameterize schema names

If it's not already the case, you'll need to parameterize the schema for any table paths you'd like Datafold to diff. For example, let's say you have a file called `dim_orgs.sql` that defines a table called `DIM_ORGS` in your warehouse. Your SQL should look something like this:

```sql
-- datafold: pk=org_id
CREATE OR REPLACE TABLE analytics.${target_schema}.dim_orgs AS (
  SELECT
    org_id,
    org_name,
    employee_count,
    created_at
  FROM analytics.${target_schema}.org_created
);
```

### 5. Provide primary keys (optional)

<Note>While this step is technically optional, we strongly recommend providing primary keys for any tables you'd like Datafold to diff.</Note>

In order for Datafold to perform full value-level comparisons between staging and production tables, Datafold needs to know the primary keys. To provide this information, place a comment above each query using the `-- datafold: pk=<your_pk>` syntax shown below:

```sql
-- datafold: pk=org_id
CREATE OR REPLACE TABLE analytics.${target_schema}.dim_orgs AS (
  SELECT
    org_id,
...
```

### 6. Create a pull request

When you create a pull request, Datafold will automatically detect it, attempt to diff any tables modified in the code, and post a summary as a comment in the PR. You can click through on the comment to view a more complete analysis of the changes in the Datafold app. Happy diffing!

## Need help?

If you have any questions about Fully-Automated CI, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com).


# No-Code
Source: https://docs.datafold.com/deployment-testing/getting-started/universal/no-code

Set up Datafold's No-Code CI integration to create and manage Data Diffs without writing code.

Monitors are easy to create and manage in the Datafold app. But for teams (or individual users) who prefer a more code-based approach, our monitors as code feature allows managing monitors via version-controlled YAML.

## Getting Started

Get up and running with our No-Code CI integration in just a few steps.

### 1. Create a repository integration

Connect your code repository using the appropriate [integration](/integrations/code-repositories).

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_app_repo_integration-d436bfd0149ef5b49b3cd2baff167737.png" />
</Frame>

### 2. Create a No-Code integration

From the integrations page, create a new No-Code CI integration.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_app_ci_integration-63a004100ab880d71821d7f41a5aeebb.png" />
</Frame>

### 3. Set up the No-Code integration

Complete the configuration by specifying the following fields:

#### Basic settings

| Field Name         | Description                                           |
| ------------------ | ----------------------------------------------------- |
| Configuration name | Choose a name for your Datafold integration.          |
| Repository         | Select the repository you configured in step 1.       |
| Data Connection    | Select the data connection your repository writes to. |

#### Advanced settings

| Field Name         | Description                                                                                                                   |
| ------------------ | ----------------------------------------------------------------------------------------------------------------------------- |
| Pull request label | When this is selected, the Datafold CI process will only run when the `datafold` label has been applied to your pull request. |
| Custom base branch | If provided, the Datafold CI process will only run on pull requests against the specified base branch.                        |

### 4. Create a pull request and add diffs

Datafold will automatically post a comment on your pull request with a link to generate a CI run that corresponds to the latest set of changes.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/1-7a001321004a1afa68a3bd74a4bb822d.png" />
</Frame>

### 5. Add diffs to your CI run

Once in Datafold, add as many pull requests as you'd like to the CI run. If you need a refresher on how to configure data diffs, check out [our docs](/data-diff/in-database-diffing/creating-a-new-data-diff).

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/4-800a438c5251d6888b83a1f9e3c811bb.png" />
</Frame>

### 6. Add a summary to your pull request

Click on **Save and Add Preview to PR** to post a summary to your pull request.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/2-cb3997ac8f47fe7d5ad651478f1fe7d8.png" />
</Frame>

### 7. View the summary in your pull request

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/3-33123cf19f9ff7f5fa1aef9952d8208d.png" />
</Frame>

## Cloning diffs from the last CI run

If you make additional changes to your pull request, clicking the **Add data diff** button generates a new CI run in Datafold. From there, you can:

* Create a new Data Diff from scratch
* Clone diffs from the last CI run

You can also diff downstream tables by clicking on the **Add Data Diff** button in the Downstream Impact table. This creates additional Data Diffs:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/5-c411b13fcdaebb9587fabcfcef92c740.png" />
</Frame>

You can then post another summary to your pull request by clicking **Save and Add Preview to PR**.


# How Datafold in CI Works
Source: https://docs.datafold.com/deployment-testing/how-it-works

Learn how Datafold integrates with your Continuous Integration (CI) process to create Data Diffs for all SQL code changes, catching issues before they make it into production.

## What is CI?

Continuous Integration (or CI) is a process for building and testing changes to your code before deploying to production. This ensures early detection of potential issues and improves the quality of code deployment.

| Without CI                                                                       | With CI                                                                  |
| -------------------------------------------------------------------------------- | ------------------------------------------------------------------------ |
| Updates are manually coordinated and become a complex synchronization chore.     | Smoothly manage code changes, and scale as your team and code base grow. |
| Testing is done manually, if at all.                                             | Automate high-confidence test coverage.                                  |
| Code changes are released at a slower cadence, and with higher rates of failure. | Boost the quantity and quality of developer output.                      |

### Datafold in CI

For Datafold to work in CI, you need to add a step that builds <Tooltip tip="Staging data is created using the version of the code in your PR/MR branch, which contains the edits you're currently working on.">staging data</Tooltip> in your CI process (e.g., GitHub).

<Note>
  **Prerequisite: Building staging data in CI**

  If you're using dbt, you'll need to add a dbt build step to your CI pipeline first. This can be done through either [dbt Cloud](https://www.datafold.com/blog/slim-ci-the-cost-effective-solution-for-successful-deployments-in-dbt-cloud) or [dbt Core](https://www.datafold.com/blog/accelerating-dbt-core-ci-cd-with-github-actions-a-step-by-step-guide).

  For other orchestrators like Airflow, follow [this guide](https://www.datafold.com/blog/datafold-in-ci-is-for-everyone) to build staging data in CI, or contact us for custom recommendations based on your infrastructure.
</Note>

In this short clip, see how the Datafold bot automatically comments on your PR, highlighting data differences between the production and development versions of your code:

<Frame>
  <video src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/small-video-01.mp4" controls />
</Frame>

## Creating production and staging data

When Datafold is integrated into your CI, it automatically detects and highlights value-level differences between production data and staging data.

These summarized Data Diff results are written directly in your pull request (PR) as a comment. From the comment, you can access the Datafold App to explore value-level differences, understand the impact on downstream BI tools, and other context-rich information about the impact of your PR code changes.

### Production data

Production data refers to the data that your organization depends on for daily operations, such as powering dashboards and BI tools. Your orchestrator (e.g., dbt, Airflow) is responsible for running SQL code that builds and maintains this data in your warehouse.

If you use dbt, we'll assume that you have a production job in either [dbt Cloud](https://docs.getdbt.com/docs/deploy/dbt-cloud-job) or [dbt Core](https://docs.getdbt.com/docs/deploy/deployment-tools) that builds or updates your dbt models in the warehouse on a schedule. Or, you might have a scheduled job in Airflow or another orchestrator that builds production data on a regular basis.

### Staging data

For Datafold to run Data Diffs in CI, you need a step in your CI process that builds staging data (a version of your data in a dedicated schema) using the code in your PR/MR branch. Datafold will compares this staging data against your production data when diffing.

<Tip>
  **Tip**

  You can use either dbt Cloud or dbt Core to add a step in your CI process that builds staging data.
</Tip>

* [Setting up dbt in CI for dbt Cloud users](https://www.datafold.com/blog/slim-ci-the-cost-effective-solution-for-successful-deployments-in-dbt-cloud)
* [Setting up dbt in CI for dbt Core users](https://www.datafold.com/blog/accelerating-dbt-core-ci-cd-with-github-actions-a-step-by-step-guide)
* [Building staging data in CI using Airflow](https://www.datafold.com/blog/datafold-in-ci-is-for-everyone)

## Comparing production and staging data

Once you have a job in CI that builds staging data, you're ready to get started with Datafold in CI!

By comparing production and staging data, Datafold ensures that any code changes are thoroughly validated before being merged, helping to prevent data issues from reaching production.

We'll walk through the setup steps in more detail in the [Getting Started](/deployment-testing/getting-started) section.

### Datafold in CI for dbt users

While Datafold can be added to CI no matter what orchestrator you use, it's worth detailing exactly how this works with dbt, a popular and opinionated tool for which we have specific recommendations.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/Datafold_in_dbt_CI-750487e5bd8ef031a87c205cbc6e5fea.png" />
</Frame>

Here is how Datafold + dbt in CI works:

<Steps>
  <Step title="Submit dbt Project Manifests">
    Two versions of your dbt project's `manifest.json` are submitted to Datafold, representing the state of the production code and the PR/MR code.

    * For dbt Cloud users, this submission happens automatically.
    * dbt Core users need to add steps in their CI configuration (e.g., Circle CI, GitHub Actions, or GitLab) to submit the artifacts.
  </Step>

  <Step title="Identify Code Differences">
    Datafold compares the two versions of the `manifest.json` to identify differences in the code.
  </Step>

  <Step title="Run Data Diffs">
    Datafold queries your data warehouse to run Data Diffs on the modified dbt models. It also identifies downstream assets (e.g., BI tools, reverse ETL pipelines) that might be impacted by the changes.

    * Datafold can diff dbt models that are materialized as both tables and views.
    * If your dbt project has many downstream dependencies, you can use [Slim Diff](/deployment-testing/best-practices/slim-diff) or other [configuration options](/deployment-testing/configuration) to manage scale, ensuring critical models are prioritized.
  </Step>

  <Step title="Summarize Data Diffs in Pull Request">
    The results of the Data Diffs are summarized in a comment on your pull request (e.g., in GitHub). You can click the comment to view more detailed information in the Datafold application.
  </Step>
</Steps>


# CI/CD Testing
Source: https://docs.datafold.com/faq/ci-cd-testing


<AccordionGroup>
  <Accordion title="What if my staging/dev environment contains only a subset of data from production?">
    You can use [SQL filters](/deployment-testing/configuration/model-specific-ci/sql-filters) to ensure that Datafold compares equivalent subsets of data between your staging/dev and production environments, allowing for accurate data quality checks despite the difference in data volume.
  </Accordion>

  <Accordion title="Can I use Datafold in development?">
    Yes, you can use Datafold in development. It helps catch data quality issues early by comparing data changes in your development environment before they reach production. This proactive approach ensures that errors and inconsistencies are identified and resolved during the development process, enhancing overall data reliability and preventing potential issues in production. Data teams can leverage the Datafold SDK to run data diffs from the command line while developing and testing data models.
  </Accordion>

  <Accordion title="How does Datafold handle data drift?">
    Data drift in CI occurs when the two data transformation builds that are compared by Datafold in CI have differing data outputs due to the upstream data changing over time.

    We have a few recommended strategies for dealing with data drift [in our docs here](/deployment-testing/best-practices/handling-data-drift).
  </Accordion>

  <Accordion title="Can I run Data Diffs before opening a PR?">
    Some teams want to show Data Diff results in their tickets *before* creating a pull request. This speeds up code reviews as developers can QA code changes before requesting a PR review.

    If you use dbt, we explain [how you can automate this workflow here](/faq/datafold-with-dbt#can-i-run-data-diffs-before-opening-a-pr).
  </Accordion>
</AccordionGroup>


# Data Diffing
Source: https://docs.datafold.com/faq/data-diffing


<AccordionGroup>
  <Accordion title="What’s a data diff?">
    A [data diff](/data-diff/what-is-data-diff) is a value-level comparison between two tables—used to identify critical changes to your data and guarantee data quality.

    Similar to how git diff highlights changes in code by comparing different versions of files to show what lines have been added, modified, or deleted, a data diff compares rows and columns in two tables to pinpoint specific data changes.
  </Accordion>

  <Accordion title="What types of data can Datafold compare?">
    Datafold can compare data in tables, views, and SQL queries in databases and data lakes.

    Datafold facilitates data diffing by supporting a wide range of basic data types across popular database systems like Snowflake, Databricks, BigQuery, Redshift, and PostgreSQL. Datafold can also diff data across legacy warehouses like Oracle, SQL Server, Teradata, IBM Netezza, MySQL, and more. See the full list of [supported data types here](/api-reference/data-types).
  </Accordion>

  <Accordion title="Can you data diff unstructured data?">
    No, Datafold cannot perform data diffs on unstructured data such as files. However, it supports diffing structured and semi-structured data in tabular formats, including `JSON` columns.

    See the full list of [supported data types here](/api-reference/data-types).
  </Accordion>

  <Accordion title="How should I compare numeric columns, especially those with floating-point values?">
    When comparing numerical columns or columns of the `FLOAT` type, it is beneficial to [set tolerance levels for differences](/data-diff/in-database-diffing/creating-a-new-data-diff#tolerance-for-floats) to avoid flagging inconsequential discrepancies. This practice ensures that only meaningful differences are highlighted, maintaining the focus on significant changes.
  </Accordion>

  <Accordion title="Can you explain how Datafold handles expected changes?">
    When a change is detected, Datafold highlights the differences in the App or through PR comments, allowing data engineers and other users to review, validate, and approve these changes during the CI process.
  </Accordion>

  <Accordion title="How does Datafold’s in-database diffing work?">
    When diffing data within the same physical database or data lake namespace, data diff compares data by executing various SQL queries in the target database. It uses several JOIN-type queries and various aggregate queries to provide detailed insights into differences at the row, value, and column levels, and to calculate differences in metrics and distributions.
  </Accordion>

  <Accordion title="How does Datafold’s cross-database diffing work?">
    To compare datasets between two different databases, Datafold leverages a proprietary stochastic checksumming algorithm that allows it to identify discrepancies down to individual primary keys and column values while minimizing the amount of data sent over the network. As a result, the comparison is mostly performed in-place, leveraging the underlying databases without the need to export the entire dataset to compare elsewhere.
  </Accordion>

  <Accordion title="What is stochastic checksumming?">
    Stochastic checksumming is a technique used to verify the integrity of large datasets by generating checksums (hashes) for randomly selected subsets of the data rather than the entire dataset. This method provides a probabilistic assurance of data integrity, allowing for efficient detection of data corruption or changes with significantly reduced computational overhead compared to full data checksumming.

    Stochastic checksumming is particularly useful in scenarios where processing the entire dataset is impractical due to size or resource constraints. We use stochastic checksumming to compare datasets between two different databases efficiently. This proprietary algorithm generates checksums for randomly selected subsets of the data, allowing it to identify discrepancies down to individual primary keys and column values with minimal data transfer over the network.

    By performing most of the comparison in-place, within the underlying databases, Datafold avoids the need to export entire datasets for external comparison. This approach ensures accurate and scalable data verification while optimizing network and computational resources.
  </Accordion>

  <Accordion title="Can I materialize diff results back to my database?">
    Yes, while the Datafold App UI provides advanced exploration of diff results, you can also materialize these results back to your database. This allows you to further investigate with SQL queries or maintain audit logs, providing flexibility in how you handle and review diff outcomes. Teams may additionally choose to download diff results as a CSV directly from the Datafold App to share with their team members.
  </Accordion>
</AccordionGroup>


# Data Migration Automation
Source: https://docs.datafold.com/faq/data-migration-automation


<AccordionGroup>
  <Accordion title="How does DMA work?">
    Datafold performs complete SQL codebase translation and validation. It uses an AI agent architecture that performs the translation leveraging an LLM model with a feedback loop optimized for achieving full parity between migration source and target. DMA takes into account metadata, including schema, data types, and relationships in the source system.
  </Accordion>

  <Accordion title="How is this approach different from other tools on the market?">
    DMA offers several key advantages over deterministic transpilers that rely on static code parsing with predefined grammars:

    * **Full parity between source and target:** DMA not only returns code that compiles, but code that produces the same result in your new database with explicit validation.
    * **Flexible dialect handling:** Ability to adapt to any arbitrary dialect for input/output without the need to provide full grammar, which is especially valuable for numerous legacy systems and their versions.
    * **Self-correction capabilities:** DMA can self-correct mistakes, taking into account compilation errors and data discrepancies.
    * **Modernizing code structure:** DMA can convert convoluted stored procedures into dbt projects following best practices.
  </Accordion>

  <Accordion title="How do I know if the output is correct?">
    Upon delivery, customers get a comprehensive report with links to data diffs validating parity and discrepancies (if any) on dataset-, column-, and row-level between source and target.
  </Accordion>

  <Accordion title="How does my team use DMA?">
    Once source and target systems are connected and Datafold ingests the code base, translations with DMA are automatically supervised by the Datafold team. In most cases, no input is required from the customer.
  </Accordion>

  <Accordion title="What do I need to start working with DMA?">
    Connect source and target data sources to Datafold. Provide Datafold access to the codebase (usually by installing the Datafold GitHub/GitLab/ADO app or via system catalog for stored procedures).
  </Accordion>

  <Accordion title="What are the security implications of using DMA?">
    Datafold is SOC 2 Type II, GDPR, and HIPAA-compliant and provides flexible deployment options, including in-VPC deployment in AWS, GCP, or Azure. The LLM infrastructure relies on local models and does not expose data to any sub-processor besides the cloud provider. In case of a VPC deployment, none of the data leaves the customer’s private network.
  </Accordion>

  <Accordion title="How long will it take to translate?">
    After the initial setup, the migration process can take several days to several weeks, depending on the source and target technologies, scale, and complexity.
  </Accordion>

  <Accordion title="What if I want to change data model/definitions?">
    DMA is an ideal fit for lift-and-shift migrations with parity between source and target as the goal. Some customization is possible and needs to be scoped on a case-by-case basis.
  </Accordion>

  <Accordion title="How does cross-database diffing work?">
    Datafold connects to any SQL source and target databases, similar to how BI tools do.

    Datafold does not need to extract the entirety of the datasets for comparisons. Instead, Datafold relies on [stochastic checksumming](../faq/data-diffing#what-is-stochastic-checksumming) to identify discrepancies and only extracts those for analysis.
  </Accordion>

  <Accordion title="What kind of information does Datafold output?">
    Datafold’s cross-database diffing will produce the following results:

    * **High-Level Summary:**
      * Total number of different rows
      * Total number of rows (primary keys) that are present in one database but not the other
      * Aggregate schema differences
    * **Schema Differences:** Per-column mapping of data types, column order, etc.
    * **Primary Key Differences:** Sample of specific rows that are present in one database but not the other
    * **Value-Level Differences:** Sample of differing column values for each column with identified discrepancies; full dataset of differences can be downloaded or materialized to the warehouse
  </Accordion>

  <Accordion title="How does a user run a data diff?">
    * Via Datafold’s interactive UI
    * Via the Datafold API
    * On schedule (as a monitor) with optional alerting via Slack, email, PagerDuty, etc.
  </Accordion>

  <Accordion title="Can I run multiple data diffs at the same time?">
    Yes, users can run as many diffs as they would like with concurrency limited by the underlying database.
  </Accordion>

  <Accordion title="What if my data is changing and replicated live, how can I ensure proper comparison?">
    In such cases, we recommend using watermarking—diffing data within a specified time window of row creation/update (e.g., `updated_at timestamp`).
  </Accordion>

  <Accordion title="What if the data types do not match between source and target?">
    Datafold performs best-effort type matching for cases where deterministic type casting is possible, e.g., comparing `VARCHAR` type with `STRING` type. When automatic type casting without information loss is not possible, the user can define type casting manually using diffing in Query mode.
  </Accordion>

  <Accordion title="Can data diff help if the dataset in the source and target databases has a different shape/schema/column naming?">
    Users can reshape input datasets by writing a SQL query and diffing in Query mode to bring the dataset to a comparable shape. Datafold also supports column remapping for datasets with different column names between tables.
  </Accordion>
</AccordionGroup>


# Data Monitoring and Observability
Source: https://docs.datafold.com/faq/data-monitoring-observability


<Accordion title="How does Datafold compare to data observability tools?">
  Most data observability tools focus on monitoring metrics (e.g., null counts, row counts) in the data warehouse. But catching data quality issues in the data warehouse is usually too late: the bad data has already affected downstream processes and negatively impacted the business.

  Our platform focuses on prevention rather than detection of data quality issues. By [integrating deeply into your CI process](/deployment-testing/how-it-works), Datafold's [Data Diff](/data-diff/what-is-data-diff) helps data teams fix potential regressions during development and deployment, before bad code and data get into the production environment.

  Our [Data Monitors](/data-monitoring/monitor-types) make it easy to monitor production data to catch issues early before they are propagated through the warehouse to business stakeholders.

  This proactive data quality strategy not only enhances the reliability and accuracy of your data pipelines but also reduces the risk of disruptions and the need for reactive troubleshooting.
</Accordion>


# Data Reconciliation
Source: https://docs.datafold.com/faq/data-reconciliation


<AccordionGroup>
  <Accordion title="How does cross-database diffing work?">
    Datafold’s data diff connects to source and target databases and performs fast, accurate and detailed comparison of datasets providing aggregate summaries, column, and value-level insights into any discrepancies.

    1. Datafold connects to any SQL source and target databases, similar to how BI tools do.
    2. Datafold does not need to extract the entirety of the datasets for comparisons. Instead, Datafold relies on [stochastic checksumming](/faq/data-diffing#what-is-stochastic-checksumming) to identify discrepancies and only extract those for analysis.
  </Accordion>

  <Accordion title="What kind of information does Datafold output?">
    Datafold’s cross-database diffing will produce the following results:

    1. High-Level Summary:
       * Total number of different rows
       * Total number of rows (primary keys) that are present in one database, but not the other
       * Aggregate schema differences
    2. Schema Differences: Per-column mapping of data types, column order, etc.
    3. Primary Key Differences: Sample of specific rows that are present in one database, but not the other
    4. Value-Level Differences: Sample of differing values for each column with identified discrepancies; full dataset of differences can be downloaded or materialized to the warehouse

    You can check out [what the results look like in the App](/data-diff/cross-database-diffing/results).
  </Accordion>

  <Accordion title="How does a user run a data diff?">
    1. Via Datafold’s interactive UI
    2. Via the Datafold API
    3. On a schedule (as a monitor) with optional alerting via Slack, email, PagerDuty, etc.
  </Accordion>

  <Accordion title="Can I run multiple data diffs at the same time?">
    Yes, users can run as many diffs as they would like with concurrency limited by the underlying database.
  </Accordion>

  <Accordion title="How can I ensure accurate data comparison if my data is changing and being replicated in real-time?">
    In such cases, we recommend using watermarking – diffing data within a specified time window of row creation / update (e.g. `updated_at timestamp`).
  </Accordion>

  <Accordion title="What if the data types do not match between source and target?">
    Datafold performs best-effort type matching for cases when deterministic type casting is possible, e.g. comparing `VARCHAR` type with `STRING` type. When automatic type casting without information loss is not possible, the user can define type casting manually using diffing in Query mode.
  </Accordion>

  <Accordion title="Can data diff help if the source and target datasets have a different shape/schema/column naming?">
    Yes, users can reshape the input dataset by writing a SQL query and diffing in Query mode to bring the dataset to a shape that can be compared with another. Datafold also supports column remapping for datasets with different column names between tables.
  </Accordion>

  <Accordion title="How can data diffs be provisioned at scale, e.g. we need to create hundreds / thousands of data diffs?">
    To make the provisioning at scale easier, you can create data diffs via the [Datafold API](https://docs.datafold.com/reference/cloud/rest-api).
  </Accordion>
</AccordionGroup>


# Data Storage and Security
Source: https://docs.datafold.com/faq/data-storage-and-security


<Accordion title="What data does Datafold ingest and store?">
  Datafold ingests and stores various types of data to ensure accurate data quality checks and insights:

  * **Metadata**: This includes table names, column names, and queries executed in the data warehouse.
  * **Data for Data Diffs**:
    * For **in-database diffs**, all data visible in the app, including data samples, is fetched and stored.
    * For **cross-database diffs**, all data visible in the app, including data samples, is fetched and stored. Larger amounts of data are fetched for comparison purposes, but only data samples are stored.
  * **Table Profiling in Data Explorer**: Datafold stores samples and distributions of data to provide detailed profiling.
</Accordion>


# Integrating Datafold with dbt
Source: https://docs.datafold.com/faq/datafold-with-dbt


<AccordionGroup>
  <Accordion title="Why do I need Datafold if I already have dbt tests?">
    You need Datafold in addition to dbt tests because while dbt tests are effective for validating specific assertions about your data, they can't catch all issues, particularly unknown unknowns. Datafold identifies value-level differences between staging and production datasets, which dbt tests might miss.

    Unlike dbt tests, which require manual configuration and maintenance, Datafold automates this process, ensuring continuous and comprehensive data quality validation without additional overhead. This is all embedded within Datafold’s unified platform that offers end-to-end data quality testing with our [Column-level Lineage](/data-explorer/lineage) and [Data Monitors](/data-monitoring/monitor-types).

    Hence, we recommend combining dbt tests with Datafold to achieve complete test coverage that addresses both known and unknown data quality issues, providing a robust safeguard against potential data integrity problems in your CI pipeline.
  </Accordion>

  <Accordion title="What do I need to implement Datafold for dbt?">
    For dbt Core users, create an integration in Datafold, specify the necessary settings, obtain a Datafold API Key and CI config ID, and configure your CI scripts with the Datafold SDK to upload manifest.json files. Our detailed setup guide [can be found here](/integrations/orchestrators/dbt-core).

    For dbt Cloud users, set up dbt Cloud CI to run Pull Request jobs and create an Artifacts Job that generates production manifest.json on merges to main/master. Obtain your dbt Cloud access URL and a Service Token, then create a dbt Cloud integration in Datafold using these credentials. Configure the integration with your repository, data connection, primary key tag, and relevant jobs. Our detailed setup guide [can be found here](/integrations/orchestrators/dbt-cloud).
  </Accordion>

  <Accordion title="We currently have a dbt Cloud Slim CI job. Does Datafold work with the custom PR schema that dbt Cloud creates?">
    Yes, Datafold is fully compatible with the custom PR schema created by dbt Cloud for Slim CI jobs.
  </Accordion>

  <Accordion title="How can I optimize diff performance in dbt?">
    We outline effective strategies for efficient and scalable data diffing in our[performance and scalability guide](faq/performance-and-scalability#how-can-i-optimize-diff-performance-at-scale).

    For dbt-specific diff performance, you can exclude certain columns or tables from data diffs in your CI/CD pipeline by adjusting the **Advanced settings** in your Datafold CI/CD configuration. This helps reduce processing load by focusing diffs on only the most relevant columns.

    ![](https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/faq/advanced_ci_columns_to_ignore.png)
  </Accordion>

  <Accordion title="Can I run Data Diffs before opening a PR?">
    Some teams want to show Data Diff results in their tickets *before* creating a pull request. This speeds up code reviews as developers can QA code changes before requesting a PR review.

    You can trigger a Data Diff by first creating a **draft PR** and then running the following command via the CLI:

    ```bash
    dbt run && datafold diff dbt
    ```

    This command runs `dbt` locally and then triggers a Data Diff, allowing you to preview data changes without pushing to Git.

    To automate this process of kicking off a Data Diff before pushing code to git, we recommend creating a GitHub Actions job for draft PRs. For example:

    ```
    name: Data Diff on draft dbt PR

    on:
      pull_request:
        types: [opened, reopened, synchronize]
        branches:
          - '!main'

    jobs:
      run:
        if: github.event.pull_request.draft == true  # Run only on draft PRs
        runs-on: ubuntu-latest

        steps:
          - name: Checkout Code
            uses: actions/checkout@v2

          - name: Set Up Python
            uses: actions/setup-python@v2
            with:
              python-version: '3.8'

          - name: Install requirements
            run: pip install -r requirements.txt  

          - name: Install dbt dependencies
            run: dbt deps

          # Update with your S3 bucket details
          - name: Grab production manifest from S3
            run: |
              aws s3 cp s3://advanced-ci-manifest-demo/manifest.json ./manifest.json
            env:
              AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
              AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
              AWS_REGION: us-east-1

          - name: Run dbt and Data Diff
            env:
              DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
            run: |
              dbt run
              datafold diff dbt
              
          # Optional: Submit artifacts to Datafold for more analysis or logging
          - name: Submit artifacts to Datafold
            run: |
              set -ex
              datafold dbt upload --ci-config-id 350 --run-type pull_request --commit-sha ${GIT_SHA}
            env:
              DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
              GIT_SHA: "${{ github.event.pull_request.head.sha }}"

    ```
  </Accordion>
</AccordionGroup>


# Overview
Source: https://docs.datafold.com/faq/overview

Get answers to the most common questions regarding our product.

Have a question that isn’t answered here? Feel free to reach out to us at [support@datafold.com](mailto:support@datafold.com), and we’ll be happy to assist you!

<CardGroup cols={2}>
  <Card title="Data Diffing" href="/faq/data-diffing" horizontal />

  <Card title="CI/CD Testing" href="/faq/ci-cd-testing" horizontal />

  <Card title="Data Migration Automation" href="/faq/data-migration-automation" horizontal />

  <Card title="Data Reconciliation" href="/faq/data-reconciliation" horizontal />

  <Card title="Data Monitoring & Observability" href="/faq/data-monitoring-observability" horizontal />

  <Card title="Datafold with dbt" href="/faq/datafold-with-dbt" horizontal />

  <Card title="Data Storage & Security" href="/faq/data-storage-and-security" horizontal />

  <Card title="Performance & Scalability" href="/faq/performance-and-scalability" horizontal />

  <Card title="Resource Management" href="/faq/resource-management" horizontal />
</CardGroup>


# Performance and Scalability
Source: https://docs.datafold.com/faq/performance-and-scalability


<AccordionGroup>
  <Accordion title="How scalable is Datafold?">
    Datafold is highly scalable, supporting data teams working with billion-row datasets and thousands of data transformation/dbt models. It offers powerful performance optimization features such as [SQL filtering](/deployment-testing/configuration/model-specific-ci/sql-filters), [sampling](/data-diff/cross-database-diffing/best-practices), and [Slim Diff](/deployment-testing/best-practices/slim-diff), which allow you to focus on testing the datasets that are most critical to your business, ensuring efficient and targeted data quality validation.
  </Accordion>

  <Accordion title="How can I optimize diff performance at scale?">
    Datafold pushes down compute to your database, and the performance of data diffs largely depends on the underlying SQL engine. Here are some in-app strategies to optimize performance:

    1. [Enable sampling](/data-diff/cross-database-diffing/best-practices): Sampling reduces the amount of data processed by comparing a randomly chosen subset. This approach balances diff detail with processing time and cost, suitable for most use cases.

    2. [Use SQL Filters](/deployment-testing/configuration/model-specific-ci/sql-filters): If you only need to compare a specific subset of data (e.g., for a particular city or a recent time period), adding a SQL filter can streamline the diff process.

    3. **Exclude columns/tables**: When certain columns or tables are unnecessary for critical comparisons—such as temporary tables with dynamic values, metadata fields, or timestamp columns that always differ—you can exclude these to increase diff efficiency and speed.

    You can exclude columns when you create a new Data Diff or when you clone an existing one:

    <Frame>
      ![](https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/faq/new_diff_exclude_columns.png)
    </Frame>

    To exclude them in your CI/CD pipeline, [follow this guide](/integrations/orchestrators/dbt-core#advanced-settings-configuration) to specify them in the Advanced settings of your CI/CD configuration in Datafold.

    4. **Optimize SQL queries**: Refactor your SQL queries to improve the efficiency of database operations, reducing execution time and resource usage.
    5. **Leverage database performance features**: Ensure your database is configured to match typical diff workload patterns. Utilize features like query optimization, caching, and parallel processing to boost performance.
    6. **Increase data warehouse resources**: If using a platform like Snowflake, consider increasing the size of your warehouse to allocate more resources to Datafold operations.
  </Accordion>
</AccordionGroup>


# Resource Management
Source: https://docs.datafold.com/faq/resource-management


<Accordion title="What is Datafold’s resource consumption footprint? How will Datafold affect my data warehouse costs?">
  Recognizing the importance of efficient data reconciliation, we provide a number of strategies to make the diffing process as efficient as possible:

  **Efficient Algorithm**

  The diffing algorithm itself leverages [stochastic checksumming](/faq/data-diffing#what-is-stochastic-checksumming) which is optimized for efficiency at scale. It provides detailed comparison by pushing down the compute to both source and target databases without requiring the extraction of datasets for comparison.

  **Flexible Controls**

  Users can easily control the volume of data used in diffing by using:

  * [Filters](/deployment-testing/configuration/model-specific-ci/sql-filters): Focus on the most relevant part of the dataset
  * [Sampling](/data-diff/cross-database-diffing/best-practices): Set sampling as a percentage of rows or desired confidence level
  * [Slim Diff](/deployment-testing/best-practices/slim-diff): Selectively diff only the models that have dbt code changes in your pull request.

  **Workload Management**

  Users can apply controls to enforce low diffing footprint:

  * On the Datafold side: Set desired concurrency
  * On the database side: Most databases support workload management settings to ensure that Datafold does not consume more than X% CPU or Y% RAM

  Also, consider that using a data quality tool like Datafold to catch issues before production will reduce cost over time as it lowers the need for expensive reprocessing and troubleshooting. Datafold's features like filtering, sampling, and Slim Diff ensure that only relevant datasets are tested, minimizing the computational load on your data warehouse. This targeted approach can lead to more efficient resource usage and potentially lower data warehouse operation costs.
</Accordion>


# dbt Exposures
Source: https://docs.datafold.com/integrations/bi-data-apps/dbt

Incorporate dbt Exposures into your Datafold lineage.

In dbt, Exposures allow you to define downstream uses of your data (e.g., in dashboards). You can include dbt Exposures in lineage within Data Explorer using our dbt Exposures integration.

## Set up the integration

<Note>
  If you haven't aleady created a dbt CI integration, please start [there](/integrations/orchestrators/).
</Note>

1. Visit Settings > BI & Data Apps > Add new integration
2. Select "dbt Exposures"
3. Enter a name for the integration (this can be anything)
4. Select your existing dbt CI integration from the dropdown
5. Save the integration

![Add dbt Exposures integration](https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/dbt-exposures-add-integration.png)
![Configure dbt Exposures integration](https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/dbt-exposures-integration-config.png)

## View dbt Exposures in Data Explorer

<Note>
  Your dbt Exposures may not appear in lineage immediately after setting up the integration. To force an update, return to the integration settings and select "Sync now".
</Note>

When you visit Data Explorer, you'll now see the option to filter for dbt Exposures:

![Filter for dbt Exposures](https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/dbt-exposures-filters.png)

Your dbt Exposures will also appear in lineage:

![View dbt Exposures in lineage](https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/dbt-exposures-lineage.png)


# Hightouch
Source: https://docs.datafold.com/integrations/bi-data-apps/hightouch

Navigate to Settings > Integrations > Data Apps and add a Hightouch Integration.

## Create a Hightouch Integration

<Frame caption="Create Integration">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_apps_add_new_integration-8fa569d0d0beb7191934287bdcdda2f1.png" />
</Frame>

<Frame caption="Create Integration">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/hightouch_blank_integration_form-379e98ee744aa52224d2dd6ccd110a44.png" />
</Frame>

Complete the configuration by specifying the following fields:

| Field Name              | Description                                                                                                                                                                                                                                                                                                            |
| ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Integration name        | An identifier used in Datafold to identify this Data App configuration.                                                                                                                                                                                                                                                |
| Workspace URL           | Then, grab your workspace URL, by navigating to **Settings** → **Workspace** tab → **Workspace slug** or by finding the workspace name in the search bar ([https://app.hightouch.io/](https://app.hightouch.io/) \<workspace\_slug/>).                                                                                 |
| API Key                 | Log into your [Hightouch account](https://app.hightouch.com/login) and navigate to **Settings** → **API keys** tab → **Add API key** to generate a new, unique API key.  <Icon icon="triangle-exclamation" /> Your API key will appear only once, so please copy and save it to your password manager for further use. |
| Data connection mapping | When the correct credentials are entered we will begin to populate data connections in Hightouch (on the left side) that will need to be mapped to data connections configured in Datafold (on the right side). See image below.                                                                                       |

<Frame caption="Create Integration">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/hightouch_data_source_match-3ed927400af746ec7b2b637b09cdd055.png" />
</Frame>

When completed, click **Submit**.

It may take some time to sync all the Hightouch entities to Datafold and for Data Explorer to populate. When completed, your Hightouch models and sync will appear in Data Explorer as search results.

<Frame caption="Create Integration">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/hightouch_sync_results-6865862cb8cd146928f7783fd2a67f56.png" />
</Frame>

<Tip>
  **TIP**

  [Tracking Jobs](/integrations/bi-data-apps/tracking-jobs) explains how to find out when your data app integration is ready.
</Tip>


# Looker
Source: https://docs.datafold.com/integrations/bi-data-apps/looker


## Create a code repositories integration

[Create a code repositories integration](/integrations/code-repositories) that connects Datafold to your Looker repository.

## Create a Looker integration

Navigate to Settings > Integrations > Data Apps and add a Looker integration.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_apps_add_new_integration-8fa569d0d0beb7191934287bdcdda2f1.png" alt="Add New Integration" />
</Frame>

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/looker_blank_integration_form-2891846b6665064a633f376d99acbde0.png" alt="Looker Integration Form" />
</Frame>

Complete the configuration by specifying the following fields:

| Field Name              | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Integration name        | An identifier used in Datafold to identify this Data App configuration.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| Project Repository      | Select the same repository as used in your Looker project.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| API Host URL            | The Looker [API Host URL](https://cloud.google.com/looker/docs/admin-panel-platform-api#api%5Fhost%5Furl). It has the following format: https\://\<instance\_name>.cloud.looker.com:\<port>. The port defaults are 19999 (legacy) and 443 (new), see the [Looker Docs](https://cloud.google.com/looker/docs/api-getting-started#looker%5Fapi%5Fpath%5Fand%5Fport) for hints. Examples: Legacy ([https://datafold.cloud.looker.com:19999](https://datafold.cloud.looker.com:19999)), New ([https://datafold.cloud.looker.com:443](https://datafold.cloud.looker.com:443)) |
| Client ID               | Follow [these steps](https://cloud.google.com/looker/docs/api-auth#authentication%5Fwith%5Fan%5Fsdk) to generate Client ID and Client Secret. These are always user specific. We recommend using a group email for continuity. See [Looker User Minimum Access Policy](/integrations/bi-data-apps/looker#looker-user-minimum-access-policy) for the required permissions.                                                                                                                                                                                                |
| Client Secret           | See Client ID.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| Data connection mapping | When the correct credentials are entered we will begin to populate data connections in Looker (on the left side) that will need to be mapped to data connections configured in Datafold (on the right side). See image below.                                                                                                                                                                                                                                                                                                                                            |

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/looker_configuration-0410bbaf211f889bf36bb8f93d378500.png" alt="Looker Configuration" />
</Frame>

When completed, click **Submit**.

It may take some time to sync all the Looker entities to Datafold and for Data Explorer to populate. When completed, your Looker assets will appear in Data Explorer as search results.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/looker_sync_results-e610d030d6891b22cffbceeae9d9a8d1.png" alt="Looker Sync Results" />
</Frame>

<Tip>
  **TIP**

  [Tracking Jobs](/integrations/bi-data-apps/tracking-jobs) explains how to find out when your data app integration is ready.
</Tip>

## Looker user: minimum access policy

The user linked to the API credentials needs the predefined Developer role, or you can create a custom role with these permissions:

* `access_data`
* `download_without_limit`
* `explore`
* `login_special_email`
* `manage_spaces`
* `see_drill_overlay`
* `see_lookml`
* `see_lookml_dashboards`
* `see_looks`
* `see_pdts`
* `see_sql`
* `see_user_dashboards`
* `send_to_integration`

## Database/schema connection context

### Database specification

Using the Fully Qualified Names in your Looker view files is not always possible. If a view references a table as`my_schema.my_table`, Datafold might have difficulty finding which database this table actually is in. There are multiple ways to guide Datafold to make a correct choice, as summarized in the table below.

<Note>
  **INFO**

  Priority #1 takes precedence over #2, and so forth.
</Note>

| # | Source, if defined                                                                                                                                                            | Example                     |
| - | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------- |
| 1 | datafold\_force\_database **User Attribute** in Looker                                                                                                                        | looker\_db                  |
| 2 | **Fully Qualified Names** in your Looker view files                                                                                                                           | my\_db.my\_schema.my\_table |
| 3 | datafold\_default\_database **User Attribute** in Looker                                                                                                                      | another\_looker\_db         |
| 4 | **Database** specified in Looker, at Database connection settings\_(We can only read these if Datafold connects to Looker via an admin user, which is probably suboptimal.)\_ | my\_db                      |
| 5 | **Database** specified in Datafold, at [Database Connection settings](/integrations/databases/)                                                                               | my\_db                      |

### Supported custom Looker user attributes

| User Attribute              | Impact                                                                                                   |
| --------------------------- | -------------------------------------------------------------------------------------------------------- |
| datafold\_force\_database   | Database to use in all cases, even if a fully qualified path in LookML refers to another database.       |
| datafold\_default\_database | Database to use if Looker View does not explictly specify a database.                                    |
| datafold\_default\_schema   | Schema to use if Looker view does not explicitly specify a schema (which equals a dataset for BigQuery). |
| datafold\_default\_host     | *(BigQuery only)* Default project name.                                                                  |

<Note>
  **INFO**

  Make sure attributes are:

  * Explicitly defined for the user in question (not just falling back to a default);
  * Not marked as hidden.
</Note>

## Integration limitations

Datafold lets you connect to Looker and extend our capabilities to your Looker Views, Explores, Looks, and Dashboards. But this is a new feature, so there are some things we don’t support yet:

* **PDT/Derived Tables**:Datafold only works with the tables that come from your data connections, but not with the [tables](https://cloud.google.com/looker/docs/derived-tables#important%5Fconsiderations%5Ffor%5Fimplementing%5Fpersisted%5Ftables) that Looker makes from your SQL queries.
* **Merge Queries**: Datafold supports the Queries and Looks that make up your Dashboards, but [Merge Queries](https://cloud.google.com/looker/docs/merged-results) are not one of them. For some use cases you could achieve the same by joining the underlying views with an explore.
* **Usage metrics and popularity**: Datafold shows you your Looker objects - such as dashboards, looks, and fields - but not how much you use or like them.

We are improving our Looker integration and adding more features soon. We welcome your feedback and suggestions.


# Mode
Source: https://docs.datafold.com/integrations/bi-data-apps/mode


## Obtain credentials from Mode

<Note>
  **INFO**

  To complete this integration, your **Mode** account must be a part of a [Mode Business Workspace](https://mode.com/compare-plans) in order to generate an API Token.
</Note>

<Note>
  **INFO**

  You need to have **Admin** privileges in your Mode Workspace to be able to create an API Token.
</Note>

In **Mode**, navigate to **Workspace Settings** → **Privacy & Security** → **API**.

Click the <Icon icon="gear" /> icon, and choose **Create new token**.

<Frame caption="Tokens">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/tokens-40fd2f50b2ec5d0acc295c11ae9e0548.png" />
</Frame>

Take note of:

* Token Name,
* Token Password,
* And the URL of the page that lists the tokens. It should look like this:

  [https://app.mode.com/organizations/\{workspace}/api\_keys](https://app.mode.com/organizations/\{workspace}/api_keys)

Take note of `{workspace}` part, we will need it when configuring Datafold.

## Configure Datafold

Navigate to **Settings** → **Integrations** → **BI & Data Apps**.

<Frame caption="Add New Integration">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_apps_add_new_integration-8fa569d0d0beb7191934287bdcdda2f1.png" />
</Frame>

Choose **Mode** Integration to add.

<Frame caption="Choose Type">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/choose-type-bf9b2d554dc7739742b1f007c9bef227.png" />
</Frame>

This will bring up **Mode** integration parameters.

<Frame caption="Create Integration">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/create-585f8b3a1e9f38c5bde10e6528e0c6b4.png" />
</Frame>

Complete the configuration by specifying the following fields:

| Field Name       | Description                                                             |
| ---------------- | ----------------------------------------------------------------------- |
| Integration name | An identifier used in Datafold to identify this Data App configuration. |
| Token            | API token, as generated above.                                          |
| Password         | API token password, as generated above.                                 |
| Workspace        | Workspace name obtained from your workspace URL.                        |

<Note>
  **INFO**

  **Workspace Name** field is not marked as required on this screen. That's for backwards compatibility: the legacy type of Mode API token, known as **Personal Token**, does not require that parameter. However, such tokens can no longer be created, so we're no longer providing instructions for them.
</Note>

When completed, click **Save**.

Datafold will try to connect to Mode and, if any issues with the connection arise you will be alerted.

Datafold will start to sync your reports. It can take some time to fetch all the reports, depending on how many of them there are.

<Tip>
  **TIP**

  [Tracking Jobs](/integrations/bi-data-apps/tracking-jobs) explains how to find out when your data app integration is ready.
</Tip>

Now that Mode sync has completed — you can browse your Mode reports!

<Frame caption="Tokens">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/mode_sync_results-dd76443d59234b676d29c6999f278c48.png" />
</Frame>


# Power BI
Source: https://docs.datafold.com/integrations/bi-data-apps/power-bi

Include Power BI entities in Data Explorer and column-level lineage.

## Overview

Our Power BI integration can help you visualize column-level lineage dependencies between warehouse tables and Power BI entities using [Data Explorer](/data-explorer/how-it-works). Datafold supports the following Power BI entity types:

* Tables (with Columns)
* Reports (with Fields)
* Dashboards

## Set up the integration

<Steps>
  <Step title="Open Microsoft 365 admin center">
    Navigate to [**Microsoft 365 admin center** -> 👤 **Active users**](https://admin.microsoft.com/#/users) and choose the user that Datafold will authenticate under.

    <Frame>
      <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/power-bi/microsoft-admin-user.png" />
    </Frame>

    As highlighted in the screenshot above, this user should have the **Power Platform Administrator** role assigned to it.
  </Step>

  <Step title="If the role is missing, assign it">
    Click **Manage roles**, enable the **Power Platform Administrator** role, and save changes.

    <Frame>
      <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/power-bi/microsoft-role.png" />
    </Frame>
  </Step>

  <Step title="Configure Power BI API">
    Navigate to [Power BI Admin Portal](https://app.powerbi.com/admin-portal/tenantSettings?experience=power-bi) and enable the following two settings:

    * Enhance admin APIs responses with detailed metadata
    * Enhance admin APIs responses with DAX and mashup expressions

    <Frame>
      <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/power-bi/admin-portal.png" />
    </Frame>
  </Step>

  <Step title="Create Power BI integration in Datafold">
    In the Datafold app, navigate to **Settings** -> **BI & Data Apps**, and click **+ Add new integration**. Choose **Power BI** from the list.

    <Frame>
      <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/power-bi/create.png" />
    </Frame>
  </Step>

  <Step title="Fill in the name for your new integration">
    ...and then **Save**.

    <Frame>
      <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/power-bi/new.png" />
    </Frame>

    On clicking **Save**, the system will redirect you to Power BI.
  </Step>

  <Step title="Sign in to Power BI">
    ...if not already signed in.

    <Frame>
      <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/power-bi/sign-in.png" />
    </Frame>
  </Step>

  <Step title="Grant permissions to Datafold">
    Allow the Datafold integration to use Power BI. Depending on the roles configured for your user in the Admin center, you may require a confirmation from a **Global Administrator**. Follow the steps in the wizard.

    <Frame>
      <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/power-bi/consent.png" />
    </Frame>
  </Step>

  <Step title="Integration is ready">
    You will be redirected back to Datafold and see a message that Power BI is successfully connected.

    <Frame>
      <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/power-bi/success.png" />
    </Frame>
  </Step>

  <Step title="Power BI integration needs some time to sync">
    You can check out **Jobs** -> **BI & Data Apps** for the status of the sync job.

    <Frame>
      <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/power-bi/jobs.png" />
    </Frame>

    See [Tracking Jobs](/integrations/bi-data-apps/tracking-jobs) for more details.
  </Step>

  <Step title="Power BI entities are now searchable">
    When the sync is complete, you will see Power BI entities in **Data Explorer**.

    <Frame>
      <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/power-bi/data-explorer.png" />
    </Frame>
  </Step>

  <Step title="Lineage is now available">
    <Frame>
      <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/power-bi/lineage.png" />
    </Frame>
  </Step>
</Steps>

## Need help?

If you have any questions about our Power BI integration, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com).


# Tableau
Source: https://docs.datafold.com/integrations/bi-data-apps/tableau

Downstream Tableau entities show up in the **Dependencies** tab of data diff results.

## Overview

Our Tableau integration can help you visualise column-level lineage dependencies between warehouse tables and Tableau entities using [Data Explorer](/data-explorer/how-it-works).

Lineage from upstream data warehouses into Tableau is supported for the following data warehouse types:

* Snowflake
* Redshift
* Databricks
* BigQuery

Potentially impacted Tableau entity names are also automatically identified in the Datafold CI printout.

The following Tableau entities types will appear in Data Explorer, data diff results, and the Datafold CI printout:

* Tableau **Data Connections** and related fields;
* **Workbooks** and related fields;
* **Dashboards**.

<Info>
  To declutter <Icon icon="sparkles" /> the Datafold lineage, Datafold filters out Tableau Data Connections and Data Connections fields that have no downstream dependencies.
</Info>

If you're interested in learning more about the Datafold integration, [please reach out to our team](https://www.datafold.com/booktime).

## Set up your Tableau instance

To connect Datafold to Tableau, you will require the following credentials from your Tableau site:

* Server URL,
* Site Name,
* Token Name,
* Token Value.

## If you are using Tableau Server

**Tableau Server** is an installation of Tableau that you are managing on your company's own infrastructure and domain. This is an alternative to using a Tableau Cloud subscription.

* Make sure that the [metadata-services](https://help.tableau.com/current/server/en-us/cli%5Fmaintenance%5Ftsm.htm#cat%5Fenable) are enabled by running the following command:

```
tsm maintenance metadata-services enable

```

* Ensure that your Tableau Server instance is accessible to Datafold. Please get in touch with our team to set this up.

## Obtaining server URL & Site Name

These can be found from URL of your Tableau home page. For instance, if your home page is:

```
https://eu-west-1a.online.tableau.com/#/site/mysupersite/home

```

Then:

* **Server URL** is `https://eu-west-1a.online.tableau.com` (the hostname with `https` in front)
* **Site Name** is `mysupersite` (the part directly after `#/site/` and until the next `/`)

## Obtaining Token Name & Token Value[](#obtaining-token-name--token-value "Direct link to Obtaining Token Name & Token Value")

Ensure that **Personal Access Tokens** are enabled on your Tableau site. For that, navigate to **Settings** and there, on the **General** tab, search for `Personal Access Tokens`. That feature needs to be enabled — not necessarily for everyone but for the user for whom we will be creating the token Datafold will use.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/tableau_enable_personal_access_tokens-a099600ff7a46573c1a2a34cd805323f.png" alt="Enable Personal Access Tokens" />
</Frame>

Now that Personal Access Tokens are enabled, click on your user’s avatar in the top right, choose **My Account Settings** in the pop-up menu, and then search for **Personal Access Tokens** on your settings page.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/tableau_personal_access_token-c39d6a9b98100f46a893e473dd8608f9.png" alt="Personal Access Token" />
</Frame>

Input a desired name, say `datafold`, into the **Token Name** field, and click **Create Token**.

This will open a popup window. Click **Copy Secret** and save the copied value somewhere — you will use this when setting up Datafold. You can read more about personal access tokens on the official Tableau documentation [here](https://help.tableau.com/current/server/en-us/security%5Fpersonal%5Faccess%5Ftokens.htm).

## Create a Tableau Integration

Navigate to **<Icon icon="gear" /> Settings** → **Integrations** → **Data Apps**. Click **<Icon icon="plus" /> Add new integration**.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_apps_add_new_integration-8fa569d0d0beb7191934287bdcdda2f1.png" alt="Add New Integration" />
</Frame>

A click on **Tableau** will lead you to the integration creation screen. Fill in the fields with data we obtained earlier. See the screenshot for hints.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/tableau_settings-a8bedc87ed42a07c156b097ddca43779.png" alt="Tableau Integration Settings" />
</Frame>

…and click **Save**.

## What's next?

The initial sync might take some time; it depends on the number of objects at your Tableau site. Eventually, Tableau entities — **Data Connections**, **Workbooks**, and **Dashboards** should appear at your **Lineage** tab.

<Tip>
  **TIP**

  [Tracking Jobs](/integrations/bi-data-apps/tracking-jobs) explains how to find out when your data app integration is ready.
</Tip>

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/search-4538ec5be9e0ecce0829e7e7eee94bd9.png" alt="Search Tableau Entities" />
</Frame>

Clicking on a Tableau entity will lead you to the Lineage screen:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/lineage-cbcb37952c6d09346c7877038c9f3e39.png" alt="Tableau Lineage Screen" />
</Frame>

<Tip>
  **TIP**

  As you might have noticed on the screenshots above, Datafold does not display Tableau **Sheets**. Instead, we group, and deduplicate, all **Fields** of all **Sheets** within a **Workbook** and display them as **Fields** of the **Workbook**.

  On the screenshot directly above, `Demo Workbook` might include one **Sheet** with `Created At` field and another with `Sub Plan` field, but for our purposes we unite all of those fields beneath the **Workbook** — which makes the Lineage graph much less cluttered, and much easier to browse <Icon icon="face-smirking" />
</Tip>

## <Icon icon="siren-on" /> I changed something in Tableau — but Datafold does not reflect my changes

Datafold relies on Tableau API to retrieve information about Tableau data model. Due to caching, Tableau API might be a bit late to reflect a recently introduced change. If your changes are not being displayed at Datafold, please wait for a few hours — and they should appear <Icon icon="thumbs-up" />


# Tracking Jobs
Source: https://docs.datafold.com/integrations/bi-data-apps/tracking-jobs

Track the completion and success of your data app integration syncs.

To track the progress of your data app integration, go to the **<Icon icon="wrench" /> Jobs** tab in the left sidebar.

<Frame caption="Data App Jobs">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_app_jobs-46476d10e9860210c1889b5b9ff196f8.png" />
</Frame>

Your **Search** and **Lineage** features will be available once you see a job marked as `Done` for your integration on this screen.

<Note>
  **INFO**

  After the initial sync, Datafold will automatically re-sync every hour to keep your Data App assets up to date.
</Note>


# Integrate with Code Repositories
Source: https://docs.datafold.com/integrations/code-repositories

Connect your code repositories with Datafold.

<Info>
  **NOTE**

  To integrate with code repositories, first connect a [Data Connection](/integrations/databases).

  Next, go to **Settings** → **Repositories** and click **Add New Integration**. Then, choose your code repository provider.
</Info>

<CardGroup>
  <Card title="GitHub" icon="file" href="/integrations/code-repositories/github" horizontal />

  <Card title="GitLab" icon="file" href="/integrations/code-repositories/gitlab" horizontal />

  <Card title="Bitbucket" icon="file" href="/integrations/code-repositories/bitbucket" horizontal />

  <Card title="Azure DevOps" icon="file" href="/integrations/code-repositories/azure-devops" horizontal />
</CardGroup>


# Azure DevOps
Source: https://docs.datafold.com/integrations/code-repositories/azure-devops


## 1. Issue an Access Token

To get your [repository access token](https://learn.microsoft.com/en-us/azure/devops/organizations/accounts/use-personal-access-tokens-to-authenticate?view=azure-devops\&tabs=Windows#create-a-pat), navigate to your Azure DevOps settings and create a new token.

When configuring your token, enable following permissions:

* **Code** -> **Read & write**
* **Identity** -> **Read**

We need write access to the repository to post reports with Data Diff results to pull requests, and read access to identities to be able to properly display Azure DevOps users in the Datafold UI.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/azure_devops_access_token-7bd79728ae3447aa77f4246a1e66b249.png" />
</Frame>

## 2. Configure integration in Datafold

Navigate back to Datafold and fill in the configuration form.

* **Personal/project Access Token**: the token you created in step 1.
* **Organization**: your Azure DevOps organization name.
* **Project**: your Azure DevOps project name.
* **Repository**: your Azure DevOps repository name.

For example, if your Azure DevOps repository URL is `https://dev.azure.com/datafold/analytics/_git/dbt`:

* Your **Organization** is `datafold`
* your **Project** is `analytics`
* your **Repository** is `dbt`


# Bitbucket
Source: https://docs.datafold.com/integrations/code-repositories/bitbucket


## 1. Issue an Access Token

### Bitbucket Cloud

To get the [repository access token](https://support.atlassian.com/bitbucket-cloud/docs/create-a-repository-access-token/), navigate to your Bitbucket repository settings and create a new token.

When configuring your token, enable following permissions:

* **Pull requests** -> **Write**, so that Datafold can post reports with Data Diff results to pull requests.
* **Webhooks** -> **Read and write**, so that Datafold can configure all webhooks that we need automatically.

<Frame caption="Bitbucket Access Token">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/bitbucket_access_token-31e43bcafa70921b2f847623fbc149e5.png" />
</Frame>

### Bitbucket Data Center / Server

To get a [repository access token](https://confluence.atlassian.com/bitbucketserver/http-access-tokens-939515499.html), navigate to your Bitbucket repository settings and create a new token.

When configuring your token, enable **Repository admin** permissions.
We need admin access to the repository to be able to post reports with Data Diff results to pull requests, and also configure all necessary webhooks automatically.

<Frame caption="Bitbucket Server Access Token">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/bitbucket_server_access_token-c2504c12d9bef6081251b9eb6aa0b12b.png" />
</Frame>

## 2. Configure integration in Datafold

Navigate back to Datafold and fill in the configuration form.

### Bitbucket Cloud

* **Personal/project Access Token**: the token you created in step 1.
* **Repository**: your Bitbucket repository name.
  For example, if your Bitbucket project URL is `https://bitbucket.org/datafold/dbt/`, your Project Name is `datafold/dbt`.

### Bitbucket Data Center / Server

* **Personal/project Access Token**: the token you created in step 1.
* **Repository**: the full URL of your Bitbucket repository.
  For example, `https://bitbucket.myorg.com/projects/datafold/repos/dbt`.


# GitHub
Source: https://docs.datafold.com/integrations/code-repositories/github


<Note>
  **PREREQUISITES**

  * Datafold Admin role
  * Your GitHub account must be a member of the GitHub organization where the Datafold app is to be installed
  * Approval of your request to add the Datafold app to your repo must be granted by a GitHub repo admin or GitHub organization owner.
</Note>

To set up a new integration, click the repository field and select the **Install GitHub app** button.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/github_install_button-27ecc75b58ccbe7681aa70223cc0e21b.png" />
</Frame>

From here, GitHub will redirect you to login to your account and choose which organization you would like to connect. After choosing the right organization, you may choose to allow access to all repositories or specific ones.

Once complete, you will be redirected back to Datafold, where you can select the appropriate repository for connection.

<Tip>
  **TIP**

  If you lack permission to add the Datafold app, request approval from a GitHub admin.

  After installation, click **Refresh** to display the newly added repositories in the dropdown list.
</Tip>

To complete the setup, click **Save**!

<Note>
  **INFO**
  VPC deployments are an Enterprise feature. Please email [sales@datafold.com](mailto:sales@datafold.com) to enable your account.
</Note>

## GitHub integration for VPC / single-tenant Datafold deployments

### Create a GitHub application

VPC clients of Datafold need to create their own GitHub app, rather than use the shared Datafold GitHub application.

Start by navigating to **Settings** → **Global Settings**.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/onprem_github_settings-4ba347a4179f693ad9cf851188d3cd3c.png" />
</Frame>

To begin the set up process, enter the domain that was registered for the VPC deployment in [AWS](/datafold-deployment/dedicated-cloud/aws) or [GCP](/datafold-deployment/dedicated-cloud/gcp). Then, enter the name of the GitHub organization where you'd like to install the application. When filled, click **Create GitHub App**.

This will redirect the admin to GitHub, where they may need to authenticate. **The GitHub user must be an admin of the GitHub organization.**

After authentication, you should be directed to enter a description for the GitHub App. After entering the description, click **Create GitHub app**.

Once the application is created, you should be returned to the Datafold settings screen. The button should then have disappeared, and the details for the GitHub App should be visible.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/onprem_github_confirmation-040de7316a509d880b13d6be431da24d.png" />
</Frame>

### Making the GitHub application public

If you have a private GitHub instance with multiple organizations and want to use the Datafold app across all of them, you'll need to make the app public on your private server.

You can do so in GitHub by following these steps:

1. Navigate to the GitHub organization where the app was created.
2. Click **Settings**.
3. Go to **Developer Settings** → **GitHub Apps**.
4. Select the **Datafold app**.
5. Click **Advanced**, then **Make public**.

<Note>
  The app will be public **only on your private GitHub server**, ensuring it can be accessed across all your organizations.
</Note>

### Configure GitHub in Datafold

If you see this screen with all the details, you've successfully created a GitHub App! Now that the app is created, you have to install it using the [GitHub integration setup](/integrations/code-repositories/github).


# GitLab
Source: https://docs.datafold.com/integrations/code-repositories/gitlab


To get the [project access token](https://docs.gitlab.com/ee/user/project/settings/project%5Faccess%5Ftokens.html), navigate to your GitLab project settings and create a new token.

<Tip>
  **TIP**

  Project access tokens are preferred over personal tokens for security.
</Tip>

When configuring your token, select the **Maintainer** role and select the **api** scope.

<Frame caption="GitLab Access Token">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/gitlab_access_token-3c34d99f464332fd5e1a3dc672c89d36.png" />
</Frame>

**Project Name** is your Gitlab project URL after `gitlab.com/`. For example, if your Gitlab project URL is `https://gitlab.com/datafold/dbt/`, your Project Name is `datafold/dbt/`

Finally, navigate back to Datafold and enter the **Project Token** and the name of your **Project** before hitting **Save**:

<Frame caption="New GitLab Integration in Datafold">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_app_gitlab_new_integration-556436f49dbd3ab4da5448a17540abd9.png" />
</Frame>

If you want to change the GitLab URL, you can do so after setting up the integration. To do so, navigate to **Settings**, then **Org Settings**:

<Frame caption="Change GitLab URL">
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/data_app_gitlab_change_url-f8ee1e8babed20cf8cb78318526d9f3e.png" />
</Frame>


# Set Up Your Data Connection
Source: https://docs.datafold.com/integrations/databases

Set up your Data Connection with Datafold.

<Info>
  **NOTE**

  To set up your Data Connection, navigate to **Settings** → **Data Connection** and click **Add New Integration**.
</Info>

<CardGroup>
  <Card title="Amazon S3" icon="file" href="/integrations/databases/amazon-s3" horizontal />

  <Card title="Azure Data Lake Storage (ADLS)" icon="file" href="/integrations/databases/adls" horizontal />

  <Card title="Athena" icon="file" href="/integrations/databases/athena" horizontal />

  <Card title="BigQuery" icon="file" href="/integrations/databases/bigquery" horizontal />

  <Card title="Databricks" icon="file" href="/integrations/databases/databricks" horizontal />

  <Card title="Dremio" icon="file" href="/integrations/databases/dremio" horizontal />

  <Card title="Google Cloud Storage (GCS)" icon="file" href="/integrations/databases/google-cloud-storage" horizontal />

  <Card title="MySQL" icon="file" href="/integrations/databases/mysql" horizontal />

  <Card title="Netezza" icon="file" href="/integrations/databases/netezza" horizontal />

  <Card title="Oracle" icon="file" href="/integrations/databases/oracle" horizontal />

  <Card title="Snowflake" icon="file" href="/integrations/databases/snowflake" horizontal />

  <Card title="PostgreSQL" icon="file" href="/integrations/databases/postgresql" horizontal />

  <Card title="Redshift" icon="file" href="/integrations/databases/redshift" horizontal />

  <Card title="SAP HANA" icon="file" href="/integrations/databases/sap-hana" horizontal />

  <Card title="Microsoft SQL Server" icon="file" href="/integrations/databases/sql-server" horizontal />

  <Card title="Starburst" icon="file" href="/integrations/databases/starburst" horizontal />

  <Card title="Teradata" icon="file" href="/integrations/databases/teradata" horizontal />
</CardGroup>


# Azure Data Lake Storage (ADLS)
Source: https://docs.datafold.com/integrations/databases/adls


<Note>
  This integration supports both Azure Data Lake Storage and Azure Blob Storage.
</Note>

**Steps to complete:**

1. [Create an app and service principal in Microsoft Entra](#create-an-app-and-service-principal-in-microsoft-entra)
2. [Configure your data connection in Datafold](#configure-your-data-connection-in-datafold)
3. [Create your first file diff](#create-your-first-file-diff)

## Create an app and service principal in Microsoft Entra

Create an app and service principal in Entra using a client secret (not certificate). Check out [Microsoft's documentation](https://learn.microsoft.com/en-us/entra/architecture/service-accounts-principal) on this topic if you need help.

![Use client secret](https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/adls-client-secret.png)

## Configure your data connection in Datafold

![ADLS Data Connection](https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/adls-connection.png)

| Field Name      | Description                                                                                              |
| --------------- | -------------------------------------------------------------------------------------------------------- |
| Connection name | The name you'd like to give to this connection in Datafold                                               |
| Account Name    | This is in the URL of any filepath in ADLS, e.g. `<account>.dfs.core.windows.net/<container>/<filepath>` |
| Client ID       | The client ID of the app you created in Microsoft Entra                                                  |
| Client Secret   | The client secret of the app you created in Microsoft Entra                                              |
| Tenant ID       | The tenant ID of the app you created in Microsoft Entra                                                  |

## Create your first file diff

For general guidance on how file diffs work in Datafold, check out our [file diffing docs](/data-diff/file-diffing).

When creating a diff, note that the file path you provide may differ depending on whether you're using ADLS or Blob Storage. For example:

* ADLS: `abfss://<my_filesystem>/<path>/<my_file>.<csv, xlsx, parquet, etc.>`
* Blob Storage: `az://<my_container>/<path>/<my_file>.<csv, xlsx, parquet, etc.>`


# Amazon S3
Source: https://docs.datafold.com/integrations/databases/amazon-s3


**Steps to complete:**

1. [Create a user with access to S3](/integrations/databases/google-cloud-storage#create-a-service-account)
2. [Assign the user to the S3 bucket](/integrations/databases/google-cloud-storage#service-account-access-and-permissions)
3. [Create an access key for the user](/integrations/databases/google-cloud-storage#generate-a-service-account-key)
4. [Configure your data connection in Datafold](/integrations/databases/google-cloud-storage#configure-in-datafold)

## Create a user with access to S3

To connect your Amazon S3 bucket, you will need to create a user for Datafold to use.

* Navigate to the [AWS Console](https://console.aws.amazon.com/).
* Click on the search bar in the top header, then find **IAM** service and click on it.
* Click on the **Users** item of the Access Management section.
* Click on the **Create user** button.
* Create a user named `Datafold`.
* Assign the user to the `AmazonS3FullAccess` policy.
* When done, keep ARN of the user handy as you'll need it in the next step.

## Assign the user to the S3 bucket

* Go to S3 panel and select the bucket.
* Click on the **Permissions** tab.
* Click on **Edit** next to the **Bucket Policy**.
* Add the following policy:
  ```json
  {
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "AWS": "arn:aws:iam:::user/Datafold" // Replace with your user's ARN
        },
        "Action": [
          "s3:GetObject",
          "s3:PutObject" // Optional: Only needed if you're planning to use this data connection as a destination for materialized diff results.
        ],
        "Resource": [
          "arn:aws:s3:::your-bucket-name/*", // Replace with your bucket's ARN
          "arn:aws:s3:::your-bucket-name" // Replace with your bucket's ARN
        ]
      }
    ]
  }
  ```

<Note>
  The Datafold user requires the following roles and permissions:

  * **s3:GetObject** for read access.
  * **s3:PutObject** for write access if you're planning to use this data connection as a destination for materialized diff results.
</Note>

## Create an access key for the user

Next, go back to the **IAM** page to generate a key for Datafold.

* Click on the **Users** page.
* Click on the **Datafold** user.
* Click on the **Security Credentials** tab.
* Click on **Create access key** and select **Create new access key**.
* Select **JSON** and click **Create**.

## Configure in Datafold

| Field Name                                                | Description                                                                                                                           |
| --------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| Connection name                                           | A name given to the data connection within Datafold                                                                                   |
| Bucket Name                                               | The name of the bucket you want to connect to.                                                                                        |
| Bucket region                                             | The region of the bucket you want to connect to.                                                                                      |
| Key ID                                                    | The key file generated in the [Create an access key for the user](#create-an-access-key-for-the-user) step                            |
| Secret Access Key                                         | The secret access key generated in the [Create an access key for the user](#create-an-access-key-for-the-user) step                   |
| Directory for writing diff results                        | Optional. The directory in the bucket where diff results will be written. Service account should have write access to this directory. |
| Default maximum number of rows to include in diff results | Optional. The maximum number of rows that a file with materialized results will contain.                                              |

Click **Create**. Your data connection is ready!


# Athena
Source: https://docs.datafold.com/integrations/databases/athena


**Steps to complete:**

1. [Create an S3 bucket](/integrations/databases/athena#create-s3-bucket)
2. [Run SQL Script for permissions](/integrations/databases/athena#run-sql-script)
3. [Configure your data connection in Datafold](/integrations/databases/athena#configure-in-datafold)

### Create an S3 bucket

If you don't already have an S3 bucket for your cluster, you'll need to create one. Datafold uses this bucket to create temporary tables and store data in it. You can learn how to create an S3 bucket in AWS by referring to the [AWS documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html).

### Run SQL Script and Create Schema for Datafold

To connect to AWS Athena, you must generate an `AWS Access Key ID` and an `AWS Secret Access Key`. These keys provide read-only access to all tables in all schemas and write access to the Datafold-specific schema for temporary tables. If you don't have these keys yet, follow the steps outlined in the [AWS documentation](https://docs.aws.amazon.com/IAM/latest/UserGuide/id%5Fcredentials%5Faccess-keys.html).

Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse.

```
/* Datafold utilizes a temporary dataset to materialize scratch work and keep data processing witin your data warehouse. */

CREATE SCHEMA IF NOT EXISTS awsdatacatlog.datafold_tmp;
```

### Configure in Datafold

| Field Name                  | Description                                                                    |
| --------------------------- | ------------------------------------------------------------------------------ |
| AWS Access Key ID           | Your AWS Access Key, which can be found in your AWS Account.                   |
| AWS Secret Access Key       | The AWS Secret Key (generate it in your AWS account if you don't have it yet). |
| S3 Staging Directory        | The S3 bucket where table data is stored.                                      |
| AWS Region                  | The region of your Athena cluster.                                             |
| Catalog                     | The catalog, which is typically awsdatacatalog by default.                     |
| Database                    | The database or schema with tables, typically default by default.              |
| Schema for Temporary Tables | The schema (datafold\_tmp) created in our SQL script.                          |

Click **Create** to complete the setup of your data connection in Datafold.


# BigQuery
Source: https://docs.datafold.com/integrations/databases/bigquery


**Steps to complete:**

1. [Create a Service Account](/integrations/databases/bigquery#create-a-service-account)
2. [Give the Service Account BigQuery Data Viewer, BigQuery Job User, BigQuery Resource Viewer access](/integrations/databases/bigquery#service-account-access-and-permissions)
3. [Create a temporary dataset and give BiqQuery Data Editor access to the service account](/integrations/databases/bigquery#create-a-temporary-dataset)
4. [Generate a Service Account JSON key](/integrations/databases/bigquery#generate-a-service-account-key)
5. [Configure your data connection in Datafold](/integrations/databases/bigquery#configure-in-datafold)

## Create a Service Account

To connect Datafold to your BigQuery project, you will need to create a *service account* for Datafold to use.

* Navigate to the [Google Developers Console](https://console.developers.google.com/), click on the drop-down to the left of the search bar, and select the project you want to connect to.
  * *Note: If you do not see your project, you may need to switch accounts.*
* Click on the hamburger menu in the upper left, then select **IAM & Admin** followed by **Service Accounts**.
* Create a service account named `Datafold`.

## Service Account Access and Permissions

The Datafold service account requires the following roles and permissions:

* **BigQuery Data Viewer** for read access on all the datasets in the project.
* **BigQuery Job User** to run queries.
* **BigQuery Resource Viewer** to fetch the query logs for parsing lineage.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/bigquery_permissions-a7a43ded62c06a55f0337cf36924dcf5.png" />
</Frame>

## Create a Temporary Dataset

Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in your warehouse.

**Caution** - Make sure that the dataset lives in the same region as the rest of the data, otherwise, the dataset will not be found.

Let's navigate to BigQuery in the console and create a new dataset.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/bigquery_tempdataset-b7d4da9e04f4239b90067a7d4858f183.png" />
</Frame>

* Give the dataset a name like `datafold_tmp` and grant the Datafold service account the **BigQuery Data Editor** role.

## Generate a Service Account Key

Next, go back to the **IAM & Admin** page to generate a key for Datafold.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/bigquery_key-368911548a71c512d065b1a227dace96.png" />
</Frame>

We recommend using the json formatted key. After creating the key, it will be saved on your local machine.

## Configure in Datafold

| Field Name                  | Description                                                                                                                                                                                                                                            |
| --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Name                        | A name given to the data connection within Datafold                                                                                                                                                                                                    |
| Project ID                  | Your BigQuery project ID. It can be found in the URL of your Google Developers Console: [https://console.developers.google.com/apis/library?project=MY\\\_PROJECT\\\_ID](https://console.developers.google.com/apis/library?project=MY\\_PROJECT\\_ID) |
| JSON Key File               | The key file generated in the [Generate a Service Account JSON key](/integrations/databases/bigquery#generate-a-service-account-key) step                                                                                                              |
| Schema for temporary tables | The schema name that was created in [Create a temporary dataset](/integrations/databases/bigquery#create-a-temporary-dataset). It should be formatted as \<project\_id>.datafold\_tmp                                                                  |
| Processing Location         | Which processing zone your project uses                                                                                                                                                                                                                |

Click **Create**. Your data connection is ready!


# Databricks
Source: https://docs.datafold.com/integrations/databases/databricks


**Steps to complete:**

1. [Generate a Personal Access Token](/integrations/databases/databricks#generate-a-personal-access-token)
2. [Retrieve SQL warehouse settings](/integrations/databases/databricks#retrieve-sql-warehouse-settings)
3. [Create schema for Datafold](/integrations/databases/databricks#create-schema-for-datafold)
4. [Configure your data connection in Datafold](/integrations/databases/databricks#configure-in-datafold)

## Generate a Personal Access Token

Visit **Settings** → **User Settings**, and then switch to **Personal Access Tokens** tab.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/databricks_new_token-a2d1a65a0105ce7ad38fca457967b07c.png" />
</Frame>

Then, click **Generate new token**. Save the generated token somewhere, you'll need it later on.

## Retrieve SQL warehouse settings

In **SQL** mode, navigate to **SQL Warehouses**.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/databricks_sql_warehouse-80e1f70713a973cb310a7b1d4d32a409.png" />
</Frame>

Choose the preferred warehouse and copy the following fields values from its **Connection Details** tab:

* Server hostname
* HTTP path

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/databrick_connection_details-5b5208f53126fa0d4dd18dc21f3ffd61.png" />
</Frame>

## Create schema for Datafold

Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse.

## Configure in Datafold

| Field Name                   | Description                                                                                                                                                                                                                                      |
| ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Name                         | A name given to the data connection within Datafold                                                                                                                                                                                              |
| Host                         | The hostname retrieved in the Connection Details tab                                                                                                                                                                                             |
| HTTP Path                    | The HTTP Path retrieved in the Connection Details tab                                                                                                                                                                                            |
| Access Token                 | The token retrieved in [Generate a Personal Access Token](/integrations/databases/databricks#generate-a-personal-access-token)                                                                                                                   |
| Catalog                      | The catalog and schema name of your Databricks account. Formatted as catalog\_name.schema\_name (In most cases, catalog\_name is hive\_metastore.)                                                                                               |
| Dataset for temporary tables | Certain operations require Datafold to materialize intermediate results, which are stored in a dedicated schema. The input for this field should be in the catalog\_name.schema\_name format. (In most cases, catalog\_name is hive\_metastore.) |

Click **Create**. Your data connection is ready!


# Dremio
Source: https://docs.datafold.com/integrations/databases/dremio


<Note>
  **INFO**

  Column-level Lineage is not currently supported for Dremio.
</Note>

<Note>
  **INFO**

  Schemas for tables in external data sources need to be specified with quotes e.g., "Postgres prod.analytics.sales".
</Note>

**Steps to complete:**

1. [Configure user in Dremio](/integrations/databases/dremio#configure-user-in-dremio)
2. [Create schema for Datafold](/integrations/databases/dremio#create-schema-for-datafold)
3. [Configure your data connection in Datafold](/integrations/databases/dremio#configure-in-datafold)

## Configure user in Dremio

To connect to Dremio, create a user with read-only access to all data sources you wish to diff and generate an access token.

Temporary tables will be created in the `$scratch` schema that doesn't require special permissions.

## Create schema for Datafold

Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse.

## Configure in Datafold

| Field Name                  | Description                                                                                                                                                |
| --------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Connection name             | A name given to the data connection within Datafold.                                                                                                       |
| Host                        | The hostname for your Dremio instance (data.dremio.cloud for Dremio SaaS).                                                                                 |
| Port                        | Dremio endpoint port; default value is 433.                                                                                                                |
| Encryption                  | Should be checked for Dremio Cloud, possibly unchecked for local deployments.                                                                              |
| User ID                     | User ID as created in Dremio, typically an email address.                                                                                                  |
| Project ID                  | Dremio Project UID. If left blank, the default project will be used.                                                                                       |
| Token                       | Access token generated in Dremio.                                                                                                                          |
| Password                    | Alternatively, provide a password.                                                                                                                         |
| Schema for temporary views  | A Dremio space for temporary views.                                                                                                                        |
| Schema for temporary tables | \$scratch should suit most applications, or use "\<Datasource>.\<schema>" (with quotes) if you wish to create temporary tables in an external data source. |

Click **Create**. Your data connection is now ready!


# Google Cloud Storage (GCS)
Source: https://docs.datafold.com/integrations/databases/google-cloud-storage


**Steps to complete:**

1. [Create a Service Account](/integrations/databases/google-cloud-storage#create-a-service-account)
2. [Give the Service Account Storage Object Admin access](/integrations/databases/google-cloud-storage#service-account-access-and-permissions)
3. [Generate a Service Account JSON key](/integrations/databases/google-cloud-storage#generate-a-service-account-key)
4. [Configure your data connection in Datafold](/integrations/databases/google-cloud-storage#configure-in-datafold)

## Create a Service Account

To connect Datafold to your Google Cloud Storage bucket, you will need to create a *service account* for Datafold to use.

* Navigate to the [Google Cloud Console](https://console.cloud.google.com/), click on the drop-down to the left of the search bar, and select the project you want to connect to.
  * *Note: If you do not see your project, you may need to switch accounts.*
* Click on the hamburger menu in the upper left, then select **IAM & Admin** followed by **Service Accounts**.
* Create a service account named `Datafold`.

## Service Account Access and Permissions

The Datafold service account requires the following roles and permissions:

* **Storage Object Admin** for read and write access on all the datasets in the project.

## Generate a Service Account Key

Next, go back to the **IAM & Admin** page to generate a key for Datafold.

* Click on the **Service Accounts** page.
* Click on the **Datafold** service account.
* Click on the **Keys** tab.
* Click on **Add Key** and select **Create new key**.
* Select **JSON** and click **Create**.

We recommend using the JSON formatted key. After creating the key, it will be saved on your local machine.

## Configure in Datafold

| Field Name                                                | Description                                                                                                                                           |
| --------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| Connection name                                           | A name given to the data connection within Datafold                                                                                                   |
| Bucket Name                                               | The name of the bucket you want to connect to.                                                                                                        |
| Bucket region                                             | The region of the bucket you want to connect to.                                                                                                      |
| JSON Key File                                             | The key file generated in the [Generate a Service Account JSON key](/integrations/databases/google-cloud-storage#generate-a-service-account-key) step |
| Directory for writing diff results                        | Optional. The directory in the bucket where diff results will be written. Service account should have write access to this directory.                 |
| Default maximum number of rows to include in diff results | Optional. The maximum number of rows that a file with materialized results will contain.                                                              |

Click **Create**. Your data connection is ready!


# MySQL
Source: https://docs.datafold.com/integrations/databases/mysql


<Note>
  **INFO**

  Please contact [support@datafold.com](mailto:support@datafold.com) if you use a MySQL version \< 8.x.
</Note>

<Note>
  **INFO**

  Column-level Lineage is not currently supported for MySQL.
</Note>

**Steps to complete:**

1. [Run SQL script for permissions and create schema for Datafold](/integrations/databases/mysql#run-sql-script-and-create-schema-for-datafold)
2. [Configure your data connection in Datafold](/integrations/databases/mysql#configure-in-datafold)

### Run SQL script and create schema for Datafold

To connect to MySQL, create a user with read-only access to all tables you wish to diff. Include read and write access to a Datafold-specific dataset:

```Bash
-- Create a temporary dataset for Datafold to utilize
CREATE DATABASE IF NOT EXISTS datafold_tmp;

-- Create a Datafold user
CREATE USER 'datafold_user'@'%' IDENTIFIED BY 'SOMESECUREPASSWORD';

-- Grant read access to diff tables in YourSchema
GRANT SELECT ON `YourSchema`.* TO 'datafold_user'@'%';

-- Grant access to all tables in a datafold_tmp database
GRANT ALL ON `datafold_tmp`.* TO 'datafold_user'@'%';

-- Apply the changes
FLUSH PRIVILEGES;
```

Datafold utilizes a temporary dataset, named `datafold_tmp` in the above script, to materialize scratch work and keep data processing in the your warehouse.

### Configure in Datafold

| Field Name                   | Description                                                                     |
| ---------------------------- | ------------------------------------------------------------------------------- |
| Connection name              | A name given to the data connection within Datafold                             |
| Host                         | The hostname for your MySQL instance                                            |
| Port                         | MySQL connection port; default value is 3306                                    |
| Username                     | The user created in our SQL script, named datafold\_user                        |
| Password                     | The password created in our SQL script                                          |
| Database                     | The name of the MySQL database (schema) you want to connect to, e.g. YourSchema |
| Dataset for temporary tables | The datafold\_tmp database created in our SQL script                            |

Click **Create**. Your data connection is ready!


# Netezza
Source: https://docs.datafold.com/integrations/databases/netezza


<Note>
  **INFO**

  Column-level Lineage is not currently supported for Netezza.
</Note>

**Steps to complete:**

1. [Configure user in Netezza](#configure-user-in-tedadata)
2. [Create schema for Datafold](#create-schema-for-datafold)
3. [Configure your data connection in Datafold](#configure-in-datafold)

## Configure user in Netezza

To connect to Netezza, create a user with read-only access to all databases you may wish to diff.

## Create a temporary database for Datafold

Datafold requires a schema with full permissions to store temporary data.

## Configure in Datafold

| Field Name                  | Description                                                                                                                                   |
| --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
| Connection Name             | A name given to the data connection within Datafold.                                                                                          |
| Host                        | The hostname for your Netezza instance (e.g., nz-85dcf66c-69aa-4ba6-b7cb-827643da5a.us-east-1.data-warehouse.cloud.ibm.com for Netezza SaaS). |
| Port                        | Netezza endpoint port; the default value is 5480.                                                                                             |
| Encryption                  | Whether to use TLS.                                                                                                                           |
| User ID                     | User ID, e.g., DATAFOLD.                                                                                                                      |
| Password                    | Password from above.                                                                                                                          |
| Default DB                  | The database to connect to.                                                                                                                   |
| Schema for Temporary Tables | Use DATABASE.SCHEMA format.                                                                                                                   |

Click **Create**. Your data source is now ready!


# Oracle
Source: https://docs.datafold.com/integrations/databases/oracle


<Note>
  **INFO**

  Please contact [support@datafold.com](mailto:support@datafold.com) if you use an Oracle version \< 19.x.
</Note>

<Note>
  **INFO**

  Column-level Lineage is not currently supported for Oracle.
</Note>

**Steps to complete:**

1. [Run SQL script and create schema for Datafold](/integrations/databases/oracle#run-sql-script-and-create-schema-for-datafold)
2. [Configure your data connection in Datafold](/integrations/databases/oracle#configure-in-datafold)

## Run SQL script and create schema for datafold\_group

To connect to Oracle, create a user with read-only access to all tables you wish to diff. Include read and write access to a Datafold-specific temp schema:

```Bash
-- Switch container context (default is "XEPDB1")
ALTER SESSION SET CONTAINER = YOURCONTAINER;

-- Create a Datafold user/schema
CREATE USER DATAFOLD IDENTIFIED BY somesecurepassword;

-- Allow Datafold user to connect
GRANT CREATE SESSION TO DATAFOLD;

-- Allow user to create tables in DATAFOLD schema
GRANT CREATE TABLE TO DATAFOLD;

-- Grant read access to diff tables in your schema
GRANT SELECT ON "YOURSCHEMA"."YOURTABLE" TO DATAFOLD;

-- Grant access to DBMS_CRYPTO utilities (hashing functions, etc.)
GRANT EXECUTE ON SYS.DBMS_CRYPTO TO DATAFOLD;

-- Allow Datafold users/schemas to use disk space (adjust if needed)
GRANT UNLIMITED TABLESPACE TO DATAFOLD;

-- Apply the changes
COMMIT;

```

Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse.

## Configure in Datafold

| Field Name                  | Description                                                                                    |
| --------------------------- | ---------------------------------------------------------------------------------------------- |
| Name                        | A name given to the data connection within Datafold                                            |
| Host                        | The hostname address for your database                                                         |
| Port                        | Postgres connection port; default value is 1521                                                |
| User                        | The user role created in our SQL script, named DATAFOLD                                        |
| Password                    | The password created in our SQL script                                                         |
| Connection type             | Choose Service or SID depending on your connection type; default value is Service              |
| Service (or SID)            | The name of the database (Service or SID) you want to connect to, e.g. XEPDB1 or YOURCONTAINER |
| Schema for temporary tables | The user/schema created in our SQL script - DATAFOLD                                           |

Click **Create**. Your data connection is ready!


# PostgreSQL
Source: https://docs.datafold.com/integrations/databases/postgresql


<Note>
  **INFO**

  Column-level Lineage is supported for AWS Aurora and RDS Postgres and *requires* Cloudwatch to be configured.
</Note>

**Steps to complete:**

1. [Run SQL script and create schema for Datafold](/integrations/databases/postgresql#run-sql-script-and-create-schema-for-datafold)
2. [Configure your data connection in Datafold](/integrations/databases/postgresql#configure-in-datafold)

## Run SQL script and create schema for Datafold

To connect to Postgres, you need to create a user with read-only access to all tables in all schemas, write access to Datafold-specific schema for temporary tables:

```Bash
/* Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in your warehouse. */

CREATE SCHEMA datafold_tmp;

/* Create a datafold user */

CREATE ROLE datafold WITH LOGIN ENCRYPTED PASSWORD 'SOMESECUREPASSWORD';

/* Give the datafold role write access to the temporary schema */

GRANT ALL ON SCHEMA datafold_tmp TO datafold;

/* Make sure that the postgres user has read permissions on the tables */

GRANT USAGE ON SCHEMA <myschema> TO datafold;
GRANT SELECT ON ALL TABLES IN SCHEMA <myschema> TO datafold;

```

Datafold utilizes a temporary schema, named `datafold_tmp` in the above script, to materialize scratch work and keep data processing in the your warehouse.

## Configure in Datafold

| Field Name                  | Description                                                     |
| --------------------------- | --------------------------------------------------------------- |
| Name                        | A name given to the data connection within Datafold             |
| Host                        | The hostname address for your database; default value 127.0.0.1 |
| Port                        | Postgres connection port; default value is 5432                 |
| User                        | The user role created in our SQL script, named datafold         |
| Password                    | The password created in our SQL script                          |
| Database Name               | The name of the Postgres database you want to connect to        |
| Schema for temporary tables | The schema (datafold\_tmp) created in our SQL script            |

Click **Create**. Your data connection is ready!

***

## Column-level Lineage with Aurora & RDS

This will guide you through setting up Column-level Lineage with AWS Aurora & RDS using CloudWatch.

**Steps to complete:**

1. [Setup Postgres with Permissions](#run-sql-script)
2. [Increase the logging verbosity of Postgres](#increase-logging-verbosity) so Datafold can parse lineage
3. [Set up an account for fetching the logs from CloudWatch.](#connect-datafold-to-cloudwatch)
4. [Configure your data connection in Datafold](#configure-in-datafold)

### Run SQL Script

To connect to Postgres, you need to create a user with read-only access to all tables in all schemas, write access to Datafold-specific schema for temporary tables:

```Bash
/* Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse. */

CREATE SCHEMA datafold_tmp;

/* Create a datafold user */

CREATE ROLE datafold WITH LOGIN ENCRYPTED PASSWORD 'SOMESECUREPASSWORD';

/* Give the datafole role write access to the temporary schema */

GRANT ALL ON SCHEMA datafold_tmp TO datafold;

/* Make sure that the postgres user has read permissions on the tables */

GRANT USAGE ON SCHEMA <myschema> TO datafold;
GRANT SELECT ON ALL TABLES IN SCHEMA <myschema> TO datafold;

```

### Increase logging verbosity

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/psql_aurora_dbs-89843982d984ed977c0254adca7a5fa0.png" />
</Frame>

Then, create a new `Parameter Group`. Database instances run with default parameters that do not include logging verbosity. To turn on the logging verbosity, you'll need to create a new Parameter Group. Hit **Parameter Groups** on the menu and create a new Parameter Group.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/psql_aurora_parameter_group-044563cebd48ae81a9d22ab2319d160e.png" />
</Frame>

Next, select the `aurora-postgresql10` parameter group family. This depends on the cluster that you're running. For Aurora serverless, this is the appropriate family.

Finally, set the `log_statement` enum field to `mod` - meaning that it will log all the DDL statements, plus data-modifying statements. Note: This field isn't set by default.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/psql_aurora_logstatement-6f0ba20fd7217047ae62fd01cbfa50d4.png" />
</Frame>

After saving the parameter group, go back to your database, and select the database cluster parameter group.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/psql_aurora_clustergroup-6a1c25e3eae1563130b7565a5b5f0ba7.png" />
</Frame>

### Connect Datafold to CloudWatch

Start by creating a new user to isolate the permissions as much as possible. Go to IAM and create a new user.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/psql_aurora_iam_user-0d82fc2408ab78e2bd94b20e0e2d363e.png" />
</Frame>

Next, create a new group named `CloudWatchLogsReadOnly` and attach the `CloudWatchLogsReadOnlyAccess` policy to it. Next, select the group.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/psql_aurora_user_permissions-f48596fc79a01aea8d9aadd3688381ce.png" />
</Frame>

When reviewing the user, it should have the freshly created group attached to it.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/psql_aurora_user_review-637256675791599a381ee290bd7e05b7.png" />
</Frame>

After confirming the new user you should be given the `Access Key` and `Secret Key`. Save these two codes securely to finish configurations on Datafold.

The last piece of information Datafold needs is the CloudWatch Log Group. You will find this in CloudWatch under the Log Group section in the sidebar. It will be formatted as `/aws/rds/cluster/<my_cluster_name>/postgresql`.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/psql_aurora_log_group-5dd6c4c2728cf4d55352976449d05c12.png" />
</Frame>

### Configure in Datafold

| Field Name                    | Description                                                                                                                             |
| ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
| Name                          | A name given to the data connection within Datafold                                                                                     |
| Host                          | The hostname address for your database; default value 127.0.0.1                                                                         |
| Port                          | Postgres connection port; default value is 5432                                                                                         |
| User                          | The user role created in the SQL script; datafold                                                                                       |
| Password                      | The password created in the SQL permissions script                                                                                      |
| Database Name                 | The name of the Postgres database you want to connect to                                                                                |
| AWS Access Key                | The Access Key provided in the [Connect Datafold to CloudWatch](/integrations/databases/postgresql#connect-datafold-to-cloudwatch) step |
| AWS Secret                    | The Secret Key provided in the [Connect Datafold to CloudWatch](/integrations/databases/postgresql#connect-datafold-to-cloudwatch) step |
| Cloudwatch Postgres Log Group | The path of the Log Group; formatted as /aws/rds/cluster/\<my\_cluster\_name>/postgresql                                                |
| Schema for temporary tables   | The schema created in the SQL setup script; datafold\_tmp                                                                               |

Click **Create**. Your data connection is ready!


# Redshift
Source: https://docs.datafold.com/integrations/databases/redshift


**Steps to complete:**

1. [Run SQL script and create schema for Datafold](/integrations/databases/redshift#run-sql-script-and-create-schema-for-datafold)
2. [Configure your data connection in Datafold](/integrations/databases/redshift#configure-in-datafold)

## Run SQL script and create schema for Datafold

To connect to Amazon Redshift, you need to create a user with read-only access to all tables in all schemas, write access to Datafold-specific schema for temporary tables, and the ability to access SQL logs:

```
/* Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse. */CREATE SCHEMA datafold_tmp;/* The Datafold user needs read access to ALL schemas; this requires superuser level privilege in Redshift */      CREATE USER datafold CREATEUSER PASSWORD 'SOMESECUREPASSWORD';/* The following permission allows Datafold to pull SQL logs and construct   column-level lineage */ALTER USER datafold WITH SYSLOG ACCESS UNRESTRICTED;
```

Datafold utilizes a temporary schema, named `datafold_tmp` in the above script, to materialize scratch work and keep data processing in the your warehouse.

## Configure in Datafold

| Field Name                  | Description                                                                                                                                         |
| --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
| Name                        | A name given to the data connection within Datafold                                                                                                 |
| Host                        | The hostname of your cluster. (Go to Redshift in your AWS console, select your cluster, the hostname is the endpoint listed at the top of the page) |
| Port                        | Redshift connection port; default value is 5439                                                                                                     |
| User                        | The user created in our SQL script, named `datafold`                                                                                                |
| Password                    | The password created in our SQL script                                                                                                              |
| Database Name               | The name of the Redshift database you want to connect to                                                                                            |
| Schema for temporary tables | The schema (`datafold_tmp`) created in our SQL script                                                                                               |

Click **Create**. Your data connection is ready!


# SAP HANA
Source: https://docs.datafold.com/integrations/databases/sap-hana


<Note>
  **INFO**

  Column-level Lineage is not currently supported for SAP HANA.
</Note>

**Steps to complete:**

1. [Create and authorize a user](#create-and-authorize-a-user)
2. [Create schema for Datafold](#create-schema-for-datafold)
3. [Configure in Datafold](#configure-in-datafold)

## Create and authorize a user

Create a new user `DATAFOLD` using SAP HANA Administration console (Systems-Security-Users). Specify password authentication, and set "Force password change on next logon" to "No". Grant MONITORING privileges for the databases to be diffed.

## Create schema for Datafold

Datafold utilizes a temporary schema to materialize scratch work and keep data processing in the your warehouse.

```
CREATE SCHEMA datafold_tmp OWNED BY DATAFOLD;

```

## Configure in Datafold

| Field Name                  | Description                                          |
| --------------------------- | ---------------------------------------------------- |
| Name                        | A name given to the data connection within Datafold. |
| Host                        | The hostname address for your database.              |
| Port                        | Sap HANA connection port; default value is 443.      |
| User                        | The user created above, named DATAFOLD.              |
| Password                    | The password for user DATAFOLD.                      |
| Schema for temporary tables | The schema created above, named datafold\_tmp        |

Click **Create**. Your data connection is ready!


# Snowflake
Source: https://docs.datafold.com/integrations/databases/snowflake


**NOTE**: Datafold needs permissions in your Snowflake dataset to read your table data. You will need to be a Snowflake *Admin* in order to grant the required permissions.

**Steps to complete:**

* [Create a user and role for Datafold](/integrations/databases/snowflake#create-a-user-and-role-for-datafold)
* [Setup password-based](/integrations/databases/snowflake#set-up-password-based-authentication) or [Use key-pair authentication](/integrations/databases/snowflake#use-key-pair-authentication)
* [Create a temporary schema](/integrations/databases/snowflake#create-schema-for-datafold)
* [Give the Datafold role access to your warehouse](/integrations/databases/snowflake#give-the-datafold-role-access)
* [Configure your data connection in Datafold](/integrations/databases/snowflake#configure-in-datafold)

## Create a user and role for Datafold

> A [full script](/integrations/databases/snowflake#full-script) can be found at the bottom of this page.

It is best practice to create a separate role for the Datafold integration (e.g., `DATAFOLDROLE`):

```
CREATE ROLE DATAFOLDROLE;
CREATE USER DATAFOLD DEFAULT_ROLE = "DATAFOLDROLE" MUST_CHANGE_PASSWORD = FALSE;
GRANT ROLE DATAFOLDROLE TO USER DATAFOLD;

```

To provide column-level lineage, Datafold needs to read & parse all SQL statements executed in your Snowflake account:

```
GRANT MONITOR EXECUTION ON ACCOUNT TO ROLE DATAFOLDROLE;
GRANT IMPORTED PRIVILEGES ON DATABASE SNOWFLAKE TO ROLE DATAFOLDROLE;

```

## Set up password-based authentication

Datafold supports username/password authentication, but also key-pair authentication.

```
ALTER USER DATAFOLD SET PASSWORD = 'SomethingSecret';

```

You can set the username/password in the Datafold web UI.

### Use key-pair authentication

If you would like to use key-pair authentication, go to **Settings** -> **Data Connections** -> **Your Snowflake Connection**, and change Authentication method from **Password** to **Key Pair**.
Generate and Download the Key Pair file, and use the value within the file when running the following command in Snowflake to set the key for this Snowflake role:

```
ALTER USER DATAFOLD SET rsa_public_key='...'

```

## Create schema for Datafold

Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse.

```
CREATE SCHEMA <database_name>.DATAFOLD_TMP;
GRANT ALL ON SCHEMA <database_name>.DATAFOLD_TMP TO DATAFOLDROLE;

```

## Give the Datafold role access

Datafold will only scan the tables that it has access to. The snippet below will give Datafold read access to a database. If you have more than one database that you want to use in Datafold, rerun the script below for each one.

```Bash
/* Repeat for every DATABASE to be usable in Datafold. This allows Datafold to
correctly discover, profile & diff each table */
GRANT USAGE ON WAREHOUSE <warehouse_name> TO ROLE DATAFOLDROLE;
GRANT USAGE ON DATABASE <database_name> TO ROLE DATAFOLDROLE;

GRANT USAGE ON ALL SCHEMAS IN DATABASE <database_name> TO ROLE DATAFOLDROLE;
GRANT USAGE ON FUTURE SCHEMAS IN DATABASE <database_name> TO ROLE DATAFOLDROLE;

GRANT SELECT ON ALL TABLES IN DATABASE <database_name> TO ROLE DATAFOLDROLE;
GRANT SELECT ON FUTURE TABLES IN DATABASE <database_name> TO ROLE DATAFOLDROLE;

GRANT SELECT ON ALL VIEWS IN DATABASE <database_name> TO ROLE DATAFOLDROLE;
GRANT SELECT ON FUTURE VIEWS IN DATABASE <database_name> TO ROLE DATAFOLDROLE;

GRANT SELECT ON ALL MATERIALIZED VIEWS IN DATABASE <database_name> TO ROLE DATAFOLDROLE;
GRANT SELECT ON FUTURE MATERIALIZED VIEWS IN DATABASE <database_name> TO ROLE DATAFOLDROLE;

GRANT ALL PRIVILEGES ON ALL DYNAMIC TABLES IN DATABASE <database_name> TO ROLE DATAFOLDROLE;
GRANT SELECT ON FUTURE DYNAMIC TABLES IN DATABASE <database_name> TO ROLE DATAFOLDROLE;

```

## Full Script

```Bash
--Step 1: Create a user and role for Datafold
CREATE ROLE DATAFOLDROLE;
CREATE USER DATAFOLD DEFAULT_ROLE = "DATAFOLDROLE" MUST_CHANGE_PASSWORD = FALSE;
GRANT ROLE DATAFOLDROLE TO USER DATAFOLD;

GRANT MONITOR EXECUTION ON ACCOUNT TO ROLE DATAFOLDROLE;
GRANT IMPORTED PRIVILEGES ON DATABASE SNOWFLAKE TO ROLE DATAFOLDROLE;

--Step 2a: Use password-based authentication
ALTER USER DATAFOLD SET PASSWORD = 'SomethingSecret';
--OR
--Step 2b: Use key-pair authentication
--ALTER USER DATAFOLD SET rsa_public_key='abc..'

--Step 3: Create schema for Datafold
CREATE SCHEMA <database_name>.DATAFOLD_TMP;
GRANT ALL ON SCHEMA <database_name>.DATAFOLD_TMP TO DATAFOLDROLE;

--Step 4: Give the Datafold role access to your data connection
/*
  Repeat for every DATABASE to be usable in Datafold. This allows Datafold to
  correctly discover, profile & diff each table
*/
GRANT USAGE ON WAREHOUSE <warehouse_name> TO ROLE DATAFOLDROLE;
GRANT USAGE ON DATABASE <database_name> TO ROLE DATAFOLDROLE;

GRANT USAGE ON ALL SCHEMAS IN DATABASE <database_name> TO ROLE DATAFOLDROLE;
GRANT USAGE ON FUTURE SCHEMAS IN DATABASE <database_name> TO ROLE DATAFOLDROLE;

GRANT SELECT ON ALL TABLES IN DATABASE <database_name> TO ROLE DATAFOLDROLE;
GRANT SELECT ON FUTURE TABLES IN DATABASE <database_name> TO ROLE DATAFOLDROLE;

GRANT SELECT ON ALL VIEWS IN DATABASE <database_name> TO ROLE DATAFOLDROLE;
GRANT SELECT ON FUTURE VIEWS IN DATABASE <database_name> TO ROLE DATAFOLDROLE;

GRANT SELECT ON ALL MATERIALIZED VIEWS IN DATABASE <database_name> TO ROLE DATAFOLDROLE;
GRANT SELECT ON FUTURE MATERIALIZED VIEWS IN DATABASE <database_name> TO ROLE DATAFOLDROLE;

```

## Validate Snowflake Grants for Datafold

Run these queries to validate that the grants have been set up correctly:

> Note: More results may be returned than shown in the screenshots below if you have granted access to multiple roles/users

Example Placeholders:

* `<database_name>` = `DEV`
* `<warehouse_name>` = `DEMO`

```
-- Validate database usage for the DATAFOLDROLE
SHOW GRANTS ON DATABASE <database_name>;
```

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/grants_on_database-cbcb5fd91f9d1ba6a680641fd1ed2cde.png" />
</Frame>

```
-- Validate warehouse usage for the DATAFOLDROLE
SHOW GRANTS ON WAREHOUSE <warehouse_name>;
```

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/grants_on_warehouse-45351631336f32ccbeeaa6646a1a9199.png" />
</Frame>

```
-- Validate schema permissions for the DATAFOLDROLE
SHOW GRANTS ON SCHEMA <database_name>.DATAFOLD_TMP;
```

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/grants_on_schema-6955ab0c695f05ab0fd245a231a20837.png" />
</Frame>

## A note on future grants

The above database grants will be insufficient if any future grants have been defined at the schema level, because [schema-level grants will override database-level grants](https://docs.snowflake.com/en/sql-reference/sql/grant-privilege#considerations). In that case, you will need to execute future grants for every existing *schema* that Datafold will operate on.

```Bash
GRANT SELECT ON FUTURE TABLES IN SCHEMA <database_name>.<schema_name> TO ROLE DATAFOLDROLE;
GRANT SELECT ON FUTURE VIEWS IN SCHEMA <database_name>.<schema_name> TO ROLE DATAFOLDROLE;
GRANT SELECT ON FUTURE MATERIALIZED VIEWS IN SCHEMA <database_name>.<schema_name> TO ROLE DATAFOLDROLE;
GRANT SELECT ON ALL TABLES IN SCHEMA <database_name>.<schema_name> TO ROLE DATAFOLDROLE;
GRANT SELECT ON ALL VIEWS IN SCHEMA <database_name>.<schema_name> TO ROLE DATAFOLDROLE;
GRANT SELECT ON ALL MATERIALIZED VIEWS IN SCHEMA <database_name>.<schema_name> TO ROLE DATAFOLDROLE;

```

## Configure in Datafold

| Field Name                  | Description                                                                                                                                                                                                                                                                                                                                                                                  |
| --------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Name                        | A name given to the data connection within Datafold                                                                                                                                                                                                                                                                                                                                          |
| Account identifier          | The Org name-Account name pair for your Snowflake account. This can be found in the browser address string. It may look like [https://orgname-accountname.snowflakecomputing.com](https://orgname-accountname.snowflakecomputing.com) or [https://app.snowflake.com/orgname/accountname](https://app.snowflake.com/orgname/accountname). In the setup form, enter \<orgname>-\<accountname>. |
| User                        | The username set in the [Setup password-based](/integrations/databases/snowflake#set-up-password-based-authentication) authentication section                                                                                                                                                                                                                                                |
| Password                    | The password set in the [Setup password-based](/integrations/databases/snowflake#set-up-password-based-authentication) authentication section                                                                                                                                                                                                                                                |
| Key Pair file               | The key file generated in the [Use key-pair authentication](/integrations/databases/snowflake#use-key-pair-authentication) section                                                                                                                                                                                                                                                           |
| Warehouse                   | The Snowflake warehouse name                                                                                                                                                                                                                                                                                                                                                                 |
| Schema for temporary tables | The schema name you created with our script (\<database\_name>.DATAFOLD\_TMP)                                                                                                                                                                                                                                                                                                                |
| Role                        | The role you created for Datafold (Typically DATAFOLDROLE)                                                                                                                                                                                                                                                                                                                                   |
| Default DB                  | A database the role above can access. If more than one database was added, whichever you prefer to be the default                                                                                                                                                                                                                                                                            |

> Note: Please review the documentation for the account name. Datafold uses Format 1 (Preferred): [https://docs.snowflake.com/en/user-guide/admin-account-identifier#using-an-account-locator-as-an-identifier](https://docs.snowflake.com/en/user-guide/admin-account-identifier#using-an-account-locator-as-an-identifier)

Click **Create**. Your data connection is ready!


# Microsoft SQL Server
Source: https://docs.datafold.com/integrations/databases/sql-server


<Note>
  **INFO**

  Column-level Lineage is not currently supported for Microsoft SQL Server.
</Note>

**Steps to complete:**

1. [Run SQL script and create schema for Datafold](/integrations/databases/sql-server#run-sql-script-and-create-schema-for-datafold)
2. [Configure your data connection in Datafold](/integrations/databases/sql-server#configure-in-datafold)

## Run SQL script and create schema for Datafold

To connect to Microsoft SQL Server, create a user with read-only access to all tables you wish to diff. Include read and write access to a Datafold-specific temp schema:

```Bash
/* Select the database that will contain the temp schema */
USE DatabaseName;

/* Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse. */
CREATE SCHEMA datafold_tmp;

/* Create the Datafold user */
CREATE LOGIN DatafoldUser WITH PASSWORD = 'SOMESECUREPASSWORD';
CREATE USER DatafoldUser FOR LOGIN DatafoldUser;

/* Grant read access to diff tables */
GRANT SELECT ON SCHEMA::YourSchema TO DatafoldUser;

/* Grant read + write access to datafold_tmp schema */
GRANT SELECT, INSERT, UPDATE, DELETE ON SCHEMA::datafold_tmp TO DatafoldUser;
```

## Configure in Datafold

| Field Name                   | Description                                                                                                      |
| ---------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| Connection name              | A name given to the data connection within Datafold                                                              |
| Host                         | The hostname for your SQL Server instance                                                                        |
| Port                         | SQL Server connection port; default value is 1433                                                                |
| Username                     | The user created in our SQL script, named DatafoldUser                                                           |
| Password                     | The password created in our SQL script                                                                           |
| Database                     | The name of the SQL Server database you want to connect to                                                       |
| Dataset for temporary tables | The schema created in our SQL script, in database.schema format: DatabaseName.datafold\_tmp in our script above. |

Click **Create**. Your data connection is ready!


# Starburst
Source: https://docs.datafold.com/integrations/databases/starburst


<Note>
  **INFO**

  Column-level Lineage is not currently supported for Starburst.
</Note>

**Steps to complete:**

1. [Configure user in Starburst](#configure-user-in-starburst)
2. [Create schema for Datafold](#create-schema-for-datafold)
3. [Configure your data connection in Datafold](#configure-in-datafold)

## Configure user in Starburst

To connect to Starburst, create a user with read-only access to all data sources you wish to diff and optionally generate an access token. Datafold requires a schema to be set up within one of the catalogs, typically hosted on platforms like Amazon S3 or similar services.

## Create schema for Datafold

Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse.

## Configure in Datafold

| Field Name                  | Description                                                                                                          |
| --------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| Connection name             | A name given to the data connection within Datafold.                                                                 |
| Host                        | The hostname for your Starburst instance (e.g., `sample-free-cluster.trino.galaxy.starburst.io` for Starburst SaaS). |
| Port                        | Starburst endpoint port; default value is 433.                                                                       |
| Encryption                  | Should be checked for Starburst Galaxy, possibly unchecked for local deployments.                                    |
| User ID                     | User ID as created in Starburst, typically an email address.                                                         |
| Token                       | Access token generated in Starburst.                                                                                 |
| Password                    | Alternatively, provide a password.                                                                                   |
| Schema for temporary tables | Use `<catalog>.<schema>` format.                                                                                     |

Click **Create**. Your data source is now ready!


# Teradata
Source: https://docs.datafold.com/integrations/databases/teradata


<Note>
  **INFO**

  Column-level Lineage is not currently supported for Teradata.
</Note>

**Steps to complete:**

1. [Install HASH\_MD5 user-defined function](#install-HASH%5FMD5-user-defined-function)
2. [Configure user in Teradata](#configure-user-in-tedadata)
3. [Create schema for Datafold](#create-schema-for-datafold)
4. [Configure your data connection in Datafold](#configure-in-datafold)

## Install HASH\_MD5 user-defined function

Follow the steps [here](https://downloads.teradata.com/download/extensibility/md5-message-digest-udf) for the database that Datafold will be connecting to (e.g., DB1).

Once completed, execute the following SQL commands in the database:

```Bash
ALTER FUNCTION HASH_MD5 COMPILE;
ALTER FUNCTION HASH_MD5 EXECUTE NOT PROTECTED;

```

## Configure user in Teradata

To connect to Teradata, create a user with read-only access to all databases you may wish to diff, including the login database:

```Bash
CREATE USER DATAFOLD AS PERMANENT=1000000000 BYTES PASSWORD= <PASSWORD> COLLATION = ASCII TIME ZONE ='GMT';
GRANT EXECUTE FUNTION ON DB1 TO DATAFOLD;
GRANT SELECT ON DB1 TO DATAFOLD;
...
GRANT SELECT ON DB9 TO DATAFOLD;

```

## Create a temporary database for Datafold

Datafold requires a database to store temporary data with full permissions:

```
CREATE DATABASE DATAFOLD_TMP AS PERMANENT=10000000000 BYTES;
GRANT ALL ON DATAFOLD_TMP TO DATAFOLD;

```

## Configure in Datafold

| Field Name                    | Description                                                                                                                |
| ----------------------------- | -------------------------------------------------------------------------------------------------------------------------- |
| Connection Name               | A name given to the data connection within Datafold.                                                                       |
| Host                          | The hostname for your Teradata instance (e.g., account-name-2e3ba8b32qac9d.env.clearscape.teradata.com for Teradata SaaS). |
| Port                          | Teradata endpoint port; the default value is 1025.                                                                         |
| User ID                       | User ID, e.g., DATAFOLD.                                                                                                   |
| Password                      | Password from above.                                                                                                       |
| Database                      | The connection database, e.g., DB1 from above.                                                                             |
| Database for Temporary Tables | The temporary database, e.g., DATAFOLD\_TMP from above.                                                                    |

Click **Create**. Your data source is now ready!


# Microsoft Teams
Source: https://docs.datafold.com/integrations/notifications/microsoft-teams

Receive notifications for monitors in Microsoft Teams.

## Prerequisites

* Microsoft Teams admin access or permissions to manage integrations
* A Datafold account with admin privileges

## Configure the Integration

1. In Datafold, go to Settings > Integrations > Notifications
2. Click "Add New Integration"
3. Select "Microsoft Teams"
4. You'll be automatically redirected to the Microsoft Office login page
5. Sign in using the Microsoft Office account with admin privileges
6. Click "Accept" to grant Datafold the necessary permissions
7. You'll be redirected back to Datafold
8. Open the Teams app in a separate browser tab
9. Next to the channel where you'd like to receive notifications, click "..." and select "Workflows"
10. Select the template called "Post to a channel when a webhook request is received"
11. Advance through the wizard (the defaults should be fine)
12. At the end of the wizard, copy the webhook URL
13. Return to Datafold and click "Add channel configuration"
14. Select the relevant Team and Channel, then paste the webhook URL
15. Repeat steps 8-14 for as many channels as you'd like
16. Save the integration settings in Datafold

You're all set! When you configure a monitor in Datafold, you'll now have the option to send notifications to the Teams channel(s) you configured.

## Monitors as Code Configuration

If you're using [monitors as code](/data-monitoring/monitors-as-code), you can configure Teams notifications by adding a `notifications` section to your monitor definition as follows:

```yaml
monitors:
  <monitor_name>:
    ...
    notifications:
      - type: teams
        integration: <integration_id>
        channel: <team_name>:<channel_name>
        mentions:
          - <tag_name>
          - <user_name>
          ...
```

* `<integration_id>` can be found in Datafold -> Settings -> Integrations -> Notifications -> \<your\_ms\_teams\_integration>

#### Full example

```yaml
monitors:
  uniqueness_test_example:
    type: test
    enabled: true
    connection_id: 1123
    test:
      type: unique
      tables:
        - path: DEV.DATA_DEV.USERS
          columns:
            - USERNAME
    schedule:
      interval:
        every: hour
    notifications:
      - type: teams
        integration: 23
        channel: Dev Team:Notifications Channel
        mentions:
          - NotifyDevCustomTag
          - Dima Cherenkov
```

## Need help?

If you have any questions about integrating with Microsoft Teams, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com).


# Slack
Source: https://docs.datafold.com/integrations/notifications/slack

Receive notifications for monitors in Slack.

## Prerequisites

* Slack admin access or permissions to manage integrations
* A Datafold account with admin privileges

## Configure the Integration

1. In Datafold, go to Settings > Integrations > Notifications
2. Click "Add New Integration"
3. Select "Slack"
4. You'll be automatically redirected to Slack
5. If you're not already signed in, sign in to your Slack account
6. Click "Allow" to grant Datafold the necessary permissions
7. You'll be redirected back to Datafold

You're all set! When you configure a monitor in Datafold, you'll now have the option to send notifications to Slack.

## Monitors as Code Configuration

If you're using [monitors as code](/data-monitoring/monitors-as-code), you can configure Slack notifications by adding a `notifications` section to your monitor definition as follows:

```yaml
monitors:
  <monitor_name>:
    ...
    notifications:
      - type: slack
        integration: <integration_id>
        channel: <channel_name>
        mentions:
          - <user_name>
          - here
          - channel
          ...
```

* `<integration_id>` can be found in Datafold -> Settings -> Integrations -> Notifications -> \<your\_slack\_integration>

#### Full example

```yaml
monitors:
  uniqueness_test_example:
    type: test
    enabled: true
    connection_id: 1123
    test:
      type: unique
      tables:
        - path: DEV.DATA_DEV.USERS
          columns:
            - USERNAME
    schedule:
      interval:
        every: hour
    notifications:
      - type: slack
        integration: 13
        channel: dev-notifications
        mentions:
          - John Doe
          - channel
```

## Need help?

If you have any questions about integrating with Slack, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com).


# OAuth Support
Source: https://docs.datafold.com/integrations/oauth

Set up OAuth App Connections in your supported data warehouses to securely execute data diffs on behalf of your users.

OAuth support empowers users to run data diffs based on their individual permissions and roles configured within the data warehouses. This ensures that data access is governed by existing security policies and protocols.

## How it works

### 1. Create a Data Diff

When you attempt to run a data diff, you will notice that it won't run without authentication:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/1-b9afb4d6ec25ca58b9a033ff1eaf6efb.png" />
</Frame>

### 2. Authorize the Data Diff

Authorize the data diff by clicking the **Authenticate** button. This will redirect you to the data warehouse for authentication:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/2-01bbf79b7aaf007bc33dc4652a825e31.png" />
</Frame>

Upon successful authentication, you will be redirected back.

### 3. The Data Diff is now running <Icon icon="party-horn" />

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/3-7d49f847dba2d6ebefe0215a7251d3e7.png" />
</Frame>

### 4. View the Data Diff results

The results reflect your permissions within the data warehouse:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/4-1e3cf172b19bd6616700f3c82f17b256.png" />
</Frame>

Note that running the same data diff, as a different user, renders different results:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/5-585c0ee49689bb8af229ad44eb260ace.png" />
</Frame>

The masked values represent the data retrieved from the data warehouse. We do not conduct any post-processing:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/6-f09d99fb5db326846be80a54d24606b0.png" />
</Frame>

By default, results are only visible for their authors. Users can still clone the data diffs, but the results might be different depending on their data warehouse access levels.

For example, as a different user, I won't be able to access the data diff results for Filip's data diff:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/7-0e23da80a3e63960a91301cdf38d8207.png" />
</Frame>

### 5. Sharing Data Diffs

Data diff sharing is a feature that enables you to share data diffs with other users. This is useful in scenarios such as compliance verification, where auditors can access specific data diffs without first requiring permissions to be set up in the data warehouse.

Sharing can be accessed via the **Actions** dropdown on the data diff page:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/1-2f00e7c34ec87bada9d464dcb97053df.png" />
</Frame>

Note that data diff sharing is disabled by default:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/2-badcc3a6ac297bc1c3ff27f8f4b6c9e0.png" />
</Frame>

It can be enabled under **Org Settings** by clicking on **Allow Data Diff sharing**:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/3-889664da5c85c56985659d0c9e675340.png" />
</Frame>

Once enabled, you can share data diffs with other users:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/4-58827ded9574bddc7ef7ce0d4f156bf8.png" />
</Frame>

## Configuring OAuth

Navigate to **Settings** and click on your data connection. Then, click on **Advanced settings** and under **OAuth**, set the **Client Id** and **Client Secret** fields:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/1-6541ee9948bb173fe28a64cb72b7ba8d.png" />
</Frame>

## Example: Databricks

To create a new Databricks app connection:

1. Go to **Settings** and **App connections**.
2. Click **Add connection** in the top right of the screen.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/2-f59b84118a8979128d2476989b4f5262.png" />
</Frame>

3. Fill in the required fields:

Application Name:

```
Datafold OAuth connection

```

Redirect URLs:

```
https://app.datafold.com/api/internal/oauth_dwh/callback

```

<Note>
  **INFO**

  Datafold caches **access tokens** and using **refresh tokens** fetches new valid tokens in order to complete the diffs and reduce the number of times users need to authenticate against the data warehouses.

  One hour is sufficient for the access token.

  The refresh token will determine the frequency of user reauthentication, whether it's daily, weekly, or monthly.
</Note>

### 3. Click **Add** to obtain the **Client ID** and **Client Secret** <Icon icon="hand-sparkles" />

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/3-c59900f8bd662e3ee8036f40eb2fcc4d.png" />
</Frame>

### 4. Fill in the **Client ID** and **Client Secret** fields in Datafold's Data Connection advanced settings:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/4-75640ad5d18710fced1d22c108bbd0c9.png" />
</Frame>

### 5. Click **Test and save OAuth**

You will be redirected to Databricks to complete authentication. If you are already authenticated, you will be redirected back.

This notification signals a successful OAuth configuration:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/5-63f6c2f97041e030191e9abc5ca70637.png" />
</Frame>

### Additional steps for Databricks

To ensure that users have correct access rights to temporary tables (stored in **Dataset for temporary tables** provided in the **Basic settings** for the Databricks connection), follow these steps:

1. Update the permissions for the **Dataset for temporary tables** in Databricks.
2. Grant these permissions to Datafold users: **USE SCHEMA** and **CREATE TABLE**.

This will ensure that materialization results from data diffs are only readable by their authors.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/6-c4186dd5e91cd8aabf283649efe7461e.png" />
</Frame>

## Example: Snowflake

To create a new Snowflake app connection:

1. Go to Snowflake and run this SQL:

```Bash
CREATE SECURITY INTEGRATION DATAFOLD_OAUTH
TYPE = OAUTH
ENABLED = TRUE
OAUTH_CLIENT = CUSTOM
OAUTH_CLIENT_TYPE = 'CONFIDENTIAL'
OAUTH_REDIRECT_URI = 'https://app.datafold.com/api/internal/oauth_dwh/callback'
PRE_AUTHORIZED_ROLES_LIST=(<ROLENAME1>, <ROLENAME2>, ...)
OAUTH_ISSUE_REFRESH_TOKENS = TRUE
OAUTH_REFRESH_TOKEN_VALIDITY = 604800
OAUTH_ENFORCE_PKCE=TRUE;

```

It should result in this message:

<Warning>
  **CAUTION**

  * `PRE_AUTHORIZED_ROLES_LIST` must include all roles allowed to use the current security integration.
  * By default, `ACCOUNTADMIN`, `SECURITYADMIN`, and `ORGADMIN` are not allowed to be included in `PRE_AUTHORIZED_ROLES_LIST`.
</Warning>

<Note>
  **INFO**

  Datafold caches **access tokens** and uses **refresh tokens** to fetch new valid tokens in order to complete the diffs and reduce the number of times users need to authenticate against the data warehouses.

  `OAUTH_REFRESH_TOKEN_VALIDITY` can be in the range of 3600 (1 hour) to 7776000 (90 days).
</Note>

1. To retrieve `OAUTH_CLIENT_ID` and `OAUTH_CLIENT_SECRET`, run the following SQL:

```
select system$show_oauth_client_secrets('DATAFOLD_OAUTH');
```

### Example result:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/oauth_snowflake_client_creds-47b11899ea2d5df0fce5f17f1711dc62.png" />
</Frame>

1. Fill in the **Client ID** and **Client Secret** fields in Datafold's Data Connection advanced settings:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/4-75640ad5d18710fced1d22c108bbd0c9.png" />
</Frame>

2. Click **Test and save OAuth**

You will be redirected to Snowflake to complete authentication.

info

Your default Snowflake role will be used for the generated **access token**.

This notification signals a successful OAuth configuration:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/5-63f6c2f97041e030191e9abc5ca70637.png" />
</Frame>

### Additional steps for Snowflake

To guarantee correct access rights to temporary tables (stored in **Dataset for temporary tables** provided in the **Basic settings** for Snowflake connection):

* Grant the required privileges on the database and `TEMP` schema for all roles that will be using the OAuth flow. This must be done for all roles that will be utilizing the OAuth flow.

```Bash
GRANT USAGE ON WAREHOUSE <WH_NAME> TO ROLE <ROLENAME>;
GRANT USAGE ON DATABASE <DB_NAME> TO ROLE <ROLENAME>;
GRANT USAGE ON ALL SCHEMAS IN DATABASE <DB_NAME> TO ROLE <ROLENAME>;
GRANT USAGE ON FUTURE SCHEMAS IN DATABASE <DB_NAME> TO ROLE <ROLENAME>;
GRANT ALL ON SCHEMA <DB_NAME>.<TEMP_SCHEMA_NAME> TO ROLE <ROLENAME>;
```

* Revoke `SELECT` privileges for tables in the `TEMP` schema for all roles that will be using the OAuth flow (except for the `DATAFOLDROLE` role), if they were provided. This action must be performed for all roles utilizing the OAuth flow\..

```Bash
-- Revoke SELECT privileges for the TEMP SCHEMA
revoke SELECT ON ALL TABLES IN SCHEMA <DB_NAME>.<TEMP_SCHEMA_NAME> FROM ROLE <ROLENAME>;
revoke SELECT ON FUTURE TABLES IN SCHEMA <DB_NAME>.<TEMP_SCHEMA_NAME> FROM ROLE <ROLENAME>;
revoke SELECT ON ALL VIEWS IN SCHEMA <DB_NAME>.<TEMP_SCHEMA_NAME> FROM ROLE <ROLENAME>;
revoke SELECT ON FUTURE VIEWS IN SCHEMA <DB_NAME>.<TEMP_SCHEMA_NAME> FROM ROLE <ROLENAME>;
revoke SELECT ON ALL MATERIALIZED VIEWS IN SCHEMA <DB_NAME>.<TEMP_SCHEMA_NAME> FROM ROLE <ROLENAME>;
revoke SELECT ON FUTURE MATERIALIZED VIEWS IN SCHEMA <DB_NAME>.<TEMP_SCHEMA_NAME> FROM ROLE <ROLENAME>;
-- Revoke SELECT privileges for a Database
revoke SELECT ON ALL TABLES IN DATABASE <DB_NAME> FROM ROLE <ROLENAME>;
revoke SELECT ON FUTURE TABLES IN DATABASE <DB_NAME> FROM ROLE <ROLENAME>;
revoke SELECT ON ALL VIEWS IN DATABASE <DB_NAME> FROM ROLE <ROLENAME>;
revoke SELECT ON FUTURE VIEWS IN DATABASE <DB_NAME> FROM ROLE <ROLENAME>;
revoke SELECT ON ALL MATERIALIZED VIEWS IN DATABASE <DB_NAME> FROM ROLE <ROLENAME>;
revoke SELECT ON FUTURE MATERIALIZED VIEWS IN DATABASE <DB_NAME> FROM ROLE <ROLENAME>;
```

<Warning>
  **CAUTION**

  If one of the roles will have `FUTURE GRANTS` at the database level, this role will also will have `FUTURE GRANTS` on the `TEMP` schema.
</Warning>

## Example: Redshift

Redshift does not support OAuth2. To execute data diffs on behalf of a specific user, that user needs to provide their own credentials to Redshift.

1. Configure permissions on the Redshift side. Grant the necessary access rights to temporary tables (stored in the **Dataset for temporary tables** provided in the **Basic settings** for Redshift connection):

```Bash
GRANT USAGE on SCHEMA <TEMP_SCHEMA_NAME> to <USERNAME>;
GRANT CREATE on SCHEMA <TEMP_SCHEMA_NAME> to <USERNAME>;
```

1. As an Administrator, select the **Enabled** toggle in Datafold's Redshift Data Connection **Advanced settings**:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/redshift_enable_toggle-df94b6b675d5b7a080b0569fed4943b0.png" />
</Frame>

Then, click the **Test and Save** button.

1. As a User, add your Redshift credentials into Datafold. Click on your Datafold username to **Edit Profile**:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/redshift_credentials_ui-749a49c20ca40f8857831a49526473cc.png" />
</Frame>

Then, click **Add credentials** and select the required Redshift data connection from the **Data Connections** list:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/redshift_before_create_creds-09bad0b040c673230f8890d35e883533.png" />
</Frame>

Finally, provide your Redshift username and password, and configure the **Delete on** field (after this date, your credentials will be removed from Datafold):

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/redshift_create_creds-efc65762308b064bfd208a0b3f19c4b3.png" />
</Frame>

Click **Create credentials**.

## Example: BigQuery

1. To create a new Google Cloud OAuth 2.0 Client ID, go to the Google Cloud console, navigate to **APIs & Services**, then **Credentials**, and click **+ CREATE CREDENTIALS**:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/gcloud_create_btn-15e16ea9d19edb6d0ad61835bd774970.png" />
</Frame>

Select **OAuth client ID**:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/gcloud_create_type-6412ecce9428e3aaf21722c81daa0ac9.png" />
</Frame>

From the list of **Application type**, select **Web application**:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/gcloud_select_type-a21194f9850db4fe9d6babea49a36ba9.png" />
</Frame>

Provide a name in the **Name** field:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/gcloud_oauth_name-b31b7e54a61b764134fd8f8bab61ccda.png" />
</Frame>

In **Authorized redirect URIs**, provide `https://app.datafold.com/api/internal/oauth_dwh/callback`:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/gcloud_redirect_uri-81f4b0fd9d93db76bf043170c6b027d6.png" />
</Frame>

Click **CREATE**. Then, download the OAuth Client credentials as a JSON file:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/gcloud_download_json1-1dfd1d02cbe9a84bd1124f24b28a293b.png" />
</Frame>

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/gcloud_download_json2-c6002a086c551afe7f06615bd1189ad9.png" />
</Frame>

1. Activate BigQuery OAuth in Datafold by uploading the JSON OAuth credentials in the **JSON OAuth keys file** section, in Datafold's BigQuery Data Connection **Advanced settings**:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/gcloud_upload_json-72298dc179c0244871824afa1c0d1362.png" />
</Frame>

Click **Test and Save**.

### Additional steps for BigQuery

1. Create a new temporary schema (dataset) for each OAuth user.

Go to Google Cloud console, navigate to BigQuery, select your project in BigQuery, and click on **Create dataset**:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/bq_create_dataset1-56868087d2d2829fcf35c92046361179.png" />
</Frame>

Provide `datafold_tmp_<username>` as the **Dataset ID** and set the same region as configured for other datasets. Click **CREATE DATASET**:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/bq_create_dataset2-ae5cb54f7dfb02699c0a8c6baf991205.png" />
</Frame>

1. Configure permissions for `datafold_tmp_<username>`.

Grant read/write/create/delete permissions to the user for their `datafold_tmp_<username>` schema. This can be done by granting roles like **BigQuery Data Editor** or **BigQuery Data Owner** or any custom roles with the required permissions.

Go to Google Cloud console, navigate to BigQuery, select `datafold_tmp_<username>` dataset, and click **Create dataset** → **Manage Permissions**:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/bq_config_schema_permissions1-c085ec5f619915d10c8e1819aa31420c.png" />
</Frame>

Click **+ ADD PRINCIPAL**, specify the user and role, then click **SAVE**:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/bq_config_schema_permissions2-2cea1df20480853fe7f9aacf1e786280.png" />
</Frame>

caution

Ensure that only the specified user (excluding admins) has read/write/create/delete permissions on `datafold_tmp_<username>`.

1. Configure temporary schema in Datafold.

As a user, navigate to `https://app.datafold.com/users/me`. If the user lacks credentials for BigQuery, click on **+ Add credentials**, select BigQuery datasource from the list, and click **Create credentials**:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/bq_datafold_temp_schema1-0ed07791aae93db489b72f56d9a8b956.png" />
</Frame>

The user will be redirected to `accounts.google.com` and then returned to the previous page:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/bq_datafold_temp_schema2-dc4418e6faa3aaceb9ecccd773618fd4.png" />
</Frame>

Select BigQuery credentials from the list, input the **Temporary Schema** field in the format `<project>.<datafold_tmp_<username>>`, and click **Update**:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/bq_datafold_temp_schema3-19e078cdc1acee794bfb92a2abb907a4.png" />
</Frame>

<Note>
  **INFO**

  Users can update BigQuery credentials only if they have the correct permissions for `<datafold_tmp_<username>`.
</Note>


# Integrate with Orchestrators
Source: https://docs.datafold.com/integrations/orchestrators

Integrate Datafold with dbt Core, dbt Cloud, Airflow, or custom orchestrators to streamline your data workflows with automated monitoring, testing, and seamless CI integration.

<Info>
  **NOTE**

  To integrate with dbt, first set up a [Data Connection](/integrations/databases) and integrate with [Code Repositories](/integrations/code-repositories).
  Then navigate to **Settings** → **dbt** and click **Add New Integration**.
</Info>

<CardGroup>
  <Card title="dbt Core" icon="file" href="/integrations/orchestrators/dbt-core" horizontal>
    Set up Datafold with dbt Core to enable automated data diffs and CI/CD integration.
  </Card>

  <Card title="dbt Cloud" icon="file" href="/integrations/orchestrators/dbt-cloud" horizontal>
    Integrate with dbt Cloud to enable automated data diffs and CI/CD integration.
  </Card>

  <Card title="Custom Integrations" icon="file" href="/integrations/orchestrators/custom-integrations" horizontal>
    Use Datafold's API and SDK to build custom CI integrations tailored to your workflow.
  </Card>
</CardGroup>


# Custom Integrations
Source: https://docs.datafold.com/integrations/orchestrators/custom-integrations

Integrate Datafold with your custom orchestration using the Datafold SDK and REST API.

<Info>
  To use the Datafold REST API, you should first create a Datafold API key in Settings > Account.
</Info>

## Install

Then, create your virtual environment for Python:

```
> python3 -m venv venv
> source venv/bin/activate
> pip install --upgrade pip setuptools wheel
```

Now, you're ready to install the Datafold SDK:

```
> pip install datafold-sdk
```

## Configure

Navigate in the Datafold UI to Settings > Integrations > CI. After selecting `datafold-sdk` from the available options, complete configuration with the following information:

| Field Name                               | Description                                                                                                                                                                                                                                                        |
| ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Repository                               | Select the repository that generates the webhooks and where pull / merge requests will be raised.                                                                                                                                                                  |
| Data Connection                          | Select the data connection where the code that is changed in the repository will run.                                                                                                                                                                              |
| Name                                     | An identifier used in Datafold to identify this CI configuration.                                                                                                                                                                                                  |
| Files to ignore                          | If defined, the files matching the pattern will be ignored in the PRs. The pattern uses the syntax of .gitignore. Excluded files can be re-included by using the negation; re-included files can be later re-excluded again to narrow down the filter.             |
| Mark the CI check as failed on errors    | If the checkbox is disabled, the errors in the CI runs will be reported back to GitHub/GitLab as successes, to keep the check "green" and not block the PR/MR. By default (enabled), the errors are reported as failures and may prevent PR/MRs from being merged. |
| Require the `datafold` label to start CI | When this is selected, the Datafold CI process will only run when the 'datafold' label has been applied. This label needs to be created manually in GitHub or GitLab and the title or name must match 'datafold' exactly.                                          |
| Sampling tolerance                       | The tolerance to apply in sampling for all data diffs.                                                                                                                                                                                                             |
| Sampling confidence                      | The confidence to apply when sampling.                                                                                                                                                                                                                             |
| Sampling Threshold                       | Sampling will be disabled automatically if tables are smaller than specified threshold. If unspecified, default values will be used depending on the Data Connection type.                                                                                         |

## Add commands to your custom orchestration

```bash
export DATAFOLD_API_KEY=XXXXXXXXX

# only needed if your Datafold app url is not app.datafold.com
export DATAFOLD_HOST=<CUSTOM_DATAFOLD_APP_DOMAIN>
```

To submit diffs for a CI run, replace `ci_config_id`, `pr_num`, and `diffs_file` with the appropriate values for your CI configuration ID, pull request number, and the path to your diffs `JSON` file.

#### CLI

```bash
datafold ci submit \
  --ci-config-id <ci_config_id> \
  --pr-num <pr_num> \
  --diffs <diffs_file> \
```

#### Python

```python
import os

from datafold_sdk.sdk.ci import run_diff

api_key = os.environ.get('DATAFOLD_API_KEY')

# Only needed if your Datafold app URL is not app.datafold.com
host = os.environ.get("DATAFOLD_HOST")

run_diff(host=host,
         api_key=api_key,
         ci_config_id=<ci_config_id>,
         pr_num=<pr_num>,
         diffs='<diffs_file>')
```

##### Example JSON format for diffs file

The `JSON` file should define the production and pull request tables to compare, along with any primary keys and columns to include or exclude in the comparison.

```json
[
  {
    "prod": "YOUR_PROJECT.PRODUCTION_TABLE_A",
    "pr": "YOUR_PROJECT.PR_TABLE_NUM",
    "pk": ["ID"],
    "include_columns": ["Column1", "Column2"],
    "exclude_columns": ["Column3"]
  },
  {
    "prod": "YOUR_PROJECT.PRODUCTION_TABLE_B",
    "pr": "YOUR_PROJECT.PR_TABLE_NUM",
    "pk": ["ID"],
    "include_columns": ["Column1"],
    "exclude_columns": []
  }
]
```


# dbt Cloud
Source: https://docs.datafold.com/integrations/orchestrators/dbt-cloud

Integrate Datafold with dbt Cloud to automate Data Diffs in your CI pipeline, leveraging dbt jobs to detect changes and ensure data quality before merging.

<Note>
  **NOTE**

  You will need a dbt **Team** account or higher to access the dbt Cloud API that Datafold uses to connect the accounts.
</Note>

## Prerequisites

### Set up dbt Cloud CI

In dbt Cloud, [set up dbt Cloud CI](https://docs.getdbt.com/docs/deploy/cloud-ci-job) so that your Pull Request job runs when you open or update a Pull Request. This job will provide Datafold information about the changes included in the PR.

### Create an Artifacts Job in dbt Cloud

The Artifacts job generates production `manifest.json` on merge to main/master, giving Datafold information about the state of production. The simplest method is to set up a dbt Cloud job that executes the `dbt ls` command on merge to main/master.

> Note: `dbt ls` is preferred over `dbt compile` as it runs faster and data diffing does not require fully compiled models to work.

Example dbt Cloud artifact job settings and successful run:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/dbt_cloud_artifacts_select_merge_job-590292c72209454e660444ea1a78fb5f.png" />
</Frame>

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/dbt_cloud_artifacts_job_settings-939f1ce3f456698459c9045115706775.png" />
</Frame>

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/dbt_ls_artifacts_job_example-2839e9104f9a64ca2833966db3900131.png" />
</Frame>

<Accordion title="Continuous Deployment">
  If you are interested in continuous deployment, you can use a [Merge Trigger Production Job](https://docs.datafold.com/cd#merge-trigger-production-job) instead of the Artifacts Job listed above.
</Accordion>

### dbt Cloud Access URL

You will need your [access url](https://docs.getdbt.com/docs/cloud/about-cloud/regions-ip-addresses) to connect Datafold to your dbt Cloud account.

### Add dbt Cloud Service Account Token

To connect Datafold to your dbt Cloud account, you will need to use a [Service Token](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens).

info

Please note that the use of User API Keys for this purpose is no longer recommended due to a [recent security update](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens) in dbt Cloud. [Learn more below](/integrations/orchestrators/dbt-cloud#deprecating-user-tokens)

1. Navigate to **Account Settings → Service Tokens → + New Token**.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/dbt_cloud_add_service_token-2367d19382e6d25416b452ec5378bbfb.png" />
</Frame>

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/dbt_cloud_add_service_token_permission-9fbdbb501c79f8a0bdee4abbf7483270.png" />
</Frame>

1. Add a Permission Set and select `Member` or `Developer`.
2. Select `All Projects`, or check only the projects you intend to use with Datafold.
3. Save your changes.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/dbt_cloud_service_token-5a4c080cb6b778f030eaf02988c36978.png" />
</Frame>

1. Navigate to **Your Profile → API Access** and copy the token.

#### Deprecating User Tokens

dbt Cloud is transitioning away from the use of User API Keys for authentication. The User API Key will be replaced by account-scoped Personal Access Tokens (PATs).

This update will affect the functionality of certain API endpoints. Specifically, `/v2/accounts`, `/v3/accounts`, and `/whoami` (undocumented API) will no longer return information about all the accounts tied to a user. Instead, the response will be filtered to include only the context of the specific account in the request.

dbt Cloud users have until April 30, 2024, to implement this change. After this date, all user API keys will be scoped to an account. New customers are required to use the new account-scoped PATs.

For more information, please refer to the [dbt Cloud API Documentation](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens).

If you have any questions or require further assistance, please don't hesitate to contact our support team.

## Create a dbt Cloud Integration in the Datafold app

* Navigate to Settings > Integrations > CI and create a new dbt Cloud integration.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/dbt_cloud_setup-b9dab8af8ca813283d0aaa3b99556eb0.png" />
</Frame>

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/dbt_cloud_api_key-f3e2f3669695bdedf80f47fa1ccc91b3.png" />
</Frame>

## Configuration

### Basic Settings

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/dbt_cloud_basic_settings-022522ea2690dcc55c4bc7d3b1e4a411.png" />
</Frame>

* **Repository**: Select a repository that you set up in [the Code Repositories setup step](/integrations/code-repositories).
* **Data Connection**: Select a connection that you set up in [the Data Connections setup step](/integrations/databases).
* **Name**: This can be anything!
* **Primary key tag**: This is a text string that you may use to tag primary keys in your dbt project yaml. Note that to avoid the need for tagging, [primary keys can be inferred from dbt uniqueness tests](/deployment-testing/configuration/primary-key).
* **Account name**: This will be autofilled using your dbt API key.
* **Job that creates dbt artifacts**: This will be [the Artifacts Job that you created](#create-an-artifacts-job-in-dbt-cloud). Or, if you have a dbt production job that runs on each merge to main, select that job.
* **Job that builds pull requests**: This is the dbt CI job that is triggered when you open a Pull Request or Merge Request.

### Advanced Settings

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/dbt_cloud_advanced_settings-c862158fc664963c51377f0daaadaca3.png" />
</Frame>

* **Enable Datafold in CI/CD**: High-level switch to turn Datafold off or on in CI (but we hope you'll leave it on!).
* **Import dbt tags and descriptions**: Populate our Lineage tool with dbt metadata. ⚠️ This feature is in development. ⚠️
* **Slim Diff**: Only diff modified models in CI, instead of all models. [Please read more about Slim Diff](/deployment-testing/best-practices/slim-diff), which is highly configurable using dbt yaml, and each organization will need to set a strategy based on their data environment.
  * Downstream Hightouch models will be diffed even when Slim Diff is turned on.
* **Diff Hightouch Models**: Hightouch customers can see diffs of downstream Hightouch assets in Pull Requests.
* **CI fails on primary key issues**: The existence of null or duplicate primary keys causes the Datafold CI check to fail.
* **Pull Request Label**: For when you want Datafold to *only* run in CI when a label is manually applied in GitHub/GitLab.
* **CI Diff Threshold**: For when you want Datafold to *only* run automatically if the number of diffs doesn't exceed this threshold for a given CI run.
* **Files to ignore**: If at least one modified file doesn’t match the ignore pattern, Datafold CI diffs all changed models in the PR. If all modified files should be ignored, Datafold CI does not run in the PR. ([Additional details.](/deployment-testing/configuration/datafold-ci/on-demand))
* **Custom base branch**: For when you want Datafold to **only** run in CI when a PR is opened against a specific base branch. You might need this if you have multiple environments built from different branches. See [Custom branch](https://docs.getdbt.com/faqs/Environments/custom-branch-settings) in dbt Cloud docs.

Click save, and that's it! <Icon icon="party-horn" />

Now that you've set up a dbt Cloud integration, Datafold will diff your impacted tables whenever you push commits to a PR. A summary of the diff will appear in GitHub, and detailed results will appear in the Datafold app.


# dbt Core
Source: https://docs.datafold.com/integrations/orchestrators/dbt-core

Set up Datafold’s integration with dbt Core to automate Data Diffs in your CI pipeline.

<Note>
  **PREREQUISITES**

  * Create a [Data Connection Integration](/integrations/databases) where your dbt project data is built.
  * Create a [Code Repository Integration](/integrations/code-repositories) where your dbt project code is stored.
</Note>

## Getting started

To add Datafold to your continuous integration (CI) pipeline using dbt Core, follow these steps:

### 1. Create a dbt Core integration.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/dbt_core_integration-e8e0468f1f20a3436a47cb37d08639c0.png" />
</Frame>

### 2. Set up the dbt Core integration.

Complete the configuration by specifying the following fields:

#### Basic settings

| Field Name         | Description                                                                                |
| ------------------ | ------------------------------------------------------------------------------------------ |
| Configuration name | Choose a name for your for your Datafold dbt integration.                                  |
| Repository         | Select your dbt project.                                                                   |
| Data Connection    | Select the data connection your dbt project writes to.                                     |
| Primary key tag    | Choose a string for [tagging primary keys](/deployment-testing/configuration/primary-key). |

#### Advanced settings: Configuration

| Field Name                       | Description                                                                                                                                                                                                                                                                                                 |
| -------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Import dbt tags and descriptions | Import dbt metadata (including column and table descriptions, tags, and owners) to Datafold.                                                                                                                                                                                                                |
| Slim Diff                        | Data diffs will be run only for models changed in a pull request. See our [guide to Slim Diff](/deployment-testing/best-practices/slim-diff) for configuration options.                                                                                                                                     |
| Diff Hightouch Models            | Run Data Diffs for Hightouch models affected by your PR.                                                                                                                                                                                                                                                    |
| CI fails on primary key issues   | The existence of null or duplicate primary keys will cause CI to fail.                                                                                                                                                                                                                                      |
| Pull Request Label               | When this is selected, the Datafold CI process will only run when the `datafold` label has been applied.                                                                                                                                                                                                    |
| CI Diff Threshold                | Data Diffs will only be run automatically for a given CI run if the number of diffs doesn't exceed this threshold.                                                                                                                                                                                          |
| Branch commit selection strategy | Select "Latest" if your CI tool creates a merge commit (the default behavior for GitHub Actions). Choose "Merge base" if CI is run against the PR branch head (the default behavior for GitLab).                                                                                                            |
| Custom base branch               | If defined, CI will run only on pull requests with the specified base branch.                                                                                                                                                                                                                               |
| Columns to ignore                | Use standard gitignore syntax to identify columns that Datafold should never diff for any table. This can [improve performance](/faq/performance-and-scalability#how-can-i-optimize-diff-performance-at-scale) for large datasets. Primary key columns will not be excluded even if they match the pattern. |
| Files to ignore                  | If at least one modified file doesn’t match the ignore pattern, Datafold CI diffs all changed models in the PR. If all modified files should be ignored, Datafold CI does not run in the PR. ([Additional details.](/deployment-testing/configuration/datafold-ci/on-demand))                               |

#### Advanced settings: Sampling

Sampling allows you to compare large datasets more efficiently by checking only a randomly selected subset of the data rather than every row. By analyzing a smaller but statistically meaningful sample, Datafold can quickly estimate differences without the overhead of a full dataset comparison. To learn more about how sampling can result in a speedup of 2x to 20x or more, see our [best practices on sampling](/data-diff/cross-database-diffing/best-practices#enable-sampling).

| Field Name          | Description                                                                                                                                                                |
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Enable sampling     | Enable sampling for data diffs to optimize analyzing large datasets.                                                                                                       |
| Sampling tolerance  | The tolerance to apply in sampling for all data diffs.                                                                                                                     |
| Sampling confidence | The confidence to apply when sampling.                                                                                                                                     |
| Sampling threshold  | Sampling will be disabled automatically if tables are smaller than specified threshold. If unspecified, default values will be used depending on the Data Connection type. |

### 3. Obtain an Datafold API Key and CI config ID.

After saving the settings in step 2, scroll down and generate a new Datafold API Key and obtain the CI config ID.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/api_key_and_config_id-634de74aeb5f3904e366c412b4c61ba1.png" />
</Frame>

### 4. Configure your CI script(s) with the Datafold SDK.

Using the Datafold SDK, configure your CI script(s) to upload dbt `manifest.json` files.

The `datafold dbt upload` command takes this general form and arguments:

```
datafold dbt upload --ci-config-id <your-ci_config-id> --run-type <job-type> --commit-sha <commit-sha>
```

You will need to configure orchestration to upload the dbt `manifest.json` files in 2 scenarios:

1. **On merges to main.** These `manifest.json` files represent the state of the dbt project on the base/production branch from which PRs are created.
2. **On updates to PRs.** These `manifest.json` files represent the state of the dbt project on the PR branch.

The dbt Core integration creation form automatically generates code snippets that can be added to CI runners.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/dbt_core_ci_config-b0a800dc1e86cfa3c0af64d97ca52af8.png" />
</Frame>

By storing and comparing these `manifest.json` files, Datafold determines which dbt models to diff in a CI run.

Implementation details vary depending on which CI tool you use. Please review [these instructions and examples](#ci-implementation-tools) to help you configure updates to your organization's CI scripts.

### 5. Test your dbt Core integration.

After updating your CI scripts, trigger jobs that will upload `manifest.json` files represent the base/production state.

Then, open a new pull request with changes to a SQL file to trigger a CI run.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/dbt_core_ci_config_test-7c78a7e83802c4273cdec4600a09e01d.png" />
</Frame>

## CI implementation tools

We've created guides and templates for three popular CI tools.

<Tip>
  **Having trouble setting up Datafold in CI?**

  We're here to help! Please reach out and [chat with a Datafold Solutions Engineer](https://www.datafold.com/booktime). <Icon icon="phone-rotary" />
</Tip>

To add Datafold to your CI tool, add `datafold dbt upload` steps in two CI jobs:

* **Upload Production Artifacts:** A CI job that build a production `manifest.json`. *This can be either your Production Job or a special Artifacts Job that runs on merge to main (explained below).*
* **Upload Pull Request Artifacts:** A CI job that builds a PR `manifest.json`.

This ensures Datafold always has the necessary `manifest.json` files, enabling us to run data diffs comparing production data to dev data.

<Tabs>
  <Tab title="GitHub Actions">
    **Upload Production Artifacts**

    Add the `datafold dbt upload` step to *either* your Production Job *or* an Artifacts Job.

    **Production Job**

    If your dbt prod job kicks off on merges to the base branch, add a `datafold dbt upload` step after the `dbt build` step.

    ```bash
    name: Production Job
    on:
      push:
        branches:
          - main
    jobs:
      run:
        runs-on: ubuntu-20.04
        steps:
          - name: Install Datafold SDK
            run: pip install -q datafold-sdk
          - name: Upload dbt artifacts to Datafold
            run: datafold dbt upload --ci-config-id <datafold_ci_config_id> --run-type production --commit-sha ${GIT_SHA}
            env:
              DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
              GIT_SHA: "${{ github.sha }}"
    ```

    **Artifacts Job**

    If your existing Production Job runs on a schedule and not on merges to the base branch, create a dedicated job that runs on merges to the base branch which generates and uploads a `manifest.json` file to Datafold.

    ```bash
    name: Artifacts Job
    on:
      push:
        branches:
          - main
    jobs:
      run:
        runs-on: ubuntu-20.04
        steps:
          - name: Install Datafold SDK
            run: pip install -q datafold-sdk
          - name: Generate dbt manifest.json
            run: dbt ls
          - name: Upload dbt artifacts to Datafold
            run: datafold dbt upload --ci-config-id <datafold_ci_config_id> --run-type production --commit-sha ${BASE_GIT_SHA}
            env:
              DATAFOLD_APIKEY: ${{ secrets.DATAFOLD_APIKEY }}
              BASE_GIT_SHA: "${{ github.sha }}"
    ```

    **Pull Request Artifacts**

    Include the `datafold dbt upload` step in your CI job that builds PR data.

    ```bash
    name: Pull Request Job
    on:
      pull_request:
      push:
        branches:
          - '!main'
    jobs:
      run:
        runs-on: ubuntu-20.04
        steps:
          - name: Install Datafold SDK
            run: pip install -q datafold-sdk
          - name: Upload PR manifest.json to Datafold
            run: |
              datafold dbt upload --ci-config-id <datafold_ci_config_id> --run-type pull_request --commit-sha ${PR_GIT_SHA}
            env:
              DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
              PR_GIT_SHA: "${{ github.event.pull_request.head.sha }}"
    ```

    **Store Datafold API Key**

    Save the API key as `DATAFOLD_API_KEY` in your [GitHub repository settings](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository).
  </Tab>

  <Tab title="CircleCI">
    **Upload Production Artifacts**

    Add the `datafold dbt upload` step to *either* your Production Job *or* an Artifacts Job.

    **Production Job**

    If your dbt prod job kicks off on merges to the base branch, add a `datafold dbt upload` step after the `dbt build` step.

    ```bash
    version: 2.1
    jobs:
      prod-job:
        filters:
          branches:
            only: main
        docker:
          - image: cimg/python:3.9
        steps:
          - checkout
          - run:
              name: "Install Datafold SDK"
              command: pip install -q datafold-sdk
          - run:
              name: "Build dbt project"
              command: dbt build
          - run:
              name: "Upload production manifest.json to Datafold"
              command: |
                datafold dbt upload --ci-config-id <datafold_ci_config_id> --run-type production --target-folder ./target/ --commit-sha ${CIRCLE_SHA1}
    ```

    **Artifacts Job**

    If your existing Production Job runs on a schedule and not on merges to the base branch, create a dedicated job that runs on merges to the base branch which generates and uploads a `manifest.json` file to Datafold.

    ```bash
    version: 2.1
    jobs:
      artifacts-job:
        filters:
          branches:
            only: main
        docker:
          - image: cimg/python:3.9
        steps:
          - checkout
          - run:
              name: "Install Datafold SDK"
              command: pip install -q datafold-sdk
          - run:
              name: "Generate manifest.json"
              command: dbt ls --profiles-dir ./
          - run:
              name: "Upload production manifest.json to Datafold"
              command: datafold dbt upload --ci-config-id <datafold_ci_config_id> --run-type production --target-folder ./target/ --commit-sha ${CIRCLE_SHA1}
    ```

    **Store Datafold API Key**

    Save the API key in the [CircleCI interface](https://circleci.com/docs/set-environment-variable/).
  </Tab>

  <Tab title="GitLab CI">
    **Upload Production Artifacts**

    Add the `datafold dbt upload` step to *either* your Production Job *or* an Artifacts Job.

    **Production Job**

    If your dbt prod job kicks off on merges to the base branch, add a `datafold dbt upload` step after the `dbt build` step.

    ```bash
    image:
      name: ghcr.io/dbt-labs/dbt-core:1.x
    run_pipeline:
      stage: deploy
      before_script:
        - pip install -q datafold-sdk
      script:
        - dbt build --profiles-dir ./
        - datafold dbt upload --ci-config-id <ci-config-id> --run-type production --commit-sha $CI_COMMIT_SHA
    ```

    **Artifacts Job**

    If your existing Production Job runs on a schedule and not on merges to the base branch, create a dedicated job that runs on merges to the base branch which generates and uploads a `manifest.json` file to Datafold.

    ```bash
    image:
      name: ghcr.io/dbt-labs/dbt-core:1.x
    run_pipeline:
      stage: deploy
      before_script:
        - pip install -q datafold-sdk
      script:
        - dbt ls --profiles-dir ./
        - datafold dbt upload --ci-config-id <ci-config-id> --run-type production --commit-sha $CI_COMMIT_SHA
    ```

    **Store Datafold API Key**

    Save the API key as `DATAFOLD_API_KEY` in [GitLab repository settings](https://docs.gitlab.com/ee/ci/yaml/index.html#secrets).
  </Tab>
</Tabs>

## CI for dbt multi-projects

When setting up CI for dbt multi-projects, each project should have its own dedicated CI integration to ensure that changes are validated independently.

## CI for dbt multi-projects within a monorepo

When managing multiple dbt projects within a monorepo (a single repository), it’s essential to configure individual Datafold CI integrations for each project to ensure proper isolation.

This approach prevents unintended triggering of CI processes for projects unrelated to the changes made. Here’s the recommended approach for setting it up in Datafold:

**1. Create separate CI integrations:** Create separate CI integrations within Datafold, one for each dbt project within the monorepo. Each integration should be configured to reference the same GitHub repository.

**2. Configure file filters**: For each CI integration, define file filters to specify which files should trigger the CI run. These filters prevent CI runs from being initiated when files from other projects in the monorepo are updated.

**3. Test and validate**: Before deployment, test each CI integration to validate that it triggers only when changes occur within its designated dbt project. Verify that modifications to files in one project do not inadvertently initiate CI processes for unrelated projects in the monorepo.

###

## Advanced configurations

### Skip Datafold in CI

To skip the Datafold step in CI, include the string `datafold-skip-ci` in the last commit message.

### Programmatically trigger CI runs

The Datafold app relies on the version control service webhooks to trigger the CI runs. When the dedicated cloud deployments is behind a VPN, webhooks cannot directly reach the deployment due to the network's restricted access.

We can overcome this by triggering the CI runs via the [datafold-sdk](/api-reference/datafold-sdk) in the Actions/Job Runners, assuming they're running in the same network.

Add a new Datafold SDK command after uploading the manifest in a PR job:

<Tip>
  **Important**

  When configuring your CI script, be sure to use `${{ github.event.pull_request.head.sha }}` for the **Pull Request Job** instead of `${{ github.sha }}`, which is often mistakenly used.

  `${{ github.sha }}` defaults to the latest commit SHA on the branch and **will not work correctly for pull requests**.
</Tip>

```Bash
  - -name: Trigger CI
    run: |
      set -ex
      datafold ci trigger --ci-config-id <datafold_ci_config_id> \
        --pr-num ${PR_NUM} \
        --base-branch ${BASE_BRANCH} \
        --base-sha ${BASE_SHA} \
        --pr-branch ${PR_BRANCH} \
        --pr-sha ${PR_SHA}
    env:
      DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
      DATAFOLD_HOST: ${{ secrets.DATAFOLD_HOST }}
      PR_NUM: ${{ github.event.number }}
      PR_BRANCH: ${{ github.event.pull_request.head.ref }}
      BASE_BRANCH: ${{ github.event.pull_request.base.ref }}
      PR_SHA: ${{ github.event.pull_request.head.sha }}
      BASE_SHA: ${{ github.event.pull_request.base.sha }}

```

### Running diffs before opening a PR

Some teams want to show Data Diff results in their tickets *before* creating a pull request. This speeds up code reviews as developers can QA code changes before requesting a PR review.

Check out how to automate this workflow [here](/faq/datafold-with-dbt#can-i-run-data-diffs-before-opening-a-pr).


# Compliance & Trust Center
Source: https://docs.datafold.com/security/compilance-trust-center


# Database OAuth
Source: https://docs.datafold.com/security/database-oauth

Datafold enables secure workflows like data diffs through OAuth, ensuring compliance with user-specific database permissions.

To improve data security and privacy, Datafold supports running workflows like data diffs through OAuth. This ensures queries are executed using the user's own database credentials, fully complying with granular access controls like data masking and object-level permissions.

The diagram below illustrates how the authentication flow proceeds:

1. Users authenticate using the configured OAuth provider.
2. Users can then create diffs between data sets that their user can access using OAuth database permissions.
3. During Continuous Integration (CI), Datafold executes diffs using a Service Account with the least privileges, thus masking sensitive/PII data.
4. If a user needs to see sensitive/PII data from a CI diff, and they have permission via OAuth to do so, they can rerun the diff, and then Datafold will authenticate the user using OAuth database permissions. Then, the user will have access to the data based on these permissions.

This structure ensures that diffs are executed with the user's database credentials with their configured roles and permissions. Data access permissions are thus fully managed by the database, and Datafold only passes through queries.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/rbac-with-sso-auth-flow-f641e578b9ee12f4ab09e5573125cb0a.png" />
</Frame>


# Securing Connections
Source: https://docs.datafold.com/security/securing-connections

Datafold supports multiple options to secure connections between your resources (e.g., databases and BI tools) and Datafold.

## Encryption

When you connect to Datafold to query your data in a database (e.g., BigQuery), communications are secured using HTTPS encryption.

## IP Whitelisting

If access to your data connection is restricted to IP addresses on an allowlist, you will need to manually add Datafold's addresses in order to use our product. Otherwise, you will receive a connection error when setting up your data connection.

For SaaS (app.datafold.com) deployments, whitelist the following IP addresses:

* `23.23.71.47`
* `35.166.223.86`
* `52.11.132.23`
* `54.71.177.163`
* `54.185.25.103`
* `54.210.34.216`

Note that at any given time, you will only see one of these addresses in use. However, the active IP address can change, so you should add them all to your IP whitelist to ensure no interruptions in service.

## Private Link

<Tabs>
  <Tab title="AWS">
    ### AWS PrivateLink

    AWS PrivateLink allows you to connect Datafold to your databases without exposing data to the internet. This option is available for both Datafold SaaS Cloud and all Datafold Dedicated Cloud options.

    The following diagram shows the architecture for a customer with a High Availability RDS setup:

    <Frame caption="SaaS with PrivateLink">
      <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/saas_with_privatelink-1294c819a7e75474a9eb736bfac2cc95.png" />
    </Frame>

    ### Setup

    <Info>
      Supported Databases

      The following setup assumes you have an RDS/Aurora database you want to connect to. Datafold also supports PrivateLink connections to other databases such as Snowflake, which should only be accessed from your VPC. Please contact [support@datafold.com](mailto:support@datafold.com) to get assistance with connecting to your specific database.
    </Info>

    Our support team will send you the following:

    * The role ARN to establish the PrivateLink connection.
    * Datafold SaaS Cloud VPC CIDR range.

    You need to do the following steps:

    1. Send us the region(s) where your database(s) are located.
    2. Create a VPC Endpoint Service and NLB.
       * The core concepts of this setup are described in this AWS blog: [Access Amazon RDS across VPCs using AWS PrivateLink and Network Load Balancer](https://aws.amazon.com/blogs/database/access-amazon-rds-across-vpcs-using-aws-privatelink-and-network-load-balancer/).
       * If your databases are HA, please implement the failover mechanics described in the blog.
         * A CloudFormation template can be found [here](https://github.com/aws-samples/amazon-rds-crossaccount-access/blob/main/CrossAccountRDSAccess.yml).
    3. Add the provided role ARN as 'Allowed Principal' on the VPC Endpoint Service.
    4. Allow ingress from the Datafold SaaS Cloud VPC.
    5. Send us the:
       * Service name(s), e.g. `com.amazonaws.vpce.us-west-2.vpce-svc-0cfd2f258c4395ad6`.
       * Availability Zone ID(s) used in the VPCE Service(s), e.g. `use1-az6` or `usw2-az3`.
       * RDS/Aurora hostname(s), e.g. `datafold.c2zezoge6btk.us-west-2.rds.amazonaws.com`.

    At the end, the database hostname used to configure the data source will be the original RDS/Aurora hostname. But with private DNS resolution, we will resolve the hostname to the VPC Endpoint. Our support team will let you know when everything is set up and you can accept the PrivateLink connection and start configuring the data source.

    ### Cross-Region PrivateLink

    Datafold SaaS Cloud supports cross-region PrivateLink for all North American regions. Datafold SaaS Cloud is located in `us-west-2`. Datafold manages the cross-region networking, allowing you to connect to a VPC Endpoint in the same region as your VPC Endpoint Service. For Datafold Dedicated Cloud customers, deployment occurs in your chosen region. If you need to connect to databases in multiple regions, Datafold also supports this through cross-region PrivateLink.

    The setup will be similar to the regular PrivateLink setup.
  </Tab>

  <Tab title="GCP">
    ### Private Service Connect

    Google Cloud's Private Service Connect is only available if both parties are in the same cloud region. This option is only available for Datafold Dedicated Cloud customers. The diagram below illustrates how the solution works:

    <Frame>
      <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/gcp-psc-endpoint-overview-codelabs-b94101869413df5385a0a6406b9ff859.png" />
    </Frame>

    The basics of Private Service Connect are available [here](https://cloud.google.com/vpc/docs/private-service-connect).
  </Tab>

  <Tab title="Azure">
    ### Azure Private Link

    Azure Private Link is only available if both parties are in the same cloud region. This option is only available for Datafold Dedicated Cloud customers. The diagram below illustrates how the solution works:

    <Frame>
      <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/azure-cross-tenant-secure-access-private-endpoints-architecture-bcf92a6fe7e8007d278b3256c1ef666d.svg" />
    </Frame>

    The basics of Private Link are available [here](https://learn.microsoft.com/en-us/azure/private-link/private-link-overview).

    For Customer-Hosted Dedicated Cloud, achieving cross-tenant access requires using Private Link. The documentation can be accessed [here](https://learn.microsoft.com/en-us/azure/architecture/guide/networking/cross-tenant-secure-access-private-endpoints).
  </Tab>
</Tabs>

## VPC Peering (SaaS)

VPC Peering is easier to set up than Private Link, but a drawback is that both networks are joined and the IP ranges must not overlap. For Datafold SaaS Cloud, this setup is an AWS-only option.

The basics of VPC peering are covered [here](https://docs.aws.amazon.com/vpc/latest/peering/vpc-peering-basics.html).

To set up VPC peering, please contact [support@datafold.com](mailto:support@datafold.com) and provide us with the following information:

* AWS region where your database is hosted.
* ID of the VPC that you would like to connect.
* CIDR of the VPC.

If there are no address collisions, we'll send you a peering request and CIDR that we use on our end, and whitelist the CIDR range for your organization. You'll need to set up routing to this CIDR through the peering connection.

If you activate DNS on your side of the peering connection, you can use the private DNS hostname to connect. Otherwise, you need to use the IP.

## VPC Peering (Dedicated Cloud)

VPC Peering is a supported option for all cloud providers, both for Datafold-hosted and customer-hosted deployments. Basic information for each cloud provider can be found here:

* [AWS](https://docs.aws.amazon.com/vpc/latest/peering/vpc-peering-basics.html)
* [GCP](https://cloud.google.com/vpc/docs/vpc-peering)
* [Azure](https://learn.microsoft.com/en-us/azure/virtual-network/create-peering-different-subscriptions?tabs=create-peering-portal)

<Tip>
  **VPC vs VNet**

  We use the term VPC accross all major cloud providers. However, Azure calls this concept a Virtual Network (VNet).
</Tip>

## SSH Tunnel

To set up a tunnel, please contact our team at [support@datafold.com](mailto:support@datafold.com) and provide the following information:

* Hostname of your bastion host and port number used for SSH service.
* Hostname of and port number of your database.
* SSH fingerprint of the bastion host (optional).

We'll get back to you with:

* SSH public key that you need to add to `~/.ssh/authorized_hosts`.
* IP address and port to use for data connection configuration in the Datafold application.

## IPSec tunnel

Please contact our team at [support@datafold.com](mailto:support@datafold.com) for more information.


# Single Sign-On
Source: https://docs.datafold.com/security/single-sign-on

Set up Single Sign-On with one of the following options.

<CardGroup>
  <Card title="Okta (OIDC)" href="/security/single-sign-on/okta" icon="file" horizontal />

  <Card title="Google OAuth" href="/security/single-sign-on/google-oauth" icon="file" horizontal />

  <Card title="SAML" href="/security/single-sign-on/saml/" icon="folder-open" horizontal />
</CardGroup>

<Tip>
  **Tip**

  You can force all users to use the configured SSO provider by unchecking the *Allow non-admin users to login with email and password* checkbox under the organization settings.

  Admin users will still be able to login using email and password.

  <Frame>
    <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/disable_non_admin_email_password_login.png" />
  </Frame>
</Tip>

<Warning>
  **Caution**

  Ensure only authorized users keep using Datafold by setting up Okta webhooks or setting up credentials for the Microsoft Entra app if you're using Microsoft Entra ID (formerly known Azure Active Directory)

  This will disable non-admin users that don't have access to the configured SSO app.

  [Configure this for Okta](/security/single-sign-on/okta#synchronize-state-with-datafold-optional)

  [Configure this for Microsoft Entra ID](/security/single-sign-on/saml/examples/microsoft-entra-id-configuration#synchronize-user-with-datafold-optional)
</Warning>


# Google OAuth
Source: https://docs.datafold.com/security/single-sign-on/google-oauth


<Info>
  **NOTE**

  Google SSO is available for both SaaS and VPC installations of Datafold.
</Info>

## Datafold SaaS

For Datafold SaaS the setup only involves enabling Google SSO integration.

If Google SSO is already enabled for your organization you will see it in the **Settings** → **Integrations** → **SSO**.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/google_oauth_list-5b0c84c5bdddde31c6e82cce055ba758.png" />
</Frame>

If this is not the case, create a new Google SSO Integration by clicking on the **Add new integration** button.

Enable the **Allow Google logins in organization** switch and click **Save**. That's it!
If you are not using Datafold SaaS, please see below.

## Create OAuth Client ID

To begin, navigate to the [Google admin console](https://console.cloud.google.com/apis/credentials?authuser=1%5C\&folder=%5C) for your organization, click **Create Credentials**, and select **OAuth Client ID**.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/google_oauth_create_credential-95776eb793bc1a1115c5fe1b18d9203f.png" />
</Frame>

<Tip>
  **TIP**

  To configure OAuth, you may need to first configure your consent screen. We recommend selecting **Internal** to keep access limited to users in your Google workspace and organization.
</Tip>

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/google_oauth_consent_screen-6fd858ff5d1fc5fd43be910f055fe0ca.png" />
</Frame>

### Configure OAuth[](#configure-oauth "Direct link to Configure OAuth")

* **Application type**: "Web application"
* **Authorized JavaScript origins**: `https://<your.domain.name>`
* **Authorized redirect URIs**: `https://<your.domain.name>/oauth/google`

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/google_oauth_authorizations-002a9446a71ba66ba4d375d897c4cdf7.png" />
</Frame>

Finally, click **Create**. You will see a set of credentials that you will copy over to your Datafold Global Settings.

## Configure Google OAuth in Datafold

To finish the configuration, create a Google SSO Integration in Datafold.

To complete the integration in Datafold, create a new integration by navigating to **Settings** → **Integrations** → **SSO** → **Add new integration** → **Google**.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/google_oauth_create-463bc5c0d7cd049e0ba4ae8fdd577185.png" />
</Frame>

* Enable the **Google OAuth** switch.
* Enter the **domain** or URL of your OAuth client Id on the respective field.
* Paste the **Client Secret** on the respective field.
* Enable the **Allow Google logins in Organization** switch.
* Finally, click **Save**.


# Okta (OIDC)
Source: https://docs.datafold.com/security/single-sign-on/okta


<Info>
  **NOTE**

  Okta SSO is available for both SaaS and dedicated cloud installations of Datafold.
</Info>

## Create Okta App Integration[](#create-okta-app-integration "Direct link to Create Okta App Integration")

<Note>
  **INFO**

  Creating an App Integration in Okta may require admin privileges.
</Note>

Start the integration by creating a web app integration in Okta.

Next, log in to Okta interface and navigate to **Applications** and click **Create App Integration**.

Then, in the configuration form, select **OpenId Connect (OIDC)** and **Web Application** as the Application Type.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/okta_create_new_app-0b8566bc3dd329ef3d80f849c0065fef.png" />
</Frame>

In the following section, you will set:

* **App integration name**: A name to identify the integration. We suggest you use `Datafold`.
* **Grant type**: Should be set to `Authorization code` automatically.
* **Sign-in redirect URI**:

<Tabs>
  <Tab title=" SaaS">
    The redirect URL should be `https://app.datafold.com/oauth/okta/client_id`, where `client_id` is the Client ID of the configuration.

    <Warning>
      **CAUTION**
      You will be given the Client ID after saving the integration and need to come back to update the client ID afterwards.
    </Warning>
  </Tab>

  <Tab title="Dedicated cloud installations of Datafold">
    The redirect URL should be `https://your-dns-name/oauth/okta`, replacing `your-dns-name` with the DNS name for your installation.
  </Tab>
</Tabs>

* **Sign-out redirect URIs**: Leave this empty.
* **Trusted Origins**: Leave this empty too.
* **Assignments**: Select `Skip group assignment for now`. Later you should assign the correct groups and users.
* Click "Save" to create the app integration in Okta.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/okta_redirect_uri-b64d1ac6c24ab8577bf8a52f14da842b.png" />
</Frame>

Once the save is successful, on the next screen, you'll be presented with Client ID and Client Secret. We need these IDs to update the redirect URLs that Datafold needs. We'll also apply the Client ID and Client Secret in the Datafold integration later.

* Edit "General settings"
* Scroll down to the **Login** section
* Update the **Sign-in redirect URI**. See above for details.
* Click "Save" to persist the changes.

## Set Up Okta-initiated login

<Tip>
  **TIP**

  Organization admins will always be able to log in with either password or Okta. Non-admin users will be required to log in through Okta once configured.
</Tip>

This step is optional and should be done at the discretion of the Okta administrator.

Users in your organization can log in to the application directly from the Okta end-user dashboard. To enable this feature, configure the integration as follows:

1. Edit "General settings"
2. Set **Login initiated by** to `Either Okta or App`.
3. Set **Application visibility** to `Display application icon to users`.
4. Set **Login flow** to `Redirect to app to initiate login (OIDC Compliant)`.
5. Set **Initiate login URI**:

<Tabs>
  <Tab title=" SaaS">
    * `https://app.datafold.com/login/sso/client-id?action=desired_action`
    * Replace `client-id` with the Client ID of the configuration, and
    * Replace `desired_action` with `signup` if you enabled users auto-creation, or `login` otherwise.
  </Tab>

  <Tab title="Dedicated cloud installations of Datafold">
    * `https://your-dns-name/login/sso/client-id?action=desired_action`
    * Replace `client-id` with the Client ID of the configuration, and
    * Replace `desired_action`with `signup` if you enabled users auto-creation, or `login` otherwise.
    * Replace `your-dns-name` with the DNS name for your installation.
  </Tab>
</Tabs>

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/okta_initiated_login-8a7541151582487dd21f8381207e25fd.png" />
</Frame>

1. Click "Save" to persist the changes.

The Okta configuration is now complete.

## Configure Okta in Datafold

To finish the configuration, create an Okta integration in Datafold.

To complete the integration in Datafold, create a new integration by navigating to **Settings** → **Integrations** → **SSO** → **Add new integration** → **Okta**.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/okta_create-8269c208d4fa7df43a8c5ad99e675297.png" />
</Frame>

* Paste in your Okta **Client Id** and **Client Secret**.
* The **Metadata Url** of Okta OAuth server is `https://<okta-server-name>/.well-known/openid-configuration`, replace `okta-server-name` with the name of your Okta domain.
* If you'd like to auto-create users in Datafold that are authorized in Okta, enable the **Allow Okta to auto-create users in Organization** switch.
* Finally, click **Save**.

<Tip>
  **TIP**

  Users can either be explicitly invited in Datafold by an admin user, using the same email as used in Okta, or they can be auto-created. When the `signup` action is set in the login URI, authenticated users on Okta who have been assigned as a user in Okta of the Datafold application will then be able to login. If that user has not yet been invited, Datafold will then automatically create a user for them, since they're already authenticated by the Okta server of your domain. The user will then receive an email to confirm their email address.
</Tip>

## Synchronize state with Datafold \[Optional]

This step is essential if you want to ensure that users from your organization are automatically logged out when they are unassigned or deactivated in Okta.

1. Navigate to **Okta Admin panel** → **Workflow** → **Event Hooks**
2. Click **Create Event Hook**
3. Set **Name** to `Datafold`
4. Set **URL** to `https://app.datafold.com/hooks/oauth/okta/<client-id>`
5. Set **Authentication field** to `secret`
6. Go to Datafold and generate a secret token in **Settings** → **Integrations** → **SSO** → **Okta**. Click the **Generate** button, copy it by using the **Copy** button and click **Save**. Use the pasted code in the **Authentication secret** field in Okta.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/generate_token_input-3ef82f777565226aa5da10b52464549e.png" />
</Frame>

<Warning>
  **CAUTION**

  Keep this secret token safe as you won't be able to see after saving your Integration.
</Warning>

7. In **Subscribe to events** add events: `User suspended`, `User deactivated`, `Deactivate application`, `User unassigned from app`
8. Click **Save & Continue**

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/config_okta_event_hooks-ed108690a4e2e94d8158527dcc2f4196.png" />
</Frame>

. On **Verify Endpoint Ownership** click **Verify**

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/verify_okta_event_hooks-57c17ee772834faf39e6c7689743d1f5.png" />
</Frame>

* If the verification is successful, you have completed the setup.

## Testing the Okta integration

<Tabs>
  <Tab title="SaaS">
    * Visit [https://app.datafold.com](https://app.datafold.com)
    * Type in your email and wait up to five seconds.
    * The Okta button should switch from disabled to enabled.
    * Click the Okta login button.
    * The browser should be redirected to your Okta domain, authenticate the user there and be redirected back to the Datafold application.
  </Tab>

  <Tab title="Dedicated cloud installations of Datafold">
    * Visit `https://your-dns-name`, replacing your-dns-name with the domain name of your installation.
    * Type in your email and wait up to five seconds.
    * The Okta button should switch from disabled to enabled.
    * Click the Okta login button.
    * The browser should be redirected to your Okta domain, authenticate the user there and be redirected back to the Datafold application.
  </Tab>
</Tabs>

If this didn't work, pay close attention to any error messages, or contact `support@datafold.com`.


# SAML
Source: https://docs.datafold.com/security/single-sign-on/saml

SAML (Security Assertion Markup Language) is a protocol that enables secure user authentication by integrating Identity Providers (IdPs) with Service Providers (SPs).

<Info>
  **NOTE**

  SAML SSO is available for both SaaS and VPC installations of Datafold.
</Info>

In this case, Datafold is the service provider. The Identity Providers can be anything used by the organization (e.g., Google, Okta, Duo).

We also support SAML SSO [group provisioning](/security/single-sign-on/saml/group-provisioning).

## Generic SAML Identity Providers

<Tip>
  **TIP**

  We also provide SAML identity providers configurations for ([Okta](/security/single-sign-on/saml/examples/okta), [Microsoft Entra ID](/security/single-sign-on/saml/examples/microsoft-entra-id-configuration), and [Google](/security/single-sign-on/saml/examples/google))
</Tip>

To configure a SAML provider:

1. Go to `Datafold`. Create a new integration by navigating to **Settings** → **Integrations** → **SSO** → **Add new integration** → **SAML**.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/saml_create-3716c6fe01352ea69c647a7856adf189.png" />
</Frame>

1. Go to the organization's `Identity Provider`, create a **SAML application** (sometimes called a **single sign-on** or **SSO** method).

If you have the option, enable the SAML Response signature and set it to **whole-response signing**.

1. Copy and paste the Service Provider URLs from the `Datafold` SAML Integration into the `Identity Provider`'s application setup. The only two mandatory fields are **Service Provider Entity ID** and the **Service Provider ACS URL**.

After creation, The `Identity Provider` will show you the metadata XML. It may be presented as raw XML, a URL to the XML, or an XML file to download.

<Info>
  **INFO**

  The Identity Providers sometimes provide additional parameters, such as SSO URLs, ACS URLs, SLO URLs, etc. We gather this information from the XML directly so these can be safely ignored.
</Info>

1. Paste either the **metadata XML** *or* **metadata URL** from your `Identity Provider` into the respective `Datafold` SAML integration fields.
2. Finally, click the **Save** button to create the integration.

After creation, the SAML login button will be available for Datafold users in your organization.

1. In your `Identity Provider`, activate the SAML application for all users or for select groups.

<Warning>
  **CAUTION**

  Only configured users in your identity provider will be able to login into Datafold *using* SAML SSO.
</Warning>

### Auto-create users in Datafold

Go to `Datafold` and navigate to **Settings** → **Integrations** → **SSO** → **SAML**.

Enable the **Allow SAML to auto-create users in Organization** switch and save the integration.

<Tabs>
  <Tab title="SaaS">
    If the **Allow SAML to auto-create users in Organization** switch from the SAML Integration in Datafold is enabled, identity provider-initiated logins will automatically create users in Datafold for authenticated users.
  </Tab>

  <Tab title="Dedicated cloud installations of Datafold">
    If the **Allow SAML to auto-create users in Organization** switch from the SAML Integration in Datafold is enabled, the SAML login button will always be enabled, and all authenticated users will be automatically created in Datafold.
  </Tab>
</Tabs>


# Google
Source: https://docs.datafold.com/security/single-sign-on/saml/examples/google


## Google as a SAML Identity Provider

Enable SAML in your Google Workspace. Check [Set up your own custom SAML app](https://support.google.com/a/answer/6087519?hl=en) for more details.

<Warning>
  **CAUTION**

  You need to be a **super-admin** in the Google Workspace to configure a SAML application.
</Warning>

* Go to `Google`, click on **Download Metadata** in the left sidebar and **copy** the XML.
* Select **Email** as the Name ID format.
* Select **Basic Information > Primary email** as the Name ID.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/saml_google_settings-22d5c2a5018e3bd88139f63f41aeb5a8.png" />
</Frame>

* Go to `Datafold` and create a new SSO integration. Navigate to **Settings** → **Integrations** → **Add new integration** → **SAML**.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/saml_create-3716c6fe01352ea69c647a7856adf189.png" />
</Frame>

* Copy the read-only field **Service Provider ACS URL**, go to `Google` and paste it into **ACS URL**.
* Copy the read-only field **Service Provider Entity ID**, go to `Google` and paste it into **Entity ID**.
* Paste the **copied** XML into `Datafold`'s **Identity Provider Metadata XML** field.
* Click **Save** to create the integration.
* (Optional step) Configure the attribute mapping as follows:
  * **First Name** → `first_name`
  * **Last Name** → `last_name`

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/saml_google_mappings-60f0f4105c1debd2b14a95aa982727fa.png" />
</Frame>


# Microsoft Entra ID
Source: https://docs.datafold.com/security/single-sign-on/saml/examples/microsoft-entra-id-configuration


## Azure AD / Entra ID as a SAML Identity Provider

You can create an **Enterprise Application** and use that to configure access to Datafold. Click on **New application** and **Create your own application**.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/AzureEntraIDSAMLEnterpriseApp-ac80b4305fc06a4a80a45532d718710a.png" />
</Frame>

**Copy** the **App Federation Metadata Url**.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/AzureEntraIDSAMLEnterpriseAppInitialConfig-6d5935f0a7efeec4595856d5171c3182.png" />
</Frame>

Go to `Datafold` and create a new SSO integration. Navigate to **Settings** → **Integrations** → **Add new Integration** → **SAML**.

Paste the **copied** URL into **Identity Provider Metadata URL**.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/saml_create-3716c6fe01352ea69c647a7856adf189.png" />
</Frame>

Go to `Azure` and edit the **Basic SAML Configuration** in your Enterprise App.

Copy from Datafold the read-only field **Service Provider ACS URL** and paste it into **Reply URL**.

Copy from Datafold the read-only field **Service Provider Entity ID** and paste it into **Identifier**.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/AzureEntraIDSAMLEnterpriseAppSAMLConfig-f04cd556cd232163a85a3ff2e47e5e7e.png" />
</Frame>

Go to `Datafold` and click **Save** to create the SAML integration.

Next, edit the **Attributes & Claims**. By default, the **Unique User Identifier** is already correctly set to `user.userprincipalname`. If you have multiple domains (i.e., `@datafold.com` and `@datafoldonmicrosoft.com`), please make sure this maps correctly to the email addresses of the users in Datafold.

(Optional step) Add two attributes: `first_name` and `last_name`.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/AzureEntraIDSAMLEnterpriseAppSAMLAttribute-99692a9fa1d102a1eaa818d36c6b812e.png" />
</Frame>

Finally, edit the **SAML Certificates**. Set the signing option to **Sign SAML response and assertion**.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/AzureEntraIDSAMLEnterpriseAppCertificates-c4582a0cf51f8dcdae03013810278e00.png" />
</Frame>

After you made sure you are added as a user to the Enterprise Application, log out from Datafold. Click on **Test** under **Test single sign-on with DatafoldSSO**.

## Synchronize user with Datafold \[Optional]

This step is essential if you want to ensure that users from your organization are disabled if they are no longer assigned to the configured Microsoft Entra App.

1. Navigate to App registrations → API permissions.
2. Add the following permissions: `Group.Read.All` and `User.ReadBasic.All`.
   2.1 Click `Add a permission`.
   2.2 Select Microsoft Graph.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/1-e2efd77a0267ffe5f9fb14ef6be44c1f.png" />
</Frame>

2.3 Select application permissions and add the required permissions.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/2-00a764fe8abf4ef520abeaf7ae07d49e.png" />

  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/3-eadbef3fd2f9c1d0326ed8a9721c16c2.png" />
</Frame>

3. Grant admin consent.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/4-40f90f212a27572e669806bc36325bc7.png" />
</Frame>

4. You should now see a <Icon icon="square-check" /> next to the permissions.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/5-257e23569930de31a6168ac10aaf5bf3.png" />
</Frame>

5. Generate a secret so that Datafold can interact with the API.
   5.1 Click `Certificates & secrets`.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/6-015ef3a0d51e4ee205d6bd5d5c888e8d.png" />
</Frame>

5.2 Click `New client secret`.
5.3 Type in a description and click `Add`.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/7-a95118698bae900f1620b47905433fc4.png" />
</Frame>

6. Go to `Datafold` and navigate to **Settings** → **Integrations** → **SSO** → **Add new Integration** and select the Microsoft Entra ID Logo.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/8-bfcf9d1f0679293415dad2a9b7c5ef6c.png" />
</Frame>

7. Paste in the four required fields:<br />
   7.1 Tenant ID - [you can find this in the overview page](https://learn.microsoft.com/en-us/entra/fundamentals/how-to-find-tenant)<br />
   7.2 Navigate to the application overview<br />
   7.3 Copy Object ID and paste it into Principal Id<br />
   7.4 Copy Application ID and paste it into Client Id<br />
   7.5 Copy the secret we created in the previous steps and paste it into Client Secret<br />
   7.6 Click **Save** to create the integration.<br />

If the update is successful, it means that the integration is valid. Users that do not have access to the configured application will be disabled and logged out in at most one hour.


# Okta
Source: https://docs.datafold.com/security/single-sign-on/saml/examples/okta


## Okta as a SAML Identity Provider

You can create an **Application** and use that to configure access to Datafold. Click on **Applications** and **Create App Integration**.

Select **SAML 2.0**

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/okta_saml_2_0-f56d1d05fe14ca913026c4618dc1518b.png" />
</Frame>

Enter "Datafold" in **App name** and click **Next**.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/okta_saml_app_name-495773bcc2261378919673c58e49b91b.png" />
</Frame>

Go to `Datafold` and create a new SSO integration. Navigate to **Settings** → **Integrations** → **Add new Integration** → **SAML**.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/saml_create-3716c6fe01352ea69c647a7856adf189.png" />
</Frame>

* Copy the read-only field **Service Provider ACS URL** and paste it into **Single sign-on URL**.
* Copy the read-only field **Service Provider Entity ID** and paste it into **Audience URI (SP Entity ID)**.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/okta_saml_settings1-a3440ff6356c33c17b630039f9d0401f.jpeg" />
</Frame>

(Optional step) In **Attribute Statements (optional)** add fields:

* Name: `first_name`, Value: `user.firstName`
* Name: `last_name`, Value: `user.lastName`

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/okta_saml_attr_statements-e51a953c5ef2853fdbd6821d07322e9f.png" />
</Frame>

Click **Next** and **Finish**.

Go to `Okta` and copy the **Metadata URL** field from **Datafold** → **Sign On** → **Metadata details**.

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/okta_saml_meta_url-137efd7dd40576337ee02f984c8841bc.png" />
</Frame>

Go back to `Datafold` and paste it into **Identity Provider Metadata URL** field.

Finally, click **Save** to create the integration.

Navigate to **Settings** → **Integrations** → **SSO** → **SAML**.

If everything is correct, the **Identity Provider Metadata XML** field will contain XML.


# null
Source: https://docs.datafold.com/security/single-sign-on/saml/group-provisioning

Automatically sync group membership with your SAML Identity Provider (IdP).

## 1. Create desired groups in the IdP

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/okta_groups-61f1b6cf7b4075477ff1275ceeea6d09.png" />
</Frame>

## 2. Assign the desired users to groups

Assign the relevant users to groups reflecting their roles and permissions.

## 3. Configure the SAML SSO provider

Configure your SAML SSO provider to include a `groups` attribute. This attribute should list all the groups you want to sync.

```Bash
  <saml2:Attribute Name="groups" NameFormat="urn:oasis:names:tc:SAML:2.0:attrname-format:unspecified"><saml2:AttributeValue xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="xs:string">datafold_admin</saml2:AttributeValue><saml2:AttributeValue xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="xs:string">datafold_read_write</saml2:AttributeValue></saml2:Attribute></saml2:AttributeStatement></saml2:Assertion></saml2p:Response>
```

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/saml_groups_attribute-00b426150ceab3149d619b067aee26fc.png" />
</Frame>

## 4. Map IdP groups to Datafold groups

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/datafold_group-f66ae2d5b9f378e444f70d1b5851dfaf.png" />
</Frame>

The `datafold_admin` group, created in the IdP through [step 1](#1-create-desired-groups-in-the-idp), will be automatically synced. Users in this IdP group will also be members of the corresponding group in Datafold.

**Note:** Manual Datafold user group memberships will be overridden upon the user's next login to Datafold. Therefore, group memberships should be managed exclusively within the IdP once the `groups` attribute is configured.

## Example configuration

Here's how you might configure three groups to map to the three default Datafold groups, `admin`, `default` and `viewonly`:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/datafold/images/datafold_groups-5e7f4e7afb9d99dee113a03b8599040a.png" />
</Frame>


# User Roles and Permissions
Source: https://docs.datafold.com/security/user-roles-and-permissions

Datafold uses role-based access control to manage user permissions and actions.

Datafold has three default roles:

| Role     | Description    | Permissions                                                                                          |
| -------- | -------------- | ---------------------------------------------------------------------------------------------------- |
| default  | Full user role | Create and modify monitors, create diffs, explore data and lineage                                   |
| admin    | Administrator  | Default permissions plus the ability to manage users and configurations such as database connections |
| viewonly | View-only role | View diffs and monitors without the ability to create or modify them                                 |


# FAQ
Source: https://docs.datafold.com/support/faq-redirect


# Support
Source: https://docs.datafold.com/support/support

Datafold offers multiple support channels to assist users with troubleshooting and inquiries.

## Datafold Support

* **Email**: Contact support at [support@datafold.com](mailto:support@datafold.com) for any assistance.
* **In-app Chat**: Reach out directly from the Datafold app via live chat for quick help.
* **Shared Slack Channel**: Collaborate with the Datafold team through a dedicated Slack channel (please inquire with your account executive to set up).
* **FAQ**: Explore our [Frequently Asked Questions](/faq/overview) for detailed answers to common queries and troubleshooting tips.

### Grant access to Datafold's team for troubleshooting

For faster resolution of support issues, you can temporarily grant Datafold Support access to your account. This enables a Datafold team member to view the same in-app context as you, minimizing back-and-forth communication.

To grant access:

1. Navigate to **Settings** → **Org Settings**.
2. Check the box next to *"Allow Datafold access to your account for troubleshooting purposes."*

To revoke access, simply uncheck the box at any time.

<Info>
  **Note:** Admin privileges are required to modify this setting in Org Settings.
</Info>


# Welcome
Source: https://docs.datafold.com/welcome

Datafold is the unified platform proactive data quality that combines automated data testing, data reconciliation, and observability to help data teams prevent data quality issues and accelerate their development velocity.

## Why Datafold?

Datafold automates the most error-prone and time-consuming aspects of the data engineering workflow by **preventing and detecting data quality issues**. In addition to standard observability features like monitoring, profiling, and lineage, we integrate deeply into the development cycle with automated CI/CD testing. This enables data teams to prevent bad code deployments and detect issues upstream of the data warehouse.

Whether it's for [CI/CD testing](deployment-testing/how-it-works) or [data migration automation](data-migration-automation), Datafold ensures data quality at every stage of the data pipeline.

## Key features

Data quality is a complex and multifaceted problem. Datafold’s unified platform helps embed proactive data quality testing in your workflows:

<CardGroup cols={2}>
  <Card title="Data Diffs" href="/data-diff/what-is-data-diff" horizontal>
    Use value-level data diffs to isolate and identify changes in your data. Catch unintended modifications before they disrupt production or downstream data usage.
  </Card>

  <Card title="Data Monitors" href="/data-monitoring/monitor-types" horizontal>
    Create monitors for data diffs, data quality metrics, SQL metrics, SQL rules, and schema changes to send alerts when inconsistencies are detected.
  </Card>

  <Card title="Datafold Migration Agent" href="/data-migration-automation" horizontal>
    Discover how DMA provides full-cycle migration automation with SQL code translation and cross-database validation.
  </Card>

  <Card title="Data Explorer & Column-Level Lineage" href="/data-explorer/how-it-works" horizontal>
    Learn how your data assets move and change across systems with column-level lineage, metadata, and profiles, to track the impacts of changes made upstream.
  </Card>
</CardGroup>

## Use cases

<CardGroup cols={3}>
  <Card title="CI/CD Data Testing" href="" horizontal>
    Catch data quality issues early with automated testing during development and deployment.
  </Card>

  <Card title="Accelerated Data Migrations" href="" horizontal>
    Speed up migrations with our full-cycle migration automation solution for data teams.
  </Card>

  <Card title="Data Monitoring & Observability" href="" horizontal>
    Shift monitoring upstream to proactively prevent disruptions and ensure data quality.
  </Card>
</CardGroup>

## Getting started

There are a few ways to get started with your first data diff:

<Steps>
  <Step title="Create a data diff" stepNumber="1">
    Once you’ve integrated a [data connection](/integrations) and [code repository](/integrations/code-repositories), you can run a new [in-database](/data-diff/in-database-diffing/creating-a-new-data-diff) or [cross-database](/data-diff/cross-database-diffing/creating-a-new-data-diff) data diff or explore your [data lineage](data-explorer/lineage).
  </Step>
</Steps>

<Steps>
  <Step title="Create automated monitors" stepNumber="2">
    Create [monitors](data-monitoring/monitor-types) to send alerts when data diffs fall outside predefined ranges.
  </Step>
</Steps>

<Steps>
  <Step title="Set up CI/CD testing" stepNumber="3">
    Get started with deployment testing through our universal ([No-Code](deployment-testing/getting-started/universal/no-code), [API](deployment-testing/getting-started/universal/api)) or [dbt](integrations/orchestrators/dbt-core) integrations.
  </Step>
</Steps>

## Learn more

Curious to learn more about why and how data quality matters? We wrote a whole guide (with illustrations of medieval castles, moats, and knights) called the [Data Quality Guide](https://www.datafold.com/data-quality-guide) which covers:

* A practical roadmap towards creating a robust data quality system
* Data quality metrics to keep, and metrics to ignore
* Nurturing a strong data quality culture within and beyond data teams