# Get Audit Logs
Source: https://docs.datafold.com/api-reference/audit-logs/get-audit-logs
get /api/v1/audit_logs
Retrieve audit logs for your Datafold organization via the API.
# Create a DBT BI integration
Source: https://docs.datafold.com/api-reference/bi/create-a-dbt-bi-integration
post /api/v1/lineage/bi/dbt/
Create a dbt BI integration for lineage tracking via the Datafold API.
# Create a Hightouch integration
Source: https://docs.datafold.com/api-reference/bi/create-a-hightouch-integration
post /api/v1/lineage/bi/hightouch/
Create a Hightouch integration for lineage tracking via the Datafold API.
# Create a Looker integration
Source: https://docs.datafold.com/api-reference/bi/create-a-looker-integration
post /api/v1/lineage/bi/looker/
Create a Looker BI integration for lineage tracking via the Datafold API.
# Create a Mode Analytics integration
Source: https://docs.datafold.com/api-reference/bi/create-a-mode-analytics-integration
post /api/v1/lineage/bi/mode/
Create a Mode Analytics BI integration for lineage tracking via the Datafold API.
# Create a Power BI integration
Source: https://docs.datafold.com/api-reference/bi/create-a-power-bi-integration
/openapi-public.json post /api/v1/lineage/bi/powerbi/
# Create a Tableau integration
Source: https://docs.datafold.com/api-reference/bi/create-a-tableau-integration
post /api/v1/lineage/bi/tableau/
Create a Tableau BI integration for lineage tracking via the Datafold API.
# Get an integration
Source: https://docs.datafold.com/api-reference/bi/get-an-integration
get /api/v1/lineage/bi/{bi_datasource_id}/
Retrieve details of a specific BI integration by ID via the Datafold API.
# List all integrations
Source: https://docs.datafold.com/api-reference/bi/list-all-integrations
get /api/v1/lineage/bi/
List all BI integrations configured in your Datafold organization via the API.
# Remove an integration
Source: https://docs.datafold.com/api-reference/bi/remove-an-integration
delete /api/v1/lineage/bi/{bi_datasource_id}/
Remove a BI integration by ID via the Datafold API.
# Sync a BI integration
Source: https://docs.datafold.com/api-reference/bi/sync-a-bi-integration
get /api/v1/lineage/bi/{bi_datasource_id}/sync/
Trigger a sync for a specific BI integration via the Datafold API.
# Update a DBT BI integration
Source: https://docs.datafold.com/api-reference/bi/update-a-dbt-bi-integration
put /api/v1/lineage/bi/dbt/{bi_datasource_id}/
Update an existing dbt BI integration via the Datafold API.
# Update a Hightouch integration
Source: https://docs.datafold.com/api-reference/bi/update-a-hightouch-integration
put /api/v1/lineage/bi/hightouch/{bi_datasource_id}/
Update an existing Hightouch integration via the Datafold API.
# Update a Looker integration
Source: https://docs.datafold.com/api-reference/bi/update-a-looker-integration
put /api/v1/lineage/bi/looker/{bi_datasource_id}/
Update an existing Looker BI integration via the Datafold API.
# Update a Mode Analytics integration
Source: https://docs.datafold.com/api-reference/bi/update-a-mode-analytics-integration
put /api/v1/lineage/bi/mode/{bi_datasource_id}/
Update an existing Mode Analytics BI integration via the Datafold API.
# Update a Power BI integration
Source: https://docs.datafold.com/api-reference/bi/update-a-power-bi-integration
/openapi-public.json put /api/v1/lineage/bi/powerbi/{bi_datasource_id}/
Updates the integration configuration. Returns the integration with changed fields.
# Update a Tableau integration
Source: https://docs.datafold.com/api-reference/bi/update-a-tableau-integration
put /api/v1/lineage/bi/tableau/{bi_datasource_id}/
Update an existing Tableau BI integration via the Datafold API.
# Get Org Spend
Source: https://docs.datafold.com/api-reference/bolt/get-org-spend
/openapi-public.json get /api/internal/bolt/org/spend
Daily/monthly LLM spend + effective caps for the caller's org.
# List CI runs
Source: https://docs.datafold.com/api-reference/ci/list-ci-runs
get /api/v1/ci/{ci_config_id}/runs
List all CI runs for a given CI configuration via the Datafold API.
# Trigger a PR/MR run
Source: https://docs.datafold.com/api-reference/ci/trigger-a-prmr-run
post /api/v1/ci/{ci_config_id}/trigger
Trigger a PR/MR diff run for a CI configuration via the Datafold API.
# Upload PR/MR changes
Source: https://docs.datafold.com/api-reference/ci/upload-prmr-changes
post /api/v1/ci/{ci_config_id}/{pr_num}
Upload PR/MR changes for a specific pull request via the Datafold API.
# Cancel a running data diff
Source: https://docs.datafold.com/api-reference/data-diffs/cancel-a-running-data-diff
/openapi-public.json post /api/v1/datadiffs/{datadiff_id}/cancel
Cancels a data diff that is currently queued or running.
This operation stops the diff execution and marks it as cancelled. If the diff has already
completed or been cancelled, this operation has no effect and returns the current status.
Use this to stop long-running diffs that are no longer needed or were started with incorrect parameters.
# Create a data diff
Source: https://docs.datafold.com/api-reference/data-diffs/create-a-data-diff
post /api/v1/datadiffs
Create a new data diff to compare datasets via the Datafold API.
# Get a data diff
Source: https://docs.datafold.com/api-reference/data-diffs/get-a-data-diff
get /api/v1/datadiffs/{datadiff_id}
Retrieve details of a specific data diff by ID via the Datafold API.
# Get a data diff summary
Source: https://docs.datafold.com/api-reference/data-diffs/get-a-data-diff-summary
get /api/v1/datadiffs/{datadiff_id}/summary_results
Get the summary results of a specific data diff via the Datafold API.
# Get a human-readable summary of a DataDiff comparison
Source: https://docs.datafold.com/api-reference/data-diffs/get-a-human-readable-summary-of-a-datadiff-comparison
/openapi-public.json get /api/v1/datadiffs/{datadiff_id}/summary
Retrieves a comprehensive, human-readable summary of a completed data diff.
This endpoint provides the most useful information for understanding diff results:
- Overall status and result (success/failure)
- Human-readable feedback explaining the differences found
- Key statistics (row counts, differences, match rates)
- Configuration details (tables compared, primary keys used)
- Error messages if the diff failed
Use this after a diff completes to get actionable insights. For diffs still running,
check status with get_datadiff first.
# List data diffs
Source: https://docs.datafold.com/api-reference/data-diffs/list-data-diffs
get /api/v1/datadiffs
List all data diffs in your Datafold organization via the API.
# Update a data diff
Source: https://docs.datafold.com/api-reference/data-diffs/update-a-data-diff
patch /api/v1/datadiffs/{datadiff_id}
Update the configuration of an existing data diff via the Datafold API.
# Create a data source
Source: https://docs.datafold.com/api-reference/data-sources/create-a-data-source
post /api/v1/data_sources
Create a new data source connection via the Datafold API.
# Execute a SQL query against a data source
Source: https://docs.datafold.com/api-reference/data-sources/execute-a-sql-query-against-a-data-source
/openapi-public.json post /api/v1/data_sources/{data_source_id}/query
Executes a SQL query against the specified data source and returns the results.
This endpoint allows you to run ad-hoc SQL queries for data exploration, validation, or analysis.
The query is executed using the data source's native query runner with the appropriate credentials.
**Streaming mode**: Use query parameter `?stream=true` or set `X-Stream-Response: true` header.
Streaming is only supported for certain data sources (e.g., Databricks).
When streaming, results are sent incrementally as valid JSON for memory efficiency.
Returns:
- Query results as rows with column metadata (name, type, description)
- Limited to a reasonable number of rows for performance
# Get a data source
Source: https://docs.datafold.com/api-reference/data-sources/get-a-data-source
get /api/v1/data_sources/{data_source_id}
Retrieve details of a specific data source by ID via the Datafold API.
# Get a data source summary
Source: https://docs.datafold.com/api-reference/data-sources/get-a-data-source-summary
get /api/v1/data_sources/{data_source_id}/summary
Get a summary of a specific data source by ID via the Datafold API.
# Get data source testing results
Source: https://docs.datafold.com/api-reference/data-sources/get-data-source-testing-results
get /api/v1/data_sources/test/{job_id}
Retrieve the testing results for a data source connection via the Datafold API.
# List data source types
Source: https://docs.datafold.com/api-reference/data-sources/list-data-source-types
get /api/v1/data_sources/types
List all supported data source types available in the Datafold API.
# List data sources
Source: https://docs.datafold.com/api-reference/data-sources/list-data-sources
get /api/v1/data_sources
List all configured data sources in your Datafold organization via the API.
# Test a data source connection
Source: https://docs.datafold.com/api-reference/data-sources/test-a-data-source-connection
post /api/v1/data_sources/{data_source_id}/test
Test the connection of a specific data source via the Datafold API.
# Datafold API
Source: https://docs.datafold.com/api-reference/datafold-api
Datafold REST API reference for programmatic access to data diffs, data sources, CI runs, monitors, BI integrations, and more.
The Datafold API reference is a guide to our available endpoints and...
Connect AI assistants to Datafold using Model Context Protocol
3 items
7 items
5 items
14 items
1 item
# Datafold SDK
Source: https://docs.datafold.com/api-reference/datafold-sdk
Use the Datafold SDK for programmatic access to data diffs, CI artifact uploads, and integration with your data pipelines.
The Datafold SDK allows you to accomplish certain actions using a thin programmatic wrapper around the Datafold REST API, in particular:
* **Custom CI Integrations**: Submitting information to Datafold about what tables to diff in CI
* **dbt CI Integrations**: Submitting dbt artifacts via CI runner
* **dbt development**: Kick off data diffs from the command line while developing in your dbt project
## Install
First, create and activate your virtual environment for Python:
```
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip setuptools wheel
```
Now, you're ready to install the Datafold SDK:
```
pip install datafold-sdk
```
#### CLI environment variables
To use the Datafold CLI, you need to set up some environment variables:
```bash theme={null}
export DATAFOLD_API_KEY=XXXXXXXXX
```
If your Datafold app URL is different from the default `app.datafold.com`, set the custom domain as the variable:
```bash theme={null}
export DATAFOLD_HOST=
```
## Custom CI Integrations
Please follow [our CI orchestration docs](../integrations/orchestrators/custom-integrations) to set up a custom CI integration levering the Datafold SDK.
## dbt Core CI Integrations
When you set up Datafold CI diffing for a dbt Core project, we rely on the submission of `manifest.json` files to represent the production and staging versions of your dbt project.
Please see our detailed docs on how to [set up Datafold in CI for dbt Core](../integrations/orchestrators/dbt-core), and reach out to our team if you have questions.
#### CLI
```bash theme={null}
datafold dbt upload \
--ci-config-id \
--run-type \
--target-folder \
--commit-sha
```
#### Python
```python theme={null}
import os
from datafold_sdk.sdk.dbt import submit_artifacts
api_key = os.environ.get('DATAFOLD_API_KEY')
# only needed if your Datafold app url is not app.datafold.com
host = os.environ.get("DATAFOLD_HOST")
submit_artifacts(host=host,
api_key=api_key,
ci_config_id=,
run_type='',
target_folder='',
commit_sha='')
```
## Diffing dbt models in development
It can be beneficial to diff between two dbt environments before opening a pull request. This can be done using the Datafold SDK from the command line:
```bash theme={null}
datafold diff dbt
```
That command will compare data between your development and production environments. By default, all models that were built in the previous `dbt run` or `dbt build` command will be compared.
### Running Data Diffs before opening a pull request
It can be helpful to view Data Diff results in your ticket before creating a pull request. This enables faster code reviews by letting developers QA changes earlier.
To do this, you can create a draft PR and run the following command:
```
dbt run && datafold diff dbt
```
This executes dbt locally and triggers a Data Diff to preview data changes without committing to Git. To automate this workflow, see our guide [here](/faq/datafold-with-dbt#can-i-run-data-diffs-before-opening-a-pr).
### Update your dbt\_project.yml with configurations
#### Option 1: Add variables to the `dbt_project.yml`
```yaml theme={null}
# dbt_project.yml
vars:
data_diff:
prod_database: my_default_database # default database for the prod target
prod_schema: my_default_schema # default schema for the prod target
prod_custom_schema: PROD_ # Optional: see dropdown below
```
**Additional schema variable details**
The value for `prod_custom_schema:` will vary based on how you have setup dbt.
This variable is used when a model has a custom schema and becomes ***dynamic*** when the string literal `` is present. The `` substring is replaced with the custom schema for the model in order to support the various ways schema name generation can be overridden here -- also referred to as "advanced custom schemas".
**Examples (not exhaustive)**
**Single production schema**
*If your prod environment looks like this ...*
```bash theme={null}
PROD.ANALYTICS
```
*... your data-diff configuration should look like this:*
```yaml theme={null}
vars:
data_diff:
prod_database: PROD
prod_schema: ANALYTICS
```
**Some custom schemas in production with a prefix like "prod\_"**
*If your prod environment looks like this ...*
```bash theme={null}
PROD.ANALYTICS
PROD.PROD_MARKETING
PROD.PROD_SALES
```
*... your data-diff configuration should look like this:*
```yaml theme={null}
vars:
data_diff:
prod_database: PROD
prod_schema: ANALYTICS
prod_custom_schema: PROD_
```
**Some custom schemas in production with no prefix**
*If your prod environment looks like this ...*
```yaml theme={null}
PROD.ANALYTICS
PROD.MARKETING
PROD.SALES
```
*... your data-diff configuration should look like this:*
```yaml theme={null}
vars:
data_diff:
prod_database: PROD
prod_scheam: ANALYTICS
prod_custom_schema:
```
#### Option 2: Specify a production `manifest.json` using `--state`
**Using the `--state` option is highly recommended for dbt projects with multiple target database and schema configurations. For example, if you customized the [`generate_schema_name`](https://docs.getdbt.com/docs/build/custom-schemas#understanding-custom-schemas) macro, this is the best option for you.**
> Note: `dbt ls` is preferred over `dbt compile` as it runs faster and data diffing does not require fully compiled models to work.
```bash theme={null}
dbt ls -t prod # compile a manifest.json using the "prod" target
mv target/manifest.json prod_manifest.json # move the file up a directory and rename it to prod_manifest.json
dbt run # run your entire dbt project or only a subset of models with `dbt run --select `
data-diff --dbt --state prod_manifest.json # run data-diff to compare your development results to the production database/schema results in the prod manifest
```
#### Add your Datafold data connection integration ID to your dbt\_project.yml
To connect to your database, navigate to **Settings** → **Integrations** → **Data connections** and click **Add new integration** and follow the prompts.
After you **Test and Save**, add the ID (which can be found on Integrations > Data connections) to your **dbt\_project.yml**.
```yaml theme={null}
# dbt_project.yml
vars:
data_diff:
...
datasource_id:
```
The following optional arguments are available:
| Options | Description |
| ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `--version` | Print version info and exit. |
| `-w, --where EXPR` | An additional 'where' expression to restrict the search space. Beware of SQL Injection! |
| `--dbt-profiles-dir PATH` | Which directory to look in for the `profiles.yml` file. If not set, we follow the default `profiles.yml` location for the dbt version being used. Can also be set via the `DBT_PROFILES_DIR` environment variable. |
| `--dbt-project-dir PATH` | Which directory to look in for the `dbt_project.yml` file. Default is the current working directory and its parents. |
| `--select SELECTION or MODEL_NAME` | Select dbt resources to compare using dbt selection syntax in dbt versions >= 1.5. In versions \< 1.5, it will naively search for a model with `MODEL_NAME` as the name. |
| `--state PATH` | Specify manifest to utilize for 'prod' comparison paths instead of using configuration. |
| `-pd, --prod-database TEXT` | Override the dbt production database configuration within `dbt_project.yml`. |
| `-ps, --prod-schema TEXT` | Override the dbt production schema configuration within `dbt_project.yml`. |
| `--help` | Show this message and exit. |
# Get column downstreams
Source: https://docs.datafold.com/api-reference/explore/get-column-downstreams
/openapi-public.json get /api/v1/explore/db/{data_connection_id}/columns/{column_path}/downstreams
Retrieve a list of columns or tables which depend on the given column.
# Get column upstreams
Source: https://docs.datafold.com/api-reference/explore/get-column-upstreams
/openapi-public.json get /api/v1/explore/db/{data_connection_id}/columns/{column_path}/upstreams
Retrieve a list of columns or tables which the given column depends on.
# Get table downstreams
Source: https://docs.datafold.com/api-reference/explore/get-table-downstreams
/openapi-public.json get /api/v1/explore/db/{data_connection_id}/tables/{table_path}/downstreams
Retrieve a list of tables which depend on the given table.
# Get table upstreams
Source: https://docs.datafold.com/api-reference/explore/get-table-upstreams
/openapi-public.json get /api/v1/explore/db/{data_connection_id}/tables/{table_path}/upstreams
Retrieve a list of tables which the given table depends on.
# Introduction
Source: https://docs.datafold.com/api-reference/introduction
Get started with the Datafold REST API. Learn how to authenticate, obtain an API key, and make your first API call.
Our REST API allows you to interact with Datafold programmatically. To use it, you'll need an API key. Follow the instructions below to get started.
## Create an API Key
Open the Datafold app, visit Settings > Account, and select **Create API Key**.
Store your API key somewhere safe. If you lose it, you'll need to generate a new one.
For CI pipelines, dbt jobs, scripts, and other automation, prefer a [service account](/security/service-accounts) over a personal API key. Service accounts are machine identities owned by your organization — they survive when the original creator leaves the team and can hold many more keys (50 vs. 5 per human user).
## Use your API Key
When making requests to the Datafold API, you'll need to include the API key as a header in your HTTP request for authentication. The header should be named `Authorization`, and the value should be in the format:
```
Authorization: Key {API_KEY}
```
For example, if you're using cURL:
```bash theme={null}
curl https://app.datafold.com/api/v1/... -H "Authorization: Key {API_KEY}"
```
## Datafold SDK
Rather than hit our REST API endpoints directly, we offer a convenient Python SDK for common development and deployment testing workflows. You can find more information about our SDK [here](/api-reference/datafold-sdk).
## Need help?
If you have any questions about how to use our REST API, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com).
# MCP Server
Source: https://docs.datafold.com/api-reference/mcp-server-setup
Connect AI assistants to Datafold using the Model Context Protocol
## Overview
Datafold provides a public HTTP MCP (Model Context Protocol) server that enables AI assistants to interact with your Datafold data sources, run queries, and manage data diffs.
**Endpoint:** `https://app.datafold.com/mcp/`
## Prerequisites
Before setting up the MCP server, you need a Datafold API key.
Open the Datafold app and go to **Settings > Account**
Click **Create API Key** and store it securely
If you lose your API key, you'll need to generate a new one. See the [Introduction guide](/api-reference/introduction) for more details.
## Authentication
The MCP server uses Bearer token authentication with your Datafold API key:
```
Authorization: Key YOUR_API_KEY
```
Include this header in your MCP client configuration.
***
## Setup by Client
Claude Desktop supports MCP servers through its developer configuration file.
**Steps:**
1. Open Claude Desktop
2. Go to **Settings > Developer > Edit Config**
3. Add the following to your list of MCP servers in `claude_desktop_config.json`:
```json theme={null}
{
"mcpServers": {
"datafold": {
"command": "mcp-remote",
"args": [
"https://app.datafold.com/mcp/",
"--header",
"Authorization: Key YOUR_API_KEY"
]
}
}
}
```
4. Save the file and restart Claude Desktop
You need to have `mcp-remote` installed. Install it with `npm install -g mcp-remote` if you haven't already.
Claude Code supports MCP through a simple CLI command.
**Quick Setup:**
Run the following command in your terminal:
```bash theme={null}
claude mcp add --transport http --scope user \
datafold https://app.datafold.com/mcp/ \
--header "Authorization: Key YOUR_API_KEY"
```
**Options:**
* `--scope user` - Makes the server available across all projects
* `--header` - Adds authentication header
**Verify Installation:**
```bash theme={null}
claude mcp list
```
The Datafold MCP server will now be available in all your Claude Code sessions.
See the [Claude Code MCP documentation](https://code.claude.com/docs/en/mcp) for more configuration options.
Cursor supports MCP servers through project-specific or global configuration.
**Project-Specific Setup:**
Create `.cursor/mcp.json` in your project directory:
```json theme={null}
{
"mcpServers": {
"datafold": {
"type": "http",
"url": "https://app.datafold.com/mcp/",
"headers": {
"Authorization": "Key YOUR_API_KEY"
}
}
}
}
```
Restart Cursor to load the configuration.
Cline is a VS Code extension that supports MCP servers.
**Steps:**
1. Install the Cline extension
2. Click the **MCP Servers** icon in Cline's navigation
3. Select **Configure** tab → **Advanced MCP Settings**
4. Add to `cline_mcp_settings.json`:
```json theme={null}
{
"mcpServers": {
"datafold": {
"type": "http",
"url": "https://app.datafold.com/mcp/",
"headers": {
"Authorization": "Key YOUR_API_KEY"
},
"timeout": 60000
}
}
}
```
5. Toggle the switch to enable the server
6. Close and reopen VS Code
Windsurf supports environment variable interpolation for secure credential storage.
**Configuration File:** `~/.codeium/windsurf/mcp_config.json`
**Basic Configuration:**
```json theme={null}
{
"mcpServers": {
"datafold": {
"type": "http",
"serverUrl": "https://app.datafold.com/mcp/",
"headers": {
"Authorization": "Key YOUR_API_KEY"
}
}
}
}
```
**Using Environment Variables (Recommended):**
```json theme={null}
{
"mcpServers": {
"datafold": {
"type": "http",
"serverUrl": "https://app.datafold.com/mcp/",
"headers": {
"Authorization": "Key ${env:DATAFOLD_API_KEY}"
}
}
}
}
```
Set `DATAFOLD_API_KEY` as an environment variable, then restart Windsurf.
Continue.dev supports MCP in agent mode through YAML configuration.
**Configuration Directory:** `.continue/mcpServers/`
Create `datafold.yaml`:
```yaml theme={null}
type: streamable-http
url: https://app.datafold.com/mcp/
headers:
Authorization: "Key YOUR_API_KEY"
```
MCP can only be used in agent mode within Continue.dev.
Zed supports MCP through custom configuration in settings.
**Steps:**
1. Open Zed
2. Go to **Preferences > Settings** (`⌘,` on macOS)
3. Add a `context_servers` section:
```json theme={null}
{
"context_servers": {
"datafold": {
"settings": {
"url": "https://app.datafold.com/mcp/",
"headers": {
"Authorization": "Key YOUR_API_KEY"
}
}
}
}
}
```
4. Check the Agent Panel - the indicator should be green when active
OpenCode supports both local and remote MCP servers with automatic OAuth handling.
**Configuration Files:**
* Global: `~/.config/opencode/opencode.json`
* Project: `opencode.json` in project root (overrides global)
**Remote MCP Configuration:**
```json theme={null}
{
"$schema": "https://opencode.ai/config.json",
"mcp": {
"datafold": {
"type": "remote",
"url": "https://app.datafold.com/mcp/",
"headers": {
"Authorization": "Bearer MY_DATAFOLD_API_KEY"
},
"enabled": true
}
}
}
```
**CLI Setup:**
You can also use the interactive CLI to add the server:
```bash theme={null}
opencode mcp add
```
Then follow the prompts to configure the Datafold remote MCP server.
See the [OpenCode MCP documentation](https://opencode.ai/docs/mcp-servers/) for advanced configuration.
Gemini CLI supports MCP servers with built-in OAuth 2.0 authentication.
**Configuration Files:**
* Global: `~/.gemini/settings.json`
* Project: `.gemini/settings.json` in project directory
**Add via CLI (Recommended):**
```bash theme={null}
gemini mcp add datafold --scope user \
--transport http \
--url https://app.datafold.com/mcp/ \
--header "Authorization: Key YOUR_API_KEY"
```
**Manual Configuration:**
Edit your settings file:
```json theme={null}
{
"mcpServers": {
"datafold": {
"transport": "http",
"url": "https://app.datafold.com/mcp/",
"headers": {
"Authorization": "Key YOUR_API_KEY"
}
}
}
}
```
See the [Gemini CLI MCP documentation](https://geminicli.com/docs/tools/mcp-server/) for OAuth and advanced features.
Kiro supports both global and project-specific MCP configuration.
**Global Configuration:**
Edit `~/.kiro/settings/mcp.json`:
```json theme={null}
{
"mcpServers": {
"datafold": {
"url": "https://app.datafold.com/mcp/",
"headers": {
"Authorization": "Key YOUR_API_KEY"
}
}
}
}
```
**Project-Specific Configuration:**
Edit `.kiro/settings/mcp.json` in your project directory with the same format.
**Using Environment Variables (Recommended):**
```json theme={null}
{
"mcpServers": {
"datafold": {
"url": "https://app.datafold.com/mcp/",
"headers": {
"Authorization": "Key ${API_TOKEN}"
}
}
}
}
```
Changes apply automatically when you save the file.
See the [Kiro MCP documentation](https://kiro.dev/docs/mcp/configuration/) for advanced options.
***
## Verification
After configuring your MCP client, verify the connection:
1. Check that the server status shows as "active" or "connected"
2. Test with a simple query:
* "List my Datafold data sources"
* "Run a query against \[data source name]"
If successful, your AI assistant can now interact with Datafold through MCP.
***
## Troubleshooting
**Symptoms:** "Unable to connect to MCP server" or "No valid session ID"
**Solutions:**
* Verify your API key is valid
* Confirm the URL is exactly `https://app.datafold.com/mcp/`
* Check header format: `Authorization: Key YOUR_API_KEY`
* Regenerate your API key if issues persist
**Symptoms:** "401 Unauthorized" or "403 Forbidden"
**Solutions:**
* Generate a new API key if needed
* Verify the header uses `Key` prefix (not `Bearer`)
* Check your API key hasn't been revoked
**Symptoms:** MCP server doesn't appear or shows as inactive
**Solutions:**
* Restart your MCP client completely
* Validate JSON/YAML syntax
* Check client logs for specific errors
* Ensure configuration file is in the correct location
***
## Best Practices
Never commit API keys to version control. Use environment variables or secure secret management.
Reference API keys through environment variables rather than hardcoding them in configuration files.
Regularly check Datafold audit logs to monitor API usage and detect anomalies.
Periodically rotate your API keys as part of security best practices.
***
## Additional Resources
Learn more about Datafold's REST API
Official MCP specification
Explore data source endpoints
Learn about data diff operations
***
## Support
If you need assistance:
* Check the [Datafold documentation](https://docs.datafold.com)
* Review your MCP client's documentation
* Contact support via Slack, in-app chat, or [support@datafold.com](mailto:support@datafold.com)
# Create a Data Diff Monitor
Source: https://docs.datafold.com/api-reference/monitors/create-a-data-diff-monitor
/openapi-public.json post /api/v1/monitors/create/diff
# Create a Data Test Monitor
Source: https://docs.datafold.com/api-reference/monitors/create-a-data-test-monitor
/openapi-public.json post /api/v1/monitors/create/test
# Create a Metric Monitor
Source: https://docs.datafold.com/api-reference/monitors/create-a-metric-monitor
/openapi-public.json post /api/v1/monitors/create/metric
# Create a Schema Change Monitor
Source: https://docs.datafold.com/api-reference/monitors/create-a-schema-change-monitor
/openapi-public.json post /api/v1/monitors/create/schema
# Delete a Monitor
Source: https://docs.datafold.com/api-reference/monitors/delete-a-monitor
/openapi-public.json delete /api/v1/monitors/{id}
# Get Monitor
Source: https://docs.datafold.com/api-reference/monitors/get-monitor
/openapi-public.json get /api/v1/monitors/{id}
# Get Monitor Run
Source: https://docs.datafold.com/api-reference/monitors/get-monitor-run
/openapi-public.json get /api/v1/monitors/{id}/runs/{run_id}
# List Monitor Runs
Source: https://docs.datafold.com/api-reference/monitors/list-monitor-runs
/openapi-public.json get /api/v1/monitors/{id}/runs
# List Monitors
Source: https://docs.datafold.com/api-reference/monitors/list-monitors
/openapi-public.json get /api/v1/monitors
# Toggle a Monitor
Source: https://docs.datafold.com/api-reference/monitors/toggle-a-monitor
/openapi-public.json put /api/v1/monitors/{id}/toggle
# Trigger a run
Source: https://docs.datafold.com/api-reference/monitors/trigger-a-run
/openapi-public.json post /api/v1/monitors/{id}/run
# Update a Monitor
Source: https://docs.datafold.com/api-reference/monitors/update-a-monitor
/openapi-public.json patch /api/v1/monitors/{id}/update
# Best Practices
Source: https://docs.datafold.com/data-diff/cross-database-diffing/best-practices
When dealing with large datasets, it's crucial to approach diffing with specific optimization strategies in mind. We share best practices that will help you get the most accurate and efficient results from your data diffs.
## Enable sampling
[Sampling](/data-diff/cross-database-diffing/creating-a-new-data-diff#row-sampling) can be helpful when diffing between extremely large datasets as it can result in a speedup of 2x to 20x or more. The extent of the speedup depends on various factors, including the scale of the data, instance sizes, and the number of data columns.
The following table illustrates the speedup achieved with sampling in different databases, varying instance sizes, and different numbers of data columns:
| Databases | vCPU | RAM, GB | Rows | Columns | Time full | Time sampled | Speedup | RDS type | Diff full | Diff sampled | Per-col noise |
| :-----------------: | :--: | :-----: | :-------: | :-----: | :-------: | :----------: | :-----: | :-----------: | :-------: | :----------: | :-----------: |
| Oracle vs Snowflake | 2 | 2 | 1,000,000 | 1 | 0:00:33 | 0:00:27 | 1.22 | db.t3.small | 5399 | 5400 | 0 |
| Oracle vs Snowflake | 8 | 32 | 1,000,000 | 1 | 0:07:23 | 0:00:18 | 24.61 | db.m5.2xlarge | 5422 | 5423 | 0.005 |
| MySQL vs Snowflake | 2 | 8 | 1,000,000 | 1 | 0:00:57 | 0:00:24 | 2.38 | db.m5.large | 5409 | 5413 | 0 |
| MySQL vs Snowflake | 2 | 8 | 1,000,000 | 29 | 0:40:00 | 0:02:14 | 17.91 | db.m5.large | 5412 | 5411 | 0 |
When sampling is enabled, Datafold compares a randomly chosen subset of the data. Sampling is the tradeoff between the diff detail and time/cost of the diffing process. For most use cases, sampling does not reduce the informational value of data diffs as it still provides the magnitude and specific examples of differences (e.g., if 10% of sampled data show discrepancies, it suggests a similar proportion of differences across the entire dataset).
Although configuring sampling can seem overwhelming at first, a good rule of thumb is to select an initial value of 95% for the sampling confidence and adjust it as needed. Tweaking the parameters can be helpful to see how they impact the sample size and the tradeoff between performance and accuracy.
## Handling data type differences
Datafold automatically manages data type differences during cross-database diffing. For example, when comparing decimals with different precisions (e.g., `DECIMAL(38,15)` in SQL Server and `DECIMAL(38,19)` in Snowflake), Datafold automatically casts values to a common precision before comparison, flagging any differences appropriately. Similarly, for timestamps with different precisions (e.g., milliseconds in SQL Server and nanoseconds in Snowflake), Datafold adjusts the precision as needed for accurate comparisons, simplifying the diffing process.
## Optimizing OLTP databases: indexing best practices
When working with row-oriented transactional databases like PostgreSQL, optimizing the database structure is crucial for efficient data diffing, especially for large tables. Here are some best practices to consider:
* **Create indexes on key columns**:
* It's essential to create indexes on the columns that will be compared, particularly the primary key columns defined in the data diffs.
* **Example**: If your data diff involves primary key columns `colA` and `colB`, ensure that indexes are created for these specific columns.
* **Use separate indexes for primary key columns:**
* Indexes for primary key columns should be distinct and start with these columns, not as subsets of other indexes. Having a dedicated primary key index is critical for efficient diffing.
* **Example**: Consider a primary key consisting of `colA` and `colB`. Ensure that the index is structured in the same order, like (`colA`, `colB`), to align with the primary key. An index with an order of (`colB`, `colA`) is strongly discouraged due to the impact on performance.
* **Example**: If the index is defined as (`colA`, `colB`, `colC`) and the primary key is a combination of `colA` and `colB`, then when setting up the diff operation, ensure that the primary key is specified as `colA`, `colB.` If the order is reversed as `colB`, `colA`, the diffing process won’t be able to fully utilize indexing, potentially leading to slower performance.
* **Leverage compound indexes**:
* Compound indexes, which involve multiple columns, can significantly improve query performance during data diffs as they efficiently handle complex queries and filtering.
* **Example**: An index defined as (`colA`, `colB`, `colC`) can be beneficial for diffing operations involving these columns, as it aligns with the order of columns in the primary key.
## Handling high percentage of differences
Data diff is optimized to perform best when the percent of different rows/values is relatively low, to support common data validation scenarios like data replication and migration.
While the tool strives to maximize the database's computational power and minimize data transfer, in extreme cases with very high difference percentages (up to 100%), it may result in transferring every row over the network, which is considerably slower.
In order to avoid long-running diffs, we recommend the following:
* **Start with diffing [primary keys](/data-diff/cross-database-diffing/creating-a-new-data-diff#primary-key)** only to identify row-level completeness first, before diffing all or more columns.
* **Set an [egress](/data-diff/cross-database-diffing/creating-a-new-data-diff#primary-key) limit** to automatically stop the diffing process after set number of rows are downloaded over the network.
* **Set a [per-column diff](/data-diff/cross-database-diffing/creating-a-new-data-diff#primary-key) limit** to stop finding differences for each column after a set number are found. This is especially useful in data reconciliation where identifying a large number of discrepancies (e.g., large percentage of missing/different rows) early on indicates that a detailed row-by-row diff may not be required, thereby saving time and computational resources.
In the screenshot below, we see that exactly 4 differences were found in `user_id`, but “at least 4,704 differences” were found in `total_runtime_seconds`. `user_id` has a number of differences below the per-column diff limit, and so we state the exact number. On the other hand, `total_runtime_seconds` has a number of differences greater than the per-column diff limit, so we state “at least.” Note that due to our algorithm’s approach, we often find significantly more differences than the limit before diffing is halted, and in that scenario, we report the value that was found, while stating that more differences may exist.
## Executing queries in parallel
Increase the number of concurrent connections to the database in Datafold. This enables queries to be executed in parallel, significantly accelerating the diff process.
Navigate to the **Settings** option in the left sidebar menu of Datafold. Adjust the **max connections** setting to increase the number of concurrent connections Datafold can establish with your data. Note that the maximum allowable value for concurrent connections is 64.
## Optimize column selection
The number of columns included in the diff directly impacts its speed: selecting fewer columns typically results in faster execution. To optimize performance, refine your column selection based on your specific use case:
* **Comprehensive verification**: For in-depth analysis, include all columns in the diff. This method is the most thorough, suitable for exhaustive data reviews, albeit time-intensive for wide tables.
* **Minimal verification**: Consider verifying only the primary key and `updated_at` columns. This is efficient and sufficient if you need to validate rows have not been added or removed, and that updates are current between databases, but do not need to check for value-level differences between rows with common primary keys.
* **Presence verification**: If your main concern is just the presence of data (whether data exists or has been removed), such as identifying missing hard deletes, verifying only the primary key column can be sufficient.
* **Hybrid verification**: Focus on key columns that are most critical to your operations or data integrity, such as monetary values in an `amount` column, while omitting large serialized or less critical columns like `json_settings`.
## Managing primary key distribution
Significant gaps in the primary key column can decrease diff efficiency (e.g., 10s of millions of continuous rows missing). Datafold will execute queries for non-existent row ranges, which can slow down the data diff.
## Handling different primary key types
As a general rule, primary keys should be of the same (or similar) type in both datasets for diffing to work properly. Comparing primary keys of different types (e.g., `INT` vs `VARCHAR`) will result in a type mismatch error. You can still diff such datasets by casting the primary key column to the same type in both datasets explicitly.
Indexes on the primary key typically cannot be utilized when the primary key is cast to a different type. This may result in slower diffing performance. Consider creating a separate index, such as [expression index in PostgreSQL](https://www.postgresql.org/docs/current/indexes-expressional.html), to improve performance.
# Creating a New Data Diff
Source: https://docs.datafold.com/data-diff/cross-database-diffing/creating-a-new-data-diff
Datafold's Data Diff can compare data across databases (e.g., PostgreSQL <> Snowflake, or between two SQL Server instances) to validate migrations, meet regulatory and compliance requirements, or ensure data is flowing successfully from source to target.
This powerful algorithm provides full row-, column-, and value-level detail into discrepancies between data tables.
## Creating a new data diff
Setting up a new data diff in Datafold is straightforward. You can configure your data diffs with the following parameters and options:
### Source and Target datasets
#### Data connection
Pick your data connection(s).
#### Diff type
Choose how you want to compare your data:
* Table: Select this to compare data directly from database tables
* Query: Use this to compare results from specific SQL queries
#### Dataset
Choose the dataset you want to compare. This can be a table or a view in your relational database.
#### Filter
Insert your filter clause after the WHERE keyword to refine your dataset. For example: `created_at > '2000-01-01'` will only include data created after January 1, 2000.
### Materialize inputs
Select this option to improve diffing speed when query is heavy on compute, or if filters are applied to non-indexed columns, or if primary keys are transformed using concatenation, coalesce, or another function.
## Column remapping
Designate columns with the same data type and different column names to be compared. Data Diff will surface differences under the column name used in the Source dataset.
Datafold automatically handles differences in data types to ensure accurate comparisons. See our best practices below for how this is handled.
## General
### Primary key
The primary key is one or more columns used to uniquely identify a row in the dataset during diffing. The primary key (or keys) does not need to be formally defined in the database or elsewhere as it is used for unique row identification during diffing.
Textual primary keys do not support values outside the set of characters `a-zA-Z0-9!"()*/^+-<>=`. If these values exist, we recommend filtering them out before running the diff operation.
### Columns
#### Columns to compare
Specify which columns to compare between datasets.
Note that this has performance implications when comparing a large number of columns, especially in wide tables with 30 or more columns. It is recommended to initially focus on comparisons using only the primary key or to select a limited subset of columns.
### Row sampling
Use sampling to compare a subset of your data instead of the entire dataset. This is best for diffing large datasets. Sampling can be configured to select a percentage of rows to compare, or to ensure differences are found to a chosen degree of statistical confidence.
#### Sampling tolerance
Sampling tolerance defines the allowable margin of error for our estimate. It sets the acceptable percentage of rows with primary key errors (e.g., nulls, duplicates, or primary keys exclusive to one dataset) before disabling sampling.
When sampling is enabled, not every row is examined, which introduces a probability of missing certain discrepancies. This threshold represents the level of difference we are willing to accept before considering the results unreliable and thereby disabling sampling. It essentially sets a limit on how much variance is tolerable in the sample compared to the complete dataset.
Default: 0.001%
#### Sampling confidence
Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset. It represents the minimum confidence level that the rate of primary key errors is below the threshold defined in sampling tolerance.
To put it simply, a 95% confidence level with a 5% tolerance means we are 95% certain that the true value falls within 5% of our estimate.
Default: 99%
#### Sampling threshold
Sampling is automatically disabled when the total row count of the largest table in the comparison falls below a specified threshold value. This approach is adopted because, for smaller datasets, a complete dataset comparison is not only more feasible but also quicker and more efficient than sampling. Disabling sampling in these scenarios ensures comprehensive data coverage and provides more accurate insights, as it becomes practical to examine every row in the dataset without significant time or resource constraints.
#### Sample size
This provides an estimated count of the total number of rows included in the combined sample from Datasets A and B, used for the diffing process. It's important to note that this number is an estimate and can vary from the actual sample size due to several factors:
The presence of duplicate primary keys in the datasets will likely increase this estimate, as it inflates the perceived uniqueness of rows.
* Applying filters to the datasets tends to reduce the estimate, as it narrows down the data scope.
* The number of rows we sample is not fixed; instead, we use a statistical approach called the Poisson distribution. This involves picking rows randomly from an infinite pool of rows with uniform random sampling. Importantly, we don't need to perform a full diff (compare every single row) to establish a baseline.
Example: Imagine there are two datasets we want to compare, Source and Target. Since we prefer not to check every row, we use a statistical approach to determine the number of rows to sample from each dataset. To do so, we set the following parameters:
* Sampling tolerance: 5%
* Sampling confidence: 95%
Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset, while sampling tolerance defines the allowable margin of error for our estimate. Here, with a 95% sampling confidence and a 5% sampling tolerance, we are 95% confident that the true value falls within 5% of our estimate. Datafold will then estimate the sample size needed (e.g., 200 rows) to achieve these parameters.
### Advanced
#### Materialize diff results to table
Create a detailed table from your diff results, indicating each row where differences occur. This table will include corresponding values from both datasets and flags showing whether each row matches or mismatches.
# Results
Source: https://docs.datafold.com/data-diff/cross-database-diffing/results
Once your data diff is complete, Datafold provides a concise, high-level summary of the detected changes in the Overview tab.
## Overview
The top-level menu displays the diff status, job ID, creation and completed times, runtime, and data connection.
## Match Score
The Match Score is the percentage shown in the Overview tab. It summarizes how similar the two datasets are across rows, columns, and values in a single number between 0% and 100%.
### Formula
```
Match Score = max(0, total_cells - non_matching_cells) / total_cells
```
clamped to the range 0%–100%.
**Total cells** is the sum of cells in both tables:
```
total_cells = (rows_A × cols_A) + (rows_B × cols_B)
```
**Non-matching cells** is the sum of three contributions:
| Source | Contribution to non-matching cells |
| -------------------------------------------------------------- | --------------------------------------------------- |
| Value differences (cells in both tables with different values) | `value_diffs × 2` |
| Exclusive rows (rows present in only one table) | `exclusive_rows × shared_columns` |
| Exclusive columns (columns present in only one table) | `(extra_cols_A × rows_A) + (extra_cols_B × rows_B)` |
Value differences are multiplied by 2 because each differing cell is counted once on Table A's side of the denominator and once on Table B's side. This keeps the score on a consistent scale regardless of whether a discrepancy comes from a value change, a missing row, or a missing column.
### Example
Table A has 100 rows × 10 columns (1,000 cells). Table B is identical in shape and content except for 4 differing values.
* Total cells: 1,000 + 1,000 = **2,000**
* Non-matching cells: 4 × 2 = **8**
* Match Score: (2,000 − 8) / 2,000 = **99.6%**
### Why an extra column lowers the score
Every contribution penalizes the score — not just value differences. A table with identical values but one extra column will not score 100%, because that column's cells exist on only one side.
For example, if Table A has the extra column (100 rows × 1 column = 100 cells), those 100 cells are added to the non-matching count, dropping the score to (2,000 − 100) / 2,000 = 95%.
### Edge cases
* **Empty tables**: if neither table has any cells, the Match Score is **0%**, not 100%. There is nothing to match.
* **Sampling**: when sampling is enabled, the Match Score is computed using the sampled row counts, not the full table sizes. The score reflects the sample.
## Columns
The Columns tab displays a table with detailed column and type mappings from the two datasets being diffed, with status indicators for each column comparison (e.g., identical, percentage of values different). This provides a quick way to identify data inconsistencies and prioritize updates.
## Primary keys
This tab highlights rows that are unique to the Target dataset in a data diff ("Rows exclusive to Target"). As this identifies rows that exist only in the Target dataset and not in the Source dataset based on the primary key, it flags potential data discrepancies.
The Clone **diffs and materialize results** button allows you to rerun existing data diffs with results materialized in the warehouse, as well as any other desired modifications.
## Values
This tab displays rows where at least one column value differs between the datasets being compared. It is useful for quickly assessing the extent of discrepancies between the two datasets.
The **Show filters** button enables the following features:
* Highlight characters: highlight value differences between tables
* % of difference: filters and displays columns based on the specified percentage range of value differences
# How Datafold Diffs Data
Source: https://docs.datafold.com/data-diff/how-datafold-diffs-data
Data diffs allow you to perform value-level comparisons between any two datasets within the same database, across different databases, or even between files.
The basic inputs required to run a diff are the data connections, names/paths of the datasets to be compared, and the primary key (one or more columns that uniquely identify rows in the datasets).
## What types of data can data diffs compare?
Diffs can compare data in tables, views, SQL queries (in relational databases and data lakes), and even files (e.g. CSV, Excel, Parquet, etc.).
Datafold facilitates data diffing by supporting a wide range of basic data types across major database systems like Snowflake, Databricks, BigQuery, Redshift, PostgreSQL, and many more.
## Creating data diffs
Diffs can be created in several ways:
* Interactively through the Datafold app
* Programmatically via our [REST API](/api-reference/data-diffs/create-a-data-diff)
* As part of a Continuous Integration (CI) workflow for [Deployment Testing](/deployment-testing/how-it-works)
## How in-database diffing works
When diffing data within the same physical database or data lake namespace, diffs compare data by executing various SQL queries in the target database. It uses several `JOIN`-type queries and various aggregate queries to provide detailed insights into differences at the row, value, and column levels, and to calculate differences in metrics and distributions.
## How cross-database diffing works
Datasets from both data connections are co-located in a centralized database to execute comparisons and identify specific rows, columns, and values with differences. To perform diffs at massive scale and increased speed, users can apply sampling, filtering, and column selection.
# Best Practices
Source: https://docs.datafold.com/data-diff/in-database-diffing/best-practices
We share best practices that will help you get the most accurate and efficient results from your data diffs.
## Comparing numeric columns: tolerance for floats
When comparing numerical columns or of `FLOAT` type which is inherently noisy, it can be helpful to specify tolerance levels for differences below which the values are considered equal.
Set appropriate tolerance levels for floating-point comparisons to avoid flagging inconsequential differences.
## Materialize diff results
While Datafold UI provides advanced exploration of diff results, sometimes it can be helpful to materialize diff results back to the database to investigate them further with SQL or for audit logging.
## Optimizing diff performance at scale
Since data diff pushes down the compute to your database (which usually has sufficient capacity to store and compute the datasets in the first place), the diffing speed and scalability depends on the performance of the underlying SQL engine. In most cases, the diffing performance is comparable to typical transformation jobs and analytical queries running in the database and has scaled to trillions of rows.
When diffs run longer or consume more database resources than desired, consider the following measures:
1. **Enable Sampling** to dramatically reduce the amount of data processed for in-database diffing.
Sampling can be helpful when diffing extremely large datasets. When sampling is enabled, Datafold compares a randomly chosen subset of the data. Sampling is the tradeoff between the diff detail and time/cost of the diffing process. For most use cases, sampling does not reduce the informational value of data diffs as it still provides the magnitude and specific examples of differences (e.g., if 10% of sampled data show discrepancies, it suggests a similar proportion of differences across the entire dataset).
Sampling is less ideal when you need to audit every changed value with 100% confidence, but this scenario is rare in practice.
Although configuring sampling can seem overwhelming at first, a good rule of thumb is to select an initial value of 95% for the sampling confidence and adjust it as needed. Tweaking the parameters can be helpful to see how they impact the sample size and the tradeoff between performance and accuracy.
2. **Add a SQL filter** if you actually need to compare just a subset of data (e.g., for a particular city or last two weeks).
3. **Optimize SQL queries** to enhance the performance and efficiency of database operations, reduce execution time, minimize resource usage, and ensure faster retrieval of data diff results.
4. **Leverage database performance** by ensuring proper configuration to match the typical workload patterns of your diff operations. Many modern databases come with performance-enhancing features like query optimization, caching, and parallel processing.
5. Consider **increasing resources** available to Datafold in your data warehouse (e.g., for Snowflake, increase warehouse size).
# Creating a New Data Diff
Source: https://docs.datafold.com/data-diff/in-database-diffing/creating-a-new-data-diff
Setting up a new data diff in Datafold is straightforward.
You can configure your data diffs with the following parameters and options:
## Dataset
### Data connection
Pick your data connection(s).
### Diff type
Choose how you want to compare your data:
* Table: Select this to compare data directly from database tables
* Query: Use this to compare results from specific SQL queries
Datafold can also diff views, materialized views, and dynamic tables (Snowflake-only) across both options too.
### Dataset
Choose the dataset you want to compare, Main and Test. This can be a table or a view in your relational database.
### Time travel point
If your database supports time travel, like [Snowflake](https://docs.snowflake.com/en/user-guide/data-time-travel#querying-historical-data), you can query data at a specified timestamp. This is useful for tracking changes over time, conducting audits, or correcting mistakes from accidental data modifications. You can adjust the database's session parameters as needed for your query.
Supported time travel expressions:
| Database | Timestamp | Negative Offset |
| :-------: | :--------------------------: | :--------------------------: |
| BigQuery | | |
| Snowflake | | |
Timestamp examples:
* `2024-01-01`
* `2024-01-01 10:04:23`
* `2024-01-01 10:04:23-09:00`
* `2024-07-16T10:04:23+05:00`
Negative offset examples (in seconds):
* `130`
* `3600`
### Filter
Insert your filter clause after the `WHERE` keyword to refine your dataset. For example: `created_at >'2000-01-01` will only include data created after January 1, 2000.
## Column remapping
When columns are the same data type but are named differently, column remapping allows you to align and compare them. This is useful when datasets have semantically identical columns with different names, such as `userID` and `user_id`. Datafold will surface any differences under the column name used in the Main dataset.
## General parameters
### Primary key
The primary key is one or more columns used to uniquely identify a row in the dataset during diffing. The primary key (or keys) does not need to be formally defined in the database or elsewhere as it is used for unique row identification during diffing. Multiple columns support compound primary key definitions.
### Time-series dimension column
If a time-series dimension is selected, this produces a Timeline plot of diff results over time to identify any time-based patterns.
This is useful for identifying trends or anomalies when a given column does not match between tables in a certain date range. By selecting a time-based column, you can visualize differences and patterns across time, measured as column match rates.
### Materialize diff results to table
Create a detailed table from your diff results, indicating each row where differences occur. This table will include corresponding values from both datasets and flags showing whether each row matches or mismatches.
### Materialize full diff result
For in-depth analysis, you can opt to materialize the full diff result. This disables sampling, allowing for a complete row-by-row comparison across datasets. Otherwise, Datafold defaults to diffing only a sample of the data.
## Row sampling
### Enable sampling
Use this to compare a subset of your data instead of the entire dataset. This is best for assessing large datasets.
### Sampling tolerance
Sampling tolerance defines the allowable margin of error for our estimate. It sets the acceptable percentage of rows with primary key errors (like nulls, duplicates, or primary keys exclusive to one dataset) before disabling sampling.
When sampling is enabled, not every row is examined, which introduces a probability of missing certain discrepancies. This threshold represents the level of difference we are willing to accept before considering the results unreliable and thereby disabling sampling. It essentially sets a limit on how much variance is tolerable in the sample compared to the complete dataset.
Default: 0.001%
### Sampling confidence
Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset. It represents the minimum confidence level that the rate of primary key errors is below the threshold defined in sampling tolerance.
To put it simply, a 95% confidence level with a 5% tolerance means we are 95% certain that the true value falls within 5% of our estimate.
Default: 99%
### Sampling threshold
Sampling is automatically disabled when the total row count of the largest table in the comparison falls below a specified threshold value. This approach is adopted because, for smaller datasets, a complete dataset comparison is not only more feasible but also quicker and more efficient than sampling. Disabling sampling in these scenarios ensures comprehensive data coverage and provides more accurate insights, as it becomes practical to examine every row in the dataset without significant time or resource constraints.
### Sample size
This provides an estimated count of the total number of rows included in the combined sample from Datasets A and B, used for the diffing process. It's important to note that this number is an estimate and can vary from the actual sample size due to several factors:
* the presence of duplicate primary keys in the datasets will likely increase this estimate, as it inflates the perceived uniqueness of rows
* applying filters to the datasets tends to reduce the estimate, as it narrows down the data scope
* The number of rows we sample is not fixed; instead, we use a statistical approach called the Poisson distribution. This involves picking rows randomly from an infinite pool of rows with uniform random sampling. Importantly, we don't need to perform a full diff (compare every single row) to establish a baseline.
Example: Imagine there are two datasets we want to compare, Main and Test. Since we prefer not to check every row, we use a statistical approach to determine the number of rows to sample from each dataset. To do so, we set the following parameters:
* sampling tolerance: 5%
* sampling confidence: 95%
Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset, while sampling tolerance defines the allowable margin of error for our estimate. Here, with a 95% sampling confidence and a 5% sampling tolerance, we are 95% confident that the true value falls within 5% of our estimate. Datafold will then estimate the sample size needed (e.g., 200 rows) to achieve these parameters.
## Tolerance for floats
An acceptable delta between numeric values is used to determine if they match. This is particularly useful for addressing rounding differences in long floating-point numbers.
Add tolerance by choosing a column name, mode, and value. For mode:
* *Relative*: Defines a percentage-based tolerance. For example, a 2% relative tolerance means no difference is noted if the absolute value of (A/B - 1) is less than or equal to 2%.
* *Absolute*: Sets a fixed numerical margin. For instance, an absolute tolerance of 0.5 means values are matched if the absolute difference between A and B is 0.5 or less.
# Results
Source: https://docs.datafold.com/data-diff/in-database-diffing/results
Once your data diff is complete, Datafold provides a concise, high-level summary of the detected changes in the Overview tab
## Overview
The top-level menu displays the diff status, job ID, creation and completed times, runtime, and data connection.
## Match Score
The Match Score is the percentage shown in the Overview tab. It summarizes how similar the two datasets are across rows, columns, and values in a single number between 0% and 100%.
### Formula
```
Match Score = max(0, total_cells - non_matching_cells) / total_cells
```
clamped to the range 0%–100%.
**Total cells** is the sum of cells in both tables:
```
total_cells = (rows_A × cols_A) + (rows_B × cols_B)
```
**Non-matching cells** is the sum of three contributions:
| Source | Contribution to non-matching cells |
| -------------------------------------------------------------- | --------------------------------------------------- |
| Value differences (cells in both tables with different values) | `value_diffs × 2` |
| Exclusive rows (rows present in only one table) | `exclusive_rows × shared_columns` |
| Exclusive columns (columns present in only one table) | `(extra_cols_A × rows_A) + (extra_cols_B × rows_B)` |
Value differences are multiplied by 2 because each differing cell is counted once on Table A's side of the denominator and once on Table B's side. This keeps the score on a consistent scale regardless of whether a discrepancy comes from a value change, a missing row, or a missing column.
### Example
Table A has 100 rows × 10 columns (1,000 cells). Table B is identical in shape and content except for 4 differing values.
* Total cells: 1,000 + 1,000 = **2,000**
* Non-matching cells: 4 × 2 = **8**
* Match Score: (2,000 − 8) / 2,000 = **99.6%**
### Why an extra column lowers the score
Every contribution penalizes the score — not just value differences. A table with identical values but one extra column will not score 100%, because that column's cells exist on only one side.
For example, if Table A has the extra column (100 rows × 1 column = 100 cells), those 100 cells are added to the non-matching count, dropping the score to (2,000 − 100) / 2,000 = 95%.
### Edge cases
* **Empty tables**: if neither table has any cells, the Match Score is **0%**, not 100%. There is nothing to match.
* **Sampling**: when sampling is enabled, the Match Score is computed using the sampled row counts, not the full table sizes. The score reflects the sample.
## Columns
The Columns tab displays a table with detailed column and type mappings from the two datasets being diffed, with status indicators for each column comparison (e.g., identical, percentage of values different). This provides a quick way to identify data inconsistencies and prioritize updates.
## Primary keys
This tab highlights rows that are unique to the Test dataset in a data diff ("Rows exclusive to Test"). As this identifies rows that exist only in the Test dataset and not in the Main dataset based on the primary key, it flags potential data discrepancies.
The **Show filters** button allows you to filter these rows by selected column(s).
The **Clone diffs and materialize** results button allows you to rerun existing data diffs with results materialized in the warehouse, as well as any other desired modifications.
## Column Profiles
Column Profiles displays aggregate statistics and distributions including averages, counts, ranges, and histogram charts representing column-level differences.
The **Show filters** button allows you to adjust chart values by relative (percentage) or absolute numbers.
## Values
This tab displays rows where at least one column value differs between the datasets being compared. It is useful for quickly assessing the extent of discrepancies between the two datasets.
The **Show** filters button enables the following features:
* Highlight characters: highlight value differences between tables
* % of difference: filters and displays columns based on the specified percentage range of value differences
## Timeline
The Timeline tab is a specialized feature that only appears if the time-series dimension column has been selected. It graphically represents data differences over time to highlight discrepancies. It only displays columns with data differences, and differences are presented as the share of mismatched data (percentage mismatched).
This feature offers enhanced clarity in pinpointing inconsistencies, supports informed decision-making through visual data representation, and increases efficiency in identifying and resolving data-related issues.
The Timeline feature is particularly useful in scenarios where an incremental model is mismanaged, leading to improper backfilling. It allows users to visually track the inconsistencies that arise over time due to the mismanagement. This graphical representation makes it easier to pinpoint the specific time frames where the errors occurred, facilitating a more targeted approach to rectify these issues.
It is also useful in correlating data differences with specific time intervals that coincide with changing data connections. When switching over or stitching together different data connections, there's often a shift in how data behaves over time. The Timeline graph helps flag the potential impact of the source change on data consistency and integrity.
## Downstream Impact
This tab displays all associated BI and data app dependencies, such as dashboards and views, linked to the compared datasets. This helps visually illustrate the impact of data changes on downstream data assets.
Each listed dependency is shown with a link to its lineage diagram within Datafold's [column-level lineage](https://docs.datafold.com/data-explorer/how-it-works). You can you can filter by tables or columns within tables, or [open this view](https://docs.datafold.com/data-explorer/how-it-works) in Data Explorer for further analysis.
# What's a Data Diff?
Source: https://docs.datafold.com/data-diff/what-is-data-diff
A data diff is the value-level comparison between two tables, used to identify critical changes to your data and guarantee data quality.
Data Diffs functionality is available via [MCP](/api-reference/mcp-server-setup) — connect your AI assistant to Datafold and run diffs directly from your development environment.
When you **git diff** your code, you’re comparing two versions of your code files to see what has changed, such as lines added, removed, or modified. Similarly, a **data diff** compares two versions of a dataset or two databases to identify differences in individual cells in the data.
## Why do I need to diff data?
Just as diffing code and text is fundamental to software engineering and working with text documents, diffing data is essential to the data engineering workflow.
Why? In data engineering, both data and the code that processes it are constantly evolving. Without the ability to easily diff data, understanding and tracking data changes becomes challenging. This slows down the development process and makes it harder to ensure data quality.
There is a lot you can do with data diff:
* Test SQL code by comparing development or staging environment data to production
* Compare tables in source and target systems to identify discrepancies when migrating data between databases
* Detect value-level outliers, or unexpected changes, in data flowing through your ETL/ELT pipelines
* Verify that reports generated for regulatory compliance accurately reflect the underlying data by comparing report outputs with source data
## Why Datafold?
Data diffing is a fundamental capability in data engineering that every engineer should have access to.
Datafold's [Data Diff](https://www.datafold.com/data-diff) compares datasets fast, within or across databases. As part of Datafold's data quality power tools, Data Diff is fully interoperable with AI agents via [MCP](/datafold-mcp) — so your coding agents can run diffs, validate their own work, and reconcile data across sources programmatically. Datafold offers an enterprise-ready solution for comparing datasets at scale, with comprehensive diffing, API access, and secure deployment options.
Here's how you can identify row-level discrepancies in Datafold:
Datafold provides end-to-end solutions for automating testing, including column-level lineage, ML-based anomaly detection, and enterprise-scale infrastructure support. It caters to complex and production-ready scenarios, including:
* Automated and collaborative diffing and testing for data transformations in CI
* Data diffing informed by column-level lineage, and validation of code changes with visibility into BI applications
* Validating large data migrations or continuous replications with automated cross-database diffing capabilities
Here's a high-level overview of what Datafold offers:
| Feature Category | Datafold |
| :---------------------------------------------------------------------------------------------------------------------------: | :----------------------------------------------: |
| **Database Support**
*Databases that are supported for source-destination diff* | Any SQL database, inquire about specific support |
| **Scale**
*Size of datasets supported for diffing* | Unlimited with advanced performance optimization |
| **Primary Key Data Type Support**
*Data types of primary keys that are supported for diffing* | Numerical, string, datetime, boolean, composite |
| **Data Types Diffing Support**
*Data types that are supported for per-column diffing* | All data types |
| **Export Diff Results to Database**
*Materialize diffing results in your database of choice* | |
| **Value-level diffs**
*Investigate row-by-row column value differences between source and destination databases* | (JSON & GUI) |
| **Diff UI**
*Explore diffs visually and easily share them with your team and stakeholders* | |
| **API Access**
*Automatically create diffs and receive results at scale using the Datafold REST API* | |
| **Persisting Diff History**
*Persist the result history of diffs to know how your data and diffs have changed over time* | |
| **Scheduled Checks**
*Run scheduled diffs for a defined list of tables* | |
| **Alerting**
*Receive automatic alerts about detected discrepancies between tables within or across databases* | |
| **Security and Compliance**
*Run diffs in secure and compliant environments* | HIPAA, SOC2 Type II, GDPR compliant |
| **Deployment Options**
*Deploy your diffs in secure environments that meet your security standards* | Multi-tenant SaaS or Single-tenant in VPC |
| **Support**
*Choose which channels offer the greatest support to your use cases and users* | Enterprise support from Datafold team members |
| **SLA**
*The types of SLAs that exist to guarantee your team can diff and interact with diffs as expected* | (Coming soon) |
## Three ways to learn more
If you're new to Datafold or data diffing, here are three easy ways to get started:
1. **Explore our CI integration guides**: See how Datafold fits into your continuous integration (CI) pipeline by checking out our guides for [No-Code](../deployment-testing/getting-started/universal/no-code), [API](../deployment-testing/getting-started/universal/api), or [dbt](../integrations/orchestrators) integrations.
2. **Try it yourself**: Use your own data with our [14-day free trial](https://app.datafold.com/) and experience Datafold in action.
3. **Book a demo**: Get a deeper technical understanding of how Datafold integrates with your company’s data infrastructure by [booking a demo](https://www.datafold.com/booktime) with our team.
# dbt Metadata Sync
Source: https://docs.datafold.com/data-explorer/best-practices/dbt-metadata-sync
Datafold can automatically ingest dbt metadata from your production environment and display it in Data Explorer.
**INFO**
You can enable the metadata sync in your Orchestration settings.
Please note that when this feature is enabled, user editing of table metadata is disabled.
### Model-level
The following model-level information can be synced:
* `description` is synchronized into the description field of the table into Lineage.
* The `owner` of the table is set to the user identified by the `user@company.com` field. This user must exist in Datafold with that email.
* The `foo` meta-information is added to the description field with the value `bar`.
* The tags `pii` and `bar` are applied to the table as tags.
Here's an example configuration in YAML format:
```Bash theme={null}
models:
- name: users
description: "Description of the table"
meta:
owner: user@company.com
foo: bar
tags:
- pii
- abc
```
### Column-level
The following column-level information can be synced:
* The column `user_id` has two tags applied: `pk` and `id`.
* The metadata for `user_id` is ignored because it reflects the primary key tag.
* The `email` column has the description applied.
* The `email` column has the tag `pii` applied.
* The `email` column has extra metadata information in the description field: `type` with the value `email`.
Here's an example configuration for columns in YAML format:
```Bash theme={null}
models:
- name: users
...
columns:
- name: user_id
tags:
- pk
- id
meta:
pk: true
- name: email
description: "The user's email"
tags:
- pii
meta:
type: email
```
# How It Works
Source: https://docs.datafold.com/data-explorer/how-it-works
Datafold's Data Knowledge Graph maps your entire data ecosystem — lineage, business logic, usage, and ontology — providing essential context to your AI agents via MCP and helping you understand the impact of changes across systems.
Our **Data Explorer** offers a comprehensive overview of your data assets, including [Lineage](/data-explorer/lineage) and [Profiles](/data-explorer/profile). It is powered by the **Data Knowledge Graph (DKG)**, which automatically collects and unifies information about your data, analytical products, and data infrastructure — serving it to your AI agents via [MCP](/datafold-mcp) to provide essential context for any data-related task.
You can filter data assets by Data Connections, Tags, Data Owners, and Asset Types (e.g., tables, columns, and BI-created assets such as views, reports, and syncs). You can also search directly to find specific data assets for lineage analysis.
After selecting a table or data asset, the UI will display a **graph of table-level lineage** by default. You can toggle between **Upstream** and **Downstream** perspectives and customize the lineage view by adjusting the **Max Depth** parameter to your preference.
# Lineage
Source: https://docs.datafold.com/data-explorer/lineage
Datafold offers a column-level and tabular lineage view.
## Column-level lineage
Datafold's column-level lineage helps users trace and document the history, transformations, dependencies, and both downstream and upstream processes of a specific data column within an organization's data assets. This feature allows you to pinpoint the origins of data validation issues and comprehensively identify downstream data processes and applications.
To view column-level lineage, click on the **Columns** dropdown menu of the selected asset.
### Highlight path between assets
To highlight the column path between assets, click the specific column. Reset the view by clicking the **Exit the selected path** button.
## Tabular lineage
Datafold also offers a tabular lineage view.
You can sort lineage information by depth, asset type, identifier, and owner. Click on the **Actions** button for further options:
### Focus lineage on current node
Drill down onto the data node or column of interest.
### Show SQL query
Access the SQL query associated with the selected column to understand how the data was queried from the source:
### Show usage details
Access detailed information about the column's read, write, and cumulative read (the sum of read count including read count of downstream columns) for the previous 7 days:
## Search and filters
Datafold offers powerful search and filtering capabilities to help users quickly locate specific data assets and isolate data connections of interest.
In both the graphical and tabular lineage views, you can filter by tables or columns within tables, allowing you to go as granular as needed.
### Table filtering
Simply enter the table's name in the search bar to filter and display all relevant information associated with that table.
### Column filtering
To focus specifically on columns, you can search using a combination of keywords. For instance, searching "column table" will display columns associated with a table, while a query like "column dim customer" narrows the search to columns within the "dim customer" table.
## Settings
You can configure the settings for Lineage under Settings > Data Connections > Advanced Settings:
### Schema indexing schedule
Customize the frequency and timing of when to update the indexes on database schemas. The schedule is defined through a cron tab expression.
### Table inclusion/exclusion
You can filter to include and/or exclude specific tables to be shown in Lineage.
When the inclusion list is set, only the tables specified in this list will be visible in the lineage and search results.
When the inclusion list is not set, all tables will be visible by default, except for those explicitly specified in the exclusion list.
### Lineage update schedule
Customize the frequency and timing of when to scan the query history of your data warehouse to build and update the data lineage. The schedule is defined through a cron tab expression.
## FAQ
Datafold computes column-level lineage by:
1. Ingesting, parsing and analyzing SQL logs from your databases and data warehouses. This allows Datafold to infer dependencies between SQL statements, including those that create, modify, and read data.
2. Augmenting the metadata graph with data from various sources. This includes metadata from orchestration tools (e.g., dbt), BI tools, and user-provided documentation.
Currently, the schema of the Datafold GraphQL API, which we use to expose lineage information, is not yet stable and is considered to be in beta. Therefore, we do not include this API in our public documentation.
If you would like to programmatically access lineage information, you can explore our GitHub repository with a few examples: [datafold/datafold-api-examples](https://github.com/datafold/datafold-api-examples). Simply clone the repository and follow the instructions provided in the `README.md` file.
# Profile
Source: https://docs.datafold.com/data-explorer/profile
View a data profile that summarizes key table and column-level statistics, and any upstream dependencies.
# Datafold Migration Agent
Source: https://docs.datafold.com/data-migration-automation/datafold-migration-agent
The Data Migration Agent delivers guaranteed-outcome migrations with fixed price, timeline, and data parity — over 6x faster than traditional approaches.
The Data Migration Agent (DMA) delivers outcome-based migrations with guaranteed price, timeline, and data quality — powered by Datafold's AI agent architecture and [Data Knowledge Graph](/data-explorer/how-it-works). Unlike traditional service providers that bill by the hour, Datafold delivers managed outcomes: AI agents do the translation and validation work, elite engineers oversee quality, and customers pay for results.
## How does DMA work?
Datafold performs complete SQL codebase translation and validation using an AI-powered architecture. This approach leverages a large language model (LLM) with a feedback loop optimized for achieving full parity between the migration source and target. DMA analyzes metadata, including schema, data types, and relationships, to ensure accuracy in translation.
Datafold provides a comprehensive report at the end of the migration. This report includes links to data diffs validating parity and highlighting any discrepancies at the dataset, column, and row levels between the source and target databases.
## Why migrate with DMA?
Unlike traditional deterministic transpilers, DMA offers several distinct benefits:
* **Full parity between source and target:** DMA ensures not just code that compiles, but code that delivers the same results in your new database, complete with explicit validation.
* **Flexible dialect handling:** DMA can adapt to any arbitrary input/output dialect without requiring a full grammar definition, which is especially valuable for legacy systems.
* **Self-correction capabilities:** The AI-driven DMA can account for and correct mistakes based on both compilation errors and data discrepancies.
* **Modernizing code structure:** DMA can convert complex stored procedures into clean, modern formats such as dbt projects, following best practices.
## Getting started with DMA
**Want to learn more?**
If you're interested in diving deeper, please take a moment to [fill out our intake form](https://nw1wdkq3rlx.typeform.com/to/VC2TbEbz) to connect with the Datafold team.
1. Connect your source and target data sources to Datafold.
2. Provide Datafold access to your codebase, typically by installing the Datafold GitHub/GitLab/ADO app or via system catalog access for stored procedures.
Once you connect your source and target systems and Datafold ingests the codebase, DMA's translation process is supervised by the Datafold team. In most cases, no additional input is required from the customer.
The migration process timeline depends on the technologies, scale, and complexity of the migration. After setup, migrations typically take several days to several weeks.
## Security
Datafold is SOC 2 Type II, GDPR, and HIPAA-compliant. We offer flexible deployment options, including in-VPC setups in AWS, GCP, or Azure. The LLM infrastructure is local, ensuring no data is exposed to external subprocessors beyond the cloud provider. For VPC deployments, data stays entirely within the customer’s private network.
## FAQ
For more information, please see our extensive [FAQ section](../faq/data-migration-automation).
# Migration Automation
Source: https://docs.datafold.com/data-migration-automation/datafold-migration-automation
Modernize your data platform in weeks, not years. Datafold's Data Migration Agent delivers guaranteed-outcome migrations with fixed price, timeline, and data parity — over 6x faster and cheaper than traditional approaches.
Datafold offers flexible migration validation options to fit your data migration workflow. Data teams can choose to leverage the full power of the [Data Migration Agent (DMA)](../data-migration-automation/datafold-migration-agent) alongside [cross-database diffing](../data-diff/how-datafold-diffs-data#how-cross-database-diffing-works), or use ad-hoc diffing exclusively for validation.
## Supported migrations
Datafold supports a wide range of migrations to meet the needs of modern data teams. The platform enables smooth transitions between different databases and transformation frameworks, ensuring both code translation and data validation throughout the migration process. Datafold can handle:
* **Data Warehouse Migrations:** Seamlessly migrate between data warehouses, for example, from PostgreSQL to Databricks.
* **Data Transformation Framework Migrations:** Transition your transformation framework from legacy stored procedures to modern tools like dbt.
* **Hybrid Migrations:** Migrate across a combination of data platforms and transformation frameworks. For example, moving from MySQL + stored procedures to Databricks + dbt.
## Migration options
The AI-powered Datafold Migration Agent (DMA) provides automated SQL code translation and validation to simplify and automate data migrations. Teams can pair DMA with ad-hoc cross-database diffing to enhance the validation process with additional manual checks when necessary.
**How it works:**
* **Step 1:** Connect your legacy and new databases to Datafold, along with your codebase.
* **Step 2:** DMA translates and validates SQL code automatically.
* **Step 3:** Pair the DMA output with ad-hoc cross-database diffing to reconcile data between legacy and new databases.
This combination streamlines the migration process, offering automatic validation with the flexibility of manual diffing for fine-tuned control.
For teams that prefer to handle code translation manually or are working with third-party migrations, Datafold's ad-hoc cross-database diffing is available as a stand-alone validation tool.
**How it works:**
* Validate data across databases manually without using DMA for code translation.
* Run ad-hoc diffing as needed, via the [Datafold REST API](../api-reference/introduction), or schedule it with [Monitors](../data-monitoring) for continuous validation.
This option gives you full control over the migration validation process, making it suitable for in-house or outsourced migrations.
## Cross-database diffing for migrations
Both options above rely on cross-database diffing to validate data parity between source and target systems. Learn how to set up and run cross-database diffs in the [Cross-Database Diffing guide](/data-diff/cross-database-diffing/creating-a-new-data-diff).
# Monitor Types
Source: https://docs.datafold.com/data-monitoring/monitor-types
Monitoring your data for unexpected changes is one of the cornerstones of data observability.
Data Monitors functionality is available via [MCP](/api-reference/mcp-server-setup) — connect your AI assistant to Datafold and manage monitors directly from your development environment.
Datafold supports all your monitoring needs through a variety of different monitor types:
1. [**Data Diff**](/data-monitoring/monitors/data-diff-monitors) → Detect differences between any two datasets, within or across databases
2. [**Metric**](/data-monitoring/monitors/metric-monitors) → Identify anomalies in standard metrics like row count, freshness, and cardinality, or in any custom metric
3. [**Data Test**](/data-monitoring/monitors/data-test-monitors) → Validate your data with business rules and see specific records that fail your tests
4. [**Schema Change**](/data-monitoring/monitors/schema-change-monitors) → Receive alerts when a table schema changes
If you need help creating your first few monitors, deciding which type of monitor to use in a particular situation, or developing an overall monitoring strategy, please reach out via email ([support@datafold.com](mailto:support@datafold.com)) and our team of experts will be happy to assist.
# Monitors as Code
Source: https://docs.datafold.com/data-monitoring/monitors-as-code
Manage Datafold monitors via version-controlled YAML for greater scalability, governance, and flexibility in code-based workflows.
**INFO**
Please contact [support@datafold.com](mailto:support@datafold.com) if you'd like to enable this feature for your organization.
Use an AI agent with [Datafold MCP](/api-reference/mcp-server-setup) to help generate and manage your monitors YAML configurations.
This is particularly useful if any of the following are true:
* You have (or plan to have) 100s or 1000s of monitors
* Your team is accustomed to managing things in code
* Strict governance and change management are important to you
## Getting started
**INFO**
This section describes how to get started with GitHub Actions, but the same concepts apply to other hosted version control platforms like GitLab and Bitbucket. Contact us if you need help getting started.
### Set up version control integration
To start using monitors as code, you'll need to decide which repository will contain your YAML configuration.
If you've already connected a repository to Datafold, you could use that. Or, follow the instructions [here](/integrations/code-repositories) to connect a new repository.
### Generate a Datafold API key
If you've already got a Datafold API key, use it. Otherwise, you can create a new one in the app by visiting **Settings > Account** and selecting **Create API Key**.
### Create monitors config
In your chosen repository, create a new YAML file where you'll define your monitors config.
For this example, we'll name the file `monitors.yaml` and place it in the root directory, but neither of these choices are hard requirements.
Leave the file blank for now—we'll come back to it in a moment.
For autocomplete, inline documentation, and real-time validation of your monitors YAML, see the [monitors-schema](https://github.com/datafold/monitors-schema) repo. It provides a JSON Schema with setup instructions for VS Code, Cursor, IntelliJ, Neovim, and other editors.
### Add CI workflow
If you're using GitHub Actions, create a new YAML file under `.github/workflows/` using the following template. Be sure to tailor it to your particular setup:
```yaml theme={null}
name: Apply monitors as code config to Datafold
on:
push:
branches:
- main # or master
jobs:
apply:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.12
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install datafold-sdk
- name: Update monitors
run: datafold monitors provision monitors.yaml # use the correct file name/path
env:
DATAFOLD_HOST: https://app.datafold.com # different for dedicated deployments
DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} # remember to add to secrets
```
### Create a monitor
Now return to your YAML configuration file to add your first monitor. Reference the list of examples below and select one that makes sense for your organization.
## Examples
**INFO**
These examples are intended to serve as inspiration and don't demonstrate every possible configuration. Contact us if you have any questions.
### Data Diff
[Data Diff monitors](/data-monitoring/monitors/data-diff-monitors) detect differences between any two datasets, within or across databases.
```yaml theme={null}
monitors:
replication_test_example:
name: 'Example of a custom name'
description: 'Example of a custom description'
type: diff
enabled: true
datadiff:
diff_type: 'inmem'
dataset_a:
connection_id: 734
table: db.schema.table
time_travel_point: '2020-01-01'
materialize: false
dataset_b:
connection_id: 736
table: db.schema.table1
time_travel_point: '2020-01-01'
materialize: true
primary_key:
- pk_column
columns_to_compare:
- col1
materialize_results: true
materialize_results_to: 734
column_remapping:
col1: col2
sampling:
tolerance: 0.2
confidence: 0.95
threshold: 5000
ignore_string_case: true
schedule:
interval:
every: hour
replication_test_example_with_thresholds:
type: diff
enabled: true
datadiff:
diff_type: 'inmem'
dataset_a:
connection_id: 734
table: db.schema.table
dataset_b:
connection_id: 736
table: db.schema.table2
session_parameters:
k: v
primary_key:
- pk_column
tolerance:
float:
default:
type: absolute
value: 50
column_tolerance:
A:
type: relative
value: 20 # %
B:
type: absolute
value: 30.0
schedule:
interval:
every: hour
alert:
different_rows_count: 100
different_rows_percent: 10
replication_test_example_with_thresholds_and_notifications:
type: diff
enabled: true
datadiff:
diff_type: 'indb'
dataset_a:
connection_id: 734
table: db.schema.table
dataset_b:
connection_id: 734
table: db.schema.table3
primary_key:
- pk_column
schedule:
interval:
every: hour
sampling:
rate: 0.1
threshold: 100000
materialize_results: true
tolerance:
float:
default:
type: absolute
value: 50
column_tolerance:
A:
type: relative
value: 20 # %
B:
type: absolute
value: 30.0
notifications:
- type: email
recipients:
- valentin@datafold.com
- type: slack
integration: 123
channel: datafold-alerts
mentions:
- "here"
- "channel"
features:
- attach_csv
- notify_first_triggered_only
- type: pagerduty
integration: 124
- type: webhook
integration: 125
alert:
different_rows_count: 100
different_rows_percent: 10
```
### Metric
[Metric monitors](/data-monitoring/monitors/metric-monitors) identify anomalies in standard metrics like row count, freshness, and cardinality, or in any custom metric.
```yaml theme={null}
monitors:
table_metric_example:
type: metric
enabled: true
connection_id: 736
metric:
type: table
table: db.schema.table
filter: deleted is false
metric: freshness # see full list of options below
alert:
type: automatic
sensitivity: 10
schedule:
interval:
every: day
hour: 8 # 0-23 UTC
column_metric_example:
type: metric
enabled: true
connection_id: 736
metric:
type: column
table: db.schema.table
column: some_col
filter: deleted is false
metric: sum # see full list of options below
alert:
type: percentage
increase: 30 # %
decrease: 0
tags:
- oncall
- action-required
schedule:
cron: 0 0 * * * # every day at midnight UTC
custom_metric_example:
name: custom metric example
type: metric
connection_id: 123
notifications: []
tags: []
enabled: true
metric:
type: custom
query: select * from table
alert_on_missing_data: true
alert:
type: absolute
max: 22.0
min: 12.0
schedule:
interval:
every: day
type: daily
```
#### Supported metrics
For more details on supported metrics, see the docs for [Metric monitors](/data-monitoring/monitors/metric-monitors#metric-types).
**Table metrics:**
* Freshness: `freshness`
* Row Count: `row_count`
**Column metrics:**
* Cardinality: `cardinality`
* Uniqueness: `uniqueness`
* Minimum: `minimum`
* Maximum: `maximum`
* Average: `average`
* Median: `median`
* Sum: `sum`
* Standard Deviation: `std_dev`
* Fill Rate: `fill_rate`
### Data Test
[Data Test monitors](/data-monitoring/monitors/data-test-monitors) validate your data with business rules and surface specific records that fail your tests.
```yaml theme={null}
monitors:
custom_data_test_example:
type: test
enabled: true
connection_id: 736
query: select 1 from db.schema.table
schedule:
interval:
every: hour
tags:
- team_1
accepted_values_test_example:
type: test
enabled: true
connection_id: 736
test:
type: accepted_values
tables:
- path: db.schema.table
columns:
- column_name
variables:
accepted_values:
value:
- 12
- 15
quote: false
schedule:
interval:
every: hour
numeric_range_test_example:
type: test
enabled: true
connection_id: 736
test:
type: numeric_range
tables:
- path: db.schema.table
columns:
- column_name
variables:
maximum:
value: 15
quote: false
schedule:
interval:
every: hour
```
**Supported variables by Standard Data Test (SDT) type**
| SDT Type | Monitor-as-Code Type | Supported Variables | Variable Type |
| --------------------- | ----------------------- | ------------------- | ---------------------- |
| Unique | `unique` | - | - |
| Not Null | `not_null` | - | - |
| Accepted Values | `accepted_values` | `accepted_values` | Collection with values |
| Referential Integrity | `referential_integrity` | - | - |
| Numeric Range | `numeric_range` | `minimum` | Single value |
| | | `maximum` | Single value |
### Schema Change
[Schema Change monitors](/data-monitoring/monitors/schema-change-monitors) detect when changes occur to a table's schema.
```yaml theme={null}
monitors:
schema_change_example:
type: schema
enabled: true
connection_id: 736
table: db.schema.table
schedule:
interval:
every: day
hour: 22 # 0-23 UTC
tags:
- team_2
```
## Bulk Manage with Wildcards
For certain monitor types—[Freshness](/data-monitoring/monitors/metric-monitors), [Row Count](/data-monitoring/monitors/metric-monitors), and [Schema Change](/data-monitoring/monitors/schema-change-monitors)—it's possible to create/manage many monitors at once using the following wildcard syntax:
```yaml theme={null}
row_count_monitors:
type: metric
connection_id: 123
metric:
type: table
metric: row_count
# include all tables in the WAREHOUSE database
include_tables: WAREHOUSE.*
# exclude all tables in the INFORMATION_SCHEMA schema
exclude_tables: WAREHOUSE.INFORMATION_SCHEMA.*
schedule:
interval:
every: day
hour: 10 # 0-23 UTC
```
This is particularly useful if you want to create the same monitor type for many tables in a particular database or schema. Note in the example above that you can specify both `include_tables` and `exclude_tables` to fine-tune your selection.
## FAQ
Yes, it's not all or nothing. You can still create/manage monitors in the app even if you're defining others in code.
By default, nothing—it remains in the app. However, you can add the `--dangling-monitors-strategy [delete|pause]` flag to your `run` command to either delete or pause notifications if they're removed from your code. For example:
```bash theme={null}
datafold monitors provision monitors.yaml --dangling-monitors-strategy delete
```
Note: this only applies to monitors that were created from code, not those created in the UI.
Add the `--dangling-monitors-strategy [delete|pause]` flag to your `run` command and replace the contents of your YAML file with the following:
```yaml theme={null}
monitors: {}
```
Note that providing an empty YAML file will likely produce an error and not have the same effect.
No, any monitors created from code will be read-only in the app (though they can still be cloned).
Yes, you can export all monitors from the app to manage them as code. There are two ways to do this:
1. Exporting all monitors: Navigate to the Monitors list page and click the **View as Code** button
2. Exporting a single monitor: Go to the specific monitor and click **Actions** and then select **View as Code**
Note that when exporting monitors, pay attention to the `id` field in the YAML. If you want to preserve monitor history, keep the `id` field as this will update the original monitor to be managed as code. If you don't want to preserve your monitor history, **delete** the `id` field to create a new monitor as code while keeping the original monitor intact.
## Need help?
If you have any questions about how to use monitors as code, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com).
# Data Diff Monitors
Source: https://docs.datafold.com/data-monitoring/monitors/data-diff-monitors
Data Diff monitors compare datasets across or within databases, identifying row and column discrepancies with customizable scheduling and notifications.
## Ways to create a data diff monitor
There are 3 ways to create a data diff monitor:
1. From the **Monitors** page by clicking **Create new monitor** and then selecting **Data diff** as a type of monitor.
2. Clone an existing monitor by clicking **Actions** and then **Clone** in the header menu. This will pre-fill the form with the existing monitor configuration.
3. Create a monitor directly from the data diff results by clicking **Actions** and **Create monitor**. This will pre-fill the configuration with the parent data diff settings, requiring updates only for the **Schedule** and **Notifications** sections.
Once a monitor is created and initial metrics collected, you can set up [thresholds](/data-monitoring/monitors/data-diff-monitors#monitoring) for the two metrics.
## Create a new data diff monitor
Setting up a new diff monitor in Datafold is straightforward. You can configure it with the following parameters and options:
### General
Choose how you want to compare your data and whether the diff type is in-database or cross-database.
Pick your data connections. Then, choose the two datasets you want to compare. This can be a table or a view in your relational database.
If you need to compare just a subset of data (e.g., for a particular city or last two weeks), add a SQL filter.
Select **Materialize inputs** to improve diffing speed when query is heavy on compute, or if filters are applied to non-indexed columns, or if primary keys are transformed using concatenation, coalesce, or another function.
### Column remapping
When columns are the same data type but are named differently, column remapping allows you to align and compare them. This is useful when datasets have semantically identical columns with different names, such as `userID` and `user_id`. Datafold will surface any differences under the column name used in Dataset A.
### Diff settings
#### Primary key
The primary key is one or more columns used to uniquely identify a row in the dataset during diffing. The primary key (or keys) does not need to be formally defined in the database or elsewhere as it is used for unique row identification during diffing. Multiple columns support compound primary key definitions.
#### Columns to compare
Determine whether to compare all columns or select specific one(s). To optimize performance on large tables, it's recommended to exclude columns known to have unique values for every row, such as timestamp columns like "updated\_at," or apply filters to limit the comparison scope.
#### Materialize diff results
Choose whether to store diff results in a table.
#### Sampling
Use this to compare a subset of your data instead of the entire dataset. This is best for assessing large datasets.
There are two ways to enable sampling in Monitors: [Tolerance](#tolerance) and [% of Rows](#-of-rows).
**TIP**
When should I use sampling tolerance instead of percent of rows?
Each has its specific use cases and benefits, please [see the FAQ section](#sampling-tolerance-vs--of-rows) for a more detailed breakdown.
##### Tolerance
Tolerance defines the allowable margin of error for our estimate. It sets the acceptable percentage of rows with primary key errors (like nulls, duplicates, or primary keys exclusive to one dataset) before disabling sampling.
When sampling tolerance is enabled, not every row is examined, which introduces a probability of missing certain discrepancies. This threshold represents the level of difference we are willing to accept before considering the results unreliable and thereby disabling sampling. It essentially sets a limit on how much variance is tolerable in the sample compared to the complete dataset.
Default: 0.001%
###### Sampling confidence
Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset. It represents the minimum confidence level that the rate of primary key errors is below the threshold defined in sampling tolerance.
To put it simply, a 95% confidence level with a 5% tolerance means we are 95% certain that the true value falls within 5% of our estimate.
Default: 99%
###### Sampling threshold
Sampling will be disabled if total row count of the largest table is less that the threshold value.
###### Sample size
This provides an estimated count of the total number of rows included in the combined sample from Datasets A and B, used for the diffing process. It's important to note that this number is an estimate and can vary from the actual sample size due to several factors:
* The presence of duplicate primary keys in the datasets will likely increase this estimate, as it inflates the perceived uniqueness of rows
* Applying filters to the datasets tends to reduce the estimate, as it narrows down the data scope
The number of rows we sample is not fixed; instead, we use a statistical approach called the Poisson distribution. This involves picking rows randomly from an infinite pool of rows with uniform random sampling. Importantly, we don't need to perform a full diff (compare every single row) to establish a baseline.
Example: Imagine there are two datasets we want to compare, Main and Test. Since we prefer not to check every row, we use a statistical approach to determine the number of rows to sample from each dataset. To do so, we set the following parameters:
* Sampling tolerance: 5%
* Sampling confidence: 95%
Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset, while sampling tolerance defines the allowable margin of error for our estimate. Here, with a 95% sampling confidence and a 5% sampling tolerance, we are 95% confident that the true value falls within 5% of our estimate. Datafold will then estimate the sample size needed (e.g., 200 rows) to achieve these parameters.
##### % of rows
Percent of rows sampling defines the proportion of the dataset to be included in the sample by specifying a percentage of the total number of rows. For example, setting the sampling percentage to 0.1% means that only 0.1% of the total rows will be sampled for analysis or comparison.
When percent of rows sampling is enabled, a fixed percentage of rows is selected randomly from the dataset. This method simplifies the sampling process, making it easy to understand and configure without needing to adjust complex statistical parameters. However, it lacks the statistical assurances provided by methods like sampling tolerance.
It doesn't dynamically adjust based on data characteristics or discrepancies but rather adheres strictly to the specified percentage, regardless of the dataset's variability. This straightforward approach is ideal for scenarios where simplicity and quick setup are more important than precision and statistical confidence. It provides a basic yet effective way to estimate the dataset's characteristics or differences, suitable for less critical data validation tasks.
###### Sampling rate
This refers to the percentage of the total number of rows in the largest table that will be used to determine the sample size. This ensures that the sample size is proportionate to the size of the dataset, providing a representative subset for comparison. For instance, if the largest table contains 1,000,000 rows and the sampling rate is set to 1%, the sample size will be 10,000 rows.
###### Sampling threshold
Sampling is automatically disabled when the total row count of the largest table in the comparison falls below a specified threshold value. This approach is adopted because, for smaller datasets, a complete dataset comparison is not only more feasible but also quicker and more efficient than sampling. Disabling sampling in these scenarios ensures comprehensive data coverage and provides more accurate insights, as it becomes practical to examine every row in the dataset without significant time or resource constraints.
###### Sampling size
This parameter is the [same one used in sampling tolerance](#sample-size).
### Add a schedule
You can choose to run your monitor daily, hourly, or even input a cron expression for more complex scheduling:
### Add notifications
You can add notifications, sent through Slack or emails, which indicate whether a monitor has been executed.
Notifications are sent when either or both predefined thresholds are reached during a Diff Monitor. You can set a maximum threshold for the:
* Number of different rows
* Percentage of different rows
## Results
The diff monitor run history shows the results from each run.
Each run includes basic stats, along with metrics such as:
* The total rows different: number of different rows according to data diff results.
* Rows with different values: percentage of different rows relative to the total number of rows in dataset A according to data diff results. Note that the status `Different` doesn't automatically map into a notification/alert.
Click the **Open Diff** link for more granular information about a specific Data Diff.
## FAQ
Use sampling tolerance when you need statistical confidence in your results, as it is more efficient and stops sampling once a difference is confidently detected. This method is ideal for critical data validation tasks that require precise accuracy.
On the other hand, use the percent of rows method for its simplicity and ease of use, especially in less critical scenarios where you just need a straightforward, quick sampling approach without worrying about statistical parameters. This method is perfect for general, easy-to-understand sampling needs.
If you have any questions about how to use Data Diff monitors, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com).
# Data Test Monitors
Source: https://docs.datafold.com/data-monitoring/monitors/data-test-monitors
Data Tests validate your data against off-the-shelf checks or custom business rules.
Data Test monitors allow you to validate your data using off-the-shelf checks for non-null or unique values, numeric ranges, accepted values, referential integrity, and more. Custom tests let you write custom SQL queries to validate your own business rules.
Think of Data Tests as pass/fail—either a test returns no records (pass) or it returns at least one record (fail). Failed records are viewable in the app, materialized to a temporary table in your warehouse, and can even be [attached to notifications as a CSV](/data-monitoring/monitors/data-test-monitors#attach-csvs-to-notifications).
## Create a Data Test monitor
There are two ways to create a Data Test monitor:
1. Open the **Monitors** page, select **Create new monitor**, and then choose **Data Test**.
2. Clone an existing Data Test monitor by clicking **Actions** and then **Clone**. This will pre-fill the form with the existing monitor configuration.
## Set up your monitor
Select your data connection, then choose whether you'd like to use a [Standard](/data-monitoring/monitors/data-test-monitors#standard-data-tests) or [Custom](/data-monitoring/monitors/data-test-monitors#custom-data-tests) test.
### Standard Data Tests
Standard tests allow you to validate your data against off-the-shelf checks for non-null or unique values, numeric ranges, accepted values, referential integrity, and more.
After choosing your data connection, select **Standard** and the specific test that you'd like to run. If you don't see the test you're looking for, you can always write a [Custom test](/data-monitoring/monitors/data-test-monitors#custom-data-tests).
#### Quoting variables
Some test types (e.g. accepted values) require you to provide one or more values, which you may want to have quoted in the final SQL. The **Quote** flag, which is enabled by default, allows you to control this behavior. Here's an example.
Quoting **enabled** for `EXAMPLE_VALUE` (default):
```sql theme={null}
SELECT *
FROM DB.SCHEMA.TABLE1
WHERE "COLUMN1" < 'EXAMPLE_VALUE';
```
Quoting **disabled** for `EXAMPLE_VALUE`:
```sql theme={null}
SELECT *
FROM DB.SCHEMA.TABLE1
WHERE "COLUMN1" < EXAMPLE_VALUE;
```
### Custom Data Tests
When you need to test something that's not available in our [Standard tests](/data-monitoring/monitors/data-test-monitors#standard-data-tests), you can write a Custom test. Select your data connection, choose **Custom**, then write your SQL query.
Importantly, keep in mind that your query should return records that *fail* the test. Here are some examples to illustrate this.
**Custom business rule**
Say your company defines active users as individuals who have signed into your application at least 3 times in the past week. You could write a test that validates this logic by checking for users marked as active who *haven't* reached this threshold:
```sql theme={null}
SELECT *
FROM users
WHERE status = 'active'
AND signins_last_7d < 3;
```
**Data formatting**
If you wanted to validate that all phone numbers in your contacts table are 10 digits and only contain numbers, you'd return records that are not 10 digits or use non-numeric characters:
```sql theme={null}
SELECT *
FROM contacts
WHERE LENGTH(phone_number) != 10
OR phone_number REGEXP '[^0-9]';
```
## Add a schedule
You can choose to run your monitor daily, hourly, or even input a cron expression for more complex scheduling:
## Add notifications
Receive notifications via Slack or email when at least one record fails your test:
## Attach CSVs to notifications
Datafold allows attaching a CSV of failed records to Slack and email notifications. This is useful if, for example, you have business users who don't have a Datafold license but need to know about records that fail your tests.
This option is configured separately per notification destination as shown here:
CSV attachments are limited to the lesser of 1,000 rows or 1 MB in file size.
### Attaching CSVs in Slack
In order to attach CSVs to Slack notifications, you need to complete 1-2 additional steps:
1. If you installed the Datafold Slack app prior to October 2024, you'll need to reinstall the app by visiting Settings > Integrations > Notifications, selecting your Slack integration, then **Reinstall Slack integration**.
2. Invite the Datafold app to the channel you wish to send notifications to using the `/invite` command shown below:
## Run Tests in CI
Standard Data Tests run on a schedule against your production data. But often it's useful to test data before it gets to production as part of your deployment workflow. For this reason, Datafold supports running tests in CI.
Data Tests in CI work very similarly to our [Monitors as Code](/data-monitoring/monitors-as-code) feature, in the sense that you define your tests in a version-controled YAML file. You then use the Datafold SDK to execute those tests as part of your CI workflow.
### Write your tests
First, create a new file (e.g. `tests.yaml`) in the root of your repository. Then write your tests use the same format described in our [Monitors as Code](/data-monitoring/monitors-as-code) docs with two exceptions:
1. Add a `run_in_ci` flag to each test and set it to `true` (assuming you'd like to run the test)
2. (Optional) Add placeholders for variables that you'd like to populate dynamically when executing your tests
Here's an example:
```yaml theme={null}
monitors:
null_pk_test:
type: test
name: No NULL pk in the users table
run_in_ci: true
connection_id: 8
query: select * from {{ schema }}.USERS where id is null
duplicate_pk_test:
type: test
name: No duplicate pk in the users table
run_in_ci: true
connection_id: 8
query: |
select *
from {{ schema }}.USERS
where id in (
select id
from {{ schema }}.USERS
group by id
having count(*) > 1
);
```
### Execute your tests
**INFO**
This section describes how to get started with GitHub Actions, but the same concepts apply to other hosted version control platforms like GitLab and Bitbucket. Contact us if you need help getting started.
If you're using GitHub Actions, create a new YAML file under `.github/workflows/` using the following template. Be sure to tailor it to your particular setup:
```yaml theme={null}
on:
push:
branches:
- main
pull_request:
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v2
with:
token: ${{ secrets.GH_TOKEN }}
repository: datafold/datafold-sdk
path: datafold-sdk
ref: data-tests-in-ci-demo
- uses: actions/setup-python@v2
with:
python-version: '3.12'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Set schema env var in PR
run: |
echo "SCHEMA=ANALYTICS.PR" >> $GITHUB_ENV
if: github.event_name == 'pull_request'
- name: Set schema env var in main
run: |
echo "SCHEMA=ANALYTICS.CORE" >> $GITHUB_ENV
if: github.event_name == 'push'
- name: Run tests
run: |
datafold tests run --var schema:$SCHEMA --ci-config-id 1 tests.yaml # use the correct file name/path
env:
DATAFOLD_HOST: https://app.datafold.com # different for dedicated deployments
DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} # remember to add to secrets
```
### View the results
When your CI workflow is triggered (e.g. by a pull request), you can view the terminal output for your test results:
## Need help?
If you have any questions about how to use Data Test monitors, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com).
# Metric Monitors
Source: https://docs.datafold.com/data-monitoring/monitors/metric-monitors
Metric monitors detect anomalies in your data using ML-based algorithms or manual thresholds, supporting standard and custom metrics for tables or columns.
**INFO**
Please contact [support@datafold.com](mailto:support@datafold.com) if you'd like to enable this feature for your organization.
Metric monitors allow you to perform anomaly detection—either automatically using our ML-based algorithm or by setting manual thresholds—on the following metric types:
1. Standard metrics (e.g. row count, freshness, and cardinality)
2. Custom metrics (e.g. sales volume per region)
## Create a Metric monitor
There are two ways to create a Metric Monitor:
1. Open the **Monitors** page, select **Create new monitor**, and then choose **Metric**.
2. Clone an existing Metric monitor by clicking **Actions** and then **Clone**. This will pre-fill the form with the existing monitor configuration.
## Set up your monitor
Select your data connection, then choose the type of metric you'd like: **Table**, **Column**, or **Custom**.
If you select table or column, you have the option to add a SQL filter to refine your dataset. For example, you could implement a 7-day rolling time window with the following: `timestamp >= dateadd(day, -7, current_timestamp)`. Please ensure the SQL is compatible with your selected data connection.
## Metric types
### Table metrics
| Metric | Definition | Additional Notes |
| --------- | --------------------------------- | -------------------------------------------------------------------------------------------------------------- |
| Freshness | Time since table was last updated | Measured in minutes. Derived from INFORMATION\_SCHEMA. Only supported for Snowflake, BigQuery, and Databricks. |
| Row Count | Total number of rows | |
### Column metrics
| Metric | Definition | Supported Column Types | Additional Notes |
| ------------------ | ------------------------------ | ---------------------- | -------------------------- |
| Cardinality | Number of distinct values | All types | |
| Uniqueness | Proportion of distinct values | All types | Proportion between 0 and 1 |
| Minimum | Lowest numeric value | Numeric columns | |
| Maximum | Highest numeric value | Numeric columns | |
| Average | Mean value | Numeric columns | |
| Median | Median value (50th percentile) | Numeric columns | |
| Sum | Sum of all values | Numeric columns | |
| Standard Deviation | Measure of data spread | Numeric columns | |
| Fill Rate | Proportion of non-null values | All types | Proportion between 0 and 1 |
### Custom metrics
Our custom metric framework is extremely flexible and supports several approaches to defining metrics. Depending on the approach you choose, your query should return some combination of the following columns:
* **Metric value (required)**: a numeric column containing your *metric values*
* **Timestamp (optional)**: a date/time column containing *timestamps* corresponding to your metric values
* **Group (optional)**: a string column containing *groups/dimensions* for your metric
**INFO**
The names and order of your columns don't matter. Datafold will automatically infer their meaning based on data type.
The following questions will help you decide which approach is best for you:
1. **Do you want to group your metric by the value of a column in your query?** For example, if your metric is *sales volume per day*, rather than looking at a single metric that encompasses all sales globally, it might be more informative to group by country. In this case, Datafold will automatically compute sales volume separately for each country to assist with root cause analysis when there’s an unexpected change.
2. **Will your query return a single metric value (per group, if relevant) on every monitor run, or an entire time series?** We generally recommend starting with the simpler approach of providing a single metric value (per group) per monitor run. However, if you’ve already defined a time series elsewhere (e.g. in your BI tool) and simply want to copy/paste that query into Datafold, then you may prefer the latter approach.
**INFO**
Datafold will only log a single data point per timestamp per group, which means you should only send data for a particular time period once that period is complete.
1. **If your metric returns a single value per monitor run, will you provide your own timestamps or use the timestamps of monitor runs?** If your query returns a single value per run, we generally recommend letting Datafold provide timestamps based on monitor runs unless you have a compelling reason to provide your own. For example, if your metric always lags by one day, you could explicitly associate yesterday's date with each observation.
As you're writing your query, Datafold will let you know if the result set doesn't match one of the accepted patterns. If you have questions, please contact us and we'll be happy to help.
## Configure anomaly detection
Enable anomaly detection to get the most out of metric monitors. You have several options:
* **Automatic**: our automated anomaly detection uses machine learning to flag metric values that are out of the ordinary. Dial the sensitivity up or down depending on how many alerts you'd like to receive.
* **Manual**: specific thresholds beyond which you'd like the monitor to trigger an alert. **Fixed Values** are specific minimum and/or maximum values, while **Percent Change** measure the magnitude of change from one observation to the next.
## Add a schedule
You can choose to run your monitor daily, hourly, or even input a cron expression for more complex scheduling:
## Add notifications
Send notifications via Slack or email when your monitor exceeds a threshold (automatic or manual):
## Need help?
If you have any questions about how to use Metric monitors, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com).
# Schema Change Monitors
Source: https://docs.datafold.com/data-monitoring/monitors/schema-change-monitors
Schema Change monitors notify you when a table’s schema changes, such as when columns are added, removed, or data types are modified.
**INFO**
Please contact [support@datafold.com](mailto:support@datafold.com) if you'd like to enable this feature for your organization.
Schema change monitors alert you when a table’s schema changes in any of the following ways:
* Column added
* Column removed
* Data type changed
## Create a Schema Change monitor
There are two ways to create a Schema Change monitor:
1. Open the **Monitors** page, select **Create new monitor**, and then choose **Schema Change**.
2. Clone an existing Schema Change monitor by clicking **Actions** and then **Clone**. This will pre-fill the form with the existing monitor configuration.
## Set up your monitor
To set up a Schema Change monitor, simply select your data connection and the table you wish to monitor for changes.
## Add a schedule
You can choose to run your monitor daily, hourly, or even input a cron expression for more complex scheduling:
## Add notifications
Receive notifications via Slack or email when at least one record fails your test:
## FAQ
Yes, but in a different context. While data diffs report on schema differences *between two tables at the same time* (unless you’re using the time travel feature), data diff monitors alert you to schema changes for the *same table over time*.
## Need help?
If you have any questions about how to use Schema Change monitors, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com).
# Deployment Options
Source: https://docs.datafold.com/datafold-deployment/datafold-deployment-options
Datafold is a web-based application with multiple deployment options, including multi-tenant SaaS and dedicated cloud (either customer- or Datafold-hosted).
## SaaS / Multi-Tenant
Our standard multi-tenant deployment is a cost-efficient option for most teams and is available in two AWS regions:
| Region Name | Region | Sign-Up Page |
| :--------------- | :---------- | :------------------------------------------------------------------------- |
| US West (Oregon) | `us-west-2` | [https://app.datafold.com/org-signup](https://app.datafold.com/org-signup) |
| Europe (Ireland) | `eu-west-1` | [https://eu.datafold.com/org-signup](https://eu.datafold.com/org-signup) |
For additional security, we provide the following options:
1. [IP Whitelisting](/security/securing-connections#ip-whitelisting): only allow access to specific IP addresses
2. [AWS PrivateLink](/security/securing-connections#private-link): set up a limited network point to access your RDS in the same region
3. [VPC Peering](/security/securing-connections#vpc-peering-saas): securely join two networks together
4. [SSH Tunnel](/security/securing-connections#ssh-tunnel): set up a secure tunnel between your network and Datafold with the SSH server on your side
5. [IPSec Tunnel](/security/securing-connections#ipsec-tunnel): an IPSec tunnel setup
## Dedicated Cloud
We also offer a single-tenant deployment of the Datafold application in a dedicated Virtual Private Cloud (VPC). The options are (from least to most complex):
1. **Datafold-hosted, Datafold-managed**: the Cloud account belongs to Datafold and we manage the Datafold application for you.
2. **Customer-hosted, Datafold-managed**: the Cloud account belongs to you, but we manage the Datafold application for you.
3. **Customer-hosted, Customer-managed**: the Cloud account belongs to you and you manage the Datafold application. Datafold does not have access.
Dedicated Cloud can be deployed to all major cloud providers:
* [AWS](/datafold-deployment/dedicated-cloud/aws)
* [GCP](/datafold-deployment/dedicated-cloud/gcp)
* [Azure](/datafold-deployment/dedicated-cloud/azure)
**VPC vs. VNet**
We use the term VPC across all major cloud providers. However, Azure refers to this concept as a Virtual Network (VNet).
### Kubernetes Platform Dependencies
Dedicated Cloud deployments run on Kubernetes (EKS, GKE, or AKS). In addition to the cloud infrastructure described in the provider-specific guides, the following platform components must be deployed on the cluster **before** the Datafold application:
| Component | Purpose | Kubernetes Namespace |
| :------------------------------------------------------------------------ | :---------------------------------------------------------------------------------------- | :------------------- |
| [Zalando Postgres Operator](https://github.com/zalando/postgres-operator) | Manages PostgreSQL databases used by Temporal | `postgres-operator` |
| [Temporal](https://temporal.io/) | Workflow orchestration engine that powers Datafold's monitors, data diffs, and scheduling | `temporal` |
Temporal uses PostgreSQL (managed by the Zalando operator) as its persistence backend. The Datafold application connects to Temporal as a client to execute workflows.
For deployment instructions, see the [Datafold Helm Charts](https://github.com/datafold/helm-charts) repository.
### Datafold Dedicated Cloud FAQ
Dedicated Cloud deployment may be the preferred deployment method by customers with special privacy and security concerns and in highly regulated domains. In a Dedicated Cloud deployment, the entire Datafold stack runs on dedicated cloud infrastructure and network, which usually means it is:
1. Not accessible to public Internet (sits behind customer's VPN)
2. Uses internal network to communicate with customer's databases and other resources – none of the data is sent using public networks
Datafold is deployed to customer's cloud infrastructure but is fully managed by Datafold. The only DevOps involvement needed from the customer's side is to set up a cloud project and role (steps #1 and #2 below).
1. Customer creates a Datafold-specific namespace in their cloud account (subaccount in AWS / project in GCP / subscription or resource group in Azure)
2. Customer creates a Datafold-specific IAM resource with permissions to deploy the Datafold-specific namespace
3. Datafold Infrastructure team provisions the Datafold stack on the customer's infrastructure using fully-automated procedure with Terraform
4. Customer and Datafold Infrastructure teams collaborate to implement the security and networking requirements, see [all available options](#additional-security-dedicated-cloud)
See cloud-specific instructions here:
* [AWS](/datafold-deployment/dedicated-cloud/aws)
* [GCP](/datafold-deployment/dedicated-cloud/gcp)
* [Azure](/datafold-deployment/dedicated-cloud/azure)
After the initial deployment, the Datafold team uses the same procedure to roll out software updates and perform maintenance to keep the uptime SLA.
Datafold is deployed in the customer's region of choice on AWS, GCP, or Azure that is owned and managed by Datafold. We collaborate to implement the security and networking requirements ensuring that traffic either does not cross the public internet or, if it does, does so securely. All available options are listed below.
This deployment method follows the same process as the standard customer-hosted deployment (see above), but with a key difference: the customer is responsible for managing both the infrastructure and the application. Datafold engineers do not have any access to the deployment in this case.
We offer open-source projects that facilitate this deployment, with examples for every major cloud provider. You can find these projects on GitHub:
* [AWS](https://github.com/datafold/terraform-aws-datafold)
* [GCP](https://github.com/datafold/terraform-google-datafold)
* [Azure](https://github.com/datafold/terraform-azure-datafold)
Each of these projects uses a Helm chart for deploying the application. The Helm chart is also available on GitHub:
* [Helm Chart](https://github.com/datafold/helm-charts)
By providing these open-source projects, Datafold enables you to integrate the deployment into your own infrastructure, including existing clusters. This allows your infrastructure team to manage the deployment effectively.
**Deployment Secrets:** Datafold provides the necessary secrets for downloading images as part of the license agreement. Without this agreement, the deployment will not complete successfully.
**Platform Dependencies:** The Kubernetes cluster must have the [Zalando Postgres Operator](https://github.com/zalando/postgres-operator) and [Temporal](https://temporal.io/) running before the Datafold Helm chart can be deployed. See [Kubernetes Platform Dependencies](#kubernetes-platform-dependencies) for details.
Because the Datafold application is deployed in a dedicated VPC, your databases/integrations are not directly accessible when they are not exposed to the public Internet. The following solutions enable secure connections to your databases/integrations without exposing them to the public Internet:
1. [PrivateLink](/security/securing-connections?current-cloud=aws#private-link "PrivateLink")
2. [VPC Peering](/security/securing-connections#vpc-peering-dedicated-cloud "VPC Peering")
3. [SSH Tunnel](/security/securing-connections#ssh-tunnel "SSH Tunnel")
4. [IPSec Tunnel](/security/securing-connections#ipsec-tunnel "IPSec Tunnel")
1. [Private Service Connect](/security/securing-connections?current-cloud=gcp#private-link "Private Service Connect")
2. [VPC Peering](/security/securing-connections#vpc-peering-dedicated-cloud "VPC Peering")
3. [SSH Tunnel](/security/securing-connections#ssh-tunnel "SSH Tunnel")
1. [Private Link](/security/securing-connections?current-cloud=azure#private-link "Private Link")
2. [VNet Peering](/security/securing-connections#vpc-peering-dedicated-cloud "VNet Peering")
3. [SSH Tunnel](/security/securing-connections#ssh-tunnel "SSH Tunnel")
Please inquire with [sales@datafold.com](mailto:sales@datafold.com) about customer-managed deployment options.
# Datafold VPC Deployment on AWS
Source: https://docs.datafold.com/datafold-deployment/dedicated-cloud/aws
Learn how to deploy Datafold in a Virtual Private Cloud (VPC) on AWS.
**INFO**
VPC deployments are an Enterprise feature. Please email [sales@datafold.com](mailto:sales@datafold.com) to enable your account.
## Create a Domain Name (optional)
You can either choose to use your domain (for example, `datafold.domain.tld`) or to use a Datafold managed domain (for example, `yourcompany.dedicated.datafold.com`).
### Customer Managed Domain Name
Create a DNS A-record for the domain where Datafold will be hosted. For the DNS record, there are two options:
* **Public-facing:** When the domain is publicly available, we will provide an SSL certificate for the endpoint.
* **Internal:** It is also possible to have Datafold disconnected from the internet. This would require an internal DNS (for example, AWS Route 53) record that points to the Datafold instance. It is possible to provide your own certificate for setting up the SSL connection.
Once the deployment is complete, you will point that A-record to the IP address of the Datafold service.
## Give Datafold Access to AWS
For setting up Datafold, it is required to set up a separate account within your organization where we can deploy Datafold. We're following the [best practices of AWS to allow third-party access](https://docs.aws.amazon.com/IAM/latest/UserGuide/id%5Froles%5Fcommon-scenarios%5Fthird-party.html).
### Create a separate AWS account for Datafold
First, create a new account for Datafold. Go to **My Organization** to add an account:
Click **Add an AWS Account**:
You can name this account anything that helps identify it clearly. In our examples, we name it **Datafold**. Make sure that the email address of the owner isn't used by another account.
When you click the **Create AWS Account** button, you'll be returned back the organization screen, and see the notification that the new account is being created. After refresh a few minutes later, the account should appear in the organizations list.
### Grant Third-Party access to Datafold
To make sure that deployment runs as expected, your Datafold Support Engineer may need access to the Datafold-specific AWS account that you created. The access can be revoked after the deployment if needed.
To grant access, log into the account created in the previous step. You can switch to the newly created account using the [Switch Role page](https://signin.aws.amazon.com/switchrole):
By default, the role name is **OrganizationAccountAccessRole**.
Click **Switch Role** to log in to the Datafold account.
## Grant Access to Datafold
Next, we need to allow Datafold to access the account. We do this by allowing the Datafold AWS account to access your AWS workspace. Go to the [IAM page](https://console.aws.amazon.com/iam/home) or type **IAM** in the search bar:
Go to the Roles page, and click the **Create Role** button:
Select **Another AWS Account**, and use account ID `710753145501`, which is Datafold's account ID. Select **Require MFA** and click **Next: Permissions**.
On the Permissions page, attach the **AdministratorAccess** permissions for Datafold to have control over the resources within the account, or see [Minimal IAM Permissions](#minimal-iam-permissions).
Next, you can set **Tags**; however, they are not a requirement.
Finally, give the role a name of your choice. Be careful not to duplicate the account name. If you named the account in an earlier step `Datafold`, you may want to name the role `Datafold-role`.
Click **Create Role** to complete this step.
Now that the role is created, you should be routed back to a list of roles in your organization.
Click on your newly created role to get a sharable link for the account and store this in your password manager. When setting up your deployment with a support engineer, Datafold will use this link to gain access to the account.
After validating the deployment with your support engineer, and making sure that everything works as it should, we will let you know when it's clear to revoke the credentials.
### Minimal IAM Permissions
Because we work in a Account dedicated to Datafold, there is no direct access to your resources unless explicitly configured (e.g., VPC Peering). The following IAM policy are required to update and maintain the infrastructure.
```JSON theme={null}
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"acm:AddTagsToCertificate",
"acm:DeleteCertificate",
"acm:DescribeCertificate",
"acm:GetCertificate",
"acm:ListCertificates",
"acm:ListTagsForCertificate",
"acm:RemoveTagsFromCertificate",
"acm:RequestCertificate",
"acm:UpdateCertificateOptions",
"apigateway:DELETE",
"apigateway:GET",
"apigateway:PATCH",
"apigateway:POST",
"apigateway:PUT",
"apigateway:UpdateRestApiPolicy",
"autoscaling:*",
"ec2:*",
"eks:*",
"elasticloadbalancing:*",
"iam:GetPolicy",
"iam:GetPolicyVersion",
"iam:GetOpenIDConnectProvider",
"iam:GetRole",
"iam:GetRolePolicy",
"iam:GetUserPolicy",
"iam:GetUser",
"iam:ListAccessKeys",
"iam:ListAttachedRolePolicies",
"iam:ListGroupsForUser",
"iam:ListInstanceProfilesForRole",
"iam:ListPolicies",
"iam:ListPolicyVersions",
"iam:ListRolePolicies",
"iam:PassRole",
"iam:TagOpenIDConnectProvider",
"iam:TagPolicy",
"iam:TagRole",
"iam:TagUser",
"kms:CreateAlias",
"kms:CreateGrant",
"kms:CreateKey",
"kms:Decrypt",
"kms:DeleteAlias",
"kms:DescribeKey",
"kms:DisableKey",
"kms:EnableKeyRotation",
"kms:GenerateDataKey",
"kms:GetKeyPolicy",
"kms:GetKeyRotationStatus",
"kms:ListAliases",
"kms:ListResourceTags",
"kms:PutKeyPolicy",
"kms:RevokeGrant",
"kms:ScheduleKeyDeletion",
"kms:TagResource",
"logs:CreateLogGroup",
"logs:DeleteLogGroup",
"logs:DescribeLogGroups",
"logs:ListTagsLogGroup",
"logs:ListTagsForResource",
"logs:PutRetentionPolicy",
"logs:TagResource",
"rds:*",
"ssm:GetParameter",
"secretsmanager:CreateSecret",
"secretsmanager:DeleteSecret",
"secretsmanager:DescribeSecret",
"secretsmanager:GetResourcePolicy",
"secretsmanager:PutSecretValue",
"secretsmanager:TagResource",
"s3:*"
],
"Resource": "*"
}
]
}
```
Some policies we need from time to time. For example, when we do the first deployment. Since those are IAM-related, we will ask for temporary permissions when required.
```JSON theme={null}
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"iam:AttachRolePolicy",
"iam:CreateAccessKey",
"iam:CreateOpenIDConnectProvider",
"iam:CreatePolicy",
"iam:CreateRole",
"iam:CreateUser",
"iam:DeleteAccessKey",
"iam:DeleteOpenIDConnectProvider",
"iam:DeletePolicy",
"iam:DeleteRole",
"iam:DeleteRolePolicy",
"iam:DeleteUser",
"iam:DeleteUserPolicy",
"iam:DetachRolePolicy",
"iam:PutRolePolicy",
"iam:PutUserPolicy"
],
"Resource": "*"
}
]
}
```
It is easier to allow `PowerUserAccess` and then selectively add iam permissions given above.
PowerUserAccess has explicit denies for `account:*`, `organization:*` and `iam:*.`
## Datafold AWS infrastructure details
This document provides detailed information about the AWS infrastructure components deployed by the Datafold Terraform module, explaining the architectural decisions and operational considerations for each component.
## EBS volumes
The Datafold application requires 3 volumes for persistent storage, each deployed as encrypted Elastic Block Store (EBS) volumes in the primary availability zone. This also means that pods cannot be deployed outside the availability zone of these volumes, because the nodes wouldn't be able to attach them.
**ClickHouse data volume** serves as the analytical database storage for Datafold. ClickHouse is a columnar database that excels at analytical queries. The default 40GB allocation usually provides sufficient space for typical deployments, but it can be scaled up based on data volume requirements. The GP3 volume type with 3000 IOPS ensures consistent performance for analytical workloads.
**ClickHouse Logs Volume** stores ClickHouse's internal logs and temporary data. The separate logs volume prevents log data from consuming IOPS and I/O performance from actual data storage.
**Redis Data Volume** provides persistent storage for Redis, which handles task distribution and distributed locks in the Datafold application. Redis is memory-first but benefits from persistence for data durability across restarts. The 50GB default size accommodates typical caching needs while remaining cost-effective.
All EBS volumes are encrypted using AWS KMS, managed by AWS, ensuring data security at rest. The volumes are deployed in the first availability zone to minimize latency and simplify backup strategies.
## Load balancer
The load balancer serves as the primary entry point for all external traffic to the Datafold application. The module offers 2 deployment strategies, each with different operational characteristics and trade-offs.
**External Load Balancer Deployment** (the default approach) creates an AWS Application Load Balancer through Terraform. This approach provides centralized control over load balancer configuration and integrates well with existing AWS infrastructure. The load balancer automatically handles SSL termination, health checks, and traffic distribution across Kubernetes pods. This method is ideal for organizations that prefer infrastructure-as-code management and want consistent load balancer configurations across environments.
**Kubernetes-Managed Load Balancer** deployment sets `deploy_lb = false` and relies on the AWS Load Balancer Controller running within the EKS cluster. This approach leverages Kubernetes-native load balancer management, allowing for dynamic scaling and easier integration with Kubernetes ingress resources. The controller automatically provisions and manages load balancers based on Kubernetes service definitions, which can be more flexible for applications that need to scale load balancer resources dynamically.
Both load balancers apply the currently recommended and strictest ELB security policies: `ELBSecurityPolicy-TLS13-1-2-Res-2021-06` and security settings.
The choice between these approaches often depends on operational preferences and existing infrastructure patterns. External deployment provides more predictable resource management, while Kubernetes-managed deployment offers greater flexibility for dynamic workloads.
**Security** A security group shared between the load balancer and the EKS nodes allows traffic to reach only the EKS nodes and nothing else. The load balancer allows traffic to land directly into the EKS private subnet.
**Certificate** The certificate can be pre-created by the customer and then attached, or a cloud-managed certificate can be created on the fly.
The application will not function without HTTPS, so a certificate is mandatory. After the certificate is created either manually or through this repository, it must be validated by the DNS administrator by adding a CNAME record. This puts the certificate in "Issued" state. The certificate cannot be found when it's still provisioning.
## EKS cluster
The Elastic Kubernetes Service (EKS) cluster forms the compute foundation for the Datafold application, providing a managed Kubernetes environment optimized for AWS infrastructure.
**Network Architecture** The entire cluster is deployed into private subnets. This means the data plane is not reachable from the Internet except through the load balancer. A NAT gateway allows the cluster to reach the internet (egress traffic) for downloading pod images, optionally sending Datadog logs and metrics, and retrieving the version to apply to the cluster from our portal. The control plane is accessible via a private endpoint using a PrivateLink setup from, for example, a VPN VPC elsewhere. This is a private+public endpoint, so the control plane can also be made accessible through the Internet, but then the appropriate CIDR restrictions should be put in place.
For a typical dedicated cloud deployment of Datafold, only around 100 IPs are needed. This assumes 3 r7a.2xlarge instances where one node runs ClickHouse+Redis, another node runs the application, and a third node may be put in place when version rollovers occur. This means a subnet of size /24 (253 IPs) should be sufficient to run this application.
By default, the repository creates a VPC and subnets, but by specifying the VPC ID of an already existing VPC, the cluster and load balancer
get deployed into existing network infrastructure. This is important for some customers where they deploy a different architecture without NAT gateways, firewall options that check egress, and other DLP controls.
**Add-ons**
The cluster includes essential add-ons like CoreDNS for service discovery, the VPC CNI for networking, and the EBS CSI driver for persistent volume management. These components are automatically updated and maintained by AWS, reducing operational overhead.
The AWS load balancer controller and metrics-server are deployed separately via Helm charts in the application deployment, not through this Terraform infrastructure. The Load Balancer Controller manages at least the AWS target group that enables ingress for the Datafold application. Optionally, it may also manage the entire external load balancer.
**Node Management** supports up to three managed node groups, allowing for workload-specific resource allocation. Each node group can be configured with different instance types, enabling cost optimization and performance tuning for different application components. The cluster autoscaler automatically adjusts node count based on resource demands, ensuring efficient resource utilization while maintaining application availability. One typical way to deploy is to let the application pods go on a wider range of nodes, and set up tolerations and labels on the second node group, which are then selected by both Redis and ClickHouse. This is because Redis and ClickHouse have restrictions on the zone they must be present in because of their volumes, and ClickHouse is a bit more CPU intensive. This method optimizes CPU performance for the Datafold application.
**Security Features** include IAM Roles for Service Accounts (IRSA), which provide fine-grained IAM permissions to Kubernetes pods without requiring AWS credentials in container images. This approach enhances security by following the principle of least privilege and integrates seamlessly with AWS security services.
## IAM Roles and Permissions
The IAM architecture follows the principle of least privilege, providing specific permissions only where needed. Service accounts in Kubernetes are mapped to IAM roles using IRSA, enabling secure access to AWS services without embedding credentials in application code.
**EBS CSI Controller Role** enables the Kubernetes cluster to manage EBS volumes dynamically. This role allows pods to request persistent storage that's automatically provisioned and attached to the appropriate nodes or attach static volumes. The permissions are scoped to only the EBS operations needed for volume lifecycle management.
**Load Balancer Controller Role** provides the permissions necessary for Kubernetes to manage AWS load balancers. This includes creating target groups, registering and deregistering targets, and managing load balancer listeners. The controller can automatically provision load balancers based on Kubernetes service definitions, enabling seamless integration between Kubernetes and AWS networking.
**Cluster Autoscaler Role** allows the cluster to automatically scale node groups based on resource demands. This role can describe and modify Auto Scaling groups, enabling the cluster to add or remove nodes as needed. The autoscaler considers pod resource requests and node capacity when making scaling decisions.
**Datafold Roles** Datafold has roles per pod pre-defined which can have their permissions assigned when they need them. At the moment, we have two specific roles in use. One is for the ClickHouse pod to be able to make backups and store them on S3. The other is for the use of the Bedrock service for our AI offering.
These roles are automatically created and configured when the cluster is deployed, ensuring that the necessary permissions are in place for the cluster to function properly. The use of IRSA means that these permissions are automatically rotated and managed by AWS, reducing security risks associated with long-lived credentials.
## RDS database
The PostgreSQL Relational Database Service (RDS) instance serves as the primary relational database for the Datafold application, storing user data, configuration, and application state.
**Storage Configuration** starts with a 20GB initial allocation that can automatically scale up to 100GB based on usage patterns. This auto-scaling feature prevents storage-related outages while avoiding over-provisioning. For typical deployments, storage usage remains under 200GB, though some high-volume deployments may approach 400GB. The GP3 storage type provides consistent performance with configurable IOPS and throughput.
**High Availability** is intentionally disabled by default, meaning the database runs in a single availability zone. This configuration reduces costs and complexity while still providing excellent reliability. The database includes automated backups with 14-day retention, ensuring data can be recovered in case of failures. For organizations requiring higher availability, multi-AZ deployment can be enabled, though this significantly increases costs.
**Security and Encryption** always encrypts data at rest using AWS KMS. A dedicated KMS key is created for the database, providing better security isolation and audit capabilities compared to using the default AWS RDS key. The database is deployed in private subnets with security groups that restrict access to only the EKS cluster, ensuring network-level security.
The database configuration prioritizes operational simplicity and cost-effectiveness while maintaining the security and reliability required for production workloads. The combination of automated backups, encryption, and network isolation provides a robust foundation for the application's data storage needs.
# Datafold VPC Deployment on Azure
Source: https://docs.datafold.com/datafold-deployment/dedicated-cloud/azure
Learn how to deploy Datafold in a Virtual Private Cloud (VPC) on Azure.
**INFO**
VPC deployments are an Enterprise feature. Please email [sales@datafold.com](mailto:sales@datafold.com) to enable your account.
## Create a Domain Name (optional)
You can either choose to use your domain (for example, `datafold.domain.tld`) or to use a Datafold managed domain (for example, `yourcompany.dedicated.datafold.com`).
### Customer Managed Domain Name
Create a DNS A-record for the domain where Datafold will be hosted. For the DNS record, there are two options:
* **Public-facing:** When the domain is publicly available, we will provide an SSL certificate for the endpoint.
* **Internal:** It is also possible to have Datafold disconnected from the internet. This would require an internal DNS (for example, Azure DNS) record that points to the Datafold instance. It is possible to provide your own certificate for setting up the SSL connection.
Once the deployment is complete, you will point that A-record to the IP address of the Datafold service.
## Create a New Subscription
For isolation reasons, it is best practice to [create a new subscription](https://learn.microsoft.com/en-us/azure/cost-management-billing/manage/create-subscription) within your Microsoft Entra directory/tenant. Please call it something like `yourcompany-datafold` to make it easy to identify.
## Set IAM Permissions
Go to **Microsoft Entra ID** and navigate to **Users**. Click **Add**, **User**, **Invite external user** and add the Datafold engineers.
Navigate to the subscription you just created and go to **Access control (IAM)** tab in the side bar.
* Navigate to the subscription you just created. Go to **Access control (IAM)**. Under **Add** select **Add role assignment**.
* Under **Role**, navigate to **Priviledged administrator roles** and select **Owner**.
* Under **Members**, click **Select members** and add the Datafold engineers.
* When you are done, select **Review + assign**.
The owner role is only required temporarily while we configure and test the initial Datafold deployment. We'll inform you when it is ok to revoke this permission.
### Required APIs
The following Azure APIs need to be enabled to run Datafold:
1. [Microsoft.ContainerService](https://portal.azure.com/#view/Microsoft_Azure_Marketplace/GalleryFeaturedMenuItemBlade/selectedMenuItemId/home/searchQuery/Container%20Service)
2. [Microsoft.Network](https://portal.azure.com/#view/Microsoft_Azure_Marketplace/GalleryFeaturedMenuItemBlade/selectedMenuItemId/home/searchQuery/Network)
3. [Microsoft.Compute](https://portal.azure.com/#view/Microsoft_Azure_Marketplace/GalleryFeaturedMenuItemBlade/selectedMenuItemId/home/searchQuery/Compute)
4. [Microsoft.KeyVault](https://portal.azure.com/#view/Microsoft_Azure_Marketplace/GalleryFeaturedMenuItemBlade/selectedMenuItemId/home/searchQuery/Key%20Vault)
5. [Microsoft.Storage](https://portal.azure.com/#view/Microsoft_Azure_Marketplace/GalleryFeaturedMenuItemBlade/selectedMenuItemId/home/searchQuery/Storage)
6. [Microsoft.DBforPostgreSQL](https://portal.azure.com/#view/Microsoft_Azure_Marketplace/GalleryFeaturedMenuItemBlade/selectedMenuItemId/home/searchQuery/PostgreSQL)
Once the access has been granted, make sure to notify Datafold so we can initiate the deployment.
## Datafold Azure infrastructure details
This document provides detailed information about the Azure infrastructure components deployed by the Datafold Terraform module,
explaining the architectural decisions and operational considerations for each component.
## Managed disks
The Datafold application requires 3 managed disks for persistent storage, each deployed as encrypted Azure managed disks in the
primary availability zone. This also means that pods cannot be deployed outside the availability zone of these disks, because
the nodes wouldn't be able to attach them.
**ClickHouse data disk** serves as the analytical database storage for Datafold. ClickHouse is a columnar database that excels
at analytical queries. The default 40GB allocation usually provides sufficient space for typical deployments, but it can be
scaled up based on data volume requirements. The StandardSSD\_LRS disk type with configurable IOPS and throughput ensures
consistent performance for analytical workloads.
**ClickHouse logs disk** stores ClickHouse's internal logs and temporary data. The separate logs disk prevents log data from
consuming IOPS and I/O performance from actual data storage.
**Redis data disk** provides persistent storage for Redis, which handles task distribution and distributed locks in the Datafold
application. Redis is memory-first but benefits from persistence for data durability across restarts. The 50GB default size
accommodates typical caching needs while remaining cost-effective.
All managed disks are encrypted by default using Azure-managed encryption keys, ensuring data security at rest. The disks are
deployed in the first availability zone to minimize latency and simplify backup strategies. For Premium and Ultra SSD disk
types, IOPS and throughput can be configured to optimize performance for specific workloads.
## Application Gateway
The Application Gateway serves as the primary entry point for all external traffic to the Datafold application. The module
offers 2 deployment strategies, each with different operational characteristics and trade-offs.
**External Application Gateway Deployment** (the default approach) creates an Azure Application Gateway through Terraform.
This approach provides centralized control over load balancer configuration and integrates well with existing Azure
infrastructure. The Application Gateway automatically handles SSL termination, health checks, and traffic distribution across
Kubernetes pods. This method is ideal for organizations that prefer infrastructure-as-code management and want consistent
load balancer configurations across environments.
**Kubernetes-Managed Application Gateway** deployment sets `deploy_lb = false` and relies on the Azure Application Gateway
Ingress Controller (AGIC) running within the AKS cluster. This approach leverages Kubernetes-native load balancer management,
allowing for dynamic scaling and easier integration with Kubernetes ingress resources. The controller automatically provisions
and manages Application Gateways based on Kubernetes service definitions, which can be more flexible for applications that
need to scale load balancer resources dynamically.
Both Application Gateways apply the currently recommended and strictest SSL policies: `AppGwSslPolicy20220101S` and security
settings.
The choice between these approaches often depends on operational preferences and existing infrastructure patterns. External
deployment provides more predictable resource management, while Kubernetes-managed deployment offers greater flexibility for
dynamic workloads.
**Security** A network security group shared between the Application Gateway and the AKS nodes allows traffic to reach only
the AKS nodes and nothing else. The Application Gateway allows traffic to land directly into the AKS private subnet.
**Certificate** The certificate can be pre-created by the customer and then attached, or a cloud-managed certificate can be
created on the fly. The application will not function without HTTPS, so a certificate is mandatory. After the certificate is
created either manually or through this repository, it must be validated by the DNS administrator by adding a CNAME record.
This puts the certificate in "Issued" state. The certificate cannot be found when it's still provisioning.
## AKS cluster
The Azure Kubernetes Service (AKS) cluster forms the compute foundation for the Datafold application, providing a managed
Kubernetes environment optimized for Azure infrastructure.
**Network Architecture** The entire cluster is deployed into private subnets. This means the data plane is not reachable from
the Internet except through the Application Gateway. A NAT gateway allows the cluster to reach the internet (egress traffic)
for downloading pod images, optionally sending Datadog logs and metrics, and retrieving the version to apply to the cluster
from our portal. The control plane is accessible via a private endpoint using a Private Link setup from, for example, a VPN
VNet elsewhere. This is a private+public endpoint, so the control plane can also be made accessible through the Internet, but
then the appropriate CIDR restrictions should be put in place.
For a typical dedicated cloud deployment of Datafold, only around 100 IPs are needed. This assumes 3 Standard\_DS2\_v2 instances
where one node runs ClickHouse+Redis, another node runs the application, and a third node may be put in place when version
rollovers occur. This means a subnet of size /24 (253 IPs) should be sufficient to run this application.
By default, the repository creates a VNet and subnets, but by specifying the VNet ID of an already existing VNet, the cluster
and Application Gateway get deployed into existing network infrastructure. This is important for some customers where they
deploy a different architecture without NAT gateways, firewall options that check egress, and other DLP controls.
**Add-ons**
The cluster includes several essential add-ons configured through Terraform:
**Workload Identity** is enabled to provide fine-grained IAM permissions to Kubernetes pods without requiring Azure credentials
in container images. This is essential for ClickHouse to access Azure Storage for backups and other services.
**Ingress Application Gateway** is integrated with the cluster to handle external traffic routing and SSL termination. The
Application Gateway Ingress Controller (AGIC) manages the Application Gateway configuration based on Kubernetes ingress resources.
**Storage Profile** includes the Azure Disk CSI driver for persistent volume management, file driver for Azure Files, and
snapshot controller for volume snapshots. These components enable dynamic provisioning and management of Azure storage resources.
**Node Management** supports up to three managed node pools, allowing for workload-specific resource allocation. Each node
pool can be configured with different VM sizes, enabling cost optimization and performance tuning for different application
components. The cluster autoscaler automatically adjusts node count based on resource demands, ensuring efficient resource
utilization while maintaining application availability. One typical way to deploy is to let the application pods go on a wider
range of nodes, and set up tolerations and labels on the second node pool, which are then selected by both Redis and
ClickHouse. This is because Redis and ClickHouse have restrictions on the zone they must be present in because of their
disks, and ClickHouse is a bit more CPU intensive. This method optimizes CPU performance for the Datafold application.
**Security Features** include Azure Workload Identity, which provides fine-grained IAM permissions to Kubernetes pods without
requiring Azure credentials in container images. This approach enhances security by following the principle of least privilege
and integrates seamlessly with Azure security services. The cluster also supports private clusters with restricted control
plane access and network policies for pod-to-pod communication control.
## IAM Roles and Permissions
The IAM architecture follows the principle of least privilege, providing specific permissions only where needed. Service
accounts in Kubernetes are mapped to IAM roles using Azure Workload Identity, enabling secure access to Azure services without
embedding credentials in application code.
**Azure Disk CSI Controller Role** enables the Kubernetes cluster to manage Azure managed disks dynamically. This role allows
pods to request persistent storage that's automatically provisioned and attached to the appropriate nodes or attach static
disks. The permissions are scoped to only the Azure Disk operations needed for disk lifecycle management.
**Application Gateway Ingress Controller Role** provides the permissions necessary for Kubernetes to manage Azure Application
Gateways. This includes creating backend address pools, registering and deregistering targets, and managing Application
Gateway listeners. The controller can automatically provision Application Gateways based on Kubernetes service definitions,
enabling seamless integration between Kubernetes and Azure networking.
**Cluster Autoscaler Role** allows the cluster to automatically scale node pools based on resource demands. This role can
describe and modify Virtual Machine Scale Sets, enabling the cluster to add or remove nodes as needed. The autoscaler considers
pod resource requests and node capacity when making scaling decisions.
**Datafold Roles** Datafold has roles per pod pre-defined which can have their permissions assigned when they need them. At
the moment, we have two specific roles in use. One is for the ClickHouse pod to be able to make backups and store them on
Azure Storage. The other is for the use of the Azure OpenAI service for our AI offering.
These roles are automatically created and configured when the cluster is deployed, ensuring that the necessary permissions are
in place for the cluster to function properly. The use of Azure Workload Identity means that these permissions are automatically
rotated and managed by Azure, reducing security risks associated with long-lived credentials.
## Azure Database for PostgreSQL
The Azure Database for PostgreSQL Flexible Server instance serves as the primary relational database for the Datafold
application, storing user data, configuration, and application state.
**Storage Configuration** starts with a 32GB initial allocation that can automatically scale up to 100GB based on usage
patterns. This auto-scaling feature prevents storage-related outages while avoiding over-provisioning. For typical deployments,
storage usage remains under 200GB, though some high-volume deployments may approach 400GB. The GP\_Standard storage type
provides consistent performance with configurable IOPS and throughput.
**High Availability** is intentionally disabled by default, meaning the database runs in a single availability zone. This
configuration reduces costs and complexity while still providing excellent reliability. The database includes automated backups
with 7-day retention, ensuring data can be recovered in case of failures. For organizations requiring higher availability,
multi-zone deployment can be enabled, though this significantly increases costs.
**Security and Encryption** always encrypts data at rest using Azure-managed encryption keys. The database is deployed in
private subnets with network security groups that restrict access to only the AKS cluster, ensuring network-level security.
The database supports Azure Private Link for secure, private connectivity from the VNet.
The database configuration prioritizes operational simplicity and cost-effectiveness while maintaining the security and
reliability required for production workloads. The combination of automated backups, encryption, and network isolation
provides a robust foundation for the application's data storage needs.
# Datafold VPC Deployment on GCP
Source: https://docs.datafold.com/datafold-deployment/dedicated-cloud/gcp
Learn how to deploy Datafold in a Virtual Private Cloud (VPC) on GCP.
**INFO**
VPC deployments are an Enterprise feature. Please email [sales@datafold.com](mailto:sales@datafold.com) to enable your account.
## Create a Domain Name (optional)
You can either choose to use your domain (for example, `datafold.domain.tld`) or to use a Datafold managed domain (for example, `yourcompany.dedicated.datafold.com`).
### Customer Managed Domain Name
Create a DNS A-record for the domain where Datafold will be hosted. For the DNS record, there are two options:
* **Public-facing:** When the domain is publicly available, we will provide an SSL certificate for the endpoint.
* **Internal:** It is also possible to have Datafold disconnected from the internet. This would require an internal DNS (for example, AWS Route 53) record that points to the Datafold instance. It is possible to provide your own certificate for setting up the SSL connection.
Once the deployment is complete, you will point that A-record to the IP address of the Datafold service.
## Create a New Project
For isolation reasons, it is best practice to [create a new project](https://console.cloud.google.com/projectcreate) within your GCP organization. Please call it something like `yourcompany-datafold` to make it easy to identify:
After a minute or so, you should receive confirmation that the project has been created. Afterward, you should be able to see the new project.
## Set IAM Permissions
Navigate to the **IAM** tab in the sidebar and click **Grant Access** to invite Datafold to the project.
Add your Datafold solutions engineer as a **principal**. You have two options for assigning IAM permissions to the Datafold Engineers.
1. Assign them as an **owner** of your project.
2. Assign the extended set of [Minimal IAM Permissions](#minimal-iam-permissions).
The owner role is only required temporarily while we configure and test the initial Datafold deployment. We'll inform you when it is ok to revoke this permission and provide us with only the [Minimal IAM Permissions](#minimal-iam-permissions).
### Required APIs
The following GCP APIs need to be additionally enabled to run Datafold:
1. [Compute Engine API](https://console.cloud.google.com/apis/library/compute.googleapis.com)
2. [Secret Manager API](https://console.cloud.google.com/apis/api/secretmanager.googleapis.com)
The following GCP APIs we use are already turned on by default when you created the project:
1. [Cloud Logging API](https://console.cloud.google.com/apis/api/logging.googleapis.com)
2. [Cloud Monitoring API](https://console.cloud.google.com/apis/api/monitoring.googleapis.com)
3. [Cloud Storage](https://console.cloud.google.com/apis/api/storage-component.googleapis.com)
4. [Service Networking API](https://console.cloud.google.com/apis/api/servicenetworking.googleapis.com)
Once the access has been granted, make sure to notify Datafold so we can initiate the deployment.
### Minimal IAM Permissions
Because we work in a Project dedicated to Datafold, there is no direct access to your resources unless explicitly configured (e.g., VPC Peering). The following IAM roles are required to update and maintain the infrastructure.
```Bash theme={null}
Cloud SQL Admin
Compute Load Balancer Admin
Compute Network Admin
Compute Security Admin
Compute Storage Admin
IAP-secured Tunnel User
Kubernetes Engine Admin
Kubernetes Engine Cluster Admin
Role Viewer
Service Account User
Storage Admin
Viewer
```
Some roles we need from time to time. For example, when we do the first deployment. Since those are IAM-related, we will ask for temporary permissions when required.
```Bash theme={null}
Role Administrator
Security Admin
Service Account Key Admin
Service Account Admin
Service Usage Admin
```
## Datafold Google Cloud infrastructure details
This document provides detailed information about the Google Cloud infrastructure components deployed
by the Datafold Terraform module, explaining the architectural decisions and operational considerations for each component.
## Persistent disks
The Datafold application requires 3 persistent disks for storage, each deployed as encrypted Google Compute Engine
persistent disks in the primary availability zone. This also means that pods cannot be deployed outside the availability
zone of these disks, because the nodes wouldn't be able to attach them.
**ClickHouse data disk** serves as the analytical database storage for Datafold. ClickHouse is a columnar database
that excels at analytical queries. The default 40GB allocation usually provides sufficient space for typical deployments,
but it can be scaled up based on data volume requirements. The pd-balanced disk type provides consistent
performance for analytical workloads with automatically managed IOPS and throughput.
**ClickHouse logs disk** stores ClickHouse's internal logs and temporary data. The separate logs disk prevents
log data from consuming IOPS and I/O performance from actual data storage.
**Redis data disk** provides persistent storage for Redis, which handles task distribution and distributed locks in
the Datafold application. Redis is memory-first but benefits from persistence for data durability across restarts.
The 50GB default size accommodates typical caching needs while remaining cost-effective.
All persistent disks are encrypted by default using Google-managed encryption keys, ensuring data security at rest.
The disks are deployed in the first availability zone to minimize latency and simplify backup strategies.
## Load balancer
The load balancer serves as the primary entry point for all external traffic to the Datafold application.
The module offers 2 deployment strategies, each with different operational characteristics and trade-offs.
**External Load Balancer Deployment** (the default approach) creates a Google Cloud Load Balancer through Terraform.
This approach provides centralized control over load balancer configuration and integrates well with existing Google Cloud infrastructure.
The load balancer automatically handles SSL termination, health checks, and traffic distribution across Kubernetes pods.
This method is ideal for organizations that prefer infrastructure-as-code management and want consistent load balancer configurations across environments.
**Kubernetes-Managed Load Balancer** deployment sets `deploy_lb = false` and relies on the Google Cloud Load Balancer Controller
running within the GKE cluster. This approach leverages Kubernetes-native load balancer management, allowing for
dynamic scaling and easier integration with Kubernetes ingress resources. The controller automatically provisions and manages load balancers based on Kubernetes service definitions, which can be more flexible for applications that need to scale load balancer resources dynamically.
For external load balancers deployed through Kubernetes, the infrastructure developer needs to create SSL policies and
Cloud Armor policies separately and attach them to the load balancer through annotations. Internal load balancers cannot
have SSL policies or Cloud Armor applied. Our Helm charts support various deployment types including internal/external
load balancers with uploaded certificates or certificates stored in Kubernetes secrets.
The choice between these approaches often depends on operational preferences and existing infrastructure patterns.
External deployment provides more predictable resource management, while Kubernetes-managed deployment offers greater flexibility for dynamic workloads.
**Security** A firewall rule shared between the load balancer and the GKE nodes allows traffic to reach only the GKE nodes and nothing else.
The load balancer allows traffic to land directly into the GKE private subnet.
**Certificate** The certificate can be pre-created by the customer and then attached, or a Google-managed SSL certificate can be created on the fly.
The application will not function without HTTPS, so a certificate is mandatory. After the certificate is created either
manually or through this repository, it must be validated by the DNS administrator by adding an A record. This puts the
certificate in "ACTIVE" state. The certificate cannot be found when it's still provisioning.
## GKE cluster
The Google Kubernetes Engine (GKE) cluster forms the compute foundation for the Datafold application,
providing a managed Kubernetes environment optimized for Google Cloud infrastructure.
**Network Architecture** The entire cluster is deployed into private subnets. This means the data plane
is not reachable from the Internet except through the load balancer. A Cloud NAT allows the cluster to reach the
internet (egress traffic) for downloading pod images, optionally sending Datadog logs and metrics,
and retrieving the version to apply to the cluster from our portal. The control plane is accessible via a private endpoint
using a Private Service Connect setup from, for example, a VPN VPC elsewhere. This is a private+public endpoint,
so the control plane can also be made accessible through the Internet, but then the appropriate CIDR restrictions should be put in place.
For a typical dedicated cloud deployment of Datafold, only around 100 IPs are needed.
This assumes 3 e2-standard-8 instances where one node runs ClickHouse+Redis, another node runs the application,
and a third node may be put in place when version rollovers occur. This means a subnet of size /24 (253 IPs)
should be sufficient to run this application, but you can always apply a different CIDR per subnet if needed.
By default, the repository creates a VPC and subnets, but by specifying the VPC ID of an already existing VPC,
the cluster and load balancer get deployed into existing network infrastructure.
This is important for some customers where they deploy a different architecture without Cloud NAT, firewall options that check egress, and other DLP controls.
**Add-ons**
The cluster includes essential add-ons like CoreDNS for service discovery, the VPC-native networking for networking,
and the GCE persistent disk CSI driver for persistent volume management. These components are automatically updated
and maintained by Google, reducing operational overhead.
**Node Management** supports up to three managed node pools, allowing for workload-specific resource allocation.
Each node pool can be configured with different machine types, enabling cost optimization and performance tuning
for different application components. The cluster autoscaler automatically adjusts node count based on resource demands,
ensuring efficient resource utilization while maintaining application availability. One typical way to deploy
is to let the application pods go on a wider range of nodes, and set up tolerations and labels on the second node pool,
which are then selected by both Redis and ClickHouse. This is because Redis and ClickHouse have restrictions
on the zone they must be present in because of their disks, and ClickHouse is a bit more CPU intensive.
This method optimizes CPU performance for the Datafold application.
**Security Features** include several critical security configurations:
* **Workload Identity** is enabled and configured with the project's workload pool, providing fine-grained IAM permissions to Kubernetes pods without requiring Google Cloud credentials in container images
* **Shielded nodes** are enabled with secure boot and integrity monitoring for enhanced node security
* **Binary authorization** is configured with project singleton policy enforcement to ensure only authorized container images can be deployed
* **Network policy** is enabled using Calico for pod-to-pod communication control
* **Private nodes** are enabled, ensuring all node traffic goes through the VPC network
These security features follow the principle of least privilege and integrate seamlessly with Google Cloud security services.
## IAM roles and permissions
The IAM architecture follows the principle of least privilege, providing specific permissions only where needed.
Service accounts in Kubernetes are mapped to IAM roles using Workload Identity, enabling secure access to Google
Cloud services without embedding credentials in application code.
**GKE service account** is created with basic permissions for logging, monitoring, and storage access.
This service account is used by the GKE nodes and provides the foundation for cluster operations.
**ClickHouse backup service account** is created with a custom role that allows ClickHouse to make backups and store them on Cloud Storage.
This service account uses Workload Identity to securely access Cloud Storage without embedding credentials.
**Datafold roles** Datafold has roles per pod pre-defined which can have their permissions assigned when they need them.
At the moment, we have two specific roles in use. One is for the ClickHouse pod to be able to make backups and store them on Cloud Storage.
The other is for the use of the Vertex AI service for our AI offering.
These roles are automatically created and configured when the cluster is deployed, ensuring that the
necessary permissions are in place for the cluster to function properly. The Datafold and ClickHouse service accounts
authenticate using Workload Identity, which means these permissions are automatically rotated and managed by Google, reducing security risks associated with long-lived credentials.
## Cloud SQL database
The PostgreSQL Cloud SQL instance serves as the primary relational database for the Datafold application,
storing user data, configuration, and application state.
**Storage configuration** starts with a 20GB initial allocation that can automatically scale up to 100GB based on usage patterns.
This auto-scaling feature prevents storage-related outages while avoiding over-provisioning.
For typical deployments, storage usage remains under 200GB, though some high-volume deployments may approach 400GB.
The pd-balanced storage type provides consistent performance with configurable IOPS and throughput.
**High availability** is intentionally disabled by default, meaning the database runs in a single availability zone.
This configuration reduces costs and complexity while still providing excellent reliability. The database includes
automated backups with 7-day retention, ensuring data can be recovered in case of failures. For organizations requiring higher availability,
multi-zone deployment can be enabled, though this significantly increases costs.
**Security and encryption** always encrypts data at rest using Google-managed encryption keys by default.
The database is deployed in private subnets with firewall rules that restrict access to only the GKE cluster,
ensuring network-level security.
The database configuration prioritizes operational simplicity and cost-effectiveness while maintaining the security
and reliability required for production workloads. The combination of automated backups, encryption,
and network isolation provides a robust foundation for the application's data storage needs.
# MCP
Source: https://docs.datafold.com/datafold-mcp
Connect your AI agent to Datafold and interact with your data through natural language
MCP (Model Context Protocol) is the easiest way to interact with Datafold by empowering AI agents like Claude Code, Cursor, or Windsurf with Datafold's tools. Run data diffs, query data sources, manage monitors, and more — all through natural language without leaving your development environment.
## Quick Start
All you need is a Datafold API key and a one-line setup in your AI assistant.
Go to **Settings > Account** in the [Datafold app](https://app.datafold.com) and click **Create API Key**.
For **Claude Code**, run:
```bash theme={null}
claude mcp add --transport http --scope user \
datafold https://app.datafold.com/mcp/ \
--header "Authorization: Key YOUR_API_KEY"
```
For **Cursor**, create `.cursor/mcp.json` in your project:
```json theme={null}
{
"mcpServers": {
"datafold": {
"type": "http",
"url": "https://app.datafold.com/mcp/",
"headers": {
"Authorization": "Key YOUR_API_KEY"
}
}
}
}
```
For other AI assistants, see the [full setup guide](/api-reference/mcp-server-setup).
Ask your AI assistant to interact with Datafold. For example:
* *"List my Datafold data sources"*
* *"Run a data diff between table A and table B"*
* *"Show me the latest monitor alerts"*
## Tool Visibility and Permissions
The tools available to your AI agent depend on the permissions of the API key's owner. Tools that require permissions the user doesn't have are automatically hidden from the agent.
To scope an agent to a specific set of tools, create a [custom group](/security/user-roles-and-permissions#custom-groups) with only the permissions you need, assign it to a [service account](/security/service-accounts), and use that service account's API key.
See [MCP Tool Permissions](/security/mcp-tool-permissions) for the exact permissions each MCP tool requires.
## Supported Clients
Datafold's MCP server works with any client that supports the [Model Context Protocol](https://modelcontextprotocol.io/), including Claude Desktop, Claude Code, Cursor, VS Code with Cline, Windsurf, Continue.dev, Zed, OpenCode, Gemini CLI, and Kiro.
Detailed setup instructions for all supported clients, troubleshooting, and best practices.
# AI Code Reviews
Source: https://docs.datafold.com/deployment-testing/ai-code-reviews
Get automated, AI-powered code reviews on every pull request to catch SQL and data pipeline issues before they reach production.
AI Code Reviews bring LLM-powered analysis directly into your CI pipeline, automatically reviewing every pull request for SQL and data pipeline best practice violations.
When combined with [Data Diffs](/deployment-testing/how-it-works), AI Code Reviews give your team both **code-level** and **data-level** validation on every PR — catching logic errors, anti-patterns, and unintended data changes before they reach production.
AI Code Reviews are an optional add-on to Datafold's CI integration. If you prefer to use only Data Diffs, no changes are needed — your existing CI setup continues to work as before.
## How It Works
When a pull request is created or updated, Datafold's CI runner detects the change and checks if AI Code Reviews are enabled for your organization.
Datafold fetches the git diff, annotates it with line numbers, and identifies the affected files.
The code diff is sent to an LLM, which analyzes added lines for potential issues while considering the full context of removed and unchanged lines. The model identifies issues, explains them, and suggests specific code improvements referencing exact lines in the diff.
A review supervisor validates the findings, merging or refining them to reduce noise and ensure actionable feedback.
The AI-generated review is posted as a summary comment and inline annotations on the pull request, providing actionable feedback and suggested code changes.
## What AI Code Reviews Check
AI Code Reviews are tuned for SQL and data pipeline code. The LLM analyzes your changes for common issues, including:
* **SQL anti-patterns** — inefficient joins, missing filters, implicit type coercion
* **Data quality risks** — missing `WHERE` clauses on `DELETE`/`UPDATE`, unintended cross joins, `SELECT *` in production models
* **dbt best practices** — model naming conventions, ref usage, materialization choices
* **Schema changes** — column additions, removals, or type changes that may break downstream consumers
## AI Code Reviews + Data Diffs
AI Code Reviews and Data Diffs complement each other:
| | AI Code Reviews | Data Diffs |
| ------------------- | ----------------------------------------------------- | --------------------------------------------------------- |
| **What it checks** | The code itself (SQL, dbt, pipeline logic) | The actual data output (row and column-level differences) |
| **When it runs** | On the first CI run for each PR | After staging data is built |
| **What it catches** | Logic errors, anti-patterns, best practice violations | Unintended data changes, row count shifts, value drift |
Used together, they provide comprehensive validation — the AI reviews catch code issues early, while Data Diffs verify the actual data impact.
## Enabling AI Code Reviews
AI Code Reviews require Datafold's CI integration to be set up with your Git provider and data warehouse. To enable the feature:
1. Ensure your [Git provider](/integrations/code-repositories) and [data warehouse](/integrations/databases) are connected in Datafold.
2. Verify your CI configuration is set up under **Settings > CI/CD**.
3. Contact [Datafold support](mailto:support@datafold.com) to enable AI Code Reviews for your organization.
Once enabled, AI Code Reviews will automatically run on the first CI run for each new pull request — no additional CI pipeline changes are required.
## Using Data Diffs Only
If your team prefers to use only Data Diffs without AI Code Reviews, no action is needed. Your existing CI configuration will continue to run Data Diffs as before. AI Code Reviews are an opt-in feature and do not affect Data Diff behavior.
# Handling Data Drift
Source: https://docs.datafold.com/deployment-testing/best-practices/handling-data-drift
Ensuring Datafold in CI executes apples-to-apples comparison between staging and production environments.
**Note**
This section of the docs is only relevant if the data used as inputs during the PR build are inconsistent with the data used as inputs during the last production build. Please contact [support@datafold.com](mailto:support@datafold.com) if you'd like to learn more.
## What is data drift in CI?
Datafold is used in CI to illuminate the impact of a pull request's proposed code change by comparing two versions of the data and identifying differences.
**Data drift in CI** happens when those data differences occur due to *changes in upstream data sources*—not because of proposed code changes.
Data drift in CI adds "noise" to your CI testing analysis, making it tricky to tell if data differences are due to new code, or changes in the source data. Unless both versions rely on the same snapshot of upstream data, data drift can compromise your ability to see the true effect of the code changes.
**Tip**
dbt users should implement Slim CI in [dbt Core](https://www.datafold.com/blog/taking-your-dbt-ci-pipeline-to-the-next-level) or [dbt Cloud](https://www.datafold.com/blog/slim-ci-the-cost-effective-solution-for-successful-deployments-in-dbt-cloud) to prevent most instances of data drift. Slim CI reduces build time and eliminates most instances of data drift because the CI build depends on upstreams in production due to state deferral. However, Slim CI will not *completely* eliminate data drift in CI, specifically in cases where the model being modified in the PR depends on a source. In those cases, we recommend [**building twice in CI**](/deployment-testing/best-practices/handling-data-drift#build-twice-in-ci).
## Why prevent data drift in CI?
By eliminating data drift entirely, you can be confident that any differences detected in CI are driven only by your code, not unexpected data changes.
You can think of this as similar to a scientific experiment, where the control versus treatment groups ideally exist in identical baseline conditions, with the treatment as the only variable which would cause differential outcomes.
In practice, many organizations do not completely eliminate data drift, and still derive value from automatic data diffing and analysis conducted by Datafold in CI, in spite of minor noise that does exist.
## Handling data drift
We recommend two options for removing data drift to the greatest extent possible:
* [Build twice in CI](#build-twice-in-ci)
* [Build CI data from clone of prod sources](#build-ci-data-from-clone-of-prod-sources)
In both of these approaches, Datafold compares transformations of identical upstream data, so that any detected differences will be due to the code changes alone, ensuring an accurate comparison with no false positives.
By building two versions of the data in CI, you can ensure an "apples-to-apples" comparison that depends on the same version of upstream data.
When deciding between the two, choose the one that best matches your workflow:
| Workflow | Approach | Why |
| ----------------------------------------------------- | ----------------------------- | --------------------------------------------------------------------------------------------- |
| Data changes frequently in production | Build twice in CI | Isolates PR impact without waiting on recent production updates, using a consistent snapshot. |
| Production has complex orchestration or multiple jobs | Build CI data from prod clone | Allows a stable comparison by freezing upstream data from a fixed production state. |
| Performance and speed are critical | Build CI data from prod clone | Limits CI build to a single snapshot, reducing the processing load on the pipeline. |
| Simplified orchestration with minimal dependencies | Build twice in CI | Reduces the need to manage production snapshots by running both environments in CI. |
### Build twice in CI
This method involves two CI builds: one representing PR data, and another representing production data, both based on an identical snapshot of upstream data.
1. Create a fixed snapshot of the upstream data that both builds will use.
2. The CI pipeline executes two builds: one using the PR branch of code, and another using the base branch of code.
3. Datafold compares these two data environments, both created in CI, and detects differences.
If performance is a concern, you can use a reduced or filtered upstream data set to speed up the CI process while still providing rich insight into the data.
This method assumes the production build doesn’t involve multiple jobs that process different sets of models at different times.
### Build CI data from clone of prod sources
This method involves comparing a CI build based on a snapshot of the upstream source data *from the time of the last production build* to the production version of transformed data.
1. Update orchestration to create and store a snapshot of the upstream source data at the time of the production transformation job.
2. The CI pipeline executes a data transformation build using the PR branch of code, with the snapshotted upstream data as the upstream source.
3. Datafold compares the CI data environment with production data and detects differences.
# Slim Diff
Source: https://docs.datafold.com/deployment-testing/best-practices/slim-diff
Choose which downstream tables to diff to optimize time, cost, and performance.
By default, Datafold diffs all modified models and downstream models. However, it won't make sense for all organizations to diff every downstream table every time you make a code update. Tradeoffs of time, cost, and risk must be considered.
That's why we created Slim Diff.
With Slim Diff enabled, Datafold will only diff models with dbt code changes in your Pull Request (PR).
## Setting up Slim Diff
In Datafold, Slim Diff can be enabled by adjusting your diff settings by navigating to Settings → Integrations → CI → Select your CI tool → Advanced Settings and check the Slim Diff box:
## Diffing only modified models
With this setting turned on, only the modified models will be diffed by default.
## Diff individual downstream models
Once Datafold has diffed only the modified models, you still have the option of diffing individual downstream models right within your PR.
## Diff all downstream models
You can also add the `datafold:diff-all-downstream` label within your PR, which will automatically diff *all* downstream models.
## Explicitly define which models to always diff
Finally, with Slim Diff turned on, there might be certain models or subdirectories that you want to *always* diff when downstream. You can think of this as an exclusion to the Slim Diff behavior.
Apply the `slim_diff: diff_when_downstream` meta tag to individual models or entire folders in your `dbt_project.yml` file:
```Bash theme={null}
models:
:
:
+materialized: view
:
+meta:
datafold:
datadiff:
slim_diff: diff_when_downstream
:
+meta:
datafold:
datadiff:
slim_diff: diff_when_downstream
```
These meta tags can also be added in individual yaml files or in config blocks. More details about using meta tags are available in [the dbt docs](https://docs.getdbt.com/reference/resource-configs/meta).
With this configuration in place, Slim Diff will prevent downstream models from being run *unless* they have been designated as exceptions with the `slim_diff: diff_when_downstream` dbt meta tag.
As usual, once the PR has been opened, you'll still have the option of diffing individual downstream models that weren't diffed, or diffing all downstream models using the `datafold:diff-all-downstream` label.
# Configuration
Source: https://docs.datafold.com/deployment-testing/configuration
Explore configuration options for CI/CD testing in Datafold.
Learn how Datafold infers primary keys for accurate Data Diffs.
Map renamed columns in PRs to their production counterparts.
Configure when Datafold runs in CI, including on-demand triggers.
Set model-specific filters and configurations for CI runs.
# Column Remapping
Source: https://docs.datafold.com/deployment-testing/configuration/column-remapping
Specify column renaming in your git commit message so Datafold can map renamed columns to their original counterparts in production for accurate comparison.
When your PR includes updates to column names, it's important to specify these updates in your git commit message using the following syntax. This allows Datafold to understand how renamed columns should be compared to the column in the production data with the original name.
## Example
By specifying column remapping in the commit message, instead of interpreting the change as a removing one column and adding another:
Datafold will recognize that the column has been renamed:
## Syntax for column remapping
You can use any of the following syntax styles as a single line to a commit message to instruct Datafold in CI to remap a column from `oldcol` to `newcol`.
```Bash theme={null}
# All models/tables in the PR:
datafold remap oldcol newcol
X-Datafold: rename oldcol newcol
/datafold renamed oldcol newcol
datafold: remapped oldcol newcol
# Filtered models/tables by shell-like glob:
datafold remap oldcol newcol model_NAME
X-Datafold: rename oldcol newcol TABLE
/datafold renamed oldcol newcol VIEW_*
```
## Chaining together column name updates
Commit messages can be chained together to reflect sequential changes. This means that a commit message does not lock you in to renaming a column.
For example, if your commit history looks like this:
Datafold will understand that the production column `name` has been renamed to `first_name` in the PR branch.
## Handling column renaming in git commits and PR comments
### Git commits
Git commits track changes on a change-by-change basis and linearize history assuming merged branches introduce new changes on top of the base/current branch (1st parent).
### PR comments
PR comments apply changes to the entire changeset.
### When to use git commits or PR comments?
When handling chained renames:
* **Git commits:** Sequential renames (`col1 > col2 > col3`) result in the final rename (`col1 > col3`).
* **PR comments:** It's best to specify the final result directly (`col1 > col3`). Sequential renames (`col1 > col2 > col3`) can also work, but specifying the final state simplifies understanding during review.
| Aspect | Git Commits | PR Comments |
| ------------------------- | ----------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Tracking Changes** | Tracks changes on a change-by-change basis. | Applies changes to the entire changeset. |
| **History Linearization** | Linearizes history assuming merged branches introduce new changes on top of the base/current branch (1st parent). | N/A |
| **Chained Renames** | Sequential renames (col1 > col2 > col3) result in the final rename (col1 > col3). | It's best to specify the final result directly (col1 > col3). Sequential renames (col1 > col2 > col3) can also work, but specifying the final state simplifies understanding during review. |
| **Precedence** | Renames specified in git commits are applied in sequence unless overridden by subsequent commits. | PR comments take precedence over renames specified in git commits if applied during the review process. |
These guidelines ensure consistency and clarity when managing column renaming in collaborative development environments, leveraging Datafold's capabilities effectively.
# Running Data Diff for Specific PRs/MRs
Source: https://docs.datafold.com/deployment-testing/configuration/datafold-ci/on-demand
By default, Datafold CI runs on every new pull/merge request and commits to existing ones.
To **only** run Datafold CI when the user explicitly requests it, you can set **Run only when tagged** option in the Datafold app [CI settings](https://app.datafold.com/settings/integrations/ci) which will only allow Datafold CI to run if a `datafold` tag/label is assigned to the pull/merge request.
## Running data diff on specific file changes
By default, Datafold CI will run on any file change in the repo. To skip Datafold CI runs for certain modified files (e.g., if the dbt code is placed in the same repo with non-dbt code), you can specify files to ignore. The pattern uses the syntax of .gitignore. Excluded files can be re-included by using the negation.
### Example
Let's say the dbt project is a folder in a repo that contains other code (e.g., Airflow). We want to run Datafold CI for changes to dbt models but skip it for other files. For that, we exclude all files in the repo except those the /dbt folder. We also want to filter out `.md` files in the /dbt folder:
```Bash theme={null}
*!dbt/*dbt/*.md
```
**SKIPPING SPECIFIC DBT MODELS**
To skip diffing individual dbt models in CI, use the [never\_diff](/deployment-testing/configuration/model-specific-ci/excluding-models) option in the Datafold dbt yaml config.
# Running Data Diff on Specific Branches
Source: https://docs.datafold.com/deployment-testing/configuration/datafold-ci/specifc
By default, Datafold CI runs on every new pull/merge request and commits to existing ones.
You can set **Custom base branch** option in the Datafold app [CI settings](https://app.datafold.com/settings/integrations/ci), to only run Datafold CI on pull requests that have a specific base branch. This might be useful if you have multiple environments built from different branches. For example, `staging` and `production` environments built from `staging` and `main` branches respectively. Using the option, you can have 2 different CI configurations in Datafold, one for each environment, and only run the CI for the corresponding branch.
# Diff Timeline
Source: https://docs.datafold.com/deployment-testing/configuration/model-specific-ci/diff-timeline
Specify a `time_column` to visualize match rates between tables for each column over time.
```Bash theme={null}
models:
- name: users
meta:
datafold:
datadiff:
time_column: created_at
```
# Excluding Models
Source: https://docs.datafold.com/deployment-testing/configuration/model-specific-ci/excluding-models
Use `never_diff` to exclude a model or subdirectory of models from data diffs.
```Bash theme={null}
models:
- name: users
meta:
datafold:
datadiff:
never_diff: true
```
# Including/Excluding Columns
Source: https://docs.datafold.com/deployment-testing/configuration/model-specific-ci/including-excluding-columns
Specify columns to include or exclude from the data diff using `include_columns` and `exclude_columns`.
```Bash theme={null}
models:
- name: users
meta:
datafold:
datadiff:
include_columns:
- user_id
- created_at
- name
exclude_columns:
- full_name
```
# SQL Filters
Source: https://docs.datafold.com/deployment-testing/configuration/model-specific-ci/sql-filters
Use dbt YAML configuration to set model-specific filters for Datafold CI.
SQL filters can be helpful in two scenarios:
1. When **Production** and **Staging** environments are not built using the same data. For example, if **Staging** is built using a subset of production data, filters can be applied to ensure that both environments are on par and can be diffed.
2. To improve Datafold CI performance by reducing the volume of data compared, e.g., only comparing the last 3 months of data.
SQL filters are an effective technique to speed up diffs by narrowing the data diffed. A SQL filter adds a `WHERE` clause to allow you to filter data on both sides using standard SQL filter expressions. They can be added to dbt YAML under the `meta.datafold.datadiff.filter` tag:
```
models:
- name: users
meta:
datafold:
datadiff:
filter: "user_id > 2350 AND source_timestamp >= current_date() - 7"
```
# Time Travel
Source: https://docs.datafold.com/deployment-testing/configuration/model-specific-ci/time-travel
Use `prod_time_travel` and `pr_time_travel` to diff tables from specific points in time.
If your database supports time travel, you can diff tables from a particular point in time by specifying `prod_time_travel` for a production model and `pr_time_travel` for a PR model.
```Bash theme={null}
models:
- name: users
meta:
datafold:
datadiff:
prod_time_travel:
- 2022-02-07T00:00:00
pr_time_travel:
- 2022-02-07T00:00:00
```
# Primary Key Inference
Source: https://docs.datafold.com/deployment-testing/configuration/primary-key
Datafold requires a primary key to perform data diffs. Using dbt metadata, Datafold identifies the column to use as the primary key for accurate data diffs.
Datafold supports composite primary keys, meaning that you can assign multiple columns that make up the primary key together.
## Metadata
The first option is setting the `primary-key` key in the dbt metadata. There are [several ways to configure this](https://docs.getdbt.com/reference/resource-configs/meta) in your dbt project using either the `meta` key in a yaml file or a model-specific config block.
```Bash theme={null}
models:
- name: users
columns:
- name: user_id
meta:
primary-key: true
## for compound primary keys, set all parts of the key as a primary-key ##
# - name: company_id
# meta:
# primary-key: true
```
## Tags
If the primary key is not found in the metadata, it will go through the [tags](https://docs.getdbt.com/reference/resource-properties/tags).
```Bash theme={null}
models:
- name: users
columns:
- name: user_id
tags:
- primary-key
## for compound primary keys, tag all parts of the key ##
# - name: company_id
# tags:
# - primary-key
```
## Inferred
If the primary key isn't provided explicitly, Datafold will try to infer a primary key from dbt's uniqueness tests. If you have a single column uniqueness test defined, it will use this column as the PK.
```Bash theme={null}
models:
- name: users
columns:
- name: user_id
tests:
- unique
```
Also, model-level uniqueness tests can be used for inferring the PK.
```Bash theme={null}
models:
- name: sales
columns:
- name: col1
- name: col2
...
tests:
- unique:
column_name: "col1 || col2"
# or
column_name: "CONCAT(col1, col2)"
# we also support dbt_utils unique_combination_of_columns test
- dbt_utils.unique_combination_of_columns:
combination_of_columns:
- order_no
- order_line
```
Keep in mind that this is a failover mechanism. If you change the uniqueness test, this will also impact the way Datafold performs the diff.
# Getting Started with CI/CD Testing
Source: https://docs.datafold.com/deployment-testing/getting-started
Learn how to set up CI/CD testing with Datafold by integrating your data connections, code repositories, and CI pipeline for automated testing.
**TEAM CLOUD**
Interested in adding Datafold Team Cloud to your CI pipeline? [Let's talk](https://calendly.com/d/zkz-63b-23q/see-a-demo?email=clay%20analytics%40datafold.com\&first_name=Clay\&last_name=Moeller\&a1=\&month=2024-07)!
## Getting Started with Deployment Testing
To get started, first set up your [data connection](https://docs.datafold.com/integrations/databases) to ensure that Datafold can access and monitor your data sources.
Next, integrate Datafold with your version control system by following the instructions for [code repositories](https://docs.datafold.com/integrations/code-repositories). This allows Datafold to track and test changes in your data pipelines.
Add Datafold to your continuous integration (CI) pipeline to enable automated deployment testing. You can do this through our universal [Fully-Automated](../deployment-testing/getting-started/universal/fully-automated), [No-Code](../deployment-testing/getting-started/universal/no-code), [API](../deployment-testing/getting-started/universal/api), or [dbt](../integrations/orchestrators) integrations.
Optionally, you can [connect data apps](https://docs.datafold.com/integrations/bi_data_apps) to extend your testing and monitoring to data applications like BI tools.
# API
Source: https://docs.datafold.com/deployment-testing/getting-started/universal/api
Learn how to set up and configure Datafold's API for CI/CD testing.
## 1. Create a repository integration
Integrate your code repository using the appropriate [integration](/integrations/code-repositories).
## 2. Create an API integration
In the Datafold app, create an API integration.
## 3. Set up the API integration
Complete the configuration by specifying the following fields:
### Basic settings
| Field Name | Description |
| ------------------ | --------------------------------------------------------- |
| Configuration name | Choose a name for your for your Datafold dbt integration. |
| Repository | Select the repository you configured in step 1. |
| Data Source | Select the data source your repository writes to. |
### Advanced settings: Configuration
| Field Name | Description |
| ------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Diff Hightouch Models | Run data diffs for Hightouch models affected by your PR. |
| CI fails on primary key issues | If null or duplicate primary keys exist, CI will fail. |
| Pull Request Label | When this is selected, the Datafold CI process will only run when the 'datafold' label has been applied. |
| CI Diff Threshold | Data Diffs will only be run automatically for given CI Run if the number of diffs doesn't exceed this threshold. |
| Custom base branch | If defined, the Datafold CI process will only run on pull requests with the specified base branch. |
| Files to ignore | Datafold CI diffs all changed models in the PR if at least one modified file doesn’t match the ignore pattern. Datafold CI doesn’t run in the PR if all modified files should be ignored. ([Additional details.](/deployment-testing/configuration/datafold-ci/on-demand)) |
### Advanced settings: Sampling
| Field Name | Description |
| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Enable sampling | Enable sampling for data diffs to optimize analyzing large datasets. |
| Sampling tolerance | The tolerance to apply in sampling for all data diffs. |
| Sampling confidence | The confidence to apply when sampling. |
| Sampling threshold | Sampling will be disabled automatically if tables are smaller than specified threshold. If unspecified, default values will be used depending on the Data Source type. |
## 4. Obtain a Datafold API Key and CI config ID
Generate a new Datafold API Key and obtain the CI config ID from the CI API integration settings page:
You will need these values later on when setting up the CI Jobs.
For production CI use, we recommend creating a [service account](/security/service-accounts) API key instead of a personal one. Service-account keys belong to your organization rather than to an individual user, so CI keeps working if the original creator leaves the team.
## 5. Install Datafold SDK into your Python environment
```Bash theme={null}
pip install datafold-sdk
```
## 6. Configure your CI script(s) with the Datafold SDK
Using the Datafold SDK, configure your CI script(s) to use the Datafold SDK `ci submit` command. The example below should be adapted to match your specific use-case.
```Bash theme={null}
datafold ci submit --ci-config-id --pr-num --diffs ./diffs.json
```
Since Datafold cannot infer which tables have changed, you'll need to manually provide this information in a specific `json` file format. Datafold can then determine which models to diff in a CI run based on the `diffs.json` you pass in to the Datafold SDK `ci submit` command.
```Bash theme={null}
[
{
"prod": "MY.PROD.TABLE", // Production table to compare PR changes against
"pr": "MY.PR.TABLE", // Changed table containing data modifications in the PR
"pk": ["MY", "PK", "LIST"], // Primary key; can be an empty array
// These fields are not required and can be omitted from the JSON file:
"include_columns": ["COLUMNS", "TO", "INCLUDE"],
"exclude_columns": ["COLUMNS", "TO", "EXCLUDE"]
}
]
```
Note: The `JSON` file is optional and you can also achieve the same effect by using standard input (stdin) as shown here. However, for brevity, we'll use the `JSON` file approach in this example:
```Bash theme={null}
datafold ci submit \
--ci-config-id \
--pr-num <<- EOF
[{
"prod": "MY.PROD.TABLE",
"pr": "MY.PR.TABLE",
"pk": ["MY", "PK", "LIST"]
}]
```
Implementation details will vary depending on [which CI tool](#ci-implementation-tools) you use. Please review the following instructions and examples for your organization's CI tool.
**NOTE**
Populating the `diffs.json` file is specific to your use case and therefore out of scope for this guide. The only requirement is to adhere to the `JSON` schema structure explained above.
## CI Implementation Tools
We've created guides and templates for three popular CI tools.
**HAVING TROUBLE SETTING UP DATAFOLD IN CI?**
We're here to help! Please [reach out and chat with a Datafold Solutions Engineer](https://www.datafold.com/booktime).
To add Datafold to your CI tool, add `datafold ci submit` step in your PR CI job.
```Bash theme={null}
name: Datafold PR Job
# Run this job when a commit is pushed to any branch except main
on:
pull_request:
push:
branches:
- '!main'
jobs:
run:
runs-on: ubuntu-20.04 # your image will vary
steps:
- name: Install Datafold SDK
run: pip install -q datafold-sdk
# ...
- name: Upload what to diff to Datafold
run: datafold ci submit --ci-config-id --pr-num ${PR_NUM} --diffs
env:
# env variables used by Datafold SDK internally
DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
DATAFOLD_HOST: ${DATAFOLD_HOST}
# For Dedicated Cloud/private deployments of Datafold,
# Set the "https://custom.url.datafold.com" variable as the base URL as an environment variable, either as a string or a project variable
# There are multiple ways to get the PR_NUM, this is just a simple example
PR_NUM: ${{ github.event.number }}
```
Be sure to replace `` with the [CI config ID](#4-obtain-a-datafold-api-key-and-ci-config-id) value.
**NOTE**
It is beyond the scope of this guide to provide guidance on generating the ``, as it heavily depends on your specific use case. However, ensure that the generated file adheres to the required schema outlined above.
Finally, store [your Datafold API Key](#4-obtain-a-datafold-api-key-and-ci-config-id) as a secret named `DATAFOLD_API_KEY` [in your GitHub repository settings](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository).
Once you've completed these steps, Datafold will run data diffs between production and development data on the next GitHub Actions CI run.
```Bash theme={null}
version: 2.1
jobs:
artifacts-job:
filters:
branches:
only: main # or master, or the name of your default branch
docker:
- image: cimg/python:3.9 # your image will vary
env:
# env variables used by Datafold SDK internally
DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
DATAFOLD_HOST: ${DATAFOLD_HOST}
# For Dedicated Cloud/private deployments of Datafold,
# Set the "https://custom.url.datafold.com" variable as the base URL as an environment variable, either as a string or a project variable, per https://circleci.com/docs/set-environment-variable/
# There are multiple ways to get the PR_NUM, this is just a simple example
PR_NUM: ${{ github.event.number }}
steps:
- checkout
- run:
name: "Install Datafold SDK"
command: pip install -q datafold-sdk
- run:
name: "Upload what to diff to Datafold"
command: datafold ci submit --ci-config-id --pr-num ${CIRCLE_PULL_REQUEST} --diffs
```
Be sure to replace `` with the [CI config ID](#4-obtain-a-datafold-api-key-and-ci-config-id) value.
**NOTE**
It is beyond the scope of this guide to provide guidance on generating the ``, as it heavily depends on your specific use case. However, ensure that the generated file adheres to the required schema outlined above.
Then, enable [**Only build pull requests**](https://circleci.com/docs/oss#only-build-pull-requests) in CircleCI. This ensures that CI runs on pull requests and production, but not on pushes to other branches.
Finally, store [your Datafold API Key](#4-obtain-a-datafold-api-key-and-ci-config-id) as a secret named `DATAFOLD_API_KEY` [your CircleCI project settings.](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository).
Once you've completed these steps, Datafold will run data diffs between production and development data on the next CircleCI run.
```Bash theme={null}
image:
name: ghcr.io/dbt-labs/dbt-core:1.x # your name will vary
entrypoint: [ "" ]
variables:
# env variables used by Datafold SDK internally
DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
DATAFOLD_HOST: ${DATAFOLD_HOST}
# For Dedicated Cloud/private deployments of Datafold,
# Set the "https://custom.url.datafold.com" variable as the base URL as an environment variable, either as a string or a project variable
# There are multiple ways to get the PR_NUM, this is just a simple example
PR_NUM: ${{ github.event.number }}
run_pipeline:
stage: test
before_script:
- pip install -q datafold-sdk
script:
# Upload what to diff to Datafold
- datafold ci submit --ci-config-id --pr-num $CI_MERGE_REQUEST_ID --diffs
rules:
- if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
```
Be sure to replace `` with the [CI config ID](#4-obtain-a-datafold-api-key-and-ci-config-id) value.
**NOTE**
It is beyond the scope of this guide to provide guidance on generating the ``, as it heavily depends on your specific use case. However, ensure that the generated file adheres to the required schema outlined above.
Finally, store [your Datafold API Key](#4-obtain-a-datafold-api-key-and-ci-config-id) as a secret named `DATAFOLD_API_KEY` [in your GitLab project's settings](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository).
Once you've completed these steps, Datafold will run data diffs between production and development data on the next GitLab CI run.
## Optional CI Configurations and Strategies
### Skip Datafold in CI
To skip the Datafold step in CI, include the string `datafold-skip-ci` in the last commit message.
# No-Code
Source: https://docs.datafold.com/deployment-testing/getting-started/universal/no-code
Set up Datafold's No-Code CI integration to create and manage Data Diffs without writing code.
Monitors are easy to create and manage in the Datafold app. But for teams (or individual users) who prefer a more code-based approach, our monitors as code feature allows managing monitors via version-controlled YAML.
## Getting Started
Get up and running with our No-Code CI integration in just a few steps.
### 1. Create a repository integration
Connect your code repository using the appropriate [integration](/integrations/code-repositories).
### 2. Create a No-Code integration
From the integrations page, create a new No-Code CI integration.
### 3. Set up the No-Code integration
Complete the configuration by specifying the following fields:
#### Basic settings
| Field Name | Description |
| ------------------ | ----------------------------------------------------- |
| Configuration name | Choose a name for your Datafold integration. |
| Repository | Select the repository you configured in step 1. |
| Data Connection | Select the data connection your repository writes to. |
#### Advanced settings
| Field Name | Description |
| ------------------ | ----------------------------------------------------------------------------------------------------------------------------- |
| Pull request label | When this is selected, the Datafold CI process will only run when the `datafold` label has been applied to your pull request. |
| Custom base branch | If provided, the Datafold CI process will only run on pull requests against the specified base branch. |
### 4. Create a pull request and add diffs
Datafold will automatically post a comment on your pull request with a link to generate a CI run that corresponds to the latest set of changes.
### 5. Add diffs to your CI run
Once in Datafold, add as many pull requests as you'd like to the CI run. If you need a refresher on how to configure data diffs, check out [our docs](/data-diff/in-database-diffing/creating-a-new-data-diff).
### 6. Add a summary to your pull request
Click on **Save and Add Preview to PR** to post a summary to your pull request.
### 7. View the summary in your pull request
## Cloning diffs from the last CI run
If you make additional changes to your pull request, clicking the **Add data diff** button generates a new CI run in Datafold. From there, you can:
* Create a new Data Diff from scratch
* Clone diffs from the last CI run
You can also diff downstream tables by clicking on the **Add Data Diff** button in the Downstream Impact table. This creates additional Data Diffs:
You can then post another summary to your pull request by clicking **Save and Add Preview to PR**.
# How Datafold in CI Works
Source: https://docs.datafold.com/deployment-testing/how-it-works
Learn how Datafold integrates with your Continuous Integration (CI) process with Data Diffs and AI Code Reviews, catching issues before they make it into production.
## What is CI?
Continuous Integration (or CI) is a process for building and testing changes to your code before deploying to production. This ensures early detection of potential issues and improves the quality of code deployment.
| Without CI | With CI |
| -------------------------------------------------------------------------------- | ------------------------------------------------------------------------ |
| Updates are manually coordinated and become a complex synchronization chore. | Smoothly manage code changes, and scale as your team and code base grow. |
| Testing is done manually, if at all. | Automate high-confidence test coverage. |
| Code changes are released at a slower cadence, and with higher rates of failure. | Boost the quantity and quality of developer output. |
### Datafold in CI
Datafold provides two complementary CI capabilities:
* **[Data Diffs](#comparing-production-and-staging-data)** — Automatically compare production and staging data to catch unintended data changes at the row and column level.
* **[AI Code Reviews](/deployment-testing/ai-code-reviews)** — Get LLM-powered analysis of your SQL and pipeline code to catch logic errors, anti-patterns, and best practice violations.
You can use both together for comprehensive validation, or use Data Diffs on their own.
### Data Diffs in CI
For Data Diffs to work in CI, you need to add a step that builds staging data in your CI process (e.g., GitHub).
**Prerequisite: Building staging data in CI**
If you're using dbt, you'll need to add a dbt build step to your CI pipeline first. This can be done through either [dbt Cloud](https://www.datafold.com/blog/slim-ci-the-cost-effective-solution-for-successful-deployments-in-dbt-cloud) or [dbt Core](https://www.datafold.com/blog/accelerating-dbt-core-ci-cd-with-github-actions-a-step-by-step-guide).
For other orchestrators like Airflow, follow [this guide](https://www.datafold.com/blog/datafold-in-ci-is-for-everyone) to build staging data in CI, or contact us for custom recommendations based on your infrastructure.
In this short clip, see how the Datafold bot automatically comments on your PR, highlighting data differences between the production and development versions of your code:
## Creating production and staging data
When Datafold is integrated into your CI, it automatically detects and highlights value-level differences between production data and staging data.
These summarized Data Diff results are written directly in your pull request (PR) as a comment. From the comment, you can access the Datafold App to explore value-level differences, understand the impact on downstream BI tools, and other context-rich information about the impact of your PR code changes.
### Production data
Production data refers to the data that your organization depends on for daily operations, such as powering dashboards and BI tools. Your orchestrator (e.g., dbt, Airflow) is responsible for running SQL code that builds and maintains this data in your warehouse.
If you use dbt, we'll assume that you have a production job in either [dbt Cloud](https://docs.getdbt.com/docs/deploy/dbt-cloud-job) or [dbt Core](https://docs.getdbt.com/docs/deploy/deployment-tools) that builds or updates your dbt models in the warehouse on a schedule. Or, you might have a scheduled job in Airflow or another orchestrator that builds production data on a regular basis.
### Staging data
For Datafold to run Data Diffs in CI, you need a step in your CI process that builds staging data (a version of your data in a dedicated schema) using the code in your PR/MR branch. Datafold will compares this staging data against your production data when diffing.
**Tip**
You can use either dbt Cloud or dbt Core to add a step in your CI process that builds staging data.
* [Setting up dbt in CI for dbt Cloud users](https://www.datafold.com/blog/slim-ci-the-cost-effective-solution-for-successful-deployments-in-dbt-cloud)
* [Setting up dbt in CI for dbt Core users](https://www.datafold.com/blog/accelerating-dbt-core-ci-cd-with-github-actions-a-step-by-step-guide)
* [Building staging data in CI using Airflow](https://www.datafold.com/blog/datafold-in-ci-is-for-everyone)
## Comparing production and staging data
Once you have a job in CI that builds staging data, you're ready to get started with Datafold in CI!
By comparing production and staging data, Datafold ensures that any code changes are thoroughly validated before being merged, helping to prevent data issues from reaching production.
We'll walk through the setup steps in more detail in the [Getting Started](/deployment-testing/getting-started) section.
### Datafold in CI for dbt users
While Datafold can be added to CI no matter what orchestrator you use, it's worth detailing exactly how this works with dbt, a popular and opinionated tool for which we have specific recommendations.
Here is how Datafold + dbt in CI works:
Two versions of your dbt project's `manifest.json` are submitted to Datafold, representing the state of the production code and the PR/MR code.
* For dbt Cloud users, this submission happens automatically.
* dbt Core users need to add steps in their CI configuration (e.g., Circle CI, GitHub Actions, or GitLab) to submit the artifacts.
Datafold compares the two versions of the `manifest.json` to identify differences in the code.
Datafold queries your data warehouse to run Data Diffs on the modified dbt models. It also identifies downstream assets (e.g., BI tools, reverse ETL pipelines) that might be impacted by the changes.
* Datafold can diff dbt models that are materialized as both tables and views.
* If your dbt project has many downstream dependencies, you can use [Slim Diff](/deployment-testing/best-practices/slim-diff) or other [configuration options](/deployment-testing/configuration) to manage scale, ensuring critical models are prioritized.
The results of the Data Diffs are summarized in a comment on your pull request (e.g., in GitHub). You can click the comment to view more detailed information in the Datafold application.
# CI/CD Testing
Source: https://docs.datafold.com/faq/ci-cd-testing
Frequently asked questions about Datafold's CI/CD testing integration, including staging environments, diff performance, and automated data quality checks.
You can use [SQL filters](/deployment-testing/configuration/model-specific-ci/sql-filters) to ensure that Datafold compares equivalent subsets of data between your staging/dev and production environments, allowing for accurate data quality checks despite the difference in data volume.
Yes, you can use Datafold in development. It helps catch data quality issues early by comparing data changes in your development environment before they reach production. This proactive approach ensures that errors and inconsistencies are identified and resolved during the development process, enhancing overall data reliability and preventing potential issues in production. Data teams can leverage the Datafold SDK to run data diffs from the command line while developing and testing data models.
Data drift in CI occurs when the two data transformation builds that are compared by Datafold in CI have differing data outputs due to the upstream data changing over time.
We have a few recommended strategies for dealing with data drift [in our docs here](/deployment-testing/best-practices/handling-data-drift).
Some teams want to show Data Diff results in their tickets *before* creating a pull request. This speeds up code reviews as developers can QA code changes before requesting a PR review.
If you use dbt, we explain [how you can automate this workflow here](/faq/datafold-with-dbt#can-i-run-data-diffs-before-opening-a-pr).
# Data Diffing
Source: https://docs.datafold.com/faq/data-diffing
Frequently asked questions about Datafold's data diffing capabilities, including supported databases, data types, performance, and use cases.
A [data diff](/data-diff/what-is-data-diff) is a value-level comparison between two tables—used to identify critical changes to your data and guarantee data quality.
Similar to how git diff highlights changes in code by comparing different versions of files to show what lines have been added, modified, or deleted, a data diff compares rows and columns in two tables to pinpoint specific data changes.
Datafold can compare data in tables, views, and SQL queries in databases and data lakes.
Datafold facilitates data diffing by supporting a wide range of basic data types across popular database systems like Snowflake, Databricks, BigQuery, Redshift, and PostgreSQL. Datafold can also diff data across legacy warehouses like Oracle, SQL Server, Teradata, IBM Netezza, MySQL, and more.
No, Datafold cannot perform data diffs on unstructured data such as files. However, it supports diffing structured and semi-structured data in tabular formats, including `JSON` columns.
When comparing numerical columns or columns of the `FLOAT` type, it is beneficial to [set tolerance levels for differences](/data-diff/in-database-diffing/creating-a-new-data-diff#tolerance-for-floats) to avoid flagging inconsequential discrepancies. This practice ensures that only meaningful differences are highlighted, maintaining the focus on significant changes.
When a change is detected, Datafold highlights the differences in the App or through PR comments, allowing data engineers and other users to review, validate, and approve these changes during the CI process.
When diffing data within the same physical database or data lake namespace, data diff compares data by executing various SQL queries in the target database. It uses several JOIN-type queries and various aggregate queries to provide detailed insights into differences at the row, value, and column levels, and to calculate differences in metrics and distributions.
Datafold connects to any SQL source and target databases, similar to how BI tools do. Datasets from both data connections are co-located in a centralized database to execute comparisons and identify specific rows, columns, and values with differences. To perform diffs at massive scale and increased speed, users can apply sampling, filtering, and column selection.
Yes, while the Datafold App UI provides advanced exploration of diff results, you can also materialize these results back to your database. This allows you to further investigate with SQL queries or maintain audit logs, providing flexibility in how you handle and review diff outcomes. Teams may additionally choose to download diff results as a CSV directly from the Datafold App to share with their team members.
# Data Monitoring and Observability
Source: https://docs.datafold.com/faq/data-monitoring-observability
Frequently asked questions about Datafold's data monitoring and observability capabilities, including how it compares to other data observability tools.
Most data observability tools focus on monitoring metrics (e.g., null counts, row counts) in the data warehouse. But catching data quality issues in the data warehouse is usually too late: the bad data has already affected downstream processes and negatively impacted the business.
Our platform focuses on prevention rather than detection of data quality issues. By [integrating deeply into your CI process](/deployment-testing/how-it-works), Datafold's [Data Diff](/data-diff/what-is-data-diff) helps data teams fix potential regressions during development and deployment, before bad code and data get into the production environment.
Our [Data Monitors](/data-monitoring/monitor-types) make it easy to monitor production data to catch issues early before they are propagated through the warehouse to business stakeholders.
This proactive data quality strategy not only enhances the reliability and accuracy of your data pipelines but also reduces the risk of disruptions and the need for reactive troubleshooting.
# Data Reconciliation
Source: https://docs.datafold.com/faq/data-reconciliation
Frequently asked questions about cross-database data reconciliation with Datafold, including how diffing works, scaling, and handling schema differences.
Datafold connects to any SQL source and target databases, similar to how BI tools do. Datasets from both data connections are co-located in a centralized database to execute comparisons and identify specific rows, columns, and values with differences. To perform diffs at massive scale and increased speed, users can apply sampling, filtering, and column selection.
Datafold’s cross-database diffing will produce the following results:
1. High-Level Summary:
* Total number of different rows
* Total number of rows (primary keys) that are present in one database, but not the other
* Aggregate schema differences
2. Schema Differences: Per-column mapping of data types, column order, etc.
3. Primary Key Differences: Sample of specific rows that are present in one database, but not the other
4. Value-Level Differences: Sample of differing values for each column with identified discrepancies; full dataset of differences can be downloaded or materialized to the warehouse
You can check out [what the results look like in the App](/data-diff/cross-database-diffing/results).
1. Via Datafold’s interactive UI
2. Via the Datafold API
3. On a schedule (as a monitor) with optional alerting via Slack, email, PagerDuty, etc.
Yes, users can run as many diffs as they would like with concurrency limited by the underlying database.
In such cases, we recommend using watermarking – diffing data within a specified time window of row creation / update (e.g. `updated_at timestamp`).
Datafold performs best-effort type matching for cases when deterministic type casting is possible, e.g. comparing `VARCHAR` type with `STRING` type. When automatic type casting without information loss is not possible, the user can define type casting manually using diffing in Query mode.
Yes, users can reshape the input dataset by writing a SQL query and diffing in Query mode to bring the dataset to a shape that can be compared with another. Datafold also supports column remapping for datasets with different column names between tables.
To make the provisioning at scale easier, you can create data diffs via the [Datafold API](/api-reference/datafold-api).
# Data Storage and Security
Source: https://docs.datafold.com/faq/data-storage-and-security
Datafold ingests and stores various types of data to ensure accurate data quality checks and insights:
* **Metadata**: This includes table names, column names, and queries executed in the data warehouse.
* **Data for Data Diffs**:
* For **in-database diffs**, all data visible in the app, including data samples, is fetched and stored.
* For **cross-database diffs**, all data visible in the app, including data samples, is fetched and stored. Larger amounts of data are fetched for comparison purposes, but only data samples are stored.
* **Table Profiling in Data Explorer**: Datafold stores samples and distributions of data to provide detailed profiling.
# Integrating Datafold with dbt
Source: https://docs.datafold.com/faq/datafold-with-dbt
Frequently asked questions about using Datafold with dbt, including CI/CD setup for dbt Core and dbt Cloud, data diff performance, and testing workflows.
You need Datafold in addition to dbt tests because while dbt tests are effective for validating specific assertions about your data, they can't catch all issues, particularly unknown unknowns. Datafold identifies value-level differences between staging and production datasets, which dbt tests might miss.
Unlike dbt tests, which require manual configuration and maintenance, Datafold automates this process, ensuring continuous and comprehensive data quality validation without additional overhead. This is all embedded within Datafold’s unified platform that offers end-to-end data quality testing with our [Column-level Lineage](/data-explorer/lineage) and [Data Monitors](/data-monitoring/monitor-types).
Hence, we recommend combining dbt tests with Datafold to achieve complete test coverage that addresses both known and unknown data quality issues, providing a robust safeguard against potential data integrity problems in your CI pipeline.
For dbt Core users, create an integration in Datafold, specify the necessary settings, obtain a Datafold API Key and CI config ID, and configure your CI scripts with the Datafold SDK to upload manifest.json files. Our detailed setup guide [can be found here](/integrations/orchestrators/dbt-core).
For dbt Cloud users, set up dbt Cloud CI to run Pull Request jobs and create an Artifacts Job that generates production manifest.json on merges to main/master. Obtain your dbt Cloud access URL and a Service Token, then create a dbt Cloud integration in Datafold using these credentials. Configure the integration with your repository, data connection, primary key tag, and relevant jobs. Our detailed setup guide [can be found here](/integrations/orchestrators/dbt-cloud).
Yes, Datafold is fully compatible with the custom PR schema created by dbt Cloud for Slim CI jobs.
We outline effective strategies for efficient and scalable data diffing in our [performance and scalability guide](/faq/performance-and-scalability#how-can-i-optimize-diff-performance-at-scale).
For dbt-specific diff performance, you can exclude certain columns or tables from data diffs in your CI/CD pipeline by adjusting the **Advanced settings** in your Datafold CI/CD configuration. This helps reduce processing load by focusing diffs on only the most relevant columns.
Some teams want to show Data Diff results in their tickets *before* creating a pull request. This speeds up code reviews as developers can QA code changes before requesting a PR review.
You can trigger a Data Diff by first creating a **draft PR** and then running the following command via the CLI:
```bash theme={null}
dbt run && datafold diff dbt
```
This command runs `dbt` locally and then triggers a Data Diff, allowing you to preview data changes without pushing to Git.
To automate this process of kicking off a Data Diff before pushing code to git, we recommend creating a GitHub Actions job for draft PRs. For example:
```
name: Data Diff on draft dbt PR
on:
pull_request:
types: [opened, reopened, synchronize]
branches:
- '!main'
jobs:
run:
if: github.event.pull_request.draft == true # Run only on draft PRs
runs-on: ubuntu-latest
steps:
- name: Checkout Code
uses: actions/checkout@v2
- name: Set Up Python
uses: actions/setup-python@v2
with:
python-version: '3.8'
- name: Install requirements
run: pip install -r requirements.txt
- name: Install dbt dependencies
run: dbt deps
# Update with your S3 bucket details
- name: Grab production manifest from S3
run: |
aws s3 cp s3://advanced-ci-manifest-demo/manifest.json ./manifest.json
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_REGION: us-east-1
- name: Run dbt and Data Diff
env:
DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
run: |
dbt run
datafold diff dbt
# Optional: Submit artifacts to Datafold for more analysis or logging
- name: Submit artifacts to Datafold
run: |
set -ex
datafold dbt upload --ci-config-id 350 --run-type pull_request --commit-sha ${GIT_SHA}
env:
DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
GIT_SHA: "${{ github.event.pull_request.head.sha }}"
```
# Overview
Source: https://docs.datafold.com/faq/overview
Get answers to the most common questions regarding our product.
Have a question that isn’t answered here? Feel free to reach out to us at [support@datafold.com](mailto:support@datafold.com), and we’ll be happy to assist you!
# Performance and Scalability
Source: https://docs.datafold.com/faq/performance-and-scalability
Datafold is highly scalable, supporting data teams working with billion-row datasets and thousands of data transformation/dbt models. It offers powerful performance optimization features such as [SQL filtering](/deployment-testing/configuration/model-specific-ci/sql-filters), [sampling](/data-diff/cross-database-diffing/best-practices), and [Slim Diff](/deployment-testing/best-practices/slim-diff), which allow you to focus on testing the datasets that are most critical to your business, ensuring efficient and targeted data quality validation.
Datafold pushes down compute to your database, and the performance of data diffs largely depends on the underlying SQL engine. Here are some in-app strategies to optimize performance:
1. [Enable sampling](/data-diff/cross-database-diffing/best-practices): Sampling reduces the amount of data processed by comparing a randomly chosen subset. This approach balances diff detail with processing time and cost, suitable for most use cases.
2. [Use SQL Filters](/deployment-testing/configuration/model-specific-ci/sql-filters): If you only need to compare a specific subset of data (e.g., for a particular city or a recent time period), adding a SQL filter can streamline the diff process.
3. **Exclude columns/tables**: When certain columns or tables are unnecessary for critical comparisons—such as temporary tables with dynamic values, metadata fields, or timestamp columns that always differ—you can exclude these to increase diff efficiency and speed.
You can exclude columns when you create a new Data Diff or when you clone an existing one:
To exclude them in your CI/CD pipeline, [follow this guide](/integrations/orchestrators/dbt-core#advanced-settings-configuration) to specify them in the Advanced settings of your CI/CD configuration in Datafold.
4. **Optimize SQL queries**: Refactor your SQL queries to improve the efficiency of database operations, reducing execution time and resource usage.
5. **Leverage database performance features**: Ensure your database is configured to match typical diff workload patterns. Utilize features like query optimization, caching, and parallel processing to boost performance.
6. **Increase data warehouse resources**: If using a platform like Snowflake, consider increasing the size of your warehouse to allocate more resources to Datafold operations.
# Resource Management
Source: https://docs.datafold.com/faq/resource-management
Frequently asked questions about Datafold's resource consumption, data warehouse cost impact, and performance optimization.
Recognizing the importance of efficient data reconciliation, we provide a number of strategies to make the diffing process as efficient as possible:
**Efficient Algorithm**
Datafold connects to any SQL source and target databases, similar to how BI tools do. Datasets from both data connections are co-located in a centralized database to execute comparisons and identify specific rows, columns, and values with differences. To perform diffs at massive scale and increased speed, users can apply sampling, filtering, and column selection.
**Flexible Controls**
Users can easily control the volume of data used in diffing by using:
* [Filters](/deployment-testing/configuration/model-specific-ci/sql-filters): Focus on the most relevant part of the dataset
* [Sampling](/data-diff/cross-database-diffing/best-practices): Set sampling as a percentage of rows or desired confidence level
* [Slim Diff](/deployment-testing/best-practices/slim-diff): Selectively diff only the models that have dbt code changes in your pull request.
**Workload Management**
Users can apply controls to enforce low diffing footprint:
* On the Datafold side: Set desired concurrency
* On the database side: Most databases support workload management settings to ensure that Datafold does not consume more than X% CPU or Y% RAM
Also, consider that using a data quality tool like Datafold to catch issues before production will reduce cost over time as it lowers the need for expensive reprocessing and troubleshooting. Datafold's features like filtering, sampling, and Slim Diff ensure that only relevant datasets are tested, minimizing the computational load on your data warehouse. This targeted approach can lead to more efficient resource usage and potentially lower data warehouse operation costs.
# Slack Bot
Source: https://docs.datafold.com/integrations/agents/slack-bot
Datafold Assistant — a conversational Slack bot that answers questions about your data using Datafold's MCP tools, scoped via a service account.
The **Datafold Assistant** brings Datafold's data context into Slack. Mention `@Datafold` in a thread (or DM it directly) to ask about your data sources, lineage, monitors, diffs, and anything else exposed through Datafold's MCP tools — without leaving the conversation.
Some features (such as lineage) require the **Knowledge Graph** feature to be enabled for your organization. Contact [support@datafold.com](mailto:support@datafold.com) if you're interested.
## What you can ask
The Assistant answers natural-language questions using the same [MCP tool surface](/datafold-mcp) as clients like Claude Code or Cursor. Typical questions in practice:
* "What does the `orders` table look like?" — schema, recent activity, downstream dependencies.
* "Show me the latest data diff for our staging dbt model and summarize the differences."
* "Which monitors fired this week, and which datasets do they cover?"
* "Find the Snowflake table that this Looker dashboard depends on."
* "Summarize the column-level lineage for `customers.email`."
The set of available tools evolves as new MCP capabilities ship. The operating model stays the same: the Assistant uses whatever your bound [service account](/security/service-accounts) has permission to use, nothing more.
You can also drop screenshots and small text files into the thread — the Assistant reads images and small UTF-8 text inline and uses them as context for its answer.
## Permissions model
The bot does **not** have its own catalog of permissions. It acts under the identity of a Datafold **service account** that you bind to the workspace at install time, and inherits exactly that account's access — to data sources, monitors, lineage, and the MCP tools governed by [tool permissions](/security/mcp-tool-permissions).
The intended way to scope the bot is therefore at the Datafold-permissions layer:
1. Create (or choose) a [**custom group**](/security/user-roles-and-permissions#custom-groups) with the MCP tool permissions you want the bot to use.
2. Create a [**service account**](/security/service-accounts) in that group.
3. Bind the service account to the Slack workspace during install.
To restrict the bot, narrow the service account's group. To expand its reach, add more permissions to the group or rebind to a different service account — re-binding rotates the bot's API key automatically, with no exposure window.
The Slack OAuth scopes the bot itself requests (read mentions, post messages, read attached files, etc.) only govern what the bot can do **inside Slack**. They are independent of Datafold permissions.
## Installation
**PREREQUISITES**
* Datafold **Admin** role
* Slack workspace **admin** access (or permission to install apps)
* A Datafold **service account** in a permission group of your choosing. You'll bind it during install — it's cleanest to set up the [custom group](/security/user-roles-and-permissions#custom-groups) and [service account](/security/service-accounts) first.
If **Settings → Integrations → Agents** is not available on your Datafold organization, reach out to the Datafold support team to enable it.
In Datafold, go to **Settings → Integrations → Agents → Slack Bot** and click **Install to Slack**. Slack will prompt you to choose a workspace and authorize the app's scopes. After approval, you'll be redirected back to Datafold.
Pick the service account whose permissions the bot should inherit and click **Save**. Datafold issues a fresh API key for the bot at this point — it is stored encrypted and never displayed.
Re-binding rotates the API key. You can switch the bot to a different service account at any time without exposure.
In Slack, invite `@Datafold` to a channel and mention it (or open a DM). Ask a question about your data and the bot will reply in-thread.
## Slack notifications
Installing the Assistant also provisions a **Slack notification destination** for the same workspace, so monitor alerts can be routed through this install without setting up Slack notifications separately. The new destination appears under **Settings → Integrations → Notifications** and is available immediately when configuring a monitor.
If you installed the Datafold Assistant before this capability shipped, re-install it from **Settings → Integrations → Agents → Slack Bot** to pick up the additional Slack scopes the notification path needs. The destination is created automatically on re-auth.
If a dedicated Slack integration is also configured under **Settings → Integrations → Notifications** for the same workspace, that one takes precedence and is used for monitor notifications instead of the Assistant-provisioned destination.
Disconnecting the Assistant from **Agents** also removes its auto-provisioned notification destination. A dedicated Slack notifications integration configured under **Notifications** is not affected.
## Saving context with 💾
Saving context requires the **Knowledge Graph** feature to be enabled for your organization. Reactions on workspaces in organizations without it are ignored.
Any message in a connected workspace can be added to Datafold's knowledge graph by reacting with `:floppy_disk:` (💾). The Assistant:
1. Saves the message text as a `Document` in the knowledge graph.
2. Extracts and links references to data sources, tables, and columns mentioned in the text.
3. Confirms with a ✅ reaction.
Saved Documents surface in future Assistant answers and anywhere else Datafold reads from the knowledge graph — useful for capturing tribal knowledge, ownership notes, deprecation announcements, or business context that isn't already in your warehouse metadata.
## Managing the integration
Each Slack workspace bound to your Datafold organization shows up under **Settings → Integrations → Agents → Slack Bot**. From there you can:
* **Re-bind** the service account if you want to change the bot's permission scope.
* **Disconnect** the workspace. Datafold will revoke the API key the bot was using and call Slack's `auth.revoke` so the bot token stops working on Slack's side as well. The bot stops responding immediately.
A single Datafold organization can install the Assistant into multiple Slack workspaces, each with its own service account binding. This is useful when, for example, the engineering and analytics workspaces should have different permission scopes.
## Limits and trust boundaries
* **Conversation memory.** The Assistant re-reads the most recent thread messages on every turn (capped at \~15 messages / 20,000 characters). Very long threads may not retain all earlier context — start a new thread for unrelated questions.
* **File attachments.** Images (PNG, JPEG, GIF, WebP) and small UTF-8 text files are read inline. Per-file caps: 4 MB for images, 64 KB for text. Per-turn caps: 24 MB total images, 256 KB total text. Other file types are skipped with a note in the reply.
* **Untrusted content.** The Assistant treats attachment content, prior thread messages, and saved Documents as **data**, not instructions. It will not follow embedded instructions to bypass its rules, reveal credentials, or take actions the user didn't ask for.
# Hightouch
Source: https://docs.datafold.com/integrations/bi-data-apps/hightouch
Navigate to Settings > Integrations > Data Apps and add a Hightouch Integration.
## Create a Hightouch Integration
Complete the configuration by specifying the following fields:
| Field Name | Description |
| ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Integration name | An identifier used in Datafold to identify this Data App configuration. |
| Workspace URL | Then, grab your workspace URL, by navigating to **Settings** → **Workspace** tab → **Workspace slug** or by finding the workspace name in the search bar ([https://app.hightouch.io/](https://app.hightouch.io/) \). |
| API Key | Log into your [Hightouch account](https://app.hightouch.com/login) and navigate to **Settings** → **API keys** tab → **Add API key** to generate a new, unique API key. Your API key will appear only once, so please copy and save it to your password manager for further use. |
| Data connection mapping | When the correct credentials are entered we will begin to populate data connections in Hightouch (on the left side) that will need to be mapped to data connections configured in Datafold (on the right side). See image below. |
When completed, click **Submit**.
It may take some time to sync all the Hightouch entities to Datafold and for Data Explorer to populate. When completed, your Hightouch models and sync will appear in Data Explorer as search results.
**TIP**
[Tracking Jobs](/integrations/bi-data-apps/tracking-jobs) explains how to find out when your data app integration is ready.
# Looker
Source: https://docs.datafold.com/integrations/bi-data-apps/looker
Integrate Datafold with Looker to track BI lineage and understand the downstream impact of data changes on your Looker dashboards and Explores.
## Create a code repositories integration
[Create a code repositories integration](/integrations/code-repositories) that connects Datafold to your Looker repository.
## Create a Looker integration
Navigate to Settings > Integrations > Data Apps and add a Looker integration.
Complete the configuration by specifying the following fields:
| Field Name | Description |
| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Integration name | An identifier used in Datafold to identify this Data App configuration. |
| Project Repository | Select the same repository as used in your Looker project. |
| API Host URL | The Looker [API Host URL](https://cloud.google.com/looker/docs/admin-panel-platform-api#api%5Fhost%5Furl). It has the following format: https\://\.cloud.looker.com:\. The port defaults are 19999 (legacy) and 443 (new), see the [Looker Docs](https://cloud.google.com/looker/docs/api-getting-started#looker%5Fapi%5Fpath%5Fand%5Fport) for hints. Examples: Legacy ([https://datafold.cloud.looker.com:19999](https://datafold.cloud.looker.com:19999)), New ([https://datafold.cloud.looker.com:443](https://datafold.cloud.looker.com:443)) |
| Client ID | Follow [these steps](https://cloud.google.com/looker/docs/api-auth#authentication%5Fwith%5Fan%5Fsdk) to generate Client ID and Client Secret. These are always user specific. We recommend using a group email for continuity. See [Looker User Minimum Access Policy](/integrations/bi-data-apps/looker#looker-user-minimum-access-policy) for the required permissions. |
| Client Secret | See Client ID. |
| Data connection mapping | When the correct credentials are entered we will begin to populate data connections in Looker (on the left side) that will need to be mapped to data connections configured in Datafold (on the right side). See image below. |
When completed, click **Submit**.
It may take some time to sync all the Looker entities to Datafold and for Data Explorer to populate. When completed, your Looker assets will appear in Data Explorer as search results.
**TIP**
[Tracking Jobs](/integrations/bi-data-apps/tracking-jobs) explains how to find out when your data app integration is ready.
## Looker user: minimum access policy
The user linked to the API credentials needs the predefined Developer role, or you can create a custom role with these permissions:
* `access_data`
* `download_without_limit`
* `explore`
* `login_special_email`
* `manage_spaces`
* `see_drill_overlay`
* `see_lookml`
* `see_lookml_dashboards`
* `see_looks`
* `see_pdts`
* `see_sql`
* `see_user_dashboards`
* `send_to_integration`
## Database/schema connection context
### Database specification
Using the Fully Qualified Names in your Looker view files is not always possible. If a view references a table as`my_schema.my_table`, Datafold might have difficulty finding which database this table actually is in. There are multiple ways to guide Datafold to make a correct choice, as summarized in the table below.
**INFO**
Priority #1 takes precedence over #2, and so forth.
| # | Source, if defined | Example |
| - | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------- |
| 1 | datafold\_force\_database **User Attribute** in Looker | looker\_db |
| 2 | **Fully Qualified Names** in your Looker view files | my\_db.my\_schema.my\_table |
| 3 | datafold\_default\_database **User Attribute** in Looker | another\_looker\_db |
| 4 | **Database** specified in Looker, at Database connection settings\_(We can only read these if Datafold connects to Looker via an admin user, which is probably suboptimal.)\_ | my\_db |
| 5 | **Database** specified in Datafold, at [Database Connection settings](/integrations/databases/) | my\_db |
### Supported custom Looker user attributes
| User Attribute | Impact |
| --------------------------- | -------------------------------------------------------------------------------------------------------- |
| datafold\_force\_database | Database to use in all cases, even if a fully qualified path in LookML refers to another database. |
| datafold\_default\_database | Database to use if Looker View does not explictly specify a database. |
| datafold\_default\_schema | Schema to use if Looker view does not explicitly specify a schema (which equals a dataset for BigQuery). |
| datafold\_default\_host | *(BigQuery only)* Default project name. |
**INFO**
Make sure attributes are:
* Explicitly defined for the user in question (not just falling back to a default);
* Not marked as hidden.
## Integration limitations
Datafold lets you connect to Looker and extend our capabilities to your Looker Views, Explores, Looks, and Dashboards. But this is a new feature, so there are some things we don’t support yet:
* **PDT/Derived Tables**:Datafold only works with the tables that come from your data connections, but not with the [tables](https://cloud.google.com/looker/docs/derived-tables#important%5Fconsiderations%5Ffor%5Fimplementing%5Fpersisted%5Ftables) that Looker makes from your SQL queries.
* **Merge Queries**: Datafold supports the Queries and Looks that make up your Dashboards, but [Merge Queries](https://cloud.google.com/looker/docs/merged-results) are not one of them. For some use cases you could achieve the same by joining the underlying views with an explore.
* **Usage metrics and popularity**: Datafold shows you your Looker objects - such as dashboards, looks, and fields - but not how much you use or like them.
We are improving our Looker integration and adding more features soon. We welcome your feedback and suggestions.
# Mode
Source: https://docs.datafold.com/integrations/bi-data-apps/mode
## Obtain credentials from Mode
**INFO**
To complete this integration, your **Mode** account must be a part of a [Mode Business Workspace](https://mode.com/compare-plans) in order to generate an API Token.
**INFO**
You need to have **Admin** privileges in your Mode Workspace to be able to create an API Token.
In **Mode**, navigate to **Workspace Settings** → **Privacy & Security** → **API**.
Click the icon, and choose **Create new token**.
Take note of:
* Token Name,
* Token Password,
* And the URL of the page that lists the tokens. It should look like this:
[https://app.mode.com/organizations/\{workspace}/api\_keys](https://app.mode.com/organizations/\{workspace}/api_keys)
Take note of `{workspace}` part, we will need it when configuring Datafold.
## Configure Datafold
Navigate to **Settings** → **Integrations** → **BI & Data Apps**.
Choose **Mode** Integration to add.
This will bring up **Mode** integration parameters.
Complete the configuration by specifying the following fields:
| Field Name | Description |
| ---------------- | ----------------------------------------------------------------------- |
| Integration name | An identifier used in Datafold to identify this Data App configuration. |
| Token | API token, as generated above. |
| Password | API token password, as generated above. |
| Workspace | Workspace name obtained from your workspace URL. |
**INFO**
**Workspace Name** field is not marked as required on this screen. That's for backwards compatibility: the legacy type of Mode API token, known as **Personal Token**, does not require that parameter. However, such tokens can no longer be created, so we're no longer providing instructions for them.
When completed, click **Save**.
Datafold will try to connect to Mode and, if any issues with the connection arise you will be alerted.
Datafold will start to sync your reports. It can take some time to fetch all the reports, depending on how many of them there are.
**TIP**
[Tracking Jobs](/integrations/bi-data-apps/tracking-jobs) explains how to find out when your data app integration is ready.
Now that Mode sync has completed — you can browse your Mode reports!
# Power BI
Source: https://docs.datafold.com/integrations/bi-data-apps/power-bi
Include Power BI entities in Data Explorer and column-level lineage.
## Overview
Our Power BI integration can help you visualize column-level lineage dependencies between warehouse tables and Power BI entities using [Data Explorer](/data-explorer/how-it-works). Datafold supports the following Power BI entity types:
* Tables (with Columns)
* Reports (with Fields)
* Dashboards
## Choose your authentication method
Datafold supports two authentication methods for Power BI. Choose the one that best fits your organization's needs. Key difference:
* Delegated auth uses your user's identity, is tied to your account and permissions, also requiring you to be a Power Platform Administrator;
* Service Principal is an independent application identity that doesn't depend on any user, but can be a bit more complicated to setup.
### Set up the integration
Navigate to [**Microsoft 365 admin center** -> **Active users**](https://admin.microsoft.com/#/users) and choose the user that Datafold will authenticate under.
As highlighted in the screenshot above, this user should have the **Power Platform Administrator** role assigned to it.
Click **Manage roles**, enable the **Power Platform Administrator** role, and save changes.
Navigate to [Power BI Admin Portal](https://app.powerbi.com/admin-portal/tenantSettings?experience=power-bi) and enable the following two settings:
* Enhance admin APIs responses with detailed metadata
* Enhance admin APIs responses with DAX and mashup expressions
In the Datafold app, navigate to **Settings** -> **BI & Data Apps**, and click **+ Add new integration**. Choose **Power BI** from the list.
...and then **Save**.
On clicking **Save**, the system will redirect you to Power BI.
...if not already signed in.
Allow the Datafold integration to use Power BI. Depending on the roles configured for your user in the Admin center, you may require a confirmation from a **Global Administrator**. Follow the steps in the wizard.
You will be redirected back to Datafold and see a message that Power BI is successfully connected.
### Set up the integration
1. Go to [Microsoft Entra admin center - New Registration](https://entra.microsoft.com/?l=en.en-us#view/Microsoft_AAD_RegisteredApps/CreateApplicationBlade/quickStartType~/null/isMSAApp~/false)
2. Configure the application:
* **Name**: `Datafold Power BI Integration` (or similar)
* **Supported account types**: "Accounts in this organizational directory only"
* **Redirect URI**: Leave blank (not needed for Service Principal)
3. Click **Register**
4. Note the **Application (client) ID** and **Directory (tenant) ID** from the Overview page
1. In the App Registration, go to **Certificates & secrets**
2. Click **New client secret**
3. Add a description (e.g., "Datafold integration") and choose an expiration period
4. Click **Add**
5. **Important**: Copy the secret **Value** immediately—it won't be shown again
1. Go to [Microsoft Entra admin center - Groups](https://entra.microsoft.com/?l=en.en-us#view/Microsoft_AAD_IAM/AddGroupBlade)
2. Click **New group**
3. Configure:
* **Group type**: Security
* **Group name**: `Power BI Service Principals` (or similar)
* **Group description**: "Service principals allowed to access Power BI APIs"
* **Membership type**: Assigned
4. In the **Members** section, click **Add members**
5. Search for and add your App Registration (by name or Client ID)
6. Click **Create**
1. Go to [Power BI Admin Portal](https://app.powerbi.com/admin-portal/tenantSettings)
2. Navigate to **Tenant settings**
3. Enable these settings and apply them to your security group (or to the whole organization, as you see fit):
* **Allow service principals to use Power BI APIs**
* **Allow service principals to use read-only admin APIs**
* **Enhance admin APIs responses with detailed metadata**
* **Enhance admin APIs responses with DAX and mashup expressions**
You must explicitly grant access to each workspace you want Datafold to sync:
1. Open the Power BI workspace you want to sync
2. Click **Access** (or the gear icon -> Manage access)
3. Add your App Registration as an **Admin** or **Member**
4. Repeat for each workspace you want Datafold to access
1. Go to Datafold -> **Settings** -> **BI & Data Apps**
2. Click **+ Add new integration** -> **Power BI**
3. Select **Service Principal** as the authentication type
4. Enter the credentials:
* **Client ID**: The Application (client) ID from Step 1
* **Client Secret**: The secret value from Step 2
* **Tenant ID**: The Directory (tenant) ID from Step 1
5. Click **Save**
## Verify the integration
You can check out **Jobs** -> **BI & Data Apps** for the status of the sync job.
See [Tracking Jobs](/integrations/bi-data-apps/tracking-jobs) for more details.
When the sync is complete, you will see Power BI entities in **Data Explorer**.
## Need help?
If you have any questions about our Power BI integration, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com).
# Tableau
Source: https://docs.datafold.com/integrations/bi-data-apps/tableau
Visualize downstream Tableau dependencies and understand how warehouse changes impact your BI layer.
## Overview
Our Tableau integration can help you visualise column-level lineage dependencies between warehouse tables and Tableau entities using [Data Explorer](/data-explorer/how-it-works).
**Note:** Lineage is only supported for Tableau assets in **Live** mode. Assets in **Extract** mode will not appear in Datafold lineage or dependency views.
Lineage from upstream data warehouses into Tableau is supported for the following data warehouse types:
* Snowflake
* Redshift
* Databricks
* BigQuery
Potentially impacted Tableau entity names are also automatically identified in the Datafold CI printout.
The following Tableau entities types will appear in Data Explorer, data diff results, and the Datafold CI printout:
* Tableau **Data Connections** and related fields;
* **Workbooks** and related fields;
* **Dashboards**.
To declutter the Datafold lineage, Datafold filters out Tableau Data Connections and Data Connections fields that have no downstream dependencies.
If you're interested in learning more about the Datafold integration, [please reach out to our team](https://www.datafold.com/booktime).
## Set up your Tableau instance
To connect Datafold to Tableau, you will require the following credentials from your Tableau site:
* Server URL,
* Site Name,
* Token Name,
* Token Value.
## If you are using Tableau Server
**Tableau Server** is an installation of Tableau that you are managing on your company's own infrastructure and domain. This is an alternative to using a Tableau Cloud subscription.
* Make sure that the [metadata-services](https://help.tableau.com/current/server/en-us/cli%5Fmaintenance%5Ftsm.htm#cat%5Fenable) are enabled by running the following command:
```
tsm maintenance metadata-services enable
```
* Ensure that your Tableau Server instance is accessible to Datafold. Please get in touch with our team to set this up.
## Obtaining server URL & Site Name
These can be found from URL of your Tableau home page. For instance, if your home page is:
```
https://eu-west-1a.online.tableau.com/#/site/mysupersite/home
```
Then:
* **Server URL** is `https://eu-west-1a.online.tableau.com` (the hostname with `https` in front)
* **Site Name** is `mysupersite` (the part directly after `#/site/` and until the next `/`)
## Obtaining Token Name & Token Value[](#obtaining-token-name--token-value "Direct link to Obtaining Token Name & Token Value")
Ensure that **Personal Access Tokens** are enabled on your Tableau site. For that, navigate to **Settings** and there, on the **General** tab, search for `Personal Access Tokens`. That feature needs to be enabled — not necessarily for everyone but for the user for whom we will be creating the token Datafold will use.
Now that Personal Access Tokens are enabled, click on your user’s avatar in the top right, choose **My Account Settings** in the pop-up menu, and then search for **Personal Access Tokens** on your settings page.
Input a desired name, say `datafold`, into the **Token Name** field, and click **Create Token**.
This will open a popup window. Click **Copy Secret** and save the copied value somewhere — you will use this when setting up Datafold. You can read more about personal access tokens on the official Tableau documentation [here](https://help.tableau.com/current/server/en-us/security%5Fpersonal%5Faccess%5Ftokens.htm).
## Create a Tableau Integration
Navigate to ** Settings** → **Integrations** → **Data Apps**. Click ** Add new integration**.
A click on **Tableau** will lead you to the integration creation screen. Fill in the fields with data we obtained earlier. See the screenshot for hints.
…and click **Save**.
## What's next?
The initial sync might take some time; it depends on the number of objects at your Tableau site. Eventually, Tableau entities — **Data Connections**, **Workbooks**, and **Dashboards** should appear at your **Lineage** tab.
**TIP**
[Tracking Jobs](/integrations/bi-data-apps/tracking-jobs) explains how to find out when your data app integration is ready.
Clicking on a Tableau entity will lead you to the Lineage screen:
**TIP**
As you might have noticed on the screenshots above, Datafold does not display Tableau **Sheets**. Instead, we group, and deduplicate, all **Fields** of all **Sheets** within a **Workbook** and display them as **Fields** of the **Workbook**.
On the screenshot directly above, `Demo Workbook` might include one **Sheet** with `Created At` field and another with `Sub Plan` field, but for our purposes we unite all of those fields beneath the **Workbook** — which makes the Lineage graph much less cluttered, and much easier to browse
## FAQ
Lineage is only supported for Tableau assets in Live mode. Assets in Extract mode will not appear in Datafold lineage or dependency views.
Datafold retrieves Tableau metadata using the Tableau API, which may not immediately reflect recent changes due to internal caching. If your updates aren’t showing up in Datafold, give it a few hours — they should appear once Tableau refreshes its metadata.
# Tracking Jobs
Source: https://docs.datafold.com/integrations/bi-data-apps/tracking-jobs
Track the completion and success of your data app integration syncs.
To track the progress of your data app integration, go to the ** Jobs** tab in the left sidebar.
Your **Search** and **Lineage** features will be available once you see a job marked as `Done` for your integration on this screen.
**INFO**
After the initial sync, Datafold will automatically re-sync every hour to keep your Data App assets up to date.
# Integrate with Code Repositories
Source: https://docs.datafold.com/integrations/code-repositories
Connect your code repositories with Datafold.
**NOTE**
To integrate with code repositories, first connect a [Data Connection](/integrations/databases).
Next, go to **Settings** → **Repositories** and click **Add New Integration**. Then, choose your code repository provider.
# Azure DevOps
Source: https://docs.datafold.com/integrations/code-repositories/azure-devops
## 1. Issue an Access Token
To get your [repository access token](https://learn.microsoft.com/en-us/azure/devops/organizations/accounts/use-personal-access-tokens-to-authenticate?view=azure-devops\&tabs=Windows#create-a-pat), navigate to your Azure DevOps settings and create a new token.
When configuring your token, enable following permissions:
* **Code** -> **Read & write**
* **Identity** -> **Read**
We need write access to the repository to post reports with Data Diff results to pull requests, and read access to identities to be able to properly display Azure DevOps users in the Datafold UI.
## 2. Configure integration in Datafold
Navigate back to Datafold and fill in the configuration form.
* **Personal/project Access Token**: the token you created in step 1.
* **Organization**: your Azure DevOps organization name.
* **Project**: your Azure DevOps project name.
* **Repository**: your Azure DevOps repository name.
For example, if your Azure DevOps repository URL is `https://dev.azure.com/datafold/analytics/_git/dbt`:
* Your **Organization** is `datafold`
* your **Project** is `analytics`
* your **Repository** is `dbt`
# Bitbucket
Source: https://docs.datafold.com/integrations/code-repositories/bitbucket
## 1. Issue an Access Token
### Bitbucket Cloud
To get the [repository access token](https://support.atlassian.com/bitbucket-cloud/docs/create-a-repository-access-token/), navigate to your Bitbucket repository settings and create a new token.
When configuring your token, enable following permissions:
* **Pull requests** -> **Write**, so that Datafold can post reports with Data Diff results to pull requests.
* **Webhooks** -> **Read and write**, so that Datafold can configure all webhooks that we need automatically.
### Bitbucket Data Center / Server
To get a [repository access token](https://confluence.atlassian.com/bitbucketserver/http-access-tokens-939515499.html), navigate to your Bitbucket repository settings and create a new token.
When configuring your token, enable **Repository admin** permissions.
We need admin access to the repository to be able to post reports with Data Diff results to pull requests, and also configure all necessary webhooks automatically.
## 2. Configure integration in Datafold
Navigate back to Datafold and fill in the configuration form.
### Bitbucket Cloud
* **Personal/project Access Token**: the token you created in step 1.
* **Repository**: your Bitbucket repository name.
For example, if your Bitbucket project URL is `https://bitbucket.org/datafold/dbt/`, your Project Name is `datafold/dbt`.
### Bitbucket Data Center / Server
* **Personal/project Access Token**: the token you created in step 1.
* **Repository**: the full URL of your Bitbucket repository.
For example, `https://bitbucket.myorg.com/projects/datafold/repos/dbt`.
# GitHub
Source: https://docs.datafold.com/integrations/code-repositories/github
Connect Datafold to GitHub to enable automated data diffs on pull requests, CI/CD testing integration, and code-level lineage tracking.
**PREREQUISITES**
* Datafold Admin role
* Your GitHub account must be a member of the GitHub organization where the Datafold app is to be installed
* Approval of your request to add the Datafold app to your repo must be granted by a GitHub repo admin or GitHub organization owner.
To set up a new integration, click the repository field and select the **Install GitHub app** button.
From here, GitHub will redirect you to login to your account and choose which organization you would like to connect. After choosing the right organization, you may choose to allow access to all repositories or specific ones.
Once complete, you will be redirected back to Datafold, where you can select the appropriate repository for connection.
**TIP**
If you lack permission to add the Datafold app, request approval from a GitHub admin.
After installation, click **Refresh** to display the newly added repositories in the dropdown list.
To complete the setup, click **Save**!
**INFO**
VPC deployments are an Enterprise feature. Please email [sales@datafold.com](mailto:sales@datafold.com) to enable your account.
## GitHub integration for VPC / single-tenant Datafold deployments
### Create a GitHub application
VPC clients of Datafold need to create their own GitHub app, rather than use the shared Datafold GitHub application.
Start by navigating to **Settings** → **Global Settings**.
To begin the set up process, enter the domain that was registered for the VPC deployment in [AWS](/datafold-deployment/dedicated-cloud/aws) or [GCP](/datafold-deployment/dedicated-cloud/gcp). Then, enter the name of the GitHub organization where you'd like to install the application. When filled, click **Create GitHub App**.
This will redirect the admin to GitHub, where they may need to authenticate. **The GitHub user must be an admin of the GitHub organization.**
After authentication, you should be directed to enter a description for the GitHub App. After entering the description, click **Create GitHub app**.
Once the application is created, you should be returned to the Datafold settings screen. The button should then have disappeared, and the details for the GitHub App should be visible.
### Making the GitHub application public
If you have a private GitHub instance with multiple organizations and want to use the Datafold app across all of them, you'll need to make the app public on your private server.
You can do so in GitHub by following these steps:
1. Navigate to the GitHub organization where the app was created.
2. Click **Settings**.
3. Go to **Developer Settings** → **GitHub Apps**.
4. Select the **Datafold app**.
5. Click **Advanced**, then **Make public**.
The app will be public **only on your private GitHub server**, ensuring it can be accessed across all your organizations.
### Configure GitHub in Datafold
If you see this screen with all the details, you've successfully created a GitHub App! Now that the app is created, you have to install it using the [GitHub integration setup](/integrations/code-repositories/github).
# GitLab
Source: https://docs.datafold.com/integrations/code-repositories/gitlab
To get the [project access token](https://docs.gitlab.com/ee/user/project/settings/project%5Faccess%5Ftokens.html), navigate to your GitLab project settings and create a new token.
**TIP**
Project access tokens are preferred over personal tokens for security.
When configuring your token, select the **Maintainer** role and select the **api** scope.
**Project Name** is your Gitlab project URL after `gitlab.com/`. For example, if your Gitlab project URL is `https://gitlab.com/datafold/dbt/`, your Project Name is `datafold/dbt/`
Finally, navigate back to Datafold and enter the **Project Token** and the name of your **Project** before hitting **Save**:
If you want to change the GitLab URL, you can do so after setting up the integration. To do so, navigate to **Settings**, then **Org Settings**:
# Set Up Your Data Connection
Source: https://docs.datafold.com/integrations/databases
Set up your Data Connection with Datafold.
**NOTE**
To set up your Data Connection, navigate to **Settings** → **Data Connection** and click **Add New Integration**.
# Athena
Source: https://docs.datafold.com/integrations/databases/athena
**Steps to complete:**
1. [Create an S3 bucket](/integrations/databases/athena#create-s3-bucket)
2. [Run SQL Script for permissions](/integrations/databases/athena#run-sql-script)
3. [Configure your data connection in Datafold](/integrations/databases/athena#configure-in-datafold)
### Create an S3 bucket
If you don't already have an S3 bucket for your cluster, you'll need to create one. Datafold uses this bucket to create temporary tables and store data in it. You can learn how to create an S3 bucket in AWS by referring to the [AWS documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html).
### Run SQL Script and Create Schema for Datafold
To connect to AWS Athena, you must generate an `AWS Access Key ID` and an `AWS Secret Access Key`. These keys provide read-only access to all tables in all schemas and write access to the Datafold-specific schema for temporary tables. If you don't have these keys yet, follow the steps outlined in the [AWS documentation](https://docs.aws.amazon.com/IAM/latest/UserGuide/id%5Fcredentials%5Faccess-keys.html).
Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse.
```
/* Datafold utilizes a temporary dataset to materialize scratch work and keep data processing witin your data warehouse. */
CREATE SCHEMA IF NOT EXISTS awsdatacatlog.datafold_tmp;
```
### Configure in Datafold
| Field Name | Description |
| --------------------------- | ------------------------------------------------------------------------------ |
| AWS Access Key ID | Your AWS Access Key, which can be found in your AWS Account. |
| AWS Secret Access Key | The AWS Secret Key (generate it in your AWS account if you don't have it yet). |
| S3 Staging Directory | The S3 bucket where table data is stored. |
| AWS Region | The region of your Athena cluster. |
| Catalog | The catalog, which is typically awsdatacatalog by default. |
| Database | The database or schema with tables, typically default by default. |
| Schema for Temporary Tables | The schema (datafold\_tmp) created in our SQL script. |
Click **Create** to complete the setup of your data connection in Datafold.
# BigQuery
Source: https://docs.datafold.com/integrations/databases/bigquery
**Steps to complete:**
1. [Create a Service Account](/integrations/databases/bigquery#create-a-service-account)
2. [Give the Service Account BigQuery Data Viewer, BigQuery Job User, BigQuery Resource Viewer access](/integrations/databases/bigquery#service-account-access-and-permissions)
3. [Create a temporary dataset and give BiqQuery Data Editor access to the service account](/integrations/databases/bigquery#create-a-temporary-dataset)
4. [Generate a Service Account JSON key](/integrations/databases/bigquery#generate-a-service-account-key)
5. [Configure your data connection in Datafold](/integrations/databases/bigquery#configure-in-datafold)
## Create a Service Account
To connect Datafold to your BigQuery project, you will need to create a *service account* for Datafold to use.
* Navigate to the [Google Developers Console](https://console.developers.google.com/), click on the drop-down to the left of the search bar, and select the project you want to connect to.
* *Note: If you do not see your project, you may need to switch accounts.*
* Click on the hamburger menu in the upper left, then select **IAM & Admin** followed by **Service Accounts**.
* Create a service account named `Datafold`.
## Service Account Access and Permissions
The Datafold service account requires the following roles and permissions:
* **BigQuery Data Viewer** for read access on all the datasets in the project.
* **BigQuery Job User** to run queries.
* **BigQuery Resource Viewer** to fetch the query logs for parsing lineage.
## Create a Temporary Dataset
Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in your warehouse.
**Caution** - Make sure that the dataset lives in the same region as the rest of the data, otherwise, the dataset will not be found.
Let's navigate to BigQuery in the console and create a new dataset.
* Give the dataset a name like `datafold_tmp` and grant the Datafold service account the **BigQuery Data Editor** role.
## Generate a Service Account Key
Next, go back to the **IAM & Admin** page to generate a key for Datafold.
We recommend using the json formatted key. After creating the key, it will be saved on your local machine.
## Configure in Datafold
| Field Name | Description |
| --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Name | A name given to the data connection within Datafold |
| Project ID | Your BigQuery project ID. It can be found in the URL of your Google Developers Console: [https://console.developers.google.com/apis/library?project=MY\\\_PROJECT\\\_ID](https://console.developers.google.com/apis/library?project=MY\\_PROJECT\\_ID) |
| JSON Key File | The key file generated in the [Generate a Service Account JSON key](/integrations/databases/bigquery#generate-a-service-account-key) step |
| Schema for temporary tables | The schema name that was created in [Create a temporary dataset](/integrations/databases/bigquery#create-a-temporary-dataset). It should be formatted as \.datafold\_tmp |
| Processing Location | Which processing zone your project uses |
Click **Create**. Your data connection is ready!
# Databricks
Source: https://docs.datafold.com/integrations/databases/databricks
Connect Datafold to Databricks for data diffing, CI/CD testing, lineage, and migration validation. Includes setup instructions and required permissions.
**NOTE**: Datafold needs catalog-level permissions in your Databricks workspace to read and write table data, query system tables, and deploy migration bundles. You will need workspace admin access to create a service principal and grant the required permissions.
**Steps to complete:**
1. [Create a service principal and configure authentication](/integrations/databases/databricks#create-a-service-principal-and-configure-authentication)
2. [Retrieve SQL warehouse connection details](/integrations/databases/databricks#retrieve-sql-warehouse-connection-details)
3. [Grant permissions](/integrations/databases/databricks#grant-permissions)
4. [Configure your data connection in Datafold](/integrations/databases/databricks#configure-in-datafold)
## Create a service principal and configure authentication
Create a dedicated service principal for the Datafold integration. This is the identity Datafold will use to connect to your workspace.
1. Go to **Settings** → **Identity and access** → **Service principals**
2. Click **Add service principal** and give it a name (e.g., `datafold`)
3. Select the service principal, go to the **Secrets** tab, and click **Generate secret**
4. Save the **Client ID** and **Secret** — the secret is only shown once
OAuth secrets are valid for up to 730 days. You can have a maximum of 5 active secrets per service principal. Rotate secrets before expiry to avoid connection interruptions.
Datafold also supports Personal Access Tokens as an alternative authentication method. PATs are considered legacy by Databricks — see the [Databricks authentication documentation](https://docs.databricks.com/en/dev-tools/auth/oauth-m2m.html) for details.
## Retrieve SQL warehouse connection details
Navigate to **SQL Warehouses** under the **SQL** section in the left sidebar.
Choose the preferred warehouse and copy the following fields from its **Connection Details** tab:
* **Server hostname**
* **HTTP path**
You also need to grant the service principal access to the SQL warehouse:
1. On the warehouse page, click the **Permissions** tab
2. Add the service principal and grant **Can Use** permission
## Grant permissions
Run the following SQL statements to grant Datafold the permissions it needs. Replace `` and `` with your values. Replace `` with the schema where you want to store the DMA bundle volume (e.g., `default`).
The `` is the application ID (also called Client ID) of your service principal. In Databricks SQL, service principal identifiers must be enclosed in backticks.
```sql theme={null}
-- Catalog access
GRANT USE CATALOG ON CATALOG TO ``;
GRANT USE SCHEMA ON CATALOG TO ``;
GRANT SELECT ON CATALOG TO ``;
GRANT CREATE TABLE ON CATALOG TO ``;
GRANT MODIFY ON CATALOG TO ``;
-- Temporary schema for Datafold
CREATE SCHEMA IF NOT EXISTS .datafold_tmp;
-- System tables access
GRANT USE CATALOG ON CATALOG system TO ``;
GRANT USE SCHEMA ON CATALOG system TO ``;
GRANT SELECT ON CATALOG system TO ``;
-- UC Volume for DMA bundle deployment
CREATE VOLUME IF NOT EXISTS ..datafold_bundles;
GRANT READ VOLUME ON VOLUME ..datafold_bundles TO ``;
GRANT WRITE VOLUME ON VOLUME ..datafold_bundles TO ``;
```
## Configure in Datafold
Select **M2M OAuth / Service Principal (Recommended)** as the authentication method and fill in the following fields:
| Field | Description |
| -------------------------------- | ---------------------------------------------------------------------------- |
| Connection name | A name for this data connection within Datafold |
| Host | The Server hostname from the warehouse Connection Details tab |
| HTTP path | The HTTP path from the warehouse Connection Details tab |
| Authentication method | Select **M2M OAuth / Service Principal (Recommended)** |
| Client ID | The Client ID of the service principal |
| Client Secret | The secret generated in the authentication step |
| Catalog | The default catalog name (e.g., `hive_metastore` or your Unity Catalog name) |
| Schema path for temporary tables | The temp schema as `.datafold_tmp` (e.g., `demo.datafold_tmp`) |
Click **Create**. Your data connection is ready!
## Validate your setup
Run these queries to verify that permissions are configured correctly:
```sql theme={null}
-- Check all grants for the service principal
SHOW GRANTS TO ``;
```
```sql theme={null}
-- Verify warehouse access (run from the SQL warehouse the service principal will use)
SELECT 1;
```
```sql theme={null}
-- Verify temp schema access
SHOW TABLES IN .datafold_tmp;
```
```sql theme={null}
-- Verify system table access (should return a row)
SELECT * FROM system.query.history LIMIT 1;
```
```sql theme={null}
-- Verify volume access
LIST '/Volumes///datafold_bundles';
```
# Dremio
Source: https://docs.datafold.com/integrations/databases/dremio
**INFO**
Column-level Lineage is not currently supported for Dremio.
**INFO**
Schemas for tables in external data sources need to be specified with quotes e.g., "Postgres prod.analytics.sales".
**Steps to complete:**
1. [Configure user in Dremio](/integrations/databases/dremio#configure-user-in-dremio)
2. [Create schema for Datafold](/integrations/databases/dremio#create-schema-for-datafold)
3. [Configure your data connection in Datafold](/integrations/databases/dremio#configure-in-datafold)
## Configure user in Dremio
To connect to Dremio, create a user with read-only access to all data sources you wish to diff and generate an access token.
Temporary tables will be created in the `$scratch` schema that doesn't require special permissions.
## Create schema for Datafold
Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse.
## Configure in Datafold
| Field Name | Description |
| --------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Connection name | A name given to the data connection within Datafold. |
| Host | The hostname for your Dremio instance (data.dremio.cloud for Dremio SaaS). |
| Port | Dremio endpoint port; default value is 433. |
| Encryption | Should be checked for Dremio Cloud, possibly unchecked for local deployments. |
| User ID | User ID as created in Dremio, typically an email address. |
| Project ID | Dremio Project UID. If left blank, the default project will be used. |
| Token | Access token generated in Dremio. |
| Password | Alternatively, provide a password. |
| Schema for temporary views | A Dremio space for temporary views. |
| Schema for temporary tables | \$scratch should suit most applications, or use "\.\" (with quotes) if you wish to create temporary tables in an external data source. |
Click **Create**. Your data connection is now ready!
# MySQL
Source: https://docs.datafold.com/integrations/databases/mysql
**INFO**
Please contact [support@datafold.com](mailto:support@datafold.com) if you use a MySQL version \< 8.x.
**INFO**
Column-level Lineage is not currently supported for MySQL.
**Steps to complete:**
1. [Run SQL script for permissions and create schema for Datafold](/integrations/databases/mysql#run-sql-script-and-create-schema-for-datafold)
2. [Configure your data connection in Datafold](/integrations/databases/mysql#configure-in-datafold)
### Run SQL script and create schema for Datafold
To connect to MySQL, create a user with read-only access to all tables you wish to diff. Include read and write access to a Datafold-specific dataset:
```Bash theme={null}
-- Create a temporary dataset for Datafold to utilize
CREATE DATABASE IF NOT EXISTS datafold_tmp;
-- Create a Datafold user
CREATE USER 'datafold_user'@'%' IDENTIFIED BY 'SOMESECUREPASSWORD';
-- Grant read access to diff tables in YourSchema
GRANT SELECT ON `YourSchema`.* TO 'datafold_user'@'%';
-- Grant access to all tables in a datafold_tmp database
GRANT ALL ON `datafold_tmp`.* TO 'datafold_user'@'%';
-- Apply the changes
FLUSH PRIVILEGES;
```
Datafold utilizes a temporary dataset, named `datafold_tmp` in the above script, to materialize scratch work and keep data processing in the your warehouse.
### Configure in Datafold
| Field Name | Description |
| ---------------------------- | ------------------------------------------------------------------------------- |
| Connection name | A name given to the data connection within Datafold |
| Host | The hostname for your MySQL instance |
| Port | MySQL connection port; default value is 3306 |
| Username | The user created in our SQL script, named datafold\_user |
| Password | The password created in our SQL script |
| Database | The name of the MySQL database (schema) you want to connect to, e.g. YourSchema |
| Dataset for temporary tables | The datafold\_tmp database created in our SQL script |
Click **Create**. Your data connection is ready!
# Netezza
Source: https://docs.datafold.com/integrations/databases/netezza
**INFO**
Column-level Lineage is not currently supported for Netezza.
**Steps to complete:**
1. [Configure user in Netezza](#configure-user-in-netezza)
2. [Create schema for Datafold](#create-a-temporary-database-for-datafold)
3. [Configure your data connection in Datafold](#configure-in-datafold)
## Configure user in Netezza
To connect to Netezza, create a user with read-only access to all databases you may wish to diff.
## Create a temporary database for Datafold
Datafold requires a schema with full permissions to store temporary data.
## Configure in Datafold
| Field Name | Description |
| --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
| Connection Name | A name given to the data connection within Datafold. |
| Host | The hostname for your Netezza instance (e.g., nz-85dcf66c-69aa-4ba6-b7cb-827643da5a.us-east-1.data-warehouse.cloud.ibm.com for Netezza SaaS). |
| Port | Netezza endpoint port; the default value is 5480. |
| Encryption | Whether to use TLS. |
| User ID | User ID, e.g., DATAFOLD. |
| Password | Password from above. |
| Default DB | The database to connect to. |
| Schema for Temporary Tables | Use DATABASE.SCHEMA format. |
Click **Create**. Your data source is now ready!
# Oracle
Source: https://docs.datafold.com/integrations/databases/oracle
Connect Datafold to Oracle Database for data diffing, reconciliation, and migration validation. Includes setup instructions and required permissions.
Oracle 19c and later are fully supported. If you use an Oracle version older than 19c, please contact [support@datafold.com](mailto:support@datafold.com) before proceeding.
Column-level Lineage is not currently supported for Oracle.
**Steps to complete:**
1. [Create a Datafold user in Oracle](#create-a-datafold-user)
2. [Grant read access to your data](#grant-read-access-to-your-data)
3. [Grant required system privileges](#grant-required-system-privileges)
4. [Configure the connection in Datafold](#configure-in-datafold)
> A [full script](#full-script) is available at the bottom of this page.
## Create a Datafold user
Datafold connects to Oracle using a dedicated database user. This user needs its own schema where Datafold can create temporary working tables and views during data diffs.
If your Oracle instance uses a multitenant architecture (CDB/PDB), first switch to the pluggable database where your data lives:
```sql theme={null}
-- Only needed for multitenant (CDB/PDB) setups. Skip if you use a single-tenant database.
-- Replace YOURPDB with the name of your pluggable database (e.g., XEPDB1).
ALTER SESSION SET CONTAINER = YOURPDB;
```
Then create the Datafold user:
```sql theme={null}
CREATE USER DATAFOLD IDENTIFIED BY ;
-- Allow Datafold to connect to the database
GRANT CREATE SESSION TO DATAFOLD;
-- Allow Datafold to create temporary working tables and views in its own schema
GRANT CREATE TABLE TO DATAFOLD;
GRANT CREATE VIEW TO DATAFOLD;
```
## Grant read access to your data
Datafold needs `SELECT` access on every table you want to diff. There are two approaches depending on how broad your access requirements are.
### Option A: Grant access to all tables (recommended for most migrations)
If Datafold should be able to diff any table in the database, grant the `SELECT ANY TABLE` system privilege:
```sql theme={null}
GRANT SELECT ANY TABLE TO DATAFOLD;
```
This is the simplest approach and avoids the need to update grants each time new tables are added.
### Option B: Grant access per schema
If you need to restrict Datafold to specific schemas, run the following block for each schema. Replace `YOURSCHEMA` with the schema name:
```sql theme={null}
BEGIN
FOR t IN (SELECT table_name FROM all_tables WHERE owner = 'YOURSCHEMA') LOOP
EXECUTE IMMEDIATE 'GRANT SELECT ON "YOURSCHEMA"."' || t.table_name || '" TO DATAFOLD';
END LOOP;
END;
/
```
Repeat for each schema that contains tables you want to diff.
Option B only grants access to tables that exist at the time you run the script. If new tables are added later, you will need to re-run the grant or add individual `GRANT SELECT` statements for those tables.
## Grant required system privileges
Datafold requires two additional privileges to function correctly.
### Tablespace quota (required)
During a diff, Datafold creates temporary tables in the `DATAFOLD` schema to store intermediate results. The user needs permission to consume disk space for these tables.
You can grant unlimited tablespace:
```sql theme={null}
GRANT UNLIMITED TABLESPACE TO DATAFOLD;
```
Or, if your DBA prefers to cap disk usage, assign a specific quota instead:
```sql theme={null}
-- Example: allow up to 1 GB on a specific tablespace
ALTER USER DATAFOLD QUOTA 1G ON ;
```
## Full script
Copy and customize this script for your environment. See the comments for what to change.
```sql theme={null}
----------------------------------------------------------------------
-- Datafold Oracle setup script
-- Run as a DBA or user with GRANT privileges
----------------------------------------------------------------------
-- Step 1: Switch to your pluggable database (multitenant only — skip if single-tenant)
-- ALTER SESSION SET CONTAINER = YOURPDB;
-- Step 2: Create the Datafold user
CREATE USER DATAFOLD IDENTIFIED BY ;
GRANT CREATE SESSION TO DATAFOLD;
GRANT CREATE TABLE TO DATAFOLD;
GRANT CREATE VIEW TO DATAFOLD;
-- Step 3: Grant read access (choose Option A or Option B)
-- Option A: Access to all tables (simplest)
GRANT SELECT ANY TABLE TO DATAFOLD;
-- Option B: Access to specific schemas only (repeat per schema)
-- BEGIN
-- FOR t IN (SELECT table_name FROM all_tables WHERE owner = 'YOURSCHEMA') LOOP
-- EXECUTE IMMEDIATE 'GRANT SELECT ON "YOURSCHEMA"."' || t.table_name || '" TO DATAFOLD';
-- END LOOP;
-- END;
-- /
-- Step 4: Required system privileges
GRANT UNLIMITED TABLESPACE TO DATAFOLD;
-- Or, to cap disk usage:
-- ALTER USER DATAFOLD QUOTA 1G ON ;
```
## Configure in Datafold
Once the Oracle user is created and grants are in place, add the connection in Datafold.
| Field Name | Description |
| ------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Name** | A display name for this connection within Datafold (e.g., "Oracle Production") |
| **Host** | The hostname or IP address of your Oracle database server |
| **Port** | The Oracle listener port (default: `1521`) |
| **User** | `DATAFOLD` (the user created above) |
| **Password** | The password you set when creating the `DATAFOLD` user |
| **Connection type** | Choose **Service** or **SID** depending on your Oracle setup. Use **Service** if unsure — it is the default for most modern Oracle configurations. |
| **Service (or SID)** | The Oracle service name or SID to connect to (e.g., `XEPDB1` or your database name) |
| **Schema for temporary tables** | `DATAFOLD` — this is the schema created automatically when you created the `DATAFOLD` user |
Click **Create**. Your data connection is ready!
## Troubleshooting
### `ORA-00942: table or view does not exist`
The Datafold user does not have `SELECT` access on the table being diffed. Verify that you ran the grants from [Step 3](#grant-read-access-to-your-data) for the correct schema.
### `ORA-28000: the account is locked`
The `DATAFOLD` user account has been locked, typically due to failed login attempts. Unlock it with:
```sql theme={null}
ALTER USER DATAFOLD ACCOUNT UNLOCK;
```
### `ORA-01950: no privileges on tablespace`
The `DATAFOLD` user does not have a tablespace quota. Run the tablespace grant from [Step 4](#grant-required-system-privileges).
# PostgreSQL
Source: https://docs.datafold.com/integrations/databases/postgresql
**INFO**
Column-level Lineage is supported for AWS Aurora and RDS Postgres and *requires* Cloudwatch to be configured.
**Steps to complete:**
1. [Run SQL script and create schema for Datafold](/integrations/databases/postgresql#run-sql-script-and-create-schema-for-datafold)
2. [Configure your data connection in Datafold](/integrations/databases/postgresql#configure-in-datafold)
## Run SQL script and create schema for Datafold
To connect to Postgres, you need to create a user with read-only access to all tables in all schemas, write access to Datafold-specific schema for temporary tables:
```Bash theme={null}
/* Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in your warehouse. */
CREATE SCHEMA datafold_tmp;
/* Create a datafold user */
CREATE ROLE datafold WITH LOGIN ENCRYPTED PASSWORD 'SOMESECUREPASSWORD';
/* Give the datafold role write access to the temporary schema */
GRANT ALL ON SCHEMA datafold_tmp TO datafold;
/* Make sure that the postgres user has read permissions on the tables */
GRANT USAGE ON SCHEMA TO datafold;
GRANT SELECT ON ALL TABLES IN SCHEMA TO datafold;
```
Datafold utilizes a temporary schema, named `datafold_tmp` in the above script, to materialize scratch work and keep data processing in the your warehouse.
## Configure in Datafold
| Field Name | Description |
| --------------------------- | --------------------------------------------------------------- |
| Name | A name given to the data connection within Datafold |
| Host | The hostname address for your database; default value 127.0.0.1 |
| Port | Postgres connection port; default value is 5432 |
| User | The user role created in our SQL script, named datafold |
| Password | The password created in our SQL script |
| Database Name | The name of the Postgres database you want to connect to |
| Schema for temporary tables | The schema (datafold\_tmp) created in our SQL script |
Click **Create**. Your data connection is ready!
***
## Column-level Lineage with Aurora & RDS
This will guide you through setting up Column-level Lineage with AWS Aurora & RDS using CloudWatch.
**Steps to complete:**
1. [Setup Postgres with Permissions](#run-sql-script)
2. [Increase the logging verbosity of Postgres](#increase-logging-verbosity) so Datafold can parse lineage
3. [Set up an account for fetching the logs from CloudWatch.](#connect-datafold-to-cloudwatch)
4. [Configure your data connection in Datafold](#configure-in-datafold)
### Run SQL Script
To connect to Postgres, you need to create a user with read-only access to all tables in all schemas, write access to Datafold-specific schema for temporary tables:
```Bash theme={null}
/* Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse. */
CREATE SCHEMA datafold_tmp;
/* Create a datafold user */
CREATE ROLE datafold WITH LOGIN ENCRYPTED PASSWORD 'SOMESECUREPASSWORD';
/* Give the datafole role write access to the temporary schema */
GRANT ALL ON SCHEMA datafold_tmp TO datafold;
/* Make sure that the postgres user has read permissions on the tables */
GRANT USAGE ON SCHEMA TO datafold;
GRANT SELECT ON ALL TABLES IN SCHEMA TO datafold;
```
### Increase logging verbosity
Then, create a new `Parameter Group`. Database instances run with default parameters that do not include logging verbosity. To turn on the logging verbosity, you'll need to create a new Parameter Group. Hit **Parameter Groups** on the menu and create a new Parameter Group.
Next, select the `aurora-postgresql10` parameter group family. This depends on the cluster that you're running. For Aurora serverless, this is the appropriate family.
Finally, set the `log_statement` enum field to `mod` - meaning that it will log all the DDL statements, plus data-modifying statements. Note: This field isn't set by default.
After saving the parameter group, go back to your database, and select the database cluster parameter group.
### Connect Datafold to CloudWatch
Start by creating a new user to isolate the permissions as much as possible. Go to IAM and create a new user.
Next, create a new group named `CloudWatchLogsReadOnly` and attach the `CloudWatchLogsReadOnlyAccess` policy to it. Next, select the group.
When reviewing the user, it should have the freshly created group attached to it.
After confirming the new user you should be given the `Access Key` and `Secret Key`. Save these two codes securely to finish configurations on Datafold.
The last piece of information Datafold needs is the CloudWatch Log Group. You will find this in CloudWatch under the Log Group section in the sidebar. It will be formatted as `/aws/rds/cluster//postgresql`.
### Configure in Datafold
| Field Name | Description |
| ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
| Name | A name given to the data connection within Datafold |
| Host | The hostname address for your database; default value 127.0.0.1 |
| Port | Postgres connection port; default value is 5432 |
| User | The user role created in the SQL script; datafold |
| Password | The password created in the SQL permissions script |
| Database Name | The name of the Postgres database you want to connect to |
| AWS Access Key | The Access Key provided in the [Connect Datafold to CloudWatch](/integrations/databases/postgresql#connect-datafold-to-cloudwatch) step |
| AWS Secret | The Secret Key provided in the [Connect Datafold to CloudWatch](/integrations/databases/postgresql#connect-datafold-to-cloudwatch) step |
| Cloudwatch Postgres Log Group | The path of the Log Group; formatted as /aws/rds/cluster/\/postgresql |
| Schema for temporary tables | The schema created in the SQL setup script; datafold\_tmp |
Click **Create**. Your data connection is ready!
# Redshift
Source: https://docs.datafold.com/integrations/databases/redshift
**Steps to complete:**
1. [Run SQL script and create schema for Datafold](/integrations/databases/redshift#run-sql-script-and-create-schema-for-datafold)
2. [Configure your data connection in Datafold](/integrations/databases/redshift#configure-in-datafold)
## Run SQL script and create schema for Datafold
To connect to Amazon Redshift, you must create a user with the following permissions:
* **Read-only access** to all tables in all schemas
* **Write access** to a dedicated temporary schema for Datafold
* **Access to SQL logs** for lineage construction
Datafold uses a temporary dataset to materialize scratch work and keep data processing in the your warehouse. Create the schema with:
```
CREATE SCHEMA datafold_tmp;
```
Next, create the Datafold user. To grant read access to all schemas, the user must have superuser-level privileges in Redshift:
```
CREATE USER datafold CREATEUSER PASSWORD 'SOMESECUREPASSWORD';
```
Grant unrestricted access to system logs so Datafold can build column-level lineage:
```
ALTER USER datafold WITH SYSLOG ACCESS UNRESTRICTED;
```
Datafold utilizes a temporary schema, named `datafold_tmp` in the above script, to materialize scratch work and keep data processing in your warehouse.
## Configure in Datafold
| Field Name | Description |
| --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
| Name | A name given to the data connection within Datafold |
| Host | The hostname of your cluster. (Go to Redshift in your AWS console, select your cluster, the hostname is the endpoint listed at the top of the page) |
| Port | Redshift connection port; default value is 5439 |
| User | The user created in our SQL script, named `datafold` |
| Password | The password created in our SQL script |
| Database Name | The name of the Redshift database you want to connect to |
| Schema for temporary tables | The schema (`datafold_tmp`) created in our SQL script |
Click **Create**. Your data connection is ready!
# SAP HANA
Source: https://docs.datafold.com/integrations/databases/sap-hana
**INFO**
Column-level Lineage is not currently supported for SAP HANA.
**Steps to complete:**
1. [Create and authorize a user](#create-and-authorize-a-user)
2. [Create schema for Datafold](#create-schema-for-datafold)
3. [Configure in Datafold](#configure-in-datafold)
## Create and authorize a user
Create a new user `DATAFOLD` using SAP HANA Administration console (Systems-Security-Users). Specify password authentication, and set "Force password change on next logon" to "No". Grant MONITORING privileges for the databases to be diffed.
## Create schema for Datafold
Datafold utilizes a temporary schema to materialize scratch work and keep data processing in the your warehouse.
```
CREATE SCHEMA datafold_tmp OWNED BY DATAFOLD;
```
## Configure in Datafold
| Field Name | Description |
| --------------------------- | ---------------------------------------------------- |
| Name | A name given to the data connection within Datafold. |
| Host | The hostname address for your database. |
| Port | Sap HANA connection port; default value is 443. |
| User | The user created above, named DATAFOLD. |
| Password | The password for user DATAFOLD. |
| Schema for temporary tables | The schema created above, named datafold\_tmp |
Click **Create**. Your data connection is ready!
# Snowflake
Source: https://docs.datafold.com/integrations/databases/snowflake
Connect Datafold to Snowflake for data diffing, CI/CD testing, lineage, and migration validation. Includes setup instructions and required permissions.
**NOTE**: Datafold needs permissions in your Snowflake dataset to read your table data. You will need to be a Snowflake *Admin* in order to grant the required permissions.
**Steps to complete:**
* [Create a user and role for Datafold](/integrations/databases/snowflake#create-a-user-and-role-for-datafold)
* [Setup password-based](/integrations/databases/snowflake#set-up-password-based-authentication) or [Use key-pair authentication](/integrations/databases/snowflake#use-key-pair-authentication)
* [Create a temporary schema](/integrations/databases/snowflake#create-schema-for-datafold)
* [Give the Datafold role access to your warehouse](/integrations/databases/snowflake#give-the-datafold-role-access)
* [Configure your data connection in Datafold](/integrations/databases/snowflake#configure-in-datafold)
## Create a user and role for Datafold
> A [full script](/integrations/databases/snowflake#full-script) can be found at the bottom of this page.
It is best practice to create a separate role for the Datafold integration (e.g., `DATAFOLDROLE`):
```
CREATE ROLE DATAFOLDROLE;
CREATE USER DATAFOLD DEFAULT_ROLE = "DATAFOLDROLE" MUST_CHANGE_PASSWORD = FALSE;
GRANT ROLE DATAFOLDROLE TO USER DATAFOLD;
```
To provide column-level lineage, Datafold needs to read & parse all SQL statements executed in your Snowflake account:
```
GRANT MONITOR EXECUTION ON ACCOUNT TO ROLE DATAFOLDROLE;
GRANT IMPORTED PRIVILEGES ON DATABASE SNOWFLAKE TO ROLE DATAFOLDROLE;
```
## Set up password-based authentication
Datafold supports username/password authentication, but also key-pair authentication.
```
ALTER USER DATAFOLD SET PASSWORD = 'SomethingSecret';
```
You can set the username/password in the Datafold web UI.
### Use key-pair authentication
If you would like to use key-pair authentication, go to **Settings** -> **Data Connections** -> **Your Snowflake Connection**, and change Authentication method from **Password** to **Key Pair**.
Generate and Download the Key Pair file, and use the value within the file when running the following command in Snowflake to set the key for this Snowflake role:
```
ALTER USER DATAFOLD SET rsa_public_key='...'
```
## Create schema for Datafold
Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse.
```
CREATE SCHEMA .DATAFOLD_TMP;
GRANT ALL ON SCHEMA .DATAFOLD_TMP TO DATAFOLDROLE;
```
## Give the Datafold role access
Datafold will only scan the tables that it has access to. The snippet below will give Datafold read access to a database. If you have more than one database that you want to use in Datafold, rerun the script below for each one.
```Bash theme={null}
/* Repeat for every DATABASE to be usable in Datafold. This allows Datafold to
correctly discover, profile & diff each table */
GRANT USAGE ON WAREHOUSE TO ROLE DATAFOLDROLE;
GRANT USAGE ON DATABASE TO ROLE DATAFOLDROLE;
GRANT USAGE ON ALL SCHEMAS IN DATABASE TO ROLE DATAFOLDROLE;
GRANT USAGE ON FUTURE SCHEMAS IN DATABASE TO ROLE DATAFOLDROLE;
GRANT SELECT ON ALL TABLES IN DATABASE TO ROLE DATAFOLDROLE;
GRANT SELECT ON FUTURE TABLES IN DATABASE TO ROLE DATAFOLDROLE;
GRANT SELECT ON ALL VIEWS IN DATABASE TO ROLE DATAFOLDROLE;
GRANT SELECT ON FUTURE VIEWS IN DATABASE TO ROLE DATAFOLDROLE;
GRANT SELECT ON ALL MATERIALIZED VIEWS IN DATABASE TO ROLE DATAFOLDROLE;
GRANT SELECT ON FUTURE MATERIALIZED VIEWS IN DATABASE TO ROLE DATAFOLDROLE;
GRANT ALL PRIVILEGES ON ALL DYNAMIC TABLES IN DATABASE TO ROLE DATAFOLDROLE;
GRANT SELECT ON FUTURE DYNAMIC TABLES IN DATABASE TO ROLE DATAFOLDROLE;
```
## Full Script
```Bash theme={null}
--Step 1: Create a user and role for Datafold
CREATE ROLE DATAFOLDROLE;
CREATE USER DATAFOLD DEFAULT_ROLE = "DATAFOLDROLE" MUST_CHANGE_PASSWORD = FALSE;
GRANT ROLE DATAFOLDROLE TO USER DATAFOLD;
GRANT MONITOR EXECUTION ON ACCOUNT TO ROLE DATAFOLDROLE;
GRANT IMPORTED PRIVILEGES ON DATABASE SNOWFLAKE TO ROLE DATAFOLDROLE;
--Step 2a: Use password-based authentication
ALTER USER DATAFOLD SET PASSWORD = 'SomethingSecret';
--OR
--Step 2b: Use key-pair authentication
--ALTER USER DATAFOLD SET rsa_public_key='abc..'
--Step 3: Create schema for Datafold
CREATE SCHEMA .DATAFOLD_TMP;
GRANT ALL ON SCHEMA .DATAFOLD_TMP TO DATAFOLDROLE;
--Step 4: Give the Datafold role access to your data connection
/*
Repeat for every DATABASE to be usable in Datafold. This allows Datafold to
correctly discover, profile & diff each table
*/
GRANT USAGE ON WAREHOUSE TO ROLE DATAFOLDROLE;
GRANT USAGE ON DATABASE TO ROLE DATAFOLDROLE;
GRANT USAGE ON ALL SCHEMAS IN DATABASE TO ROLE DATAFOLDROLE;
GRANT USAGE ON FUTURE SCHEMAS IN DATABASE TO ROLE DATAFOLDROLE;
GRANT SELECT ON ALL TABLES IN DATABASE TO ROLE DATAFOLDROLE;
GRANT SELECT ON FUTURE TABLES IN DATABASE TO ROLE DATAFOLDROLE;
GRANT SELECT ON ALL VIEWS IN DATABASE TO ROLE DATAFOLDROLE;
GRANT SELECT ON FUTURE VIEWS IN DATABASE TO ROLE DATAFOLDROLE;
GRANT SELECT ON ALL MATERIALIZED VIEWS IN DATABASE TO ROLE DATAFOLDROLE;
GRANT SELECT ON FUTURE MATERIALIZED VIEWS IN DATABASE TO ROLE DATAFOLDROLE;
```
## Validate Snowflake Grants for Datafold
Run these queries to validate that the grants have been set up correctly:
> Note: More results may be returned than shown in the screenshots below if you have granted access to multiple roles/users
Example Placeholders:
* `` = `DEV`
* `` = `DEMO`
```
-- Validate database usage for the DATAFOLDROLE
SHOW GRANTS ON DATABASE ;
```
```
-- Validate warehouse usage for the DATAFOLDROLE
SHOW GRANTS ON WAREHOUSE ;
```
```
-- Validate schema permissions for the DATAFOLDROLE
SHOW GRANTS ON SCHEMA .DATAFOLD_TMP;
```
## A note on future grants
The above database grants will be insufficient if any future grants have been defined at the schema level, because [schema-level grants will override database-level grants](https://docs.snowflake.com/en/sql-reference/sql/grant-privilege#considerations). In that case, you will need to execute future grants for every existing *schema* that Datafold will operate on.
```Bash theme={null}
GRANT SELECT ON FUTURE TABLES IN SCHEMA . TO ROLE DATAFOLDROLE;
GRANT SELECT ON FUTURE VIEWS IN SCHEMA . TO ROLE DATAFOLDROLE;
GRANT SELECT ON FUTURE MATERIALIZED VIEWS IN SCHEMA . TO ROLE DATAFOLDROLE;
GRANT SELECT ON ALL TABLES IN SCHEMA . TO ROLE DATAFOLDROLE;
GRANT SELECT ON ALL VIEWS IN SCHEMA . TO ROLE DATAFOLDROLE;
GRANT SELECT ON ALL MATERIALIZED VIEWS IN SCHEMA . TO ROLE DATAFOLDROLE;
```
## Configure in Datafold
| Field Name | Description |
| --------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Name | A name given to the data connection within Datafold |
| Account identifier | The Org name-Account name pair for your Snowflake account. This can be found in the browser address string. It may look like [https://orgname-accountname.snowflakecomputing.com](https://orgname-accountname.snowflakecomputing.com) or [https://app.snowflake.com/orgname/accountname](https://app.snowflake.com/orgname/accountname). In the setup form, enter \-\. |
| User | The username set in the [Setup password-based](/integrations/databases/snowflake#set-up-password-based-authentication) authentication section |
| Password | The password set in the [Setup password-based](/integrations/databases/snowflake#set-up-password-based-authentication) authentication section |
| Key Pair file | The key file generated in the [Use key-pair authentication](/integrations/databases/snowflake#use-key-pair-authentication) section |
| Warehouse | The Snowflake warehouse name |
| Schema for temporary tables | The schema name you created with our script (\.DATAFOLD\_TMP) |
| Role | The role you created for Datafold (Typically DATAFOLDROLE) |
| Default DB | A database the role above can access. If more than one database was added, whichever you prefer to be the default |
> Note: Please review the documentation for the account name. Datafold uses Format 1 (Preferred): [https://docs.snowflake.com/en/user-guide/admin-account-identifier#using-an-account-locator-as-an-identifier](https://docs.snowflake.com/en/user-guide/admin-account-identifier#using-an-account-locator-as-an-identifier)
Click **Create**. Your data connection is ready!
# Microsoft SQL Server
Source: https://docs.datafold.com/integrations/databases/sql-server
Connect Datafold to Microsoft SQL Server for data diffing, reconciliation, and migration validation. Includes setup instructions and required permissions.
**INFO**
Column-level Lineage is not currently supported for Microsoft SQL Server.
**Steps to complete:**
1. [Run SQL script and create schema for Datafold](/integrations/databases/sql-server#run-sql-script-and-create-schema-for-datafold)
2. [Configure your data connection in Datafold](/integrations/databases/sql-server#configure-in-datafold)
## Run SQL script and create schema for Datafold
To connect to Microsoft SQL Server, create a user with read-only access to all tables you wish to diff. Include read and write access to a Datafold-specific temp schema:
```Bash theme={null}
/* Select the database that will contain the temp schema */
USE DatabaseName;
/* Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse. */
CREATE SCHEMA datafold_tmp;
/* Create the Datafold user */
CREATE LOGIN DatafoldUser WITH PASSWORD = 'SOMESECUREPASSWORD';
CREATE USER DatafoldUser FOR LOGIN DatafoldUser;
/* Allow the user to create views */
GRANT CREATE VIEW TO DatafoldUser;
/* Grant read access to diff tables */
GRANT SELECT ON SCHEMA::YourSchema TO DatafoldUser;
/* Grant read + write access to datafold_tmp schema */
GRANT CONTROL ON SCHEMA::datafold_tmp TO DatafoldUser;
```
## Configure in Datafold
| Field Name | Description |
| ---------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| Connection name | A name given to the data connection within Datafold |
| Host | The hostname for your SQL Server instance |
| Port | SQL Server connection port; default value is 1433 |
| Username | The user created in our SQL script, named DatafoldUser |
| Password | The password created in our SQL script |
| Database | The name of the SQL Server database you want to connect to |
| Dataset for temporary tables | The schema created in our SQL script, in database.schema format: DatabaseName.datafold\_tmp in our script above. |
Click **Create**. Your data connection is ready!
# Starburst
Source: https://docs.datafold.com/integrations/databases/starburst
**INFO**
Column-level Lineage is not currently supported for Starburst.
**Steps to complete:**
1. [Configure user in Starburst](#configure-user-in-starburst)
2. [Create schema for Datafold](#create-schema-for-datafold)
3. [Configure your data connection in Datafold](#configure-in-datafold)
## Configure user in Starburst
To connect to Starburst, create a user with read-only access to all data sources you wish to diff and optionally generate an access token. Datafold requires a schema to be set up within one of the catalogs, typically hosted on platforms like Amazon S3 or similar services.
## Create schema for Datafold
Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse.
## Configure in Datafold
| Field Name | Description |
| --------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| Connection name | A name given to the data connection within Datafold. |
| Host | The hostname for your Starburst instance (e.g., `sample-free-cluster.trino.galaxy.starburst.io` for Starburst SaaS). |
| Port | Starburst endpoint port; default value is 433. |
| Encryption | Should be checked for Starburst Galaxy, possibly unchecked for local deployments. |
| User ID | User ID as created in Starburst, typically an email address. |
| Token | Access token generated in Starburst. |
| Password | Alternatively, provide a password. |
| Schema for temporary tables | Use `.` format. |
Click **Create**. Your data source is now ready!
# Teradata
Source: https://docs.datafold.com/integrations/databases/teradata
**INFO**
Column-level Lineage is not currently supported for Teradata.
**Steps to complete:**
1. [Configure user in Teradata](#configure-user-in-tedadata)
2. [Create a temporary database for Datafold](#create-a-temporary-database-for-datafold)
3. [Configure data connection in Datafold](#configure-in-datafold)
## Configure user in Teradata
To connect to Teradata, create a user with read-only access to all databases you may wish to diff, including the login database:
```
CREATE USER DATAFOLD AS PERMANENT=1000000000 BYTES PASSWORD= COLLATION = ASCII TIME ZONE ='GMT';
GRANT EXECUTE FUNTION ON DB1 TO DATAFOLD;
GRANT SELECT ON DB1 TO DATAFOLD;
...
GRANT SELECT ON DB9 TO DATAFOLD;
```
## Create a temporary database for Datafold
Datafold requires a database to store temporary data with full permissions:
```
CREATE DATABASE DATAFOLD_TMP AS PERMANENT=10000000000 BYTES;
GRANT ALL ON DATAFOLD_TMP TO DATAFOLD;
```
## Configure data connection in Datafold
| Field Name | Description |
| ----------------------------- | -------------------------------------------------------------------------------------------------------------------------- |
| Connection Name | A name given to the data connection within Datafold. |
| Host | The hostname for your Teradata instance (e.g., account-name-2e3ba8b32qac9d.env.clearscape.teradata.com for Teradata SaaS). |
| Port | Teradata endpoint port; the default value is 1025. |
| User ID | User ID, e.g., DATAFOLD. |
| Password | Password from above. |
| Database | The connection database, e.g., DB1 from above. |
| Database for Temporary Tables | The temporary database, e.g., DATAFOLD\_TMP from above. |
Click **Create**. Your data connection is now ready!
# OAuth Support
Source: https://docs.datafold.com/integrations/oauth
Set up OAuth App Connections in your supported data warehouses to securely execute data diffs on behalf of your users.
This feature is currently supported for Databricks, Snowflake, Redshift, and BigQuery.
OAuth support empowers users to run data diffs based on their individual permissions and roles configured within the data warehouses. This ensures that data access is governed by existing security policies and protocols.
## Overview
The diagram below illustrates how the authentication flow proceeds:
1. Users authenticate using the configured OAuth provider.
2. Users can then create diffs between data sets that their user can access using OAuth database permissions.
3. During Continuous Integration (CI), Datafold executes diffs using a Service Account with the least privileges, thus masking sensitive/PII data.
4. If a user needs to see sensitive/PII data from a CI diff, and they have permission via OAuth to do so, they can rerun the diff, and then Datafold will authenticate the user using OAuth database permissions. Then, the user will have access to the data based on these permissions.
This structure ensures that diffs are executed with the user's database credentials with their configured roles and permissions. Data access permissions are thus fully managed by the database, and Datafold only passes through queries.
## How it works
### 1. Create a Data Diff
When you attempt to run a data diff, you will notice that it won't run without authentication:
### 2. Authorize the Data Diff
Authorize the data diff by clicking the **Authenticate** button. This will redirect you to the data warehouse for authentication:
Upon successful authentication, you will be redirected back.
### 3. The Data Diff is now running
### 4. View the Data Diff results
The results reflect your permissions within the data warehouse:
Note that running the same data diff, as a different user, renders different results:
The masked values represent the data retrieved from the data warehouse. We do not conduct any post-processing:
By default, results are only visible to their authors. Users can still clone data diffs, but the results may differ depending on their data warehouse access levels.
For example, another user would not be able to access the data diff results from the previous example:
### 5. Sharing Data Diffs
Data diff sharing is a feature that enables you to share data diffs with other users. This is useful in scenarios such as compliance verification, where auditors can access specific data diffs without first requiring permissions to be set up in the data warehouse.
Sharing can be accessed via the **Actions** dropdown on the data diff page:
Note that data diff sharing is disabled by default:
It can be enabled under **Org Settings** by clicking on **Allow Data Diff sharing**:
Once enabled, you can share data diffs with other users:
## Configuring OAuth
Navigate to **Settings** and click on your data connection. Then, click on **Advanced settings** and under **OAuth**, set the **Client Id** and **Client Secret** fields:
## Example: Databricks
To create a new Databricks app connection:
1. Go to **Settings** and **App connections**.
2. Click **Add connection** in the top right of the screen.
3. Fill in the required fields:
Application Name:
```
Datafold OAuth connection
```
Redirect URLs:
```
https://app.datafold.com/api/internal/oauth_dwh/callback
```
Datafold caches **access tokens** and uses **refresh tokens** to fetch new valid tokens in order to complete the diffs and reduce the number of times users need to authenticate against the data warehouses.
One hour is sufficient for the access token.
The refresh token will determine the frequency of user reauthentication, whether it's daily, weekly, or monthly.
4. Click **Add** to obtain the **Client ID** and **Client Secret**.
5. Fill in the **Client ID** and **Client Secret** fields in Datafold's Data Connection advanced settings:
6. Click **Test and save OAuth**. You will be redirected to Databricks to complete authentication. If you are already authenticated, you will be redirected back. This notification signals a successful OAuth configuration:
### Additional steps for Databricks
To ensure that users have correct access rights to temporary tables (stored in **Dataset for temporary tables** provided in the **Basic settings** for the Databricks connection), follow these steps:
1. Update the permissions for the **Dataset for temporary tables** in Databricks.
2. Grant these permissions to Datafold users: **USE SCHEMA** and **CREATE TABLE**.
This will ensure that materialization results from data diffs are only readable by their authors.
## Example: Snowflake
To create a new Snowflake app connection:
1. Go to Snowflake and run this SQL:
```sql theme={null}
CREATE SECURITY INTEGRATION DATAFOLD_OAUTH
TYPE = OAUTH
ENABLED = TRUE
OAUTH_CLIENT = CUSTOM
OAUTH_CLIENT_TYPE = 'CONFIDENTIAL'
OAUTH_REDIRECT_URI = 'https://app.datafold.com/api/internal/oauth_dwh/callback'
PRE_AUTHORIZED_ROLES_LIST=(, , ...)
OAUTH_ISSUE_REFRESH_TOKENS = TRUE
OAUTH_REFRESH_TOKEN_VALIDITY = 604800
OAUTH_ENFORCE_PKCE=TRUE;
```
**CAUTION**
* `PRE_AUTHORIZED_ROLES_LIST` must include all roles allowed to use the current security integration.
* By default, `ACCOUNTADMIN`, `SECURITYADMIN`, and `ORGADMIN` are not allowed to be included in `PRE_AUTHORIZED_ROLES_LIST`.
Datafold caches **access tokens** and uses **refresh tokens** to fetch new valid tokens in order to complete the diffs and reduce the number of times users need to authenticate against the data warehouses.
`OAUTH_REFRESH_TOKEN_VALIDITY` can be in the range of 3600 (1 hour) to 7776000 (90 days).
2. To retrieve `OAUTH_CLIENT_ID` and `OAUTH_CLIENT_SECRET`, run the following SQL:
```sql theme={null}
select system$show_oauth_client_secrets('DATAFOLD_OAUTH');
```
Example result:
3. Fill in the **Client ID** and **Client Secret** fields in Datafold's Data Connection advanced settings:
4. Click **Test and save OAuth**. You will be redirected to Snowflake to complete authentication.
Your default Snowflake role will be used for the generated **access token**.
This notification signals a successful OAuth configuration:
### Additional steps for Snowflake
To guarantee correct access rights to temporary tables (stored in **Schema for temporary tables** provided in the **Basic settings** for Snowflake connection):
* Grant the required privileges on the database and `TEMP` schema for all roles that will be using the OAuth flow.
```sql theme={null}
GRANT USAGE ON WAREHOUSE TO ROLE ;
GRANT USAGE ON DATABASE TO ROLE ;
GRANT USAGE ON ALL SCHEMAS IN DATABASE TO ROLE ;
GRANT USAGE ON FUTURE SCHEMAS IN DATABASE TO ROLE ;
GRANT ALL ON SCHEMA . TO ROLE ;
```
* Revoke `SELECT` privileges for tables in the `TEMP` schema for all roles that will be using the OAuth flow (except for the `DATAFOLDROLE` role), if they were provided. This action must be performed for all roles utilizing the OAuth flow.
```sql theme={null}
-- Revoke SELECT privileges for the TEMP SCHEMA
revoke SELECT ON ALL TABLES IN SCHEMA . FROM ROLE ;
revoke SELECT ON FUTURE TABLES IN SCHEMA . FROM ROLE ;
revoke SELECT ON ALL VIEWS IN SCHEMA . FROM ROLE ;
revoke SELECT ON FUTURE VIEWS IN SCHEMA . FROM ROLE ;
revoke SELECT ON ALL MATERIALIZED VIEWS IN SCHEMA . FROM ROLE ;
revoke SELECT ON FUTURE MATERIALIZED VIEWS IN SCHEMA . FROM ROLE ;
-- Revoke SELECT privileges for a Database
revoke SELECT ON ALL TABLES IN DATABASE FROM ROLE ;
revoke SELECT ON FUTURE TABLES IN DATABASE FROM ROLE ;
revoke SELECT ON ALL VIEWS IN DATABASE FROM ROLE ;
revoke SELECT ON FUTURE VIEWS IN DATABASE FROM ROLE ;
revoke SELECT ON ALL MATERIALIZED VIEWS IN DATABASE FROM ROLE ;
revoke SELECT ON FUTURE MATERIALIZED VIEWS IN DATABASE FROM ROLE ;
```
**CAUTION**
If one of the roles has `FUTURE GRANTS` at the database level, this role will also have `FUTURE GRANTS` on the `TEMP` schema.
## Example: Redshift
Redshift does not support OAuth2. To execute data diffs on behalf of a specific user, that user needs to provide their own credentials to Redshift.
1. Configure permissions on the Redshift side. Grant the necessary access rights to temporary tables (stored in the **Schema for temporary tables** provided in the **Basic settings** for Redshift connection):
```sql theme={null}
GRANT USAGE on SCHEMA to ;
GRANT CREATE on SCHEMA to ;
```
2. As an Administrator, select the **Enabled** toggle in Datafold's Redshift Data Connection **Advanced settings**:
Then, click the **Test and Save** button.
3. As a User, add your Redshift credentials into Datafold. Click on your Datafold username to **Edit Profile**:
Then, click **Add credentials** and select the required Redshift data connection from the **Data Connections** list:
Finally, provide your Redshift username and password, and configure the **Delete on** field (after this date, your credentials will be removed from Datafold):
Click **Create credentials**.
## Example: BigQuery
1. Create a new Google Cloud OAuth 2.0 Client ID. Go to the Google Cloud console, navigate to **APIs & Services**, then **Credentials**, and click **+ CREATE CREDENTIALS**:
Select **OAuth client ID**:
From the list of **Application type**, select **Web application**:
Provide a name in the **Name** field:
In **Authorized redirect URIs**, provide `https://app.datafold.com/api/internal/oauth_dwh/callback`:
Click **CREATE**. Then, download the OAuth Client credentials as a JSON file:
2. Activate BigQuery OAuth in Datafold by uploading the JSON OAuth credentials in the **JSON OAuth keys file** section, in Datafold's BigQuery Data Connection **Advanced settings**:
Click **Test and Save**.
### Additional steps for BigQuery
1. Create a new temporary schema (dataset) for each OAuth user.
Go to Google Cloud console, navigate to BigQuery, select your project in BigQuery, and click on **Create dataset**:
Provide `datafold_tmp_` as the **Dataset ID** and set the same region as configured for other datasets. Click **CREATE DATASET**:
2. Configure permissions for `datafold_tmp_`.
Grant read/write/create/delete permissions to the user for their `datafold_tmp_` schema. This can be done by granting roles like **BigQuery Data Editor** or **BigQuery Data Owner** or any custom roles with the required permissions.
Go to Google Cloud console, navigate to BigQuery, select `datafold_tmp_` dataset, and click **Manage Permissions**:
Click **+ ADD PRINCIPAL**, specify the user and role, then click **SAVE**:
Ensure that only the specified user (excluding admins) has read/write/create/delete permissions on `datafold_tmp_`.
3. Configure temporary schema in Datafold.
As a user, navigate to `https://app.datafold.com/users/me`. If the user lacks credentials for BigQuery, click on **+ Add credentials**, select BigQuery datasource from the list, and click **Create credentials**:
The user will be redirected to `accounts.google.com` and then returned to the previous page:
Select BigQuery credentials from the list, input the **Temporary Schema** field in the format `.>`, and click **Update**:
Users can update BigQuery credentials only if they have the correct permissions for ``.
# Integrate with Orchestrators
Source: https://docs.datafold.com/integrations/orchestrators
Integrate Datafold with dbt Core, dbt Cloud, Airflow, or custom orchestrators to streamline your data workflows with automated monitoring, testing, and seamless CI integration.
**NOTE**
To integrate with dbt, first set up a [Data Connection](/integrations/databases) and integrate with [Code Repositories](/integrations/code-repositories).
Then navigate to **Settings** → **dbt** and click **Add New Integration**.
Set up Datafold with dbt Core to enable automated data diffs and CI/CD integration.
Integrate with dbt Cloud to enable automated data diffs and CI/CD integration.
Use Datafold's API and SDK to build custom CI integrations tailored to your workflow.
# Custom Integrations
Source: https://docs.datafold.com/integrations/orchestrators/custom-integrations
Integrate Datafold with your custom orchestration using the Datafold SDK and REST API.
To use the Datafold REST API, you should first create a Datafold API key in Settings > Account.
For automated/unattended integrations, we recommend creating a [service account](/security/service-accounts) API key instead of a personal one. Service-account keys belong to your organization rather than to an individual user, so your integration keeps working if the original creator leaves the team.
## Install
Then, create your virtual environment for Python:
```
> python3 -m venv venv
> source venv/bin/activate
> pip install --upgrade pip setuptools wheel
```
Now, you're ready to install the Datafold SDK:
```
> pip install datafold-sdk
```
## Configure
Navigate in the Datafold UI to Settings > Integrations > CI. After selecting `datafold-sdk` from the available options, complete configuration with the following information:
| Field Name | Description |
| ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Repository | Select the repository that generates the webhooks and where pull / merge requests will be raised. |
| Data Connection | Select the data connection where the code that is changed in the repository will run. |
| Name | An identifier used in Datafold to identify this CI configuration. |
| Files to ignore | If defined, the files matching the pattern will be ignored in the PRs. The pattern uses the syntax of .gitignore. Excluded files can be re-included by using the negation; re-included files can be later re-excluded again to narrow down the filter. |
| Mark the CI check as failed on errors | If the checkbox is disabled, the errors in the CI runs will be reported back to GitHub/GitLab as successes, to keep the check "green" and not block the PR/MR. By default (enabled), the errors are reported as failures and may prevent PR/MRs from being merged. |
| Require the `datafold` label to start CI | When this is selected, the Datafold CI process will only run when the 'datafold' label has been applied. This label needs to be created manually in GitHub or GitLab and the title or name must match 'datafold' exactly. |
| Sampling tolerance | The tolerance to apply in sampling for all data diffs. |
| Sampling confidence | The confidence to apply when sampling. |
| Sampling Threshold | Sampling will be disabled automatically if tables are smaller than specified threshold. If unspecified, default values will be used depending on the Data Connection type. |
## Add commands to your custom orchestration
```bash theme={null}
export DATAFOLD_API_KEY=XXXXXXXXX
# only needed if your Datafold app url is not app.datafold.com
export DATAFOLD_HOST=
```
To submit diffs for a CI run, replace `ci_config_id`, `pr_num`, and `diffs_file` with the appropriate values for your CI configuration ID, pull request number, and the path to your diffs `JSON` file.
#### CLI
```bash theme={null}
datafold ci submit \
--ci-config-id \
--pr-num \
--diffs \
```
#### Python
```python theme={null}
import os
from datafold_sdk.sdk.ci import run_diff
api_key = os.environ.get('DATAFOLD_API_KEY')
# Only needed if your Datafold app URL is not app.datafold.com
host = os.environ.get("DATAFOLD_HOST")
run_diff(host=host,
api_key=api_key,
ci_config_id=,
pr_num=,
diffs='')
```
##### Example JSON format for diffs file
The `JSON` file should define the production and pull request tables to compare, along with any primary keys and columns to include or exclude in the comparison.
```json theme={null}
[
{
"prod": "YOUR_PROJECT.PRODUCTION_TABLE_A",
"pr": "YOUR_PROJECT.PR_TABLE_NUM",
"pk": ["ID"],
"include_columns": ["Column1", "Column2"],
"exclude_columns": ["Column3"]
},
{
"prod": "YOUR_PROJECT.PRODUCTION_TABLE_B",
"pr": "YOUR_PROJECT.PR_TABLE_NUM",
"pk": ["ID"],
"include_columns": ["Column1"],
"exclude_columns": []
}
]
```
# dbt Cloud
Source: https://docs.datafold.com/integrations/orchestrators/dbt-cloud
Integrate Datafold with dbt Cloud to automate Data Diffs in your CI pipeline, leveraging dbt jobs to detect changes and ensure data quality before merging.
**NOTE**
You will need a dbt **Team** account or higher to access the dbt Cloud API that Datafold uses to connect the accounts.
## Prerequisites
### Set up dbt Cloud CI
In dbt Cloud, [set up dbt Cloud CI](https://docs.getdbt.com/docs/deploy/cloud-ci-job) so that your Pull Request job runs when you open or update a Pull Request. This job will provide Datafold information about the changes included in the PR.
### Create an Artifacts Job in dbt Cloud
The Artifacts job generates production `manifest.json` on merge to main/master, giving Datafold information about the state of production. The simplest method is to set up a dbt Cloud job that executes the `dbt ls` command on merge to main/master.
> Note: `dbt ls` is preferred over `dbt compile` as it runs faster and data diffing does not require fully compiled models to work.
Example dbt Cloud artifact job settings and successful run:
If you are interested in continuous deployment, you can use a Merge Trigger Production Job instead of the Artifacts Job listed above.
### dbt Cloud Access URL
You will need your [access url](https://docs.getdbt.com/docs/cloud/about-cloud/regions-ip-addresses) to connect Datafold to your dbt Cloud account.
### Add dbt Cloud Service Account Token
To connect Datafold to your dbt Cloud account, you will need to use a [Service Token](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens).
info
Please note that the use of User API Keys for this purpose is no longer recommended due to a [recent security update](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens) in dbt Cloud. [Learn more below](/integrations/orchestrators/dbt-cloud#deprecating-user-tokens)
1. Navigate to **Account Settings → Service Tokens → + New Token**.
1. Add a Permission Set and select `Member` or `Developer`.
2. Select `All Projects`, or check only the projects you intend to use with Datafold.
3. Save your changes.
1. Navigate to **Your Profile → API Access** and copy the token.
#### Deprecating User Tokens
dbt Cloud is transitioning away from the use of User API Keys for authentication. The User API Key will be replaced by account-scoped Personal Access Tokens (PATs).
This update will affect the functionality of certain API endpoints. Specifically, `/v2/accounts`, `/v3/accounts`, and `/whoami` (undocumented API) will no longer return information about all the accounts tied to a user. Instead, the response will be filtered to include only the context of the specific account in the request.
dbt Cloud users have until April 30, 2024, to implement this change. After this date, all user API keys will be scoped to an account. New customers are required to use the new account-scoped PATs.
For more information, please refer to the [dbt Cloud API Documentation](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens).
If you have any questions or require further assistance, please don't hesitate to contact our support team.
## Create a dbt Cloud Integration in the Datafold app
* Navigate to Settings > Integrations > CI and create a new dbt Cloud integration.
## Configuration
### Basic Settings
* **Repository**: Select a repository that you set up in [the Code Repositories setup step](/integrations/code-repositories).
* **Data Connection**: Select a connection that you set up in [the Data Connections setup step](/integrations/databases).
* **Name**: This can be anything!
* **Primary key tag**: This is a text string that you may use to tag primary keys in your dbt project yaml. Note that to avoid the need for tagging, [primary keys can be inferred from dbt uniqueness tests](/deployment-testing/configuration/primary-key).
* **Account name**: This will be autofilled using your dbt API key.
* **Job that creates dbt artifacts**: This will be [the Artifacts Job that you created](#create-an-artifacts-job-in-dbt-cloud). Or, if you have a dbt production job that runs on each merge to main, select that job.
* **Job that builds pull requests**: This is the dbt CI job that is triggered when you open a Pull Request or Merge Request.
### Advanced Settings
* **Enable Datafold in CI/CD**: High-level switch to turn Datafold off or on in CI (but we hope you'll leave it on!).
* **Import dbt tags and descriptions**: Populate our Lineage tool with dbt metadata. ⚠️ This feature is in development. ⚠️
* **Slim Diff**: Only diff modified models in CI, instead of all models. [Please read more about Slim Diff](/deployment-testing/best-practices/slim-diff), which is highly configurable using dbt yaml, and each organization will need to set a strategy based on their data environment.
* Downstream Hightouch models will be diffed even when Slim Diff is turned on.
* **Diff Hightouch Models**: Hightouch customers can see diffs of downstream Hightouch assets in Pull Requests.
* **CI fails on primary key issues**: The existence of null or duplicate primary keys causes the Datafold CI check to fail.
* **Pull Request Label**: For when you want Datafold to *only* run in CI when a label is manually applied in GitHub/GitLab.
* **CI Diff Threshold**: For when you want Datafold to *only* run automatically if the number of diffs doesn't exceed this threshold for a given CI run.
* **Files to ignore**: If at least one modified file doesn’t match the ignore pattern, Datafold CI diffs all changed models in the PR. If all modified files should be ignored, Datafold CI does not run in the PR. ([Additional details.](/deployment-testing/configuration/datafold-ci/on-demand))
* **Custom base branch**: For when you want Datafold to **only** run in CI when a PR is opened against a specific base branch. You might need this if you have multiple environments built from different branches. See [Custom branch](https://docs.getdbt.com/faqs/Environments/custom-branch-settings) in dbt Cloud docs.
Click save, and that's it!
Now that you've set up a dbt Cloud integration, Datafold will diff your impacted tables whenever you push commits to a PR. A summary of the diff will appear in GitHub, and detailed results will appear in the Datafold app.
# dbt Core
Source: https://docs.datafold.com/integrations/orchestrators/dbt-core
Set up Datafold’s integration with dbt Core to automate Data Diffs in your CI pipeline.
**PREREQUISITES**
* Create a [Data Connection Integration](/integrations/databases) where your dbt project data is built.
* Create a [Code Repository Integration](/integrations/code-repositories) where your dbt project code is stored.
## Getting started
To add Datafold to your continuous integration (CI) pipeline using dbt Core, follow these steps:
### 1. Create a dbt Core integration.
### 2. Set up the dbt Core integration.
Complete the configuration by specifying the following fields:
#### Basic settings
| Field Name | Description |
| ------------------ | ------------------------------------------------------------------------------------------ |
| Configuration name | Choose a name for your for your Datafold dbt integration. |
| Repository | Select your dbt project. |
| Data Connection | Select the data connection your dbt project writes to. |
| Primary key tag | Choose a string for [tagging primary keys](/deployment-testing/configuration/primary-key). |
#### Advanced settings: Configuration
| Field Name | Description |
| -------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Import dbt tags and descriptions | Import dbt metadata (including column and table descriptions, tags, and owners) to Datafold. |
| Slim Diff | Data diffs will be run only for models changed in a pull request. See our [guide to Slim Diff](/deployment-testing/best-practices/slim-diff) for configuration options. |
| Diff Hightouch Models | Run Data Diffs for Hightouch models affected by your PR. |
| CI fails on primary key issues | The existence of null or duplicate primary keys will cause CI to fail. |
| Pull Request Label | When this is selected, the Datafold CI process will only run when the `datafold` label has been applied. |
| CI Diff Threshold | Data Diffs will only be run automatically for a given CI run if the number of diffs doesn't exceed this threshold. |
| Branch commit selection strategy | Select "Latest" if your CI tool creates a merge commit (the default behavior for GitHub Actions). Choose "Merge base" if CI is run against the PR branch head (the default behavior for GitLab). |
| Custom base branch | If defined, CI will run only on pull requests with the specified base branch. |
| Columns to ignore | Use standard gitignore syntax to identify columns that Datafold should never diff for any table. This can [improve performance](/faq/performance-and-scalability#how-can-i-optimize-diff-performance-at-scale) for large datasets. Primary key columns will not be excluded even if they match the pattern. |
| Files to ignore | If at least one modified file doesn’t match the ignore pattern, Datafold CI diffs all changed models in the PR. If all modified files should be ignored, Datafold CI does not run in the PR. ([Additional details.](/deployment-testing/configuration/datafold-ci/on-demand)) |
#### Advanced settings: Sampling
Sampling allows you to compare large datasets more efficiently by checking only a randomly selected subset of the data rather than every row. By analyzing a smaller but statistically meaningful sample, Datafold can quickly estimate differences without the overhead of a full dataset comparison. To learn more about how sampling can result in a speedup of 2x to 20x or more, see our [best practices on sampling](/data-diff/cross-database-diffing/best-practices#enable-sampling).
| Field Name | Description |
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Enable sampling | Enable sampling for data diffs to optimize analyzing large datasets. |
| Sampling tolerance | The tolerance to apply in sampling for all data diffs. |
| Sampling confidence | The confidence to apply when sampling. |
| Sampling threshold | Sampling will be disabled automatically if tables are smaller than specified threshold. If unspecified, default values will be used depending on the Data Connection type. |
### 3. Obtain an Datafold API Key and CI config ID.
After saving the settings in step 2, scroll down and generate a new Datafold API Key and obtain the CI config ID.
For production CI use, we recommend creating a [service account](/security/service-accounts) API key instead of a personal one. Service-account keys belong to your organization rather than to an individual user, so CI keeps working if the original creator leaves the team.
### 4. Configure your CI script(s) with the Datafold SDK.
Using the Datafold SDK, configure your CI script(s) to upload dbt `manifest.json` files.
The `datafold dbt upload` command takes this general form and arguments:
```
datafold dbt upload --ci-config-id --run-type --commit-sha
```
You will need to configure orchestration to upload the dbt `manifest.json` files in 2 scenarios:
1. **On merges to main.** These `manifest.json` files represent the state of the dbt project on the base/production branch from which PRs are created.
2. **On updates to PRs.** These `manifest.json` files represent the state of the dbt project on the PR branch.
The dbt Core integration creation form automatically generates code snippets that can be added to CI runners.
By storing and comparing these `manifest.json` files, Datafold determines which dbt models to diff in a CI run.
Implementation details vary depending on which CI tool you use. Please review [these instructions and examples](#ci-implementation-tools) to help you configure updates to your organization's CI scripts.
### 5. Test your dbt Core integration.
After updating your CI scripts, trigger jobs that will upload `manifest.json` files represent the base/production state.
Then, open a new pull request with changes to a SQL file to trigger a CI run.
## CI implementation tools
We've created guides and templates for three popular CI tools.
**Having trouble setting up Datafold in CI?**
We're here to help! Please reach out and [chat with a Datafold Solutions Engineer](https://www.datafold.com/booktime).
To add Datafold to your CI tool, add `datafold dbt upload` steps in two CI jobs:
* **Upload Production Artifacts:** A CI job that build a production `manifest.json`. *This can be either your Production Job or a special Artifacts Job that runs on merge to main (explained below).*
* **Upload Pull Request Artifacts:** A CI job that builds a PR `manifest.json`.
This ensures Datafold always has the necessary `manifest.json` files, enabling us to run data diffs comparing production data to dev data.
**Upload Production Artifacts**
Add the `datafold dbt upload` step to *either* your Production Job *or* an Artifacts Job.
**Production Job**
If your dbt prod job kicks off on merges to the base branch, add a `datafold dbt upload` step after the `dbt build` step.
```bash theme={null}
name: Production Job
on:
push:
branches:
- main
jobs:
run:
runs-on: ubuntu-20.04
steps:
- name: Install Datafold SDK
run: pip install -q datafold-sdk
- name: Upload dbt artifacts to Datafold
run: datafold dbt upload --ci-config-id --run-type production --commit-sha ${GIT_SHA}
env:
DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
GIT_SHA: "${{ github.sha }}"
```
**Artifacts Job**
If your existing Production Job runs on a schedule and not on merges to the base branch, create a dedicated job that runs on merges to the base branch which generates and uploads a `manifest.json` file to Datafold.
```bash theme={null}
name: Artifacts Job
on:
push:
branches:
- main
jobs:
run:
runs-on: ubuntu-20.04
steps:
- name: Install Datafold SDK
run: pip install -q datafold-sdk
- name: Generate dbt manifest.json
run: dbt ls
- name: Upload dbt artifacts to Datafold
run: datafold dbt upload --ci-config-id --run-type production --commit-sha ${BASE_GIT_SHA}
env:
DATAFOLD_APIKEY: ${{ secrets.DATAFOLD_APIKEY }}
BASE_GIT_SHA: "${{ github.sha }}"
```
**Pull Request Artifacts**
Include the `datafold dbt upload` step in your CI job that builds PR data.
```bash theme={null}
name: Pull Request Job
on:
pull_request:
push:
branches:
- '!main'
jobs:
run:
runs-on: ubuntu-20.04
steps:
- name: Install Datafold SDK
run: pip install -q datafold-sdk
- name: Upload PR manifest.json to Datafold
run: |
datafold dbt upload --ci-config-id --run-type pull_request --commit-sha ${PR_GIT_SHA}
env:
DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
PR_GIT_SHA: "${{ github.event.pull_request.head.sha }}"
```
**Store Datafold API Key**
Save the API key as `DATAFOLD_API_KEY` in your [GitHub repository settings](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository).
**Upload Production Artifacts**
Add the `datafold dbt upload` step to *either* your Production Job *or* an Artifacts Job.
**Production Job**
If your dbt prod job kicks off on merges to the base branch, add a `datafold dbt upload` step after the `dbt build` step.
```bash theme={null}
version: 2.1
jobs:
prod-job:
filters:
branches:
only: main
docker:
- image: cimg/python:3.9
steps:
- checkout
- run:
name: "Install Datafold SDK"
command: pip install -q datafold-sdk
- run:
name: "Build dbt project"
command: dbt build
- run:
name: "Upload production manifest.json to Datafold"
command: |
datafold dbt upload --ci-config-id --run-type production --target-folder ./target/ --commit-sha ${CIRCLE_SHA1}
```
**Artifacts Job**
If your existing Production Job runs on a schedule and not on merges to the base branch, create a dedicated job that runs on merges to the base branch which generates and uploads a `manifest.json` file to Datafold.
```bash theme={null}
version: 2.1
jobs:
artifacts-job:
filters:
branches:
only: main
docker:
- image: cimg/python:3.9
steps:
- checkout
- run:
name: "Install Datafold SDK"
command: pip install -q datafold-sdk
- run:
name: "Generate manifest.json"
command: dbt ls --profiles-dir ./
- run:
name: "Upload production manifest.json to Datafold"
command: datafold dbt upload --ci-config-id --run-type production --target-folder ./target/ --commit-sha ${CIRCLE_SHA1}
```
**Store Datafold API Key**
Save the API key in the [CircleCI interface](https://circleci.com/docs/set-environment-variable/).
**Upload Production Artifacts**
Add the `datafold dbt upload` step to *either* your Production Job *or* an Artifacts Job.
**Production Job**
If your dbt prod job kicks off on merges to the base branch, add a `datafold dbt upload` step after the `dbt build` step.
```bash theme={null}
image:
name: ghcr.io/dbt-labs/dbt-core:1.x
run_pipeline:
stage: deploy
before_script:
- pip install -q datafold-sdk
script:
- dbt build --profiles-dir ./
- datafold dbt upload --ci-config-id --run-type production --commit-sha $CI_COMMIT_SHA
```
**Artifacts Job**
If your existing Production Job runs on a schedule and not on merges to the base branch, create a dedicated job that runs on merges to the base branch which generates and uploads a `manifest.json` file to Datafold.
```bash theme={null}
image:
name: ghcr.io/dbt-labs/dbt-core:1.x
run_pipeline:
stage: deploy
before_script:
- pip install -q datafold-sdk
script:
- dbt ls --profiles-dir ./
- datafold dbt upload --ci-config-id --run-type production --commit-sha $CI_COMMIT_SHA
```
**Store Datafold API Key**
Save the API key as `DATAFOLD_API_KEY` in [GitLab repository settings](https://docs.gitlab.com/ee/ci/yaml/index.html#secrets).
## CI for dbt multi-projects
When setting up CI for dbt multi-projects, each project should have its own dedicated CI integration to ensure that changes are validated independently.
## CI for dbt multi-projects within a monorepo
When managing multiple dbt projects within a monorepo (a single repository), it’s essential to configure individual Datafold CI integrations for each project to ensure proper isolation.
This approach prevents unintended triggering of CI processes for projects unrelated to the changes made. Here’s the recommended approach for setting it up in Datafold:
**1. Create separate CI integrations:** Create separate CI integrations within Datafold, one for each dbt project within the monorepo. Each integration should be configured to reference the same GitHub repository.
**2. Configure file filters**: For each CI integration, define file filters to specify which files should trigger the CI run. These filters prevent CI runs from being initiated when files from other projects in the monorepo are updated.
**3. Test and validate**: Before deployment, test each CI integration to validate that it triggers only when changes occur within its designated dbt project. Verify that modifications to files in one project do not inadvertently initiate CI processes for unrelated projects in the monorepo.
###
## Advanced configurations
### Skip Datafold in CI
To skip the Datafold step in CI, include the string `datafold-skip-ci` in the last commit message.
### Programmatically trigger CI runs
The Datafold app relies on the version control service webhooks to trigger the CI runs. When the dedicated cloud deployments is behind a VPN, webhooks cannot directly reach the deployment due to the network's restricted access.
We can overcome this by triggering the CI runs via the [datafold-sdk](/api-reference/datafold-sdk) in the Actions/Job Runners, assuming they're running in the same network.
Add a new Datafold SDK command after uploading the manifest in a PR job:
**Important**
When configuring your CI script, be sure to use `${{ github.event.pull_request.head.sha }}` for the **Pull Request Job** instead of `${{ github.sha }}`, which is often mistakenly used.
`${{ github.sha }}` defaults to the latest commit SHA on the branch and **will not work correctly for pull requests**.
```Bash theme={null}
- -name: Trigger CI
run: |
set -ex
datafold ci trigger --ci-config-id \
--pr-num ${PR_NUM} \
--base-branch ${BASE_BRANCH} \
--base-sha ${BASE_SHA} \
--pr-branch ${PR_BRANCH} \
--pr-sha ${PR_SHA}
env:
DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
DATAFOLD_HOST: ${{ secrets.DATAFOLD_HOST }}
PR_NUM: ${{ github.event.number }}
PR_BRANCH: ${{ github.event.pull_request.head.ref }}
BASE_BRANCH: ${{ github.event.pull_request.base.ref }}
PR_SHA: ${{ github.event.pull_request.head.sha }}
BASE_SHA: ${{ github.event.pull_request.base.sha }}
```
### Running diffs before opening a PR
Some teams want to show Data Diff results in their tickets *before* creating a pull request. This speeds up code reviews as developers can QA code changes before requesting a PR review.
Check out how to automate this workflow [here](/faq/datafold-with-dbt#can-i-run-data-diffs-before-opening-a-pr).
# Compliance & Trust Center
Source: https://docs.datafold.com/security/compilance-trust-center
# MCP Tool Permissions
Source: https://docs.datafold.com/security/mcp-tool-permissions
Which permissions each MCP tool requires. Use this reference when scoping a service account's group for MCP use.
Each MCP tool requires one or more permissions. Service accounts inherit their group's permissions, so to use a tool the group must include all of its required permissions. Tools that require permissions the service account lacks are automatically hidden from the MCP client.
See [Custom groups](/security/user-roles-and-permissions#custom-groups) for how to set up a group, and [Service accounts](/security/service-accounts) for how to issue an API key.
## Minimum permissions to enable every MCP tool
To give a service account access to every MCP tool, create a custom group with the permissions below and assign your service account to it.
* **Cancel diffs** (`cancel_datadiff`)
* **Create diffs** (`create_datadiff`)
* **Edit monitors** (`edit_alert`)
* **List data sources** (`list_data_sources`)
* **List users** (`list_users`)
* **View diffs** (`view_datadiff`)
* **View knowledge graph** (`view_knowledge_graph`)
* **View monitors** (`view_monitor`)
## Tools by category
### Organization
| Tool | What it does | Required permissions | Requires feature |
| ------------------ | ---------------------------------------------------------------------- | ------------------------- | ---------------- |
| `list_org_members` | Retrieves all active members of the authenticated user's organization. | List users (`list_users`) | — |
### Data Sources
| Tool | What it does | Required permissions | Requires feature |
| -------------------- | ------------------------------------------------------------------------------- | --------------------------------------- | ---------------- |
| `get_dataset_schema` | Get the column schema for a table on a data source. | List data sources (`list_data_sources`) | — |
| `list_data_sources` | Retrieves all data sources accessible to the authenticated user. | List data sources (`list_data_sources`) | — |
| `run_query` | Executes a SQL query against the specified data source and returns the results. | List data sources (`list_data_sources`) | — |
| `search_tables` | Search for tables on a data source by name. | List data sources (`list_data_sources`) | — |
### Data Diffs
| Tool | What it does | Required permissions | Requires feature |
| ----------------------------- | ------------------------------------------------------------------------------------------- | -------------------------------- | ---------------- |
| `cancel_datadiff` | Cancel a running or queued data diff. | Cancel diffs (`cancel_datadiff`) | Data Diffs |
| `create_datadiff` | Launches a new data diff to compare two datasets (tables or queries). | Create diffs (`create_datadiff`) | Data Diffs |
| `get_datadiff_overview` | Retrieves a structured overview of a data diff, mirroring the UI's Overview tab. | View diffs (`view_datadiff`) | Data Diffs |
| `get_datadiff_result_section` | Retrieves detailed results for a specific section of a data diff, corresponding to UI tabs. | View diffs (`view_datadiff`) | Data Diffs |
| `list_datadiffs` | Lists existing data diffs for the organization, ordered by creation date (newest first). | View diffs (`view_datadiff`) | Data Diffs |
### Monitors
| Tool | What it does | Required permissions | Requires feature |
| ------------------------- | ------------------------------------------------------------------------------------- | ------------------------------ | ---------------- |
| `get_monitor` | Get full details for a monitor by ID. | View monitors (`view_monitor`) | Monitors |
| `get_monitor_as_code` | Export a monitor's complete configuration as YAML (monitors-as-code format). | View monitors (`view_monitor`) | Monitors |
| `get_monitor_run_results` | Get recent run history for a monitor, ordered by most recent first. | View monitors (`view_monitor`) | Monitors |
| `get_monitors_schema` | Returns the JSON Schema for the monitors-as-code YAML config format. | View monitors (`view_monitor`) | Monitors |
| `list_monitors` | Lists monitors for the organization, ordered by creation date (newest first). | View monitors (`view_monitor`) | Monitors |
| `provision_monitors` | Create, update, or delete monitors from a declarative YAML config (monitors-as-code). | Edit monitors (`edit_alert`) | Monitors |
| `trigger_monitor_run` | Manually trigger a monitor check. The check runs asynchronously. | Edit monitors (`edit_alert`) | Monitors |
### Knowledge Graph
| Tool | What it does | Required permissions | Requires feature |
| ------------------------------------ | ------------------------------------------------------------------- | --------------------------------------------- | ---------------- |
| `knowledge_graph_expand` | BFS expansion from a seed node in the knowledge graph. | View knowledge graph (`view_knowledge_graph`) | Knowledge Graph |
| `knowledge_graph_get_schema_details` | Return full definitions for one or more items in a schema category. | View knowledge graph (`view_knowledge_graph`) | Knowledge Graph |
| `knowledge_graph_run_query` | Execute a named query template against the knowledge graph. | View knowledge graph (`view_knowledge_graph`) | Knowledge Graph |
### Feedback
| Tool | What it does | Required permissions | Requires feature |
| ----------------- | ------------------------------------------------------------------------------------------------ | -------------------- | ---------------- |
| `submit_feedback` | Submit feedback to the Datafold team — report bugs, request features, or share general thoughts. | — | — |
# Securing Connections
Source: https://docs.datafold.com/security/securing-connections
Datafold supports multiple options to secure connections between your resources (e.g., databases and BI tools) and Datafold.
## Encryption
When you connect to Datafold to query your data in a database (e.g., BigQuery), communications are secured using HTTPS encryption.
## IP whitelisting
If access to your data connection is restricted to IP addresses on an allowlist, you will need to manually add Datafold's addresses in order to use our product. Otherwise, you will receive a connection error when setting up your data connection.
For SaaS (app.datafold.com) deployments, whitelist the following IP addresses:
* `23.23.71.47`
* `35.166.223.86`
* `52.11.132.23`
* `54.71.177.163`
* `54.185.25.103`
* `54.210.34.216`
Note that at any given time, you will only see one of these addresses in use. However, the active IP address can change, so you should add them all to your IP whitelist to ensure no interruptions in service.
## Private Link
### AWS PrivateLink
AWS PrivateLink allows you to connect Datafold to your databases without exposing data to the internet. This option is available for both Datafold SaaS Cloud and all Datafold Dedicated Cloud options.
The following diagram shows the architecture for a customer with a High Availability RDS setup:
### Setup
Supported databases
The following setup assumes you have an RDS/Aurora database you want to connect to. Datafold also supports PrivateLink connections to other databases such as Snowflake, which should only be accessed from your VPC. Please contact [support@datafold.com](mailto:support@datafold.com) to get assistance with connecting to your specific database.
Our support team will send you the following:
* The role ARN to establish the PrivateLink connection.
* Datafold SaaS Cloud VPC CIDR range.
You need to do the following steps:
1. Send us the region(s) where your database(s) are located.
2. Create a VPC Endpoint Service and NLB.
* The core concepts of this setup are described in this AWS blog: [Access Amazon RDS across VPCs using AWS PrivateLink and Network Load Balancer](https://aws.amazon.com/blogs/database/access-amazon-rds-across-vpcs-using-aws-privatelink-and-network-load-balancer/).
* If your databases are HA, please implement the failover mechanics described in the blog.
* A CloudFormation template for inspiration can be found [here](https://github.com/aws-samples/amazon-rds-crossaccount-access/blob/main/CrossAccountRDSAccess.yml).
* You'll need to create a Network Load Balancer that points to your database and a VPC Endpoint Service that exposes the NLB.
* Configure security groups to allow traffic from Datafold's VPC to your database.
* If your databases are HA (High Availability), implement automatic failover mechanics to ensure the NLB routes to the active database instance.
* For detailed step-by-step instructions, see our [**AWS PrivateLink Setup Guide**](/security/aws_privatelink_setup).
3. Add the provided role ARN as 'Allowed Principal' on the VPC Endpoint Service.
4. Allow ingress from the Datafold SaaS Cloud VPC.
5. Send us the:
* Service name(s), e.g. `com.amazonaws.vpce.us-west-2.vpce-svc-0cfd2f258c4395ad6`.
* Availability Zone ID(s) used in the VPCE Service(s), e.g. `use1-az6` or `usw2-az3`.
* RDS/Aurora hostname(s), e.g. `datafold.c2zezoge6btk.us-west-2.rds.amazonaws.com`.
At the end, the database hostname used to configure the data source will be the original RDS/Aurora hostname. But with private DNS resolution, we will resolve the hostname to the VPC Endpoint. Our support team will let you know when everything is set up and you can accept the PrivateLink connection and start configuring the data source.
**Detailed Instructions**
For comprehensive step-by-step instructions including security group configuration, target group setup, Lambda-based automatic failover for HA setups, and troubleshooting, see our [**AWS PrivateLink Setup Guide**](/security/aws_privatelink_setup).
### Cross-Region PrivateLink
Datafold SaaS Cloud supports cross-region PrivateLink for all North American regions. Datafold SaaS Cloud is located in `us-west-2`. Datafold manages the cross-region networking, allowing you to connect to a VPC Endpoint in the same region as your VPC Endpoint Service. For Datafold Dedicated Cloud customers, deployment occurs in your chosen region. If you need to connect to databases in multiple regions, Datafold also supports this through cross-region PrivateLink.
The setup will be similar to the regular PrivateLink setup.
### Private Service Connect
Google Cloud's Private Service Connect is only available if both parties are in the same cloud region. This option is only available for Datafold Dedicated Cloud customers. The diagram below illustrates how the solution works:
The basics of Private Service Connect are available [here](https://cloud.google.com/vpc/docs/private-service-connect).
### Azure Private Link
Azure Private Link is only available if both parties are in the same cloud region. This option is only available for Datafold Dedicated Cloud customers. The diagram below illustrates how the solution works:
The basics of Private Link are available [here](https://learn.microsoft.com/en-us/azure/private-link/private-link-overview).
For Customer-Hosted Dedicated Cloud, achieving cross-tenant access requires using Private Link. The documentation can be accessed [here](https://learn.microsoft.com/en-us/azure/architecture/guide/networking/cross-tenant-secure-access-private-endpoints).
## VPC Peering (SaaS)
VPC Peering is easier to set up than Private Link, but a drawback is that both networks are joined and the IP ranges must not overlap. For Datafold SaaS Cloud, this setup is an AWS-only option.
The basics of VPC peering are covered [here](https://docs.aws.amazon.com/vpc/latest/peering/vpc-peering-basics.html).
To set up VPC peering, please contact [support@datafold.com](mailto:support@datafold.com) and provide us with the following information:
* AWS region where your database is hosted.
* ID of the VPC that you would like to connect.
* CIDR of the VPC.
If there are no address collisions, we'll send you a peering request and CIDR that we use on our end, and whitelist the CIDR range for your organization. You'll need to set up routing to this CIDR through the peering connection.
If you activate DNS on your side of the peering connection, you can use the private DNS hostname to connect. Otherwise, you need to use the IP.
## VPC Peering (Dedicated Cloud)
VPC Peering is a supported option for all cloud providers, both for Datafold-hosted and customer-hosted deployments. Basic information for each cloud provider can be found here:
* [AWS](https://docs.aws.amazon.com/vpc/latest/peering/vpc-peering-basics.html)
* [GCP](https://cloud.google.com/vpc/docs/vpc-peering)
* [Azure](https://learn.microsoft.com/en-us/azure/virtual-network/create-peering-different-subscriptions?tabs=create-peering-portal)
**VPC vs VNet**
We use the term VPC across all major cloud providers. However, Azure calls this concept a Virtual Network (VNet).
## SSH Tunnel
To set up a tunnel, please contact our team at [support@datafold.com](mailto:support@datafold.com) and provide the following information:
* Hostname of your bastion host and port number used for SSH service.
* Hostname of and port number of your database.
* SSH fingerprint of the bastion host (optional).
We'll get back to you with:
* SSH public key that you need to add to `~/.ssh/authorized_hosts`.
* IP address and port to use for data connection configuration in the Datafold application.
## IPSec tunnel
Please contact our team at [support@datafold.com](mailto:support@datafold.com) for more information.
# Service Accounts
Source: https://docs.datafold.com/security/service-accounts
Machine identities for CI, integrations, and scripts. Service accounts own their own API keys, inherit permissions from groups, and are managed independently of human users.
Service accounts are organization-managed machine identities. They own API keys for automation — CI pipelines, dbt jobs, scripts, third-party integrations — without tying those keys to a human user who might leave the team.
## How service accounts differ from user accounts
| | User account | Service account |
| ------------------------------------------------ | --------------------- | --------------------------------- |
| Web UI login | Yes (password or SSO) | No — API keys only |
| Email | Real mailbox | Non-deliverable synthetic address |
| Invitation flow | Email invite required | Created directly by an admin |
| Permissions | Via groups or role | Via groups (required) |
| API keys per account | Up to 5 | Up to 50 |
| Visible in user lists, groups, and subscriptions | Yes | Hidden |
Because service accounts cannot log in interactively, they are invisible in the user directory and cannot be added to Slack/email subscription targets — they only exist to hold API keys.
## Create a service account
Only organization admins can manage service accounts.
1. Open the Datafold app and navigate to **Settings → Service Accounts**.
2. Click **Create Service Account**.
3. Fill in:
* **Name** — a short identifier (e.g., `ci-bot`, `dbt-cloud-prod`). Shown in audit logs and the Service Accounts list.
* **Description** *(optional)* — a free-form note to document what the account is used for.
* **Groups** — at least one permission group is required. The service account inherits all permissions from the groups you select.
4. Click **Create**. The account appears in the list immediately. No API keys are issued yet.
A service account with no groups would have no permissions at all, so the form requires at least one. If you need a different permission set, create a new group under **Settings → Groups** first.
## Issue API keys
From the Service Accounts list, click **API Keys** on the relevant row. A service account can have up to **50 active API keys**, which is intentionally higher than the per-user limit so you can issue one key per CI environment, region, or deployment without collapsing them together.
Each key can be given:
* A **name** for identification.\`\`\`
* An optional **description** for your own reference.
* An optional **expiration** (in days). Leave unset for a non-expiring key.
The key value is shown **only once** at creation — copy it into your secret store immediately.
## Using a service account API key
Service account keys are used exactly the same way as personal API keys. Include the key in the `Authorization` header:
```bash theme={null}
curl https://app.datafold.com/api/v1/... -H "Authorization: Key {API_KEY}"
```
See the [API Introduction](/api-reference/introduction) for the full authentication reference.
## Lifecycle
| Action | Effect |
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Disable** | All of the account's API keys immediately stop working. The account and its keys remain in place and can be re-enabled. Use this to pause an integration without losing the key set. |
| **Enable** | Restores a disabled service account. Existing keys resume working immediately. |
| **Edit** | Change the name, description, or permission groups. Permission changes take effect on the next authenticated request. |
| **Delete** | Permanently revokes all of the account's API keys and removes the account. This cannot be undone. |
Individual API keys can be revoked from the **API Keys** modal without affecting the account itself — useful when rotating a single leaked key.
## Permissions
A service account's permissions are the union of the groups assigned to it. To change what a service account can access:
* Change the account's group assignments under **Settings → Service Accounts** (admin-only).
* Or change what a group can access under **Settings → Groups** — every account (human or service) in that group picks up the change.
Service accounts ignore the `admin`, `default`, and `viewonly` role flags. Grant admin-level access only by putting the account in a group that has those permissions.
## Best practices
* **One service account per automation surface.** A separate account for CI, for your ETL scheduler, and for each third-party integration makes it obvious in audit logs who did what, and lets you rotate or disable one surface without affecting the others.
* **Scope via groups.** Give a service account only the groups it needs — a CI-only account does not need access to production monitors.
* **Name keys after where they run.** `ci-bot` with keys named `github-actions-prod`, `github-actions-staging`, etc. makes rotation obvious.
* **Rotate keys on offboarding-equivalent events.** When a CI runner is decommissioned, a deployment region is retired, or a secret may have been exposed, revoke the specific key rather than the whole account.
* **Disable, don't delete, for temporary pauses.** Deletion is permanent; disable is reversible.
# Single Sign-On
Source: https://docs.datafold.com/security/single-sign-on
Set up Single Sign-On with one of the following options.
**Tip**
You can force all users to use the configured SSO provider by unchecking the *Allow non-admin users to login with email and password* checkbox under the organization settings.
Admin users will still be able to login using email and password.
**Caution**
Ensure only authorized users keep using Datafold by setting up Okta webhooks or setting up credentials for the Microsoft Entra app if you're using Microsoft Entra ID (formerly known Azure Active Directory)
This will disable non-admin users that don't have access to the configured SSO app.
[Configure this for Okta](/security/single-sign-on/okta#synchronize-state-with-datafold-optional)
[Configure this for Microsoft Entra ID](/security/single-sign-on/saml/examples/microsoft-entra-id-configuration#synchronize-user-with-datafold-optional)
# Google OAuth
Source: https://docs.datafold.com/security/single-sign-on/google-oauth
Configure Google OAuth single sign-on (SSO) for Datafold. Step-by-step setup instructions for authenticating your team with Google.
**NOTE**
Google SSO is available for both SaaS and VPC installations of Datafold.
## Datafold SaaS
For Datafold SaaS the setup only involves enabling Google SSO integration.
If Google SSO is already enabled for your organization you will see it in the **Settings** → **Integrations** → **SSO**.
If this is not the case, create a new Google SSO Integration by clicking on the **Add new integration** button.
Enable the **Allow Google logins in organization** switch and click **Save**. That's it!
If you are not using Datafold SaaS, please see below.
## Create OAuth Client ID
To begin, navigate to the [Google admin console](https://console.cloud.google.com/apis/credentials?authuser=1%5C\&folder=%5C) for your organization, click **Create Credentials**, and select **OAuth Client ID**.
**TIP**
To configure OAuth, you may need to first configure your consent screen. We recommend selecting **Internal** to keep access limited to users in your Google workspace and organization.
### Configure OAuth[](#configure-oauth "Direct link to Configure OAuth")
* **Application type**: "Web application"
* **Authorized JavaScript origins**: `https://`
* **Authorized redirect URIs**: `https:///oauth/google`
Finally, click **Create**. You will see a set of credentials that you will copy over to your Datafold Global Settings.
## Configure Google OAuth in Datafold
To finish the configuration, create a Google SSO Integration in Datafold.
To complete the integration in Datafold, create a new integration by navigating to **Settings** → **Integrations** → **SSO** → **Add new integration** → **Google**.
* Enable the **Google OAuth** switch.
* Enter the **domain** or URL of your OAuth client Id on the respective field.
* Paste the **Client Secret** on the respective field.
* Enable the **Allow Google logins in Organization** switch.
* Finally, click **Save**.
# Okta (OIDC)
Source: https://docs.datafold.com/security/single-sign-on/okta
Configure Okta OIDC single sign-on (SSO) for Datafold. Step-by-step setup instructions for authenticating your team with Okta.
**NOTE**
Okta SSO is available for both SaaS and dedicated cloud installations of Datafold.
## Create Okta App Integration[](#create-okta-app-integration "Direct link to Create Okta App Integration")
**INFO**
Creating an App Integration in Okta may require admin privileges.
Start the integration by creating a web app integration in Okta.
Next, log in to Okta interface and navigate to **Applications** and click **Create App Integration**.
Then, in the configuration form, select **OpenId Connect (OIDC)** and **Web Application** as the Application Type.
In the following section, you will set:
* **App integration name**: A name to identify the integration. We suggest you use `Datafold`.
* **Grant type**: Should be set to `Authorization code` automatically.
* **Sign-in redirect URI**:
The redirect URL should be `https://app.datafold.com/oauth/okta/client_id`, where `client_id` is the Client ID of the configuration.
**CAUTION**
You will be given the Client ID after saving the integration and need to come back to update the client ID afterwards.
The redirect URL should be `https://your-dns-name/oauth/okta`, replacing `your-dns-name` with the DNS name for your installation.
* **Sign-out redirect URIs**: Leave this empty.
* **Trusted Origins**: Leave this empty too.
* **Assignments**: Select `Skip group assignment for now`. Later you should assign the correct groups and users.
* Click "Save" to create the app integration in Okta.
Once the save is successful, on the next screen, you'll be presented with Client ID and Client Secret. We need these IDs to update the redirect URLs that Datafold needs. We'll also apply the Client ID and Client Secret in the Datafold integration later.
* Edit "General settings"
* Scroll down to the **Login** section
* Update the **Sign-in redirect URI**. See above for details.
* Click "Save" to persist the changes.
## Set Up Okta-initiated login
**TIP**
Organization admins will always be able to log in with either password or Okta. Non-admin users will be required to log in through Okta once configured.
This step is optional and should be done at the discretion of the Okta administrator.
Users in your organization can log in to the application directly from the Okta end-user dashboard. To enable this feature, configure the integration as follows:
1. Edit "General settings"
2. Set **Login initiated by** to `Either Okta or App`.
3. Set **Application visibility** to `Display application icon to users`.
4. Set **Login flow** to `Redirect to app to initiate login (OIDC Compliant)`.
5. Set **Initiate login URI**:
* `https://app.datafold.com/login/sso/client-id?action=desired_action`
* Replace `client-id` with the Client ID of the configuration, and
* Replace `desired_action` with `signup` if you enabled users auto-creation, or `login` otherwise.
* `https://your-dns-name/login/sso/client-id?action=desired_action`
* Replace `client-id` with the Client ID of the configuration, and
* Replace `desired_action`with `signup` if you enabled users auto-creation, or `login` otherwise.
* Replace `your-dns-name` with the DNS name for your installation.
1. Click "Save" to persist the changes.
The Okta configuration is now complete.
## Configure Okta in Datafold
To finish the configuration, create an Okta integration in Datafold.
To complete the integration in Datafold, create a new integration by navigating to **Settings** → **Integrations** → **SSO** → **Add new integration** → **Okta**.
* Paste in your Okta **Client Id** and **Client Secret**.
* The **Metadata Url** of Okta OAuth server is `https:///.well-known/openid-configuration`, replace `okta-server-name` with the name of your Okta domain.
* If you'd like to auto-create users in Datafold that are authorized in Okta, enable the **Allow Okta to auto-create users in Organization** switch.
* Finally, click **Save**.
**TIP**
Users can either be explicitly invited in Datafold by an admin user, using the same email as used in Okta, or they can be auto-created. When the `signup` action is set in the login URI, authenticated users on Okta who have been assigned as a user in Okta of the Datafold application will then be able to login. If that user has not yet been invited, Datafold will then automatically create a user for them, since they're already authenticated by the Okta server of your domain. The user will then receive an email to confirm their email address.
## Synchronize state with Datafold \[Optional]
This step is essential if you want to ensure that users from your organization are automatically logged out when they are unassigned or deactivated in Okta.
1. Navigate to **Okta Admin panel** → **Workflow** → **Event Hooks**
2. Click **Create Event Hook**
3. Set **Name** to `Datafold`
4. Set **URL** to `https://app.datafold.com/hooks/oauth/okta/`
5. Set **Authentication field** to `secret`
6. Go to Datafold and generate a secret token in **Settings** → **Integrations** → **SSO** → **Okta**. Click the **Generate** button, copy it by using the **Copy** button and click **Save**. Use the pasted code in the **Authentication secret** field in Okta.
**CAUTION**
Keep this secret token safe as you won't be able to see after saving your Integration.
7. In **Subscribe to events** add events: `User suspended`, `User deactivated`, `Deactivate application`, `User unassigned from app`
8. Click **Save & Continue**
. On **Verify Endpoint Ownership** click **Verify**
* If the verification is successful, you have completed the setup.
## Testing the Okta integration
* Visit [https://app.datafold.com](https://app.datafold.com)
* Type in your email and wait up to five seconds.
* The Okta button should switch from disabled to enabled.
* Click the Okta login button.
* The browser should be redirected to your Okta domain, authenticate the user there and be redirected back to the Datafold application.
* Visit `https://your-dns-name`, replacing your-dns-name with the domain name of your installation.
* Type in your email and wait up to five seconds.
* The Okta button should switch from disabled to enabled.
* Click the Okta login button.
* The browser should be redirected to your Okta domain, authenticate the user there and be redirected back to the Datafold application.
If this didn't work, pay close attention to any error messages, or contact `support@datafold.com`.
# SAML
Source: https://docs.datafold.com/security/single-sign-on/saml
SAML (Security Assertion Markup Language) is a protocol that enables secure user authentication by integrating Identity Providers (IdPs) with Service Providers (SPs).
**NOTE**
SAML SSO is available for both SaaS and VPC installations of Datafold.
In this case, Datafold is the service provider. The Identity Providers can be anything used by the organization (e.g., Google, Okta, Duo).
We also support SAML SSO [group provisioning](/security/single-sign-on/saml/group-provisioning).
## Generic SAML Identity Providers
**TIP**
We also provide SAML identity providers configurations for ([Okta](/security/single-sign-on/saml/examples/okta), [Microsoft Entra ID](/security/single-sign-on/saml/examples/microsoft-entra-id-configuration), and [Google](/security/single-sign-on/saml/examples/google))
To configure a SAML provider:
1. Go to `Datafold`. Create a new integration by navigating to **Settings** → **Integrations** → **SSO** → **Add new integration** → **SAML**.
1. Go to the organization's `Identity Provider`, create a **SAML application** (sometimes called a **single sign-on** or **SSO** method).
If you have the option, enable the SAML Response signature and set it to **whole-response signing**.
1. Copy and paste the Service Provider URLs from the `Datafold` SAML Integration into the `Identity Provider`'s application setup. The only two mandatory fields are **Service Provider Entity ID** and the **Service Provider ACS URL**.
After creation, The `Identity Provider` will show you the metadata XML. It may be presented as raw XML, a URL to the XML, or an XML file to download.
**INFO**
The Identity Providers sometimes provide additional parameters, such as SSO URLs, ACS URLs, SLO URLs, etc. We gather this information from the XML directly so these can be safely ignored.
1. Paste either the **metadata XML** *or* **metadata URL** from your `Identity Provider` into the respective `Datafold` SAML integration fields.
2. Finally, click the **Save** button to create the integration.
After creation, the SAML login button will be available for Datafold users in your organization.
1. In your `Identity Provider`, activate the SAML application for all users or for select groups.
**CAUTION**
Only configured users in your identity provider will be able to login into Datafold *using* SAML SSO.
### Auto-create users in Datafold
Go to `Datafold` and navigate to **Settings** → **Integrations** → **SSO** → **SAML**.
Enable the **Allow SAML to auto-create users in Organization** switch and save the integration.
If the **Allow SAML to auto-create users in Organization** switch from the SAML Integration in Datafold is enabled, identity provider-initiated logins will automatically create users in Datafold for authenticated users.
If the **Allow SAML to auto-create users in Organization** switch from the SAML Integration in Datafold is enabled, the SAML login button will always be enabled, and all authenticated users will be automatically created in Datafold.
# Google
Source: https://docs.datafold.com/security/single-sign-on/saml/examples/google
## Google as a SAML Identity Provider
Enable SAML in your Google Workspace. Check [Set up your own custom SAML app](https://support.google.com/a/answer/6087519?hl=en) for more details.
**CAUTION**
You need to be a **super-admin** in the Google Workspace to configure a SAML application.
* Go to `Google`, click on **Download Metadata** in the left sidebar and **copy** the XML.
* Select **Email** as the Name ID format.
* Select **Basic Information > Primary email** as the Name ID.
* Go to `Datafold` and create a new SSO integration. Navigate to **Settings** → **Integrations** → **Add new integration** → **SAML**.
* Copy the read-only field **Service Provider ACS URL**, go to `Google` and paste it into **ACS URL**.
* Copy the read-only field **Service Provider Entity ID**, go to `Google` and paste it into **Entity ID**.
* Paste the **copied** XML into `Datafold`'s **Identity Provider Metadata XML** field.
* Click **Save** to create the integration.
* (Optional step) Configure the attribute mapping as follows:
* **First Name** → `first_name`
* **Last Name** → `last_name`
# Microsoft Entra ID
Source: https://docs.datafold.com/security/single-sign-on/saml/examples/microsoft-entra-id-configuration
Configure Microsoft Entra ID (Azure AD) as a SAML identity provider for Datafold SSO. Step-by-step setup and configuration guide.
## Azure AD / Entra ID as a SAML Identity Provider
You can create an **Enterprise Application** and use that to configure access to Datafold. Click on **New application** and **Create your own application**.
**Copy** the **App Federation Metadata Url**.
Go to `Datafold` and create a new SSO integration. Navigate to **Settings** → **Integrations** → **Add new Integration** → **SAML**.
Paste the **copied** URL into **Identity Provider Metadata URL**.
Go to `Azure` and edit the **Basic SAML Configuration** in your Enterprise App.
Copy from Datafold the read-only field **Service Provider ACS URL** and paste it into **Reply URL**.
Copy from Datafold the read-only field **Service Provider Entity ID** and paste it into **Identifier**.
Go to `Datafold` and click **Save** to create the SAML integration.
Next, edit the **Attributes & Claims**. By default, the **Unique User Identifier** is already correctly set to `user.userprincipalname`. If you have multiple domains (i.e., `@datafold.com` and `@datafoldonmicrosoft.com`), please make sure this maps correctly to the email addresses of the users in Datafold.
(Optional step) Add two attributes: `first_name` and `last_name`.
Finally, edit the **SAML Certificates**. Set the signing option to **Sign SAML response and assertion**.
After you made sure you are added as a user to the Enterprise Application, log out from Datafold. Click on **Test** under **Test single sign-on with DatafoldSSO**.
## Synchronize user with Datafold \[Optional]
This step is essential if you want to ensure that users from your organization are disabled if they are no longer assigned to the configured Microsoft Entra App.
1. Navigate to App registrations → API permissions.
2. Add the following permissions: `Group.Read.All` and `User.ReadBasic.All`.
2.1 Click `Add a permission`.
2.2 Select Microsoft Graph.
2.3 Select application permissions and add the required permissions.
3. Grant admin consent.
4. You should now see a next to the permissions.
5. Generate a secret so that Datafold can interact with the API.
5.1 Click `Certificates & secrets`.
5.2 Click `New client secret`.
5.3 Type in a description and click `Add`.
6. Go to `Datafold` and navigate to **Settings** → **Integrations** → **SSO** → **Add new Integration** and select the Microsoft Entra ID Logo.
7. Paste in the four required fields:
7.1 Tenant ID - [you can find this in the overview page](https://learn.microsoft.com/en-us/entra/fundamentals/how-to-find-tenant)
7.2 Navigate to the application overview
7.3 Copy Application ID and paste it into Client Id
7.4 Copy the secret we created in the previous steps and paste it into Client Secret
7.5 Navigate to the enterprise application and copy Object ID and paste it into Principal Id.
7.6 Click **Save** to create the integration.
If the update is successful, it means that the integration is valid. Users that do not have access to the configured application will be disabled and logged out in at most one hour.
# Okta
Source: https://docs.datafold.com/security/single-sign-on/saml/examples/okta
## Okta as a SAML Identity Provider
You can create an **Application** and use that to configure access to Datafold. Click on **Applications** and **Create App Integration**.
Select **SAML 2.0**
Enter "Datafold" in **App name** and click **Next**.
Go to `Datafold` and create a new SSO integration. Navigate to **Settings** → **Integrations** → **Add new Integration** → **SAML**.
* Copy the read-only field **Service Provider ACS URL** and paste it into **Single sign-on URL**.
* Copy the read-only field **Service Provider Entity ID** and paste it into **Audience URI (SP Entity ID)**.
(Optional step) In **Attribute Statements (optional)** add fields:
* Name: `first_name`, Value: `user.firstName`
* Name: `last_name`, Value: `user.lastName`
Click **Next** and **Finish**.
Go to `Okta` and copy the **Metadata URL** field from **Datafold** → **Sign On** → **Metadata details**.
Go back to `Datafold` and paste it into **Identity Provider Metadata URL** field.
Finally, click **Save** to create the integration.
Navigate to **Settings** → **Integrations** → **SSO** → **SAML**.
If everything is correct, the **Identity Provider Metadata XML** field will contain XML.
# Group provisioning
Source: https://docs.datafold.com/security/single-sign-on/saml/group-provisioning
Automatically sync group membership with your SAML Identity Provider (IdP).
## 1. Create desired groups in the IdP
## 2. Assign the desired users to groups
Assign the relevant users to groups reflecting their roles and permissions.
## 3. Configure the SAML SSO provider
Configure your SAML SSO provider to include a `groups` attribute. This attribute should list all the groups you want to sync.
```Bash theme={null}
datafold_admindatafold_read_write
```
## 4. Map IdP groups to Datafold groups
The `datafold_admin` group, created in the IdP through [step 1](#1-create-desired-groups-in-the-idp), will be automatically synced. Users in this IdP group will also be members of the corresponding group in Datafold.
**Note:** Manual Datafold user group memberships will be overridden upon the user's next login to Datafold. Therefore, group memberships should be managed exclusively within the IdP once the `groups` attribute is configured.
## Example configuration
Here's how you might configure three groups to map to the three default Datafold groups, `admin`, `default` and `viewonly`:
# User Roles and Permissions
Source: https://docs.datafold.com/security/user-roles-and-permissions
Datafold uses role-based access control to manage user permissions and actions.
Datafold uses groups to control what users and service accounts can access. Every user belongs to one or more groups, and each group carries a set of permissions.
## Built-in groups
Every organization has three built-in groups that cannot be deleted or have their permissions modified:
| Group | Description | Permissions |
| -------- | -------------- | -------------------------------------------------------------------------------------- |
| admin | Administrator | All permissions, plus user and configuration management |
| default | Full user role | Create and modify monitors, create diffs, explore data, lineage, and knowledge graph |
| viewonly | View-only role | View diffs, monitors, and knowledge graph without the ability to create or modify them |
New users are automatically added to the **default** and **admin** (if the first user) groups.
## Custom groups
Admins can create custom groups with a tailored set of permissions. This is useful for:
* **Service accounts** that should only access specific tools (e.g., an MCP integration that only needs data source and knowledge graph access)
* **External partners** who should have limited access
* **Specialized roles** like "monitor operators" who can trigger monitor runs but not create diffs
To create a custom group:
1. Go to **Settings → Groups** and click **New Group**
2. Enter a name and select the permissions you want to grant
3. Click **Create**
To edit permissions on an existing custom group, click **Edit** on the group row, then toggle permissions in the checklist.
Built-in group permissions (admin, default, viewonly) cannot be modified. To restrict access, create a custom group with only the permissions you need.
## Permissions reference
Permissions are organized by category. A user's effective permissions are the union of all groups they belong to.
### Organization
| Permission | Description |
| ----------------------- | ------------------------------------ |
| List users | View organization members |
| Edit table descriptions | Modify table and column descriptions |
| Edit tags | Create and modify tags |
### Data Sources
| Permission | Description |
| -------------------- | --------------------------------------- |
| List data sources | View and query connected data sources |
| Refresh schema | Trigger schema refresh on a data source |
| Run profiling | Run table profiling |
| Cancel profiling | Cancel running profiling jobs |
| Cancel schema fetch | Cancel schema fetch jobs |
| Cancel fetch history | Cancel fetch history jobs |
| Cancel BI sync | Cancel BI sync jobs |
### CI/CD
| Permission | Description |
| -------------------- | ------------------------------------- |
| Cancel CI run | Cancel a running CI check |
| Upload dbt artifacts | Upload dbt manifest and catalog files |
### Data Diffs
| Permission | Description |
| ------------- | ------------------------------------------ |
| View diffs | View existing data diffs and their results |
| Create diffs | Create new data diffs |
| Cancel diffs | Cancel running data diffs |
| Archive diffs | Archive completed diffs |
| Purge diffs | Permanently delete diffs |
### Monitors
| Permission | Description |
| ------------- | --------------------------------------------------- |
| View monitors | View monitors, their configuration, and run results |
| Edit monitors | Create, modify, provision, and trigger monitor runs |
### Knowledge Graph
| Permission | Description |
| -------------------- | ----------------------------------------------------------- |
| View knowledge graph | Query the knowledge graph, view schema, and explore lineage |
| Edit knowledge graph | Modify knowledge graph data (reserved for future use) |
## MCP tool visibility
When using the [Datafold MCP server](/datafold-mcp), the tools available to an AI agent are determined by the API key's user permissions. Tools that require permissions the user doesn't have are automatically hidden.
This means you can create a custom group with a limited set of permissions, assign it to a [service account](/security/service-accounts), and use that service account's API key to control exactly which MCP tools the agent can access.
For example, to give an agent access to only data sources and the knowledge graph:
1. Create a custom group with **List data sources** and **View knowledge graph** permissions
2. Create a service account assigned to that group
3. Use the service account's API key in your MCP client configuration
See [MCP Tool Permissions](/security/mcp-tool-permissions) for the exact permissions each MCP tool requires, plus the minimum set needed to enable every tool.
## Data source access control
In addition to group-level permissions, Datafold supports per-data-source access control. Admins can restrict which groups can access specific data sources under **Settings → Integrations → \[Data Source] → Restrict Access**.
This provides an additional layer of control: a user may have the "List data sources" permission but only see data sources their groups are allowed to access.
# FAQ
Source: https://docs.datafold.com/support/faq-redirect
# Support
Source: https://docs.datafold.com/support/support
Datafold offers multiple support channels to assist users with troubleshooting and inquiries.
## Datafold Support
* **Email**: Contact support at [support@datafold.com](mailto:support@datafold.com) for any assistance.
* **In-app Chat**: Reach out directly from the Datafold app via live chat for quick help.
* **Shared Slack Channel**: Collaborate with the Datafold team through a dedicated Slack channel (please inquire with your account executive to set up).
* **FAQ**: Explore our [Frequently Asked Questions](/faq/overview) for detailed answers to common queries and troubleshooting tips.
### Grant access to Datafold's team for troubleshooting
For faster resolution of support issues, you can temporarily grant Datafold Support access to your account. This enables a Datafold team member to view the same in-app context as you, minimizing back-and-forth communication.
To grant access:
1. Navigate to **Settings** → **Org Settings**.
2. Check the box next to *"Allow Datafold access to your account for troubleshooting purposes."*
To revoke access, simply uncheck the box at any time.
**Note:** Admin privileges are required to modify this setting in Org Settings.
# Datafold
Source: https://docs.datafold.com/welcome
Datafold is the data engineering automation platform that combines specialized AI agents with a context layer and data quality tools — so data teams and their coding agents ship higher-quality data faster, migrate with confidence, and optimize platform costs.
## Key features
The Data Migration Agent delivers guaranteed-outcome migrations with fixed price, timeline, and data parity — over 6x faster than traditional approaches.
The context layer for reliable AI-assisted data engineering — lineage, business logic, usage, and ontology served via MCP to your coding agents.
Value-level data diffs, monitors, and reconciliation power tools — exposed via MCP so your coding agents can validate their own work.
Connect your AI coding agent to Datafold and interact with your data through natural language — diffs, lineage, monitors, and more.
## Use cases
Modernize your data platform in weeks, not years, with AI-powered migration automation and cross-database validation.
Supercharge your coding agents with the Data Knowledge Graph and data quality tools via MCP.
Automatically test, data-diff, and validate every pull request before it reaches production.
## Data Knowledge Graph
**Private Beta** — The Data Knowledge Graph is currently in private beta. Contact the Datafold team at [sales@datafold.com](mailto:sales@datafold.com) to enable this for your organization.
The **Data Knowledge Graph (DKG)** automatically collects and unifies all information about your data ecosystem — lineage, business logic, usage statistics, BI connections, git history, and organizational knowledge — and serves it to your AI agents via MCP.
Unlike data catalogs that rely on manual curation, the DKG is sourced and maintained by AI, and optimized for consumption by the coding agents of your choice. It spans all your data sources and code bases, creating a comprehensive view of your entire data platform that is inaccessible to any single provider on their own.
The DKG powers Datafold's specialized agents (such as the Data Migration Agent) and supercharges external coding agents (Claude Code, Cursor, Windsurf) by providing the context they need to produce reliable results for any data engineering task.
## Getting started
There are a few ways to get started with your first data diff:
Once you’ve integrated a [data connection](/integrations) and [code repository](/integrations/code-repositories), you can run a new [in-database](/data-diff/in-database-diffing/creating-a-new-data-diff) or [cross-database](/data-diff/cross-database-diffing/creating-a-new-data-diff) data diff or explore your [data lineage](data-explorer/lineage).
Create [monitors](data-monitoring/monitor-types) to send alerts when data diffs fall outside predefined ranges.
Get started with deployment testing through our universal ([No-Code](deployment-testing/getting-started/universal/no-code), [API](deployment-testing/getting-started/universal/api)) or [dbt](integrations/orchestrators/dbt-core) integrations.
## Learn more
* [Connect your AI agent to Datafold via MCP](datafold-mcp) and start using data diffs, lineage, and monitors from your development environment
* Read our [Data Quality Guide](https://www.datafold.com/data-quality-guide) for a practical roadmap to building a robust data quality system
* [Book a demo](https://www.datafold.com/) to see how Datafold can automate your data engineering workflows