Skip to main content

Data Test Monitors

info

Please contact support@datafold.com if you'd like to enable this feature for your organization.

Data Test monitors allow you to validate your data with business rules written as custom SQL queries. Failed records are materialized to a temporary table in your warehouse, and you can view a sample of the records in Datafold.

Think of these monitors as pass/fail—either the query returns no records (pass) or it returns at least one record (fail). Data Tests are extremely flexible, but common use cases are validating custom business rules, referential integrity between tables, and data formatting (see examples below).

Create a Data Test monitor

There are two ways to create a Data Test monitor:

  1. Open the Monitors page, select Create new monitor, and then choose Data Test.
  2. Clone an existing Data Test monitor by clicking Actions and then Clone. This will pre-fill the form with the existing monitor configuration.

Set up your monitor

Give your monitor a descriptive name. For example, if your test confirms that records in a table of transactions have an amount that's greater than zero, you might call it Transaction amount > 0.

Then, select your data connection and write your test. Keep in mind your query should return records that fail the test.

Continuing with the example above, if transaction amounts should be greater than zero, your query should return records with amounts less than or equal to zero:

SELECT *
FROM transactions
WHERE amount <= 0

Add a schedule

Customize the frequency and timing of monitor runs. You can choose a specific hourly or daily time in UTC, or input a cron tab expression for more complex scheduling:

Add notifications

Receive notifications via Slack or email when at least one record fails your test:

Example queries

Custom business rule

Say your company defines active users as individuals who have signed into your application at least 3 times in the past week. You could write a test that validates this logic by checking for users marked as active who haven't reached this threshold:

SELECT *
FROM users
WHERE status = 'active'
AND signins_last_7d < 3;

Referential integrity

Now let’s assume you have contacts and accounts in your warehouse. Every contact belongs to exactly one account, and you want to confirm that every account referenced in the contacts table exists in the accounts table. In this case, you’d look for references to accounts that are missing from the accounts table:

SELECT *
FROM contacts
LEFT JOIN accounts
ON contacts.account_id = accounts.id
WHERE contacts.account_id IS NOT NULL
AND accounts.id IS NULL;

Data formatting

Finally, if you wanted to validate that all phone numbers in your contacts table are 10 digits and only contain numbers, you'd return records that are not 10 digits or use non-numeric characters:

SELECT *
FROM contacts
WHERE LENGTH(phone_number) != 10
OR phone_number REGEXP '[^0-9]';

Running Tests in CI

info

Data Tests in CI are an experimental feature. Please contact support@datafold.com if you'd like to use them.

Standard Data Test monitors run on a schedule against your production data. But often it's useful to test data before it gets to production as part of your deployment workflow. For this reason, Datafold supports running tests in CI.

Data Tests in CI work very similarly to our Monitors as Code feature, in the sense that you define your tests in a version-controled YAML file. You then use the Datafold SDK to execute those tests as part of your CI workflow.

Write your tests

First, create a new file (e.g. tests.yaml) in the root of your repository. Then write your tests use the same format described in our Monitors as Code docs with two exceptions:

  1. Add a run_in_ci flag to each test and set it to true (assuming you'd like to run the test)
  2. (Optional) Add placeholders for variables that you'd like to populate dynamically when executing your tests

Here's an example:

monitors:
null_pk_test:
type: test
name: No NULL pk in the users table
run_in_ci: true
connection_id: 8
query: select * from {{ schema }}.USERS where id is null

duplicate_pk_test:
type: test
name: No duplicate pk in the users table
run_in_ci: true
connection_id: 8
query: |
select *
from {{ schema }}.USERS
where id in (
select id
from {{ schema }}.USERS
group by id
having count(*) > 1
);

Execute your tests

info

This section describes how to get started with GitHub Actions, but the same concepts apply to other hosted version control platforms like GitLab and Bitbucket. Contact us if you need help getting started.

If you're using GitHub Actions, create a new YAML file under .github/workflows/ using the following template. Be sure to tailor it to your particular setup:

  on:
push:
branches:
- main
pull_request:
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v2
with:
token: ${{ secrets.GH_TOKEN }}
repository: datafold/datafold-sdk
path: datafold-sdk
ref: data-tests-in-ci-demo
- uses: actions/setup-python@v2
with:
python-version: '3.12'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Set schema env var in PR
run: |
echo "SCHEMA=ANALYTICS.PR" >> $GITHUB_ENV
if: github.event_name == 'pull_request'
- name: Set schema env var in main
run: |
echo "SCHEMA=ANALYTICS.CORE" >> $GITHUB_ENV
if: github.event_name == 'push'
- name: Run tests
run: |
datafold tests run --var schema:$SCHEMA --ci-config-id 1 tests.yaml # use the correct file name/path
env:
DATAFOLD_HOST: https://app.datafold.com # different for dedicated deployments
DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} # remember to add to secrets

View the results

When your CI workflow is triggered (e.g. by a pull request), you can view the terminal output for your test results:

Need help?

If you have any questions about how to use Data Test monitors, please reach out to our team via Slack, in-app chat, or email us at support@datafold.com.