Data Test monitors allow you to validate your data with business rules written as custom SQL queries. Failed records are materialized to a temporary table in your warehouse, and you can view a sample of the records in Datafold.

INFO

Please contact support@datafold.com if you’d like to enable this feature for your organization.

Think of these monitors as pass/fail—either the query returns no records (pass) or it returns at least one record (fail). Data Tests are extremely flexible, but common use cases are validating custom business rules, referential integrity between tables, and data formatting (see examples below).

Create a Data Test monitor

There are two ways to create a Data Test monitor:

  1. Open the Monitors page, select Create new monitor, and then choose Data Test.
  2. Clone an existing Data Test monitor by clicking Actions and then Clone. This will pre-fill the form with the existing monitor configuration.

Set up your monitor

Give your monitor a descriptive name. For example, if your test confirms that records in a table of transactions have an amount that’s greater than zero, you might call it Transaction amount > 0.

Then, select your data connection and write your test. Keep in mind your query should return records that fail the test.

Continuing with the example above, if transaction amounts should be greater than zero, your query should return records with amounts less than or equal to zero:

SELECT *
FROM transactions
WHERE amount <= 0

Add a schedule

Customize the frequency and timing of monitor runs. You can choose a specific hourly or daily time in UTC, or input a cron tab expression for more complex scheduling:

Add notifications

Receive notifications via Slack or email when at least one record fails your test:

Example queries

Custom business rule

Say your company defines active users as individuals who have signed into your application at least 3 times in the past week. You could write a test that validates this logic by checking for users marked as active who haven’t reached this threshold:

SELECT *
FROM users
WHERE status = 'active'
    AND signins_last_7d < 3;

Referential integrity

Now let’s assume you have contacts and accounts in your warehouse. Every contact belongs to exactly one account, and you want to confirm that every account referenced in the contacts table exists in the accounts table. In this case, you’d look for references to accounts that are missing from the accounts table:

SELECT *
FROM contacts
LEFT JOIN accounts
    ON contacts.account_id = accounts.id
WHERE contacts.account_id IS NOT NULL
    AND accounts.id IS NULL;

Data formatting

Finally, if you wanted to validate that all phone numbers in your contacts table are 10 digits and only contain numbers, you’d return records that are not 10 digits or use non-numeric characters:

SELECT *
FROM contacts
WHERE LENGTH(phone_number) != 10
    OR phone_number REGEXP '[^0-9]';

Attach CSVs to notifications

Datafold allows attaching a CSV of failed records to notifications in Slack, email, etc. This is useful if, for example, you have business users who don’t have a Datafold license but need to know about records that fail your tests.

This option is configured separately per notification destination as shown here:

CSV attachments are limited to the lesser of 1,000 rows or 1 MB in file size.

Attaching CSVs in Slack

In order to attach CSVs to Slack notifications, you need to complete 1-2 additional steps:

  1. If you installed the Datafold Slack app prior to October 2024, you’ll need to reinstall the app by visiting Settings > Integrations > Notifications, selecting your Slack integration, then Reinstall Slack integration.
  2. Invite the Datafold app to the channel you wish to send notifications to using the /invite command shown below:

Run Tests in CI

Standard Data Tests run on a schedule against your production data. But often it’s useful to test data before it gets to production as part of your deployment workflow. For this reason, Datafold supports running tests in CI.

Data Tests in CI work very similarly to our Monitors as Code feature, in the sense that you define your tests in a version-controled YAML file. You then use the Datafold SDK to execute those tests as part of your CI workflow.

Write your tests

First, create a new file (e.g. tests.yaml) in the root of your repository. Then write your tests use the same format described in our Monitors as Code docs with two exceptions:

  1. Add a run_in_ci flag to each test and set it to true (assuming you’d like to run the test)
  2. (Optional) Add placeholders for variables that you’d like to populate dynamically when executing your tests

Here’s an example:

monitors:
  null_pk_test:
    type: test
    name: No NULL pk in the users table
    run_in_ci: true
    connection_id: 8
    query: select * from {{ schema }}.USERS where id is null

  duplicate_pk_test:
    type: test
    name: No duplicate pk in the users table
    run_in_ci: true
    connection_id: 8
    query: |
        select *
        from {{ schema }}.USERS
        where id in (
            select id
            from {{ schema }}.USERS
            group by id
            having count(*) > 1
        );

Execute your tests

INFO

This section describes how to get started with GitHub Actions, but the same concepts apply to other hosted version control platforms like GitLab and Bitbucket. Contact us if you need help getting started.

If you’re using GitHub Actions, create a new YAML file under .github/workflows/ using the following template. Be sure to tailor it to your particular setup:

  on:
    push:
      branches:
        - main
    pull_request:
  jobs:
    test:
      runs-on: ubuntu-latest
      steps:
        - uses: actions/checkout@v2
        - uses: actions/checkout@v2
          with:
            token: ${{ secrets.GH_TOKEN }}
            repository: datafold/datafold-sdk
            path: datafold-sdk
            ref: data-tests-in-ci-demo
        - uses: actions/setup-python@v2
          with:
            python-version: '3.12'
        - name: Install dependencies
          run: |
            python -m pip install --upgrade pip
            pip install -r requirements.txt
        - name: Set schema env var in PR
          run: |
            echo "SCHEMA=ANALYTICS.PR" >> $GITHUB_ENV
          if: github.event_name == 'pull_request'
        - name: Set schema env var in main
          run: |
            echo "SCHEMA=ANALYTICS.CORE" >> $GITHUB_ENV
          if: github.event_name == 'push'
        - name: Run tests
          run: |
            datafold tests run --var schema:$SCHEMA --ci-config-id 1 tests.yaml # use the correct file name/path
          env:
            DATAFOLD_HOST: https://app.datafold.com # different for dedicated deployments
            DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} # remember to add to secrets

View the results

When your CI workflow is triggered (e.g. by a pull request), you can view the terminal output for your test results:

Need help?

If you have any questions about how to use Data Test monitors, please reach out to our team via Slack, in-app chat, or email us at support@datafold.com.