Data Test monitors allow you to validate your data using off-the-shelf checks for non-null or unique values, numeric ranges, accepted values, referential integrity, and more. Custom tests let you write custom SQL queries to validate your own business rules.

Think of Data Tests as pass/fail—either a test returns no records (pass) or it returns at least one record (fail). Failed records are viewable in the app, materialized to a temporary table in your warehouse, and can even be attached to notifications as a CSV.

Create a Data Test monitor

There are two ways to create a Data Test monitor:

  1. Open the Monitors page, select Create new monitor, and then choose Data Test.
  2. Clone an existing Data Test monitor by clicking Actions and then Clone. This will pre-fill the form with the existing monitor configuration.

Set up your monitor

Select your data connection, then choose whether you’d like to use a Standard or Custom test.

Standard Data Tests

Standard tests allow you to validate your data against off-the-shelf checks for non-null or unique values, numeric ranges, accepted values, referential integrity, and more.

After choosing your data connection, select Standard and the specific test that you’d like to run. If you don’t see the test you’re looking for, you can always write a Custom test.

Quoting variables

Some test types (e.g. accepted values) require you to provide one or more values, which you may want to have quoted in the final SQL. The Quote flag, which is enabled by default, allows you to control this behavior. Here’s an example.

Quoting enabled for EXAMPLE_VALUE (default):

SELECT *
FROM DB.SCHEMA.TABLE1
WHERE "COLUMN1" < 'EXAMPLE_VALUE';

Quoting disabled for EXAMPLE_VALUE:

SELECT *
FROM DB.SCHEMA.TABLE1
WHERE "COLUMN1" < EXAMPLE_VALUE;

Custom Data Tests

When you need to test something that’s not available in our Standard tests, you can write a Custom test. Select your data connection, choose Custom, then write your SQL query.

Importantly, keep in mind that your query should return records that fail the test. Here are some examples to illustrate this.

Custom business rule

Say your company defines active users as individuals who have signed into your application at least 3 times in the past week. You could write a test that validates this logic by checking for users marked as active who haven’t reached this threshold:

SELECT *
FROM users
WHERE status = 'active'
    AND signins_last_7d < 3;

Data formatting

If you wanted to validate that all phone numbers in your contacts table are 10 digits and only contain numbers, you’d return records that are not 10 digits or use non-numeric characters:

SELECT *
FROM contacts
WHERE LENGTH(phone_number) != 10
    OR phone_number REGEXP '[^0-9]';

Add a schedule

You can choose to run your monitor daily, hourly, or even input a cron expression for more complex scheduling:

Add notifications

Receive notifications via Slack or email when at least one record fails your test:

Attach CSVs to notifications

Datafold allows attaching a CSV of failed records to Slack and email notifications. This is useful if, for example, you have business users who don’t have a Datafold license but need to know about records that fail your tests.

This option is configured separately per notification destination as shown here:

CSV attachments are limited to the lesser of 1,000 rows or 1 MB in file size.

Attaching CSVs in Slack

In order to attach CSVs to Slack notifications, you need to complete 1-2 additional steps:

  1. If you installed the Datafold Slack app prior to October 2024, you’ll need to reinstall the app by visiting Settings > Integrations > Notifications, selecting your Slack integration, then Reinstall Slack integration.
  2. Invite the Datafold app to the channel you wish to send notifications to using the /invite command shown below:

Run Tests in CI

Standard Data Tests run on a schedule against your production data. But often it’s useful to test data before it gets to production as part of your deployment workflow. For this reason, Datafold supports running tests in CI.

Data Tests in CI work very similarly to our Monitors as Code feature, in the sense that you define your tests in a version-controled YAML file. You then use the Datafold SDK to execute those tests as part of your CI workflow.

Write your tests

First, create a new file (e.g. tests.yaml) in the root of your repository. Then write your tests use the same format described in our Monitors as Code docs with two exceptions:

  1. Add a run_in_ci flag to each test and set it to true (assuming you’d like to run the test)
  2. (Optional) Add placeholders for variables that you’d like to populate dynamically when executing your tests

Here’s an example:

monitors:
  null_pk_test:
    type: test
    name: No NULL pk in the users table
    run_in_ci: true
    connection_id: 8
    query: select * from {{ schema }}.USERS where id is null

  duplicate_pk_test:
    type: test
    name: No duplicate pk in the users table
    run_in_ci: true
    connection_id: 8
    query: |
        select *
        from {{ schema }}.USERS
        where id in (
            select id
            from {{ schema }}.USERS
            group by id
            having count(*) > 1
        );

Execute your tests

INFO

This section describes how to get started with GitHub Actions, but the same concepts apply to other hosted version control platforms like GitLab and Bitbucket. Contact us if you need help getting started.

If you’re using GitHub Actions, create a new YAML file under .github/workflows/ using the following template. Be sure to tailor it to your particular setup:

  on:
    push:
      branches:
        - main
    pull_request:
  jobs:
    test:
      runs-on: ubuntu-latest
      steps:
        - uses: actions/checkout@v2
        - uses: actions/checkout@v2
          with:
            token: ${{ secrets.GH_TOKEN }}
            repository: datafold/datafold-sdk
            path: datafold-sdk
            ref: data-tests-in-ci-demo
        - uses: actions/setup-python@v2
          with:
            python-version: '3.12'
        - name: Install dependencies
          run: |
            python -m pip install --upgrade pip
            pip install -r requirements.txt
        - name: Set schema env var in PR
          run: |
            echo "SCHEMA=ANALYTICS.PR" >> $GITHUB_ENV
          if: github.event_name == 'pull_request'
        - name: Set schema env var in main
          run: |
            echo "SCHEMA=ANALYTICS.CORE" >> $GITHUB_ENV
          if: github.event_name == 'push'
        - name: Run tests
          run: |
            datafold tests run --var schema:$SCHEMA --ci-config-id 1 tests.yaml # use the correct file name/path
          env:
            DATAFOLD_HOST: https://app.datafold.com # different for dedicated deployments
            DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} # remember to add to secrets

View the results

When your CI workflow is triggered (e.g. by a pull request), you can view the terminal output for your test results:

Need help?

If you have any questions about how to use Data Test monitors, please reach out to our team via Slack, in-app chat, or email us at support@datafold.com.