Install

First, create your virtual environment for python:

>python3 -m venv venv
> source venv/bin/activate
> pip install --upgrade pip setuptools wheel

Now, you’re ready to install the datafold SDK:

    > pip install datafold-sdk

Configure

After selecting datafold-sdk from the available options, complete configuration with the following information:

Field NameDescription
RepositorySelect the repository that generates the webhooks and where pull / merge requests will be raised.
Data ConnectionSelect the data connection where the code that is changed in the repository will run.
NameAn identifier used in Datafold to identify this CI configuration.
Files to ignoreIf defined, the files matching the pattern will be ignored in the PRs. The pattern uses the syntax of .gitignore. Excluded files can be re-included by using the negation; re-included files can be later re-excluded again to narrow down the filter. For example, to exclude everything except the /dbt folder, but not the dbt .md files, do:*!dbt/*dbt/*.md.
Mark the CI check as failed on errorsIf the checkbox is disabled, the errors in the CI runs will be reported back to GitHub/GitLab as successes, to keep the check “green” and not block the PR/MR. By default (enabled), the errors are reported as failures and may prevent PR/MRs from being merged.
Require the datafold label to start CIWhen this is selected, the Datafold CI process will only run when the ‘datafold’ label has been applied. This label needs to be created manually in GitHub or GitLab and the title or name must match ‘datafold’ exactly.
Sampling toleranceThe tolerance to apply in sampling for all data diffs.
Sampling confidenceThe confidence to apply when sampling.
Sampling ThresholdSampling will be disabled automatically if tables are smaller than specified threshold. If unspecified, default values will be used depending on the Data Connection type.

Example Uses

Submit dbt Artifacts

The following arguments need to be specified when submitting dbt artifacts via the Datafold SDK (examples for Python and CLI below):

  • ci_config_id: The id of your Orchestration config, which can be found in Settings > Integrations > Orchestration.
  • run-type: This can be either ‘pull_request’ or ‘production’, depending on whether you’re uploading dbt artifacts for a git commit SHA corresponding to production or pull request code.
  • artifacts_path: A path to dbt artifacts. Typically, these artifacts will be located in the ‘target’ folder of your dbt project. If your current working directory is the dbt project, you can use ‘./target/’ as your artifacts_path.
  • git_sha: The git commit SHA for which you will provide artifacts.

CLI

    export DATAFOLD_API_KEY=XXXXXXXXX
    # only needed if your Datafold app url is not app.datafold.com
    export DATAFOLD_HOST=<CUSTOM_DATAFOLD_APP_DOMAIN>
    datafold dbt upload \
    --ci-config-id <ci_config_id> \
    --run-type <run-type> \
    --target-folder <artifacts_path> \
    --commit-sha <git_sha>

Python

import os

from datafold.sdk.dbt import submit_artifacts

api_key = os.environ.get('DATAFOLD_API_KEY')

# only needed if your Datafold app url is not app.datafold.com
host = os.environ.get("DATAFOLD_HOST")

submit_artifacts(host=host,
                 api_key=api_key,
                 ci_config_id=<ci_config_id>,
                 run_type='<run-type>',
                 target_folder='<artifacts_path>',
                 commit_sha='<git_sha>')