Datafold home pagelogo
  • Login
  • Request a Demo
  • Request a Demo
Documentation
API Reference
Frequently Asked Questions
  • About Datafold
  • Blog
  • FAQ
    • Overview
    • Data Diffing
    • CI/CD Testing
    • Data Migration Automation
    • Data Reconciliation
    • Data Monitoring and Observability
    • Integrating Datafold with dbt
    • Data Storage and Security
    • Performance and Scalability
    • Resource Management
    FAQ

    Integrating Datafold with dbt

    You need Datafold in addition to dbt tests because while dbt tests are effective for validating specific assertions about your data, they can’t catch all issues, particularly unknown unknowns. Datafold identifies value-level differences between staging and production datasets, which dbt tests might miss.

    Unlike dbt tests, which require manual configuration and maintenance, Datafold automates this process, ensuring continuous and comprehensive data quality validation without additional overhead. This is all embedded within Datafold’s unified platform that offers end-to-end data quality testing with our Column-level Lineage and Data Monitors.

    Hence, we recommend combining dbt tests with Datafold to achieve complete test coverage that addresses both known and unknown data quality issues, providing a robust safeguard against potential data integrity problems in your CI pipeline.

    For dbt Core users, create an integration in Datafold, specify the necessary settings, obtain a Datafold API Key and CI config ID, and configure your CI scripts with the Datafold SDK to upload manifest.json files. Our detailed setup guide can be found here.

    For dbt Cloud users, set up dbt Cloud CI to run Pull Request jobs and create an Artifacts Job that generates production manifest.json on merges to main/master. Obtain your dbt Cloud access URL and a Service Token, then create a dbt Cloud integration in Datafold using these credentials. Configure the integration with your repository, data connection, primary key tag, and relevant jobs. Our detailed setup guide can be found here.

    Yes, Datafold is fully compatible with the custom PR schema created by dbt Cloud for Slim CI jobs.

    We outline effective strategies for efficient and scalable data diffing in ourperformance and scalability guide.

    For dbt-specific diff performance, you can exclude certain columns or tables from data diffs in your CI/CD pipeline by adjusting the Advanced settings in your Datafold CI/CD configuration. This helps reduce processing load by focusing diffs on only the most relevant columns.

    Some teams want to show Data Diff results in their tickets before creating a pull request. This speeds up code reviews as developers can QA code changes before requesting a PR review.

    You can trigger a Data Diff by first creating a draft PR and then running the following command via the CLI:

    dbt run && datafold diff dbt
    

    This command runs dbt locally and then triggers a Data Diff, allowing you to preview data changes without pushing to Git.

    To automate this process of kicking off a Data Diff before pushing code to git, we recommend creating a GitHub Actions job for draft PRs. For example:

    name: Data Diff on draft dbt PR
    
    on:
      pull_request:
        types: [opened, reopened, synchronize]
        branches:
          - '!main'
    
    jobs:
      run:
        if: github.event.pull_request.draft == true  # Run only on draft PRs
        runs-on: ubuntu-latest
    
        steps:
          - name: Checkout Code
            uses: actions/checkout@v2
    
          - name: Set Up Python
            uses: actions/setup-python@v2
            with:
              python-version: '3.8'
    
          - name: Install requirements
            run: pip install -r requirements.txt  
    
          - name: Install dbt dependencies
            run: dbt deps
    
          # Update with your S3 bucket details
          - name: Grab production manifest from S3
            run: |
              aws s3 cp s3://advanced-ci-manifest-demo/manifest.json ./manifest.json
            env:
              AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
              AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
              AWS_REGION: us-east-1
    
          - name: Run dbt and Data Diff
            env:
              DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
            run: |
              dbt run
              datafold diff dbt
              
          # Optional: Submit artifacts to Datafold for more analysis or logging
          - name: Submit artifacts to Datafold
            run: |
              set -ex
              datafold dbt upload --ci-config-id 350 --run-type pull_request --commit-sha ${GIT_SHA}
            env:
              DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }}
              GIT_SHA: "${{ github.event.pull_request.head.sha }}"
    
    
    Data Monitoring and ObservabilityData Storage and Security
    linkedinxgithubyoutube
    Powered by Mintlify
    Assistant
    Responses are generated using AI and may contain mistakes.