Make sure Datafold Github integration is set up.
Datafold needs to know which column is the primary key of the table to perform the diff. When Datafold cannot determine the primary key of the two tables to diff, and will produce an error:
When setting up the CI integration, one of the steps is proving the primary-key tag:
This tag we can use in the dbt metadata to let Datafold know which column can be used to perform the diff. Datafold supports composite primary keys, meaning that you can assign multiple columns that make up the primary key together. There are three ways of doing this, which we'll discuss next:
The first one is setting the tag in the dbt metadata. We set the primary key tag to
primary-keyso we use this in the metadata.
models:- name: userscolumns:- name: user_idmeta:primary-key: true
If the primary key is not found in the metadata, it will go through the tags.
models:- name: userscolumns:- name: user_idtags:- primary-key
If the primary key isn't provided explicitly, Datafold will try to assume a pk from dbt's uniqueness tests. If you have a single column uniqueness test defined, it will use this column as the PK:
models:- name: userscolumns:- name: user_idtests:- unique
Also, model level uniqueness tests are used for inferring the PK:
models:- name: salescolumns:- name: order_no- name: order_line...tests:- unique:column_name: "order_no || order_line"
Finally, we also support
unique_combination_ofcolumns from the
models:- name: userscolumns:- name: order_no- name: order_line...tests:- dbt_utils.unique_combination_of_columns:combination_of_columns:- order_no- order_line
Keep in mind that this is a failover mechanism. If you change the uniqueness test, this will also impact the way Datafold performs the diff.
Datafold integrates very well with dbt, and also has the ability to ingest the metadata provided by dbt automatically. dbt models has metadata that can be synchronized from the production branch into the Datafold catalog. When a table has metadata being synchronized using dbt, user editing is no longer permitted for that entire table. This is to ensure that there is a single source of truth.
Metadata can be applied both on a table and column level:
models:- name: usersdescription: "Description of the table"meta:owner: [email protected]foo: bartags:- pii- abccolumns:- name: user_idtags:- pk- idmeta:pk: true- name: emaildescription: "The user's email"tags:- piimeta:type: email
There are two special meta types:
owner: Used to specify the owner of the table and applies the owner of the table in the catalog view
<pk_tag>: The tag/name that is configured to identify primary columns is not synchronized into the meta-information, but it is synchronized as a tag if it exists.
So for the above table:
description is synchronized into the description field of the table in the catalog.
owner of the table is set to the user identified by the
[email protected] field. This user must exist in Datafold with that email.
foo meta information is added to the description field with the value
bar are applied to the table as tags.
For the columns above:
user_id has two tags applied:
The metadata for
user_id is ignored, because it reflects the primary key tag.
type with the value
Metadata synchronization occurs in one of two methods:
meta_schedule is set for the dbt cloud integration. This will run according to the specified cron schedule, find the most recent dbt cloud production run, and synchronize the metadata from there.
It can also be configured to synchronize metadata whenever a push to production happens.