Skip to main content

CLI

Deprecated

As of May 17, 2024, Datafold is no longer actively supporting or developing open source data-diff. For continued access to in-database and cross-database diffing, please check out a free trial of Datafold Cloud. We’re grateful to everyone who made contributions along the way. Please see our blog post for additional context on this decision.

Options

CategoryConfig keyCLI switch                                       Descriptiondbt       joindiffhashdiff
help--helpShow help message and exit.
Schemakey_columns-k or --key-columnsName of the primary key column. If none provided, default is 'id'.
Schemaupdate_column-t or --update-columnName of updated_at/last_updated column.
Schemacolumns-c or --columnsNames of extra columns to compare. Can be used more than once in the same command. Accepts a name or a pattern, like in SQL. Example: -c col% -c another_col -c %foorb.r%
Schemaassume_unique_key--assume-unique-keySkip validating the uniqueness of the key column during joindiff, which is costly in non-cloud dbs.
Filteringmin_age--min-ageConsiders only rows older than specified. Useful for specifying replication lag. Example: --min-age=5min ignores rows from the last 5 minutes. Valid units: d, days, h, hours, min, minutes, mon, months, s, seconds, w, weeks, y, years
Filteringmax_age--max-ageConsiders only rows younger than specified. See --min-age.
Filteringwhere-w, --whereAn additional 'where' expression to restrict the search space.
Performancelimit-l or --limitMaximum number of differences to find (limits maximum bandwidth and runtime).
Performancethreads-j or --threadsNumber of worker threads to use per database. Default=1.
Performancealgorithm-a, --algorithmForce algorithm choice. Options: auto, joindiff, hashdiff
Performancebisection_threshold--bisection-thresholdMinimal size of segment to be split. Smaller segments will be downloaded and compared locally.
Performancebisection_factor--bisection-factorSegments per iteration. When set to 2, it performs binary search.
Outputstats-s or --statsPrint stats instead of a detailed diff.
Outputdebug-d or --debugPrint debug info.
Outputinteractive-i or --interactiveConfirm queries, implies --debug
Outputverbose-v or --verbosePrint extra info.
Outputjson--jsonPrint JSONL output for machine readability.
Outputsample_exclusive_rows--sample-exclusive-rowsSample several rows that only appear in one of the tables, but not the other. Use with -s.
Outputmaterialize_all_rows--materialize-all-rowsMaterialize every row, even if they are the same, instead of just the differing rows.
Outputmaterialize-m, --materializeMaterialize the diff results into a new table in the database. If a table exists by that name, it will be replaced. Use %t in the name to place a timestamp. Example: -m test_mat_%t
Outputtable_write_limit--table-write-limitMaximum number of rows to write when creating materialized or sample tables, per thread. Default=1000.
Settings--conf, --runSpecify the run and configuration from a TOML file.
Settingsno_tracking--no-trackingdata-diff sends home anonymous usage data. Use this to disable it.

Examples

dbt

For dbt examples, check out our documentation on Development Testing with Open Source.

joindiff

data-diff \
"snowflake://<username>:<password>@<ACCOUNT>/<DATABASE>/<SCHEMA_1>?warehouse=<WAREHOUSE>&role=<ROLE>" <TABLE_1> \
<SCHEMA_2>.<TABLE_2> \
-k org_id \
-c created_at -c is_internal \
-w "org_id != 1 and org_id < 2000" \
-m test_results_%t \
--materialize-all-rows \
--table-write-limit 10000

-m materializes the results into the specified table. %t will get replaced by the current timestamp.

Check out the rest of the options here.

hashdiff

data-diff \
postgresql://<username>:'<password>'@localhost:5432/<database> \
<table> \
"snowflake://<username>:<password>@<ACCOUNT>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \
<TABLE> \
-k activity_id \
-c activity \
-w "event_timestamp < '2022-10-10'"

Check out the options available here.

Connection Methods

URI Strings

Note: Unless something is explicitly case sensitive (like your password), use all caps.

With password:

"snowflake://<USER>:<password>@<ACCOUNT>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>"

With SSO:

"snowflake://<USER>@<ACCOUNT>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>&authenticator=externalbrowser"

TOML Config File

There are two main sections to a TOML config file:

  1. Database Connection - define one or more databases that will be used by runs.
  2. Run Parameters - define default options that are inherited or overridden by specific runs.

Example

# datadiff.toml

## DATABASE CONNECTION ##
[database.postgres_connection]
driver = "postgresql"
database = "<PG_DATABASE>"
user = "<PG_USERNAME>"
password = ${PASSWORD_ENV_VARIABLE} # or "<PASSWORD_STRING>"

[database.snowflake_connection]
driver "snowflake"
database = "<SNOWFLAKE_DATABASE>"
user = "<SNOWFLAKE_USER>"
password = ${PASSWORD_ENV_VARIABLE} # or "<PASSWORD_STRING>"
# the info below is only required for snowflake
account = "<SNOWFLAKE_ACCOUNT>"
schema = "<SNOWFLAKE_SCHEMA>"
warehouse = "<SNOWFLAKE_WAREHOUSE>"
role = "<SNOWFLAKE_ROLE>"


## RUN PARAMETERS ##
[run.default]
verbose = true

[run.<RUN_NAME>]
# Source 1 ("left")
1.database = "postgres_connection"
1.table = "<TABLE_NAME_1>"

# Source 2 ("right")
2.database = "<snowflake_connection>"
2.table = "<TABLE_NAME_2>"

verbose = false

The following command line input uses the configuration defined above

data-diff \
--conf ~/config_files/datadiff.toml \
--run <RUN_NAME> \
-k <PRIMARY_KEY> \
-w "<TIMESTAMP_COL> < 'XXXX-XX-XX'"
note

When defining how a run connects to a database, you can use a URI string instead of a database defined in the "Database Connection" section:

  # Source 1 ("left")
1.database = "postgresql://<PG_DATABASE>:<PASSWORD>/"
1.table = "<TABLE_NAME>"

Inheritance and overriding parameters

CLI switches have the final say, and will override the settings defined by the configuration file.

Where should you store your TOML configuration file?

  • If you are using a single file with multiple configurations for multiple projects, store it in your home directory.
  • For project-specific TOML files, store the file in your project.
caution

Exclude the TOML file in your .gitignore, especially if it includes sensitive information such as passwords.