Skip to main content

CLI

Options

CategoryConfig keyCLI switch                                       Descriptiondbt       joindiffhashdiff
help--helpShow help message and exit.
Schemakey_columns-k or --key-columnsName of the primary key column. If none provided, default is 'id'.
Schemaupdate_column-t or --update-columnName of updated_at/last_updated column.
Schemacolumns-c or --columnsNames of extra columns to compare. Can be used more than once in the same command. Accepts a name or a pattern, like in SQL. Example: -c col% -c another_col -c %foorb.r%
Schemaassume_unique_key--assume-unique-keySkip validating the uniqueness of the key column during joindiff, which is costly in non-cloud dbs.
Filteringmin_age--min-ageConsiders only rows older than specified. Useful for specifying replication lag. Example: --min-age=5min ignores rows from the last 5 minutes. Valid units: d, days, h, hours, min, minutes, mon, months, s, seconds, w, weeks, y, years
Filteringmax_age--max-ageConsiders only rows younger than specified. See --min-age.
Filteringwhere-w, --whereAn additional 'where' expression to restrict the search space.
Performancelimit-l or --limitMaximum number of differences to find (limits maximum bandwidth and runtime).
Performancethreads-j or --threadsNumber of worker threads to use per database. Default=1.
Performancealgorithm-a, --algorithmForce algorithm choice. Options: auto, joindiff, hashdiff
Performancebisection_threshold--bisection-thresholdMinimal size of segment to be split. Smaller segments will be downloaded and compared locally.
Performancebisection_factor--bisection-factorSegments per iteration. When set to 2, it performs binary search.
Outputstats-s or --statsPrint stats instead of a detailed diff.
Outputdebug-d or --debugPrint debug info.
Outputinteractive-i or --interactiveConfirm queries, implies --debug
Outputverbose-v or --verbosePrint extra info.
Outputjson--jsonPrint JSONL output for machine readability.
Outputsample_exclusive_rows--sample-exclusive-rowsSample several rows that only appear in one of the tables, but not the other. Use with -s.
Outputmaterialize_all_rows--materialize-all-rowsMaterialize every row, even if they are the same, instead of just the differing rows.
Outputmaterialize-m, --materializeMaterialize the diff results into a new table in the database. If a table exists by that name, it will be replaced. Use %t in the name to place a timestamp. Example: -m test_mat_%t
Outputtable_write_limit--table-write-limitMaximum number of rows to write when creating materialized or sample tables, per thread. Default=1000.
Settings--conf, --runSpecify the run and configuration from a TOML file.
Settingsno_tracking--no-trackingdata-diff sends home anonymous usage data. Use this to disable it.

Examples

dbt

For dbt examples, check out our documentation on Development Testing with Open Source.

joindiff

data-diff \
"snowflake://<username>:<password>@<ACCOUNT>/<DATABASE>/<SCHEMA_1>?warehouse=<WAREHOUSE>&role=<ROLE>" <TABLE_1> \
<SCHEMA_2>.<TABLE_2> \
-k org_id \
-c created_at -c is_internal \
-w "org_id != 1 and org_id < 2000" \
-m test_results_%t \
--materialize-all-rows \
--table-write-limit 10000

-m materializes the results into the specified table. %t will get replaced by the current timestamp.

Check out the rest of the options here.

hashdiff

data-diff \
postgresql://<username>:'<password>'@localhost:5432/<database> \
<table> \
"snowflake://<username>:<password>@<ACCOUNT>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \
<TABLE> \
-k activity_id \
-c activity \
-w "event_timestamp < '2022-10-10'"

Check out the options available here.

Connection Methods

URI Strings

Note: Unless something is explicitly case sensitive (like your password), use all caps.

With password:

"snowflake://<USER>:<password>@<ACCOUNT>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>"

With SSO:

"snowflake://<USER>@<ACCOUNT>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>&authenticator=externalbrowser"

TOML Config File

There are two main sections to a TOML config file:

  1. Database Connection - define one or more databases that will be used by runs.
  2. Run Parameters - define default options that are inherited or overridden by specific runs.

Example

# datadiff.toml

## DATABASE CONNECTION ##
[database.postgres_connection]
driver = "postgresql"
database = "<PG_DATABASE>"
user = "<PG_USERNAME>"
password = ${PASSWORD_ENV_VARIABLE} # or "<PASSWORD_STRING>"

[database.snowflake_connection]
driver "snowflake"
database = "<SNOWFLAKE_DATABASE>"
user = "<SNOWFLAKE_USER>"
password = ${PASSWORD_ENV_VARIABLE} # or "<PASSWORD_STRING>"
# the info below is only required for snowflake
account = "<SNOWFLAKE_ACCOUNT>"
schema = "<SNOWFLAKE_SCHEMA>"
warehouse = "<SNOWFLAKE_WAREHOUSE>"
role = "<SNOWFLAKE_ROLE>"


## RUN PARAMETERS ##
[run.default]
verbose = true

[run.<RUN_NAME>]
# Source 1 ("left")
1.database = "postgres_connection"
1.table = "<TABLE_NAME_1>"

# Source 2 ("right")
2.database = "<snowflake_connection>"
2.table = "<TABLE_NAME_2>"

verbose = false

The following command line input uses the configuration defined above

data-diff \
--conf ~/config_files/datadiff.toml \
--run <RUN_NAME> \
-k <PRIMARY_KEY> \
-w "<TIMESTAMP_COL> < 'XXXX-XX-XX'"
note

When defining how a run connects to a database, you can use a URI string instead of a database defined in the "Database Connection" section:

  # Source 1 ("left")
1.database = "postgresql://<PG_DATABASE>:<PASSWORD>/"
1.table = "<TABLE_NAME>"

Inheritance and overriding parameters

CLI switches have the final say, and will override the settings defined by the configuration file.

Where should you store your TOML configuration file?

  • If you are using a single file with multiple configurations for multiple projects, store it in your home directory.
  • For project-specific TOML files, store the file in your project.
caution

Exclude the TOML file in your .gitignore, especially if it includes sensitive information such as passwords.