Skip to main content

Diff Monitors

Ways to create a data diff monitor

There are 3 ways to create a data diff monitor:

  1. From the Monitors page by clicking Create new monitor and then selecting Data diff as a type of the monitor.
  2. Clone an existing monitor by clicking Actions and then Clone in the header menu. This will pre-fill the form with the existing monitor configuration.
  3. Create a monitor directly from the data diff results by clicking Actions and Create monitor. This will pre-fill the configuration with the parent data diff settings, requiring updates only for the Schedule and Notifications sections.

Once a monitor is created and initial metrics collected, you can set up thresholds for the two metrics.

Creating a new diff monitor from scratch

Setting up a new diff monitor in Datafold Cloud is straightforward. You can configure it with the following parameters and options:

General

Diff type

Choose how you want to compare your data and whether the diff type is in-database or cross-database.

Schedule

Customize the frequency and timing of monitor executions. You can choose a specific hourly or daily time in UTC, or input a cron tab expression for more complex scheduling.

You can add notifications, sent through Slack or emails, which indicate whether a monitor has been executed.

Dataset A and B

Pick your connected data source(s). Then, choose the two datasets you want to compare. This can be a table or a view in your relational database.

If you need to compare just a subset of data (e.g., for a particular city or last two weeks), add a SQL filter.

Column remapping

When columns are the same data type but are named differently, column remapping allows you to align and compare them. This is useful when datasets have semantically identical columns with different names, such as userID and user_id. Datafold will surface any differences under the column name used in Dataset A.

Diff settings

Primary key

The primary key is one or more columns used to uniquely identify a row in the dataset during diffing. The primary key (or keys) does not need to be formally defined in the database or elsewhere as it is used for unique row identification during diffing. Multiple columns support compound primary key definitions.

Egress limit

The egress limit optimizes the diff process by terminating it once a predefined number of rows are downloaded. The limit is by default set to 1,000,000 by default.

When the egress limit is reached, the diffing process does not produce the same results each time it is run, as it is not deterministic (i.e., the order in which data is processed may vary).

The egress limit prevents redundant analysis in scenarios with minor, repetitive discrepancies, such as formatting differences (e.g., whitespace, rounding differences). For most use cases, it is impractical to continue diffing after it is known that datasets are substantially different.

Since the algorithm aims to detect and return every mismatched row/value, if the datasets have a large percentage of differing rows, the algorithm may be unable to take advantage of checksumming. This can cause a large amount of data to be pulled over the network, which slows down the diffing process, and increases the strain on the database.

Setting an egress limit prevents unwanted runtime and database load by stopping the operation early in cases of substantial dataset discrepancies. It is highly recommended to set an egress limit, taking into account these tradeoffs between cost/speed and rigor.

Per-column diff limit

By setting a per-column diff limit, Data Diff will stop identifying differences for any column after a number of differences is found, based on the limit. Data Diff will also stop searching for exclusive and duplicate primary keys after the limit is reached.

Setting a per-column diff limit enables your team to find data quality issues that arise during data reconciliation while minimizing compute and time spent searching for differences.

Columns to compare

Determine whether to compare all columns or select specific one(s). To optimize performance on large tables, it's recommended to exclude columns known to have unique values for every row, such as timestamp columns like "updated_at," or apply filters to limit the comparison scope.

Advanced Diff settings

These settings can be added after a monitor has been created.

The bisection factor sets the number of checksum segments that the dataset is divided into per comparison iteration. Increasing this value may improve performance for extremely large data sets. For example, setting the factor at 32 means the dataset will be divided into 32 segments for comparison.

The bisection threshold defines the row count threshold for segments. If the row count falls below this threshold, the diff will compare data outside of the warehouse. Raising this value can enhance performance, especially for datasets with substantial expected differences. For example, setting the threshold at 16384 means the diff will compare data outside of the warehouse for segments with fewer than 16384 rows.

Monitoring

Notifications are sent when either or both predefined thresholds are reached during a Diff Monitor. You can set a maximum threshold for the:

  • Number of different rows
  • Percentage of different rows

Results

The diff monitor run history shows the results from each run. Each run includes basic stats, along with metrics such as:

  • The total rows different: number of different rows according to data diff results.
  • Rows with different values: percentage of different rows relative to the total number of rows in dataset A according to data diff results. Note that the status Different doesn't automatically map into a notification/alert.

Click the Open Diff link for more granular information about a specific Data Diff.