Search…
Data Diff
The API documentation for creating comparing datasets programmatically using Data Diff.

Creating a new Diff

This method creates a new data diff job asynchronously and returns the id that can later be used to retrieve the diff results.
Endpoint
/api/v1/datadiffs
Method
POST
Request Content-Type
application/json
Response Content-Type
application/json
Request
Response
Example request body:
1
{
2
"data_source1_id": 10,
3
"data_source2_id": 10,
4
"table1": ["df", "prod", "page_views"],
5
"query2": "SELECT * FROM df.dev.page_views WHERE date > '2020-01-01'",
6
"materialize_dataset1": true,
7
"pk_columns": ["organization_id", "user_email"],
8
"filter1": "date > '2020-01-01'",
9
"sampling_tolerance": 0.0001,
10
"sampling_confidence": 99.9,
11
"diff_tolerances_per_column": [
12
{"column_name": "col_name",
13
"tolerance_value": 0.001,
14
"tolerance_mode": "absolute"}]
15
}
Copied!
Parameter
Required
Type
Description
data_source1_id
required
integer
Id of data source. You can look it up in Datafold UI: Settings -> Data Sources -> Connection, it'll be the last number in the URL.
data_source2_id
required
integer
Id of the second datasource. It should be the same as data_source1_id, unless you are doing cross-database diffs
table1
[string]
Use the table as the first datasource. Table is specified as a list of segments. Names are case-sensitive.
query1
string
Use provided query as the first datasource. Either table1, or query1 must be specified.
table2
[string]
Use this table as the second data source.
query2
string
Use the query as the second data source. Either table2 or query2 must be specified.
materialize_dataset1
optional
bool
Force materialization of dataset 1. It's useful if table1 is a view and it needs to be materialized prior to diffing. If a query is used as a dataset, then this flag is ignored.
materialize_dataset2
optional
bool
Force materialization of dataset 2.
pk_columns
required
[string]
List of primary key columns. It's case-sensitive.
include_columns
optional
[string]
Include only specified columns, besides PK columns.
exclude_columns
optional
[string]
Exclude those columns from diff. Useful for excluding data transfer timestamps and such.
time_column
optional
string
Time column to use for DataDiff time view.
filter1
optional
string
SQL filter for dataset 1. It'll be injected as-is to WHERE clauses during diffing.
filter2
optional
string
SQL filter for dataset 2.
time_travel_point1
optional
string
Allows you to select a specific version of the table. Snowflake:
  • Relative using a negative integer (ex. -220) for an offset in seconds.
  • At a certain point in time using a timestamp (ex. 2021-08-13T00:00:00)
  • At a certain statement using the query id.
BigQuery:
  • At a certain point in time using a timestamp (ex. 2021-08-13T00:00:00)
time_travel_point2
optional
string
Same as the abovementioned, for the second data source.
sampling_tolerance
optional
float
Ratio of acceptable number of rows that can be different to the total number of rows. Used only in sampling.
sampling_confidence
optional
float
Confidence that number of differing rows is less than allowed by sampling_tolerance, in percents. E.g. 99.9.
diff_tolerances_per_column
optional
[object]
List of objects to configure tolerance value per column.
diff_tolerances_per_column.
column_name
required
string
The name of the column to which the tolerance applies. Must be of type integer or float, ignored otherwise.
diff_tolerances_per_column.
tolerance_value
required
float
The value applied to the tolerance.
diff_tolerances_per_column.
tolerance_mode
required
string
The tolerance mode to apply.
absolute : abs(b-a) < X
relative : abs((b-a) / a) < X
The call returns all data from the request with missing values filled in with either nulls or defaults. In addition to that, it returns the diff id that can be used later to get diff results.
1
{
2
"id": 69004,
3
"user_id": 1,
4
"data_source1_id": 7,
5
"data_source2_id": 7,
6
"table1": ["df", "prod", "page_views"],
7
"table2": null,
8
"query1": null,
9
"query2": "SELECT * FROM df.dev.page_views WHERE date > '2020-01-01'",
10
"filter1": null,
11
"filter2": null,
12
"pk_columns": ["col0"],
13
"include_columns": [],
14
"exclude_columns": [],
15
"time_column": null,
16
"time_aggregate": null,
17
"time_interval_start": null,
18
"time_interval_end": null,
19
"sampling_tolerance": null,
20
"sampling_confidence": null,
21
"materialize_dataset1": false,
22
"materialize_dataset2": false,
23
"done": false,
24
"result_statuses": {},
25
}
Copied!

Getting Diff Results

This method retrieves results of previously created data diff.
Endpoint
/api/v1/datadiffs/<id>/summary_results
Method
GET
Response Content-Type
application/json
Response
Second Tab
Immediately after the diff is submitted, the status will be either waiting for processing:
1
{
2
"status": "running",
3
}
Copied!
When the diff task is done, the status will be set to done and additional fields will contain high-level statistics:
1
{
2
"status": "success",
3
4
"schema": {
5
"columns_mismatched": [3, 4],
6
"column_type_mismatches": 0,
7
"column_reorders": 0,
8
"column_counts": [20, 21]
9
},
10
"pks": {
11
"total_rows": [17343000, 19294000],
12
"nulls": [0, 0],
13
"dupes": [0, 0],
14
"exclusives": [0, 1951000],
15
"distincts": [17343000, 19294000]
16
},
17
"values": {
18
"total_rows": 17343000,
19
"rows_with_differences": 13780000,
20
"total_values": 277488000,
21
"values_with_differences": 19505000,
22
"compared_columns": 16,
23
"columns_with_differences": 5
24
},
25
}
Copied!
Field Path
Description
schema.columns_mismatched
The number of columns across both datasets compared that didn't match. Datafold matches columns by name and type.
schema.column_type_mismatches
The number of columns across both datasets compared that had a matching name but different types.
schema.column_reorders
The approximate number of columns that were reordered (changed the ordinal position between datasets)
schema.column_counts
Column counts in each dataset that were considered for diffing (after include/exclude filters)
pks.total_rows
The total number of rows in each dataset.
pks.nulls
The number of NULL primary keys in each dataset. In the case of a composite primary key, if any part of it is NULL, then the whole key is counted as NULL.
pks.dupes
The number of rows that have duplicate primary keys for each dataset.
pks.exclusives
The number of primary keys that are exclusive to the first and the second dataset.
pks.distincts
The number of distinct primary keys in each dataset.
values.total_rows
The total number of rows for which value comparison was done. Rows with exclusive, duplicate, and null primary keys are excluded from the comparison.
values.compared_columns
The number of columns that were compared between datasets (not counting primary keys). Mismatched columns (different names or types) are not compared.
values.total_values
The number of values (individual dataset cells) that were compared, excluding primary keys.
values.rows_with_differences
The number of rows with at least one value different between the datasets.
values.columns_with_differences
The number of columns with at least one value different between the datasets.
values.values_with_differences
The number of individual cells across both rows and columns that are different between datasets.
In case of error status will be set to failed and the message field will contain error description:
1
{
2
"status" : "error",
3
"error" : {
4
"error_type" : "NoSuchColumnException",
5
"error_value" : "unique_key"
6
}
7
}
8
Copied!

Example: using DataDiff with curl

1
$ cat >diff_request.json << EOF
2
{
3
"data_source1_id": 138,
4
"data_source2_id": 138,
5
"table1": ["datadiff-demo", "public", "taxi_trips"],
6
"table2": ["datadiff-demo", "dev", "taxi_trips"],
7
"pk_columns": ["trip_id"]
8
}
9
EOF
10
11
$ curl -d "@diff_request.json" \
12
-X POST -H 'Content-Type: application/json' \
13
-H 'Authorization: Key YOUR_DATAFOLD_API_KEY' \
14
'https://app.datafold.com/api/v1/datadiffs' | json_pp
15
16
# use diff id returned by previous request
17
$ curl -H 'Authorization: Key YOUR_DATAFOLD_API_KEY' \
18
'https://app.datafold.com/api/v1/datadiffs/12345/summary_results' | json_pp
Copied!

Example: using DataDiff with Python and requests lib

1
#!/usr/bin/env python3
2
import pprint
3
import time
4
5
import requests
6
7
HEADERS = {'Authorization': 'Key YOUR_DATAFOLD_API_KEY'}
8
DIFF_REQUEST = {
9
'data_source1_id': 138,
10
'data_source2_id': 138,
11
'table1': ['datadiff-demo', 'public', 'taxi_trips'],
12
'table2': ['datadiff-demo', 'dev', 'taxi_trips'],
13
'pk_columns': ['unique_key'],
14
}
15
16
# Create a new Diff task
17
resp = requests.post(
18
'https://app.datafold.com/api/v1/datadiffs',
19
headers=HEADERS,
20
json=DIFF_REQUEST,
21
)
22
diff = resp.json()
23
24
# Poll until we get results back
25
while 1:
26
resp = requests.get(
27
f'https://app.datafold.com/api/v1/datadiffs/{diff["id"]}/'
28
'summary_results',
29
headers=HEADERS,
30
)
31
results = resp.json()
32
33
if results['status'] == 'error':
34
print('ERROR!', results['error'])
35
break
36
elif results['status'] == 'success':
37
pprint.pprint(results)
38
break
39
40
time.sleep(20)
Copied!
Last modified 2mo ago