Skip to main content
Our MongoDB integration is still in beta. Some features, such as column-level lineage, are not yet supported. Please contact us if you need assistance.
Steps to complete:
  1. Configure user in MongoDB
  2. Configure your data connection in Datafold
  3. Diff your data

Configure user in MongoDB

To connect to MongoDB, create a user with read-only access to all databases you plan to diff.

Configure in Datafold

Field nameDescription
Connection nameThe name you’d like to assign to this connection in Datafold.
HostThe hostname for your MongoDB instance. For MongoDB Atlas, use your cluster hostname (e.g., cluster0.mongodb.net).
PortMongoDB endpoint port (default value is 27017). Not required for MongoDB Atlas connections.
User IDUser ID (e.g., DATAFOLD).
PasswordPassword for the user provided above.
DatabaseDatabase to connect to.
Authentication databaseDatabase name associated with the user credentials (e.g., admin).
Click Create. Your data connection is now ready!
MongoDB Atlas clusters (using *.mongodb.net domains) automatically use the SRV protocol for connection, which performs server discovery and load balancing. Other MongoDB deployments connect directly to the specified host and port.

Diff your data

Write your MongoDB query MongoDB works differently from our relational database integrations. You write native MongoDB queries wrapped in a JSON structure. Here’s how to diff your MongoDB data:
  1. Create a new data diff.
  2. Select your MongoDB data connection.
  3. Select Query diff. Only query diffs are supported for MongoDB. Table diffs are not available.
  4. Write a native MongoDB query using JSON format. All queries must include a collection name and an operation type (either find or aggregate).

Query structure

All MongoDB queries use a JSON wrapper with the following structure:
{
  "collection": "collection_name",
  "operation": "find" | "aggregate",
  ...operation-specific fields
}

Supported operations

We support two MongoDB operations: find and aggregate.
Primary keys are required for data diffs. Your query results must include the primary key field (typically _id in MongoDB) to run a diff. The find operation is the typical and recommended approach for data diffs, as it preserves document identity. The aggregate operation may not be suitable for diffs if the pipeline transforms or groups data in a way that loses the primary key.

Find operation

Use find to query documents with filters, projections, and sorting. This is the typical operation for data diffs:
{
  "collection": "users",
  "operation": "find",
  "query": {"age": {"$gt": 25}},
  "projection": {"_id": 1, "name": 1, "email": 1},
  "sort": [["created_at", -1]],
  "limit": 100
}
Find operation fields:
  • collection (required): The collection name.
  • operation (required): Must be "find".
  • query (optional): Filter criteria using MongoDB query operators (default: {}).
  • projection (optional): Fields to include or exclude. Important: Always include the primary key field (typically _id) in your projection for data diffs to work.
  • sort (optional): Sort specification as an array of [field, direction] pairs.
  • limit (optional): Maximum number of documents to return.
  • skip (optional): Number of documents to skip for pagination.

Aggregate operation

Use aggregate to run aggregation pipelines with multiple stages. Note: Aggregate queries may not be suitable for data diffs if the pipeline groups or transforms data in a way that removes the primary key:
{
  "collection": "orders",
  "operation": "aggregate",
  "pipeline": [
    {"$match": {"status": "completed"}},
    {"$group": {
      "_id": "$customerId",
      "total": {"$sum": "$amount"},
      "count": {"$sum": 1}
    }},
    {"$sort": {"total": -1}},
    {"$limit": 10}
  ]
}
Aggregate operation fields:
  • collection (required): The collection name.
  • operation (required): Must be "aggregate".
  • pipeline (required): Array of aggregation stages (e.g., $match, $group, $project, $lookup, $sort, $limit). Important: Ensure your pipeline preserves the primary key field if you need to run a data diff.

Examples

Find with nested fields:
{
  "collection": "tracks_v1_1m",
  "operation": "find",
  "query": {"point_id": {"$lt": 100000}},
  "projection": {
    "point_id": 1,
    "device_id": 1,
    "timestamp": 1,
    "location.longitude": 1,
    "location.latitude": 1
  }
}
Aggregate with grouping:
{
  "collection": "sales",
  "operation": "aggregate",
  "pipeline": [
    {"$match": {"year": 2024}},
    {"$group": {
      "_id": "$product",
      "totalRevenue": {"$sum": "$amount"},
      "avgPrice": {"$avg": "$price"}
    }},
    {"$sort": {"totalRevenue": -1}}
  ]
}
  1. Configure the rest of your diff and run it.
Datafold automatically limits data transfer to 10GB per query based on calculated record size to ensure optimal performance. This limit is applied automatically based on sampling your data.