Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

4.50intermediate4 min read

MongoDB for Nested/Variable Data

When scraped documents have deep nesting, optional fields, or wildly varying shapes per source, MongoDB is often easier than coercing them into Postgres rows.

What you’ll learn

  • Identify when document storage beats relational storage.
  • Design indexes for nested-field queries.
  • Use upserts and bulk writes for idempotent ingest.

Postgres + JSONB handles most scraped data fine. MongoDB earns its keep when documents are deeply nested, schemas drift across sources, and you want to query inside nested arrays without writing PostgreSQL-specific JSON path syntax.

When Mongo wins

Workload Postgres MongoDB
Structured rows (products, prices) Easier Fine
Wildly variable JSON per source JSONB works Native fit
Deep nested arrays you query a lot Awkward path syntax First-class
Transactional consistency Strong (ACID) OK in replicas
Aggregation pipelines SQL Aggregation framework, powerful

A common pattern: relational core (Postgres) + Mongo for raw HTML snapshots, scraped review trees, or anything that's "a tree per item, hundreds of fields, half of them optional."

Document model

A scraped product with nested reviews and variant tree:

{
  "_id": ObjectId("..."),
  "source": "catalog108",
  "source_id": "P-1042",
  "url": "https://practice.scrapingcentral.com/products/1042",
  "title": "Stainless Blender",
  "price": { "amount_cents": 4999, "currency": "USD" },
  "variants": [
  { "color": "red", "sku": "1042-R", "in_stock": true },
  { "color": "blue", "sku": "1042-B", "in_stock": false }
  ],
  "reviews": [
  { "stars": 5, "text": "...", "author": "alex", "date": ISODate("...") }
  ],
  "scraped_at": ISODate("...")
}

In Postgres, you could store this as JSONB. In Mongo, it's the native shape, queries like "products with at least one out-of-stock variant" don't need path operators.

Idempotent upsert

In Python with pymongo:

from pymongo import MongoClient, UpdateOne
client = MongoClient("mongodb://localhost:27017/")
products = client.scraping.products

products.update_one(
  {"source": item["source"], "source_id": item["source_id"]},
  {"$set": {**item, "scraped_at": datetime.utcnow()}},
  upsert=True
)

For bulk:

ops = [
  UpdateOne(
  {"source": p["source"], "source_id": p["source_id"]},
  {"$set": {**p, "scraped_at": datetime.utcnow()}},
  upsert=True
  )
  for p in batch
]
products.bulk_write(ops, ordered=False)

ordered=False lets the driver keep going on errors, important when some docs in the batch are invalid.

In PHP with the mongodb extension:

$bulk = new MongoDB\Driver\BulkWrite();
foreach ($items as $item) {
  $bulk->update(
  ['source' => $item['source'], 'source_id' => $item['source_id']],
  ['$set' => $item + ['scraped_at' => new MongoDB\BSON\UTCDateTime()]],
  ['upsert' => true]
  );
}
$client->executeBulkWrite('scraping.products', $bulk);

Indexes for nested queries

// Unique composite, same role as Postgres' UNIQUE (source, source_id)
db.products.createIndex({source: 1, source_id: 1}, {unique: true})

// Query by latest scrape
db.products.createIndex({scraped_at: -1})

// Query inside variants
db.products.createIndex({"variants.in_stock": 1})

// Text search across title and reviews.text
db.products.createIndex({title: "text", "reviews.text": "text"})

Mongo indexes nested fields with dotted notation. variants.in_stock lets you query db.products.find({"variants.in_stock": false}) efficiently.

Aggregation pipeline, Mongo's superpower

Find average price per source, only in-stock items:

db.products.aggregate([
  {$match: {"variants.in_stock": true}},
  {$group: {
  _id: "$source",
  avg_price: {$avg: "$price.amount_cents"},
  count: {$sum: 1}
  }},
  {$sort: {avg_price: -1}}
])

This kind of query in Postgres against JSONB is doable but uglier. For analytical work over messy nested data, the aggregation framework is genuinely nicer.

Schema drift handling

Because Mongo doesn't enforce schema, schema drift is "free" but quiet. Two practices keep this manageable:

  1. JSON Schema validation at the collection level. Mongo can enforce required fields and types without going full SQL:
db.createCollection("products", {
validator: { $jsonSchema: {
bsonType: "object",
required: ["source", "source_id", "url"],
properties: {
source: { bsonType: "string" },
price: { bsonType: "object", properties: { amount_cents: {bsonType: "int"} }}
}
}}
})
  1. Versioning the document shape. Add schema_version: 2 and migrate readers, not writers, old docs can stay until naturally rewritten.

Sizing and sharding

  • A Mongo document is capped at 16MB. For HTML snapshots, store the body in S3 / GridFS and keep only metadata in the doc.
  • Replication via replica sets is standard; sharding is for ~TB-scale and beyond.
  • Indexes must fit in RAM for fast queries. Watch db.collection.stats().indexSizes.

When NOT to use Mongo

  • You need cross-document transactions across many tables. Postgres is still easier.
  • The data is fundamentally tabular. Don't reach for Mongo because you "want NoSQL."
  • Your team is more comfortable with SQL. Operations matter more than theoretical fit.

What to try

Run Mongo locally (docker run -d -p 27017:27017 mongo:7), load 1000 scraped products with nested variants and reviews, then write an aggregation pipeline answering "top 10 most-reviewed products per source." That single query in Mongo is one pipeline; in Postgres + JSONB it's a small essay.

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

MongoDB for Nested/Variable Data1 / 8

Which scraped-data shape most justifies MongoDB over Postgres?

Score so far: 0 / 0