Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

4.13intermediate4 min read

Doctrine ORM for Scraped Data Persistence

How to model scraped entities, write efficient upserts, and avoid Doctrine's classic memory pitfalls at scrape-scale.

What you’ll learn

  • Design entities for scraped products with idempotent storage.
  • Batch inserts with EntityManager::clear() to avoid memory leaks.
  • Use UPSERT (INSERT ... ON CONFLICT) when ORM-level merge is too slow.

Doctrine is excellent for transactional domain models. It is not especially designed for inserting 100k rows in a tight loop. For scraping, you need to know which Doctrine patterns to embrace and which to bypass.

Entity design for scraped data

<?php
// src/Entity/Product.php
namespace App\Entity;

use Doctrine\ORM\Mapping as ORM;

#[ORM\Entity]
#[ORM\Table(name: 'products')]
#[ORM\UniqueConstraint(name: 'uniq_url', columns: ['url'])]
#[ORM\Index(name: 'idx_sku', columns: ['sku'])]
#[ORM\Index(name: 'idx_scraped_at', columns: ['scraped_at'])]
class Product
{
  #[ORM\Id, ORM\GeneratedValue, ORM\Column]
  private ?int $id = null;

  #[ORM\Column(length: 500)]
  private string $url;

  #[ORM\Column(length: 100, nullable: true)]
  private ?string $sku = null;

  #[ORM\Column(length: 500)]
  private string $title;

  #[ORM\Column(type: 'decimal', precision: 10, scale: 2, nullable: true)]
  private ?string $price = null;

  #[ORM\Column(type: 'json', nullable: true)]
  private ?array $rawData = null;

  #[ORM\Column]
  private \DateTimeImmutable $scrapedAt;

  // getters, setters...
}

Key choices for scraped entities:

  • Unique constraint on the natural key (URL or SKU). Lets the database enforce uniqueness; lets you UPSERT.
  • scraped_at indexed. Most queries are "what changed since X."
  • raw_data JSON column. Keep the full parsed blob for debugging or schema evolution.
  • decimal for prices. Float corrupts money.

Insertion: the batch-and-clear pattern

The naive loop leaks memory. Every persisted entity stays in the EntityManager's UnitOfWork.

// BAD, OOM at large scales
foreach ($items as $item) {
  $p = new Product();
  $p->setUrl($item['url']);
  $this->em->persist($p);
  $this->em->flush();
}

Correct pattern:

$batchSize = 100;
foreach ($items as $i => $item) {
  $p = new Product();
  $p->setUrl($item['url']);
  $this->em->persist($p);

  if (($i + 1) % $batchSize === 0) {
  $this->em->flush();
  $this->em->clear();  // detach all entities, freeing memory
  }
}
$this->em->flush();
$this->em->clear();

clear() is the critical line. Without it, the UnitOfWork grows unbounded. With it, memory stays flat.

UPSERT for "scrape or update"

Scrapers re-fetch the same URLs over time. You want "insert if new, update if exists." Three approaches:

1. find() + update | persist

$existing = $this->repo->findOneBy(['url' => $url]);
if ($existing) {
  $existing->setPrice($price);
  $existing->setScrapedAt(new \DateTimeImmutable());
} else {
  $p = new Product();
  $p->setUrl($url);
  $p->setPrice($price);
  $this->em->persist($p);
}
$this->em->flush();

Clean but slow at scale. Each iteration does a SELECT.

2. Doctrine's merge(), discouraged

EntityManager::merge() was removed in Doctrine ORM 3. Don't use it.

3. Native UPSERT via SQL

Postgres:

$sql = <<<'SQL'
INSERT INTO products (url, title, price, scraped_at)
VALUES (:url, :title, :price, :scraped_at)
ON CONFLICT (url) DO UPDATE SET
  title = EXCLUDED.title,
  price = EXCLUDED.price,
  scraped_at = EXCLUDED.scraped_at
SQL;
$this->em->getConnection()->executeStatement($sql, [
  'url' => $url, 'title' => $title, 'price' => $price,
  'scraped_at' => (new \DateTimeImmutable())->format('Y-m-d H:i:s'),
]);

Skips the ORM entirely. Fastest. The cost: no entity events fire, no lifecycle hooks. For pure scrape-and-store, that's fine.

MySQL equivalent uses ON DUPLICATE KEY UPDATE.

Bulk insert with executeBatch

For pure inserts (no upsert):

$conn = $this->em->getConnection();
$conn->beginTransaction();
$stmt = $conn->prepare(
  'INSERT INTO products (url, title, price, scraped_at) VALUES (?, ?, ?, ?)'
);
foreach ($items as $item) {
  $stmt->executeStatement([
  $item['url'], $item['title'], $item['price'],
  (new \DateTimeImmutable())->format('Y-m-d H:i:s')
  ]);
}
$conn->commit();

10-100x faster than ORM persistence. Use when the ORM features aren't needed.

DBAL when ORM is too much

Doctrine\DBAL\Connection gives you a low-level SQL client with prepared statements and transactions. For pure scraping pipelines, much of the persistence layer is better in DBAL than ORM. Mix freely: ORM for domain queries, DBAL for bulk writes.

Migrations

Use doctrine/doctrine-migrations-bundle. Don't rely on doctrine:schema:update in production, it's for dev only.

php bin/console make:migration
# review the generated SQL
php bin/console doctrine:migrations:migrate

For scraped tables, common indexes you'll add over time:

CREATE INDEX idx_scraped_at ON products(scraped_at);
CREATE INDEX idx_category_price ON products(category_id, price);
CREATE INDEX idx_title_gin ON products USING gin (to_tsvector('english', title));  -- full-text

Read-vs-write entity managers

For very large projects, run two entity managers: a "scraping" EM for writes (small connection pool, batch-friendly), a "default" EM for reads. Configure in doctrine.yaml. Most projects don't need this; mention it because it's the next-level pattern.

Schema for change tracking

If you care about price history (covered in §4.54):

CREATE TABLE product_snapshots (
  id BIGSERIAL PRIMARY KEY,
  product_id INT REFERENCES products(id),
  price NUMERIC(10,2),
  in_stock BOOLEAN,
  scraped_at TIMESTAMPTZ NOT NULL
);
CREATE INDEX ON product_snapshots(product_id, scraped_at);

The products row holds the current state. product_snapshots rows record history. Insert a snapshot only when price/stock actually changed (diff-only storage).

Hands-on lab

Build a Product entity for Catalog108. In a ScrapeProductHandler:

  1. Use UPSERT via native SQL on insert.
  2. After every 100 items, em->clear().
  3. After 1000 inserts, check memory: memory_get_usage(true). It should be flat or near-flat.

If memory grows linearly, your clear() is missing. The pattern is: persist, flush, clear, repeat.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /products

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Doctrine ORM for Scraped Data Persistence1 / 8

Why does a naive `persist + flush` loop over 100,000 scraped items run out of memory?

Score so far: 0 / 0