Doctrine ORM for Scraped Data Persistence, Production, Scale & Career

How to model scraped entities, write efficient upserts, and avoid Doctrine's classic memory pitfalls at scrape-scale.

Doctrine is excellent for transactional domain models. It is not especially designed for inserting 100k rows in a tight loop. For scraping, you need to know which Doctrine patterns to embrace and which to bypass.

Entity design for scraped data

<?php
// src/Entity/Product.php
namespace App\Entity;

use Doctrine\ORM\Mapping as ORM;

#[ORM\Entity]
#[ORM\Table(name: 'products')]
#[ORM\UniqueConstraint(name: 'uniq_url', columns: ['url'])]
#[ORM\Index(name: 'idx_sku', columns: ['sku'])]
#[ORM\Index(name: 'idx_scraped_at', columns: ['scraped_at'])]
class Product
{
  #[ORM\Id, ORM\GeneratedValue, ORM\Column]
  private ?int $id = null;

  #[ORM\Column(length: 500)]
  private string $url;

  #[ORM\Column(length: 100, nullable: true)]
  private ?string $sku = null;

  #[ORM\Column(length: 500)]
  private string $title;

  #[ORM\Column(type: 'decimal', precision: 10, scale: 2, nullable: true)]
  private ?string $price = null;

  #[ORM\Column(type: 'json', nullable: true)]
  private ?array $rawData = null;

  #[ORM\Column]
  private \DateTimeImmutable $scrapedAt;

  // getters, setters...
}

Key choices for scraped entities:

Unique constraint on the natural key (URL or SKU). Lets the database enforce uniqueness; lets you UPSERT.
scraped_at indexed. Most queries are "what changed since X."
raw_data JSON column. Keep the full parsed blob for debugging or schema evolution.
decimal for prices. Float corrupts money.

Insertion: the batch-and-clear pattern

The naive loop leaks memory. Every persisted entity stays in the EntityManager's UnitOfWork.

// BAD, OOM at large scales
foreach ($items as $item) {
  $p = new Product();
  $p->setUrl($item['url']);
  $this->em->persist($p);
  $this->em->flush();
}

Correct pattern:

$batchSize = 100;
foreach ($items as $i => $item) {
  $p = new Product();
  $p->setUrl($item['url']);
  $this->em->persist($p);

  if (($i + 1) % $batchSize === 0) {
  $this->em->flush();
  $this->em->clear();  // detach all entities, freeing memory
  }
}
$this->em->flush();
$this->em->clear();

clear() is the critical line. Without it, the UnitOfWork grows unbounded. With it, memory stays flat.

UPSERT for "scrape or update"

Scrapers re-fetch the same URLs over time. You want "insert if new, update if exists." Three approaches:

1. find() + update | persist

$existing = $this->repo->findOneBy(['url' => $url]);
if ($existing) {
  $existing->setPrice($price);
  $existing->setScrapedAt(new \DateTimeImmutable());
} else {
  $p = new Product();
  $p->setUrl($url);
  $p->setPrice($price);
  $this->em->persist($p);
}
$this->em->flush();

Clean but slow at scale. Each iteration does a SELECT.

2. Doctrine's merge(), discouraged

EntityManager::merge() was removed in Doctrine ORM 3. Don't use it.

3. Native UPSERT via SQL

Postgres:

$sql = <<<'SQL'
INSERT INTO products (url, title, price, scraped_at)
VALUES (:url, :title, :price, :scraped_at)
ON CONFLICT (url) DO UPDATE SET
  title = EXCLUDED.title,
  price = EXCLUDED.price,
  scraped_at = EXCLUDED.scraped_at
SQL;
$this->em->getConnection()->executeStatement($sql, [
  'url' => $url, 'title' => $title, 'price' => $price,
  'scraped_at' => (new \DateTimeImmutable())->format('Y-m-d H:i:s'),
]);

Skips the ORM entirely. Fastest. The cost: no entity events fire, no lifecycle hooks. For pure scrape-and-store, that's fine.

MySQL equivalent uses ON DUPLICATE KEY UPDATE.

Bulk insert with executeBatch

For pure inserts (no upsert):

$conn = $this->em->getConnection();
$conn->beginTransaction();
$stmt = $conn->prepare(
  'INSERT INTO products (url, title, price, scraped_at) VALUES (?, ?, ?, ?)'
);
foreach ($items as $item) {
  $stmt->executeStatement([
  $item['url'], $item['title'], $item['price'],
  (new \DateTimeImmutable())->format('Y-m-d H:i:s')
  ]);
}
$conn->commit();

10-100x faster than ORM persistence. Use when the ORM features aren't needed.

DBAL when ORM is too much

Doctrine\DBAL\Connection gives you a low-level SQL client with prepared statements and transactions. For pure scraping pipelines, much of the persistence layer is better in DBAL than ORM. Mix freely: ORM for domain queries, DBAL for bulk writes.

Migrations

Use doctrine/doctrine-migrations-bundle. Don't rely on doctrine:schema:update in production, it's for dev only.

php bin/console make:migration
# review the generated SQL
php bin/console doctrine:migrations:migrate

For scraped tables, common indexes you'll add over time:

CREATE INDEX idx_scraped_at ON products(scraped_at);
CREATE INDEX idx_category_price ON products(category_id, price);
CREATE INDEX idx_title_gin ON products USING gin (to_tsvector('english', title));  -- full-text

Read-vs-write entity managers

For very large projects, run two entity managers: a "scraping" EM for writes (small connection pool, batch-friendly), a "default" EM for reads. Configure in doctrine.yaml. Most projects don't need this; mention it because it's the next-level pattern.

Schema for change tracking

If you care about price history (covered in §4.54):

CREATE TABLE product_snapshots (
  id BIGSERIAL PRIMARY KEY,
  product_id INT REFERENCES products(id),
  price NUMERIC(10,2),
  in_stock BOOLEAN,
  scraped_at TIMESTAMPTZ NOT NULL
);
CREATE INDEX ON product_snapshots(product_id, scraped_at);

The products row holds the current state. product_snapshots rows record history. Insert a snapshot only when price/stock actually changed (diff-only storage).

Hands-on lab

Build a Product entity for Catalog108. In a ScrapeProductHandler:

Use UPSERT via native SQL on insert.
After every 100 items, em->clear().
After 1000 inserts, check memory: memory_get_usage(true). It should be flat or near-flat.

If memory grows linearly, your clear() is missing. The pattern is: persist, flush, clear, repeat.

Doctrine ORM for Scraped Data Persistence

What you’ll learn

Entity design for scraped data

Insertion: the batch-and-clear pattern

UPSERT for "scrape or update"

1. find() + update | persist

2. Doctrine's merge(), discouraged

3. Native UPSERT via SQL

Bulk insert with executeBatch

DBAL when ORM is too much

Migrations

Read-vs-write entity managers

Schema for change tracking

Hands-on lab

Hands-on lab

Quiz, check your understanding

Why does a naive `persist + flush` loop over 100,000 scraped items run out of memory?