Doctrine ORM for Scraped Data Persistence
How to model scraped entities, write efficient upserts, and avoid Doctrine's classic memory pitfalls at scrape-scale.
What you’ll learn
- Design entities for scraped products with idempotent storage.
- Batch inserts with EntityManager::clear() to avoid memory leaks.
- Use UPSERT (INSERT ... ON CONFLICT) when ORM-level merge is too slow.
Doctrine is excellent for transactional domain models. It is not especially designed for inserting 100k rows in a tight loop. For scraping, you need to know which Doctrine patterns to embrace and which to bypass.
Entity design for scraped data
<?php
// src/Entity/Product.php
namespace App\Entity;
use Doctrine\ORM\Mapping as ORM;
#[ORM\Entity]
#[ORM\Table(name: 'products')]
#[ORM\UniqueConstraint(name: 'uniq_url', columns: ['url'])]
#[ORM\Index(name: 'idx_sku', columns: ['sku'])]
#[ORM\Index(name: 'idx_scraped_at', columns: ['scraped_at'])]
class Product
{
#[ORM\Id, ORM\GeneratedValue, ORM\Column]
private ?int $id = null;
#[ORM\Column(length: 500)]
private string $url;
#[ORM\Column(length: 100, nullable: true)]
private ?string $sku = null;
#[ORM\Column(length: 500)]
private string $title;
#[ORM\Column(type: 'decimal', precision: 10, scale: 2, nullable: true)]
private ?string $price = null;
#[ORM\Column(type: 'json', nullable: true)]
private ?array $rawData = null;
#[ORM\Column]
private \DateTimeImmutable $scrapedAt;
// getters, setters...
}
Key choices for scraped entities:
- Unique constraint on the natural key (URL or SKU). Lets the database enforce uniqueness; lets you UPSERT.
scraped_atindexed. Most queries are "what changed since X."raw_dataJSON column. Keep the full parsed blob for debugging or schema evolution.decimalfor prices. Float corrupts money.
Insertion: the batch-and-clear pattern
The naive loop leaks memory. Every persisted entity stays in the EntityManager's UnitOfWork.
// BAD, OOM at large scales
foreach ($items as $item) {
$p = new Product();
$p->setUrl($item['url']);
$this->em->persist($p);
$this->em->flush();
}
Correct pattern:
$batchSize = 100;
foreach ($items as $i => $item) {
$p = new Product();
$p->setUrl($item['url']);
$this->em->persist($p);
if (($i + 1) % $batchSize === 0) {
$this->em->flush();
$this->em->clear(); // detach all entities, freeing memory
}
}
$this->em->flush();
$this->em->clear();
clear() is the critical line. Without it, the UnitOfWork grows unbounded. With it, memory stays flat.
UPSERT for "scrape or update"
Scrapers re-fetch the same URLs over time. You want "insert if new, update if exists." Three approaches:
1. find() + update | persist
$existing = $this->repo->findOneBy(['url' => $url]);
if ($existing) {
$existing->setPrice($price);
$existing->setScrapedAt(new \DateTimeImmutable());
} else {
$p = new Product();
$p->setUrl($url);
$p->setPrice($price);
$this->em->persist($p);
}
$this->em->flush();
Clean but slow at scale. Each iteration does a SELECT.
2. Doctrine's merge(), discouraged
EntityManager::merge() was removed in Doctrine ORM 3. Don't use it.
3. Native UPSERT via SQL
Postgres:
$sql = <<<'SQL'
INSERT INTO products (url, title, price, scraped_at)
VALUES (:url, :title, :price, :scraped_at)
ON CONFLICT (url) DO UPDATE SET
title = EXCLUDED.title,
price = EXCLUDED.price,
scraped_at = EXCLUDED.scraped_at
SQL;
$this->em->getConnection()->executeStatement($sql, [
'url' => $url, 'title' => $title, 'price' => $price,
'scraped_at' => (new \DateTimeImmutable())->format('Y-m-d H:i:s'),
]);
Skips the ORM entirely. Fastest. The cost: no entity events fire, no lifecycle hooks. For pure scrape-and-store, that's fine.
MySQL equivalent uses ON DUPLICATE KEY UPDATE.
Bulk insert with executeBatch
For pure inserts (no upsert):
$conn = $this->em->getConnection();
$conn->beginTransaction();
$stmt = $conn->prepare(
'INSERT INTO products (url, title, price, scraped_at) VALUES (?, ?, ?, ?)'
);
foreach ($items as $item) {
$stmt->executeStatement([
$item['url'], $item['title'], $item['price'],
(new \DateTimeImmutable())->format('Y-m-d H:i:s')
]);
}
$conn->commit();
10-100x faster than ORM persistence. Use when the ORM features aren't needed.
DBAL when ORM is too much
Doctrine\DBAL\Connection gives you a low-level SQL client with prepared statements and transactions. For pure scraping pipelines, much of the persistence layer is better in DBAL than ORM. Mix freely: ORM for domain queries, DBAL for bulk writes.
Migrations
Use doctrine/doctrine-migrations-bundle. Don't rely on doctrine:schema:update in production, it's for dev only.
php bin/console make:migration
# review the generated SQL
php bin/console doctrine:migrations:migrate
For scraped tables, common indexes you'll add over time:
CREATE INDEX idx_scraped_at ON products(scraped_at);
CREATE INDEX idx_category_price ON products(category_id, price);
CREATE INDEX idx_title_gin ON products USING gin (to_tsvector('english', title)); -- full-text
Read-vs-write entity managers
For very large projects, run two entity managers: a "scraping" EM for writes (small connection pool, batch-friendly), a "default" EM for reads. Configure in doctrine.yaml. Most projects don't need this; mention it because it's the next-level pattern.
Schema for change tracking
If you care about price history (covered in §4.54):
CREATE TABLE product_snapshots (
id BIGSERIAL PRIMARY KEY,
product_id INT REFERENCES products(id),
price NUMERIC(10,2),
in_stock BOOLEAN,
scraped_at TIMESTAMPTZ NOT NULL
);
CREATE INDEX ON product_snapshots(product_id, scraped_at);
The products row holds the current state. product_snapshots rows record history. Insert a snapshot only when price/stock actually changed (diff-only storage).
Hands-on lab
Build a Product entity for Catalog108. In a ScrapeProductHandler:
- Use UPSERT via native SQL on insert.
- After every 100 items,
em->clear(). - After 1000 inserts, check memory:
memory_get_usage(true). It should be flat or near-flat.
If memory grows linearly, your clear() is missing. The pattern is: persist, flush, clear, repeat.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/productsQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.