PHP Parser Comparison: DOMDocument vs DomCrawler vs paquettg, Static Scraping

Three popular PHP HTML parsers compared on the same page: DOMDocument (stdlib), Symfony DomCrawler, and paquettg/php-html-parser. Honest tradeoffs.

You've seen DOMDocument and DomCrawler. A third popular option in the PHP world is paquettg/php-html-parser, a pure-PHP, jQuery-like selector engine that some projects standardize on. This lesson runs the same scrape through all three on /blog, side by side.

The task

Scrape blog post cards from /blog: title, excerpt, author, link.

Option 1: DOMDocument + DOMXPath

<?php
$html = file_get_contents('https://practice.scrapingcentral.com/blog');

$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML('<?xml encoding="UTF-8"?>' . $html);
libxml_clear_errors();

$xpath = new DOMXPath($doc);
$posts = [];

foreach ($xpath->query('//article[contains(@class, "post")]') as $article) {
  $posts[] = [
  'title'  => trim($xpath->evaluate('string(.//h2)', $article)),
  'excerpt' => trim($xpath->evaluate('string(.//p[contains(@class, "excerpt")])', $article)),
  'author'  => trim($xpath->evaluate('string(.//*[contains(@class, "author")])', $article)),
  'url'  => $xpath->evaluate('string(.//h2/a/@href)', $article),
  ];
}

print_r($posts);

Pros: zero dependencies, ships with PHP. Cons: verbose, UTF-8 workaround required, libxml warning suppression boilerplate.

Option 2: Symfony DomCrawler

<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;

$html = file_get_contents('https://practice.scrapingcentral.com/blog');
$crawler = new Crawler($html);

$posts = $crawler->filter('article.post')->each(function (Crawler $article) {
  return [
  'title'  => $article->filter('h2')->text(''),
  'excerpt' => $article->filter('p.excerpt')->text(''),
  'author'  => $article->filter('.author')->text(''),
  'url'  => $article->filter('h2 a')->attr('href'),
  ];
});

print_r($posts);

Pros: clean, encoding handled, CSS + XPath, Symfony-integrated. Cons: Composer dependency, libxml-backed (so HTML5 quirks inherited).

Option 3: paquettg/php-html-parser

<?php
require 'vendor/autoload.php';
use PHPHtmlParser\Dom;

$dom = new Dom();
$dom->loadFromUrl('https://practice.scrapingcentral.com/blog');

$posts = [];
foreach ($dom->find('article.post') as $article) {
  $posts[] = [
  'title'  => trim($article->find('h2', 0)->text),
  'excerpt' => trim($article->find('p.excerpt', 0)->text),
  'author'  => trim($article->find('.author', 0)->text),
  'url'  => $article->find('h2 a', 0)->getAttribute('href'),
  ];
}

print_r($posts);

Install: composer require paquettg/php-html-parser.

Pros: very jQuery-like syntax, friendly for JS developers, no libxml. Cons: pure-PHP parser is slower on large documents; less actively maintained than DomCrawler; some edge-case HTML5 differences.

Feature comparison

Concern	DOMDocument	DomCrawler	paquettg
Dependency	None (stdlib)	Composer (~symfony/*)	Composer
CSS selectors	No (XPath only)	Yes	Yes
XPath	Yes	Yes	No
Encoding	Manual UTF-8 workaround	Automatic	Automatic
Speed (libxml backend)	Fast	Fast	Slow (pure PHP)
HTML5 awareness	Limited	Limited (libxml)	Better
API style	Verbose, DOM-classic	Fluent, jQuery-ish	jQuery-clone
Form/session helpers	No	Via BrowserKit	No
Active maintenance	Stable	Active	Less active

Performance rough numbers

On a typical 100KB HTML page, parsing + 10 queries:

DOMDocument + DOMXPath, ~5-10 ms
DomCrawler, ~6-12 ms (small overhead over raw DOMDocument)
paquettg, ~30-80 ms (pure PHP, no libxml)

Differences only matter when you parse thousands of pages per minute. For 100 pages over coffee, all three feel identical.

Memory characteristics

DOMDocument and DomCrawler share the libxml memory model, relatively compact. paquettg builds Python-like node objects per element; memory grows faster with document size. On very large pages (1MB+ of HTML, lots of elements), paquettg can struggle where libxml-based parsers shrug.

Encoding handling

This is where DOMDocument is most painful:

DOMDocument: defaults to ISO-8859-1; you must prefix <?xml encoding="UTF-8"?> (or use PHP 8.4+'s Dom\HTMLDocument).
DomCrawler: auto-detects from <meta charset>.
paquettg: auto-detects.

If you handle a lot of non-English content, this alone is enough to skip raw DOMDocument.

Form submission

DOMDocument / paquettg: nothing built-in. You parse the form fields out manually and POST them yourself with cURL/Guzzle.
DomCrawler + BrowserKit: the killer combo. selectButton('Submit')->form([...]) + $browser->submit($form).

If your scraping involves logins, multi-step flows, or CSRF-protected forms, DomCrawler + BrowserKit pays for itself immediately.

XPath availability

XPath is more powerful than CSS for axis-based traversal and text-content matching (//dt[text()="Brand"]). DOMXPath and DomCrawler's filterXPath() give you full XPath; paquettg doesn't. Sites with complex semi-structured layouts are easier in XPath; pure CSS handles 90% but trips on the last 10%.

Pick a default

For most PHP scraping work:

Use DomCrawler as your default. It's actively maintained, ergonomic, fast, and integrates with BrowserKit for stateful flows.
Use DOMDocument when you can't install Composer dependencies, or for the smallest one-off scripts.
Use paquettg if your team strongly prefers a jQuery-like syntax and you're working with pages where speed isn't critical.

There's no wrong answer, all three produce correct results, but DomCrawler is the one most likely to scale with your project.

Hands-on lab

Implement the blog scrape with all three parsers (you've already got the code above). Use microtime(true) to compare wall time over 50 iterations. Print first 3 results from each to confirm they match. Notice the ergonomic differences, which felt easier? Which would you reach for next time?

PHP Parser Comparison: DOMDocument vs DomCrawler vs paquettg

What you’ll learn

The task

Option 1: DOMDocument + DOMXPath

Option 2: Symfony DomCrawler

Option 3: paquettg/php-html-parser

Feature comparison

Performance rough numbers

Memory characteristics

Encoding handling

Form submission

XPath availability

Pick a default

Hands-on lab

Hands-on lab

Quiz, check your understanding

Which of the three parsers has zero Composer dependencies?