Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

1.20intermediate4 min read

PHP Parser Comparison: DOMDocument vs DomCrawler vs paquettg

Three popular PHP HTML parsers compared on the same page: DOMDocument (stdlib), Symfony DomCrawler, and paquettg/php-html-parser. Honest tradeoffs.

What you’ll learn

  • Implement the same scrape with all three parsers on the same page.
  • Compare verbosity, performance, encoding handling, and feature surface.
  • Pick a default parser for your projects.

You've seen DOMDocument and DomCrawler. A third popular option in the PHP world is paquettg/php-html-parser, a pure-PHP, jQuery-like selector engine that some projects standardize on. This lesson runs the same scrape through all three on /blog, side by side.

The task

Scrape blog post cards from /blog: title, excerpt, author, link.

Option 1: DOMDocument + DOMXPath

<?php
$html = file_get_contents('https://practice.scrapingcentral.com/blog');

$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML('<?xml encoding="UTF-8"?>' . $html);
libxml_clear_errors();

$xpath = new DOMXPath($doc);
$posts = [];

foreach ($xpath->query('//article[contains(@class, "post")]') as $article) {
  $posts[] = [
  'title'  => trim($xpath->evaluate('string(.//h2)', $article)),
  'excerpt' => trim($xpath->evaluate('string(.//p[contains(@class, "excerpt")])', $article)),
  'author'  => trim($xpath->evaluate('string(.//*[contains(@class, "author")])', $article)),
  'url'  => $xpath->evaluate('string(.//h2/a/@href)', $article),
  ];
}

print_r($posts);

Pros: zero dependencies, ships with PHP. Cons: verbose, UTF-8 workaround required, libxml warning suppression boilerplate.

Option 2: Symfony DomCrawler

<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;

$html = file_get_contents('https://practice.scrapingcentral.com/blog');
$crawler = new Crawler($html);

$posts = $crawler->filter('article.post')->each(function (Crawler $article) {
  return [
  'title'  => $article->filter('h2')->text(''),
  'excerpt' => $article->filter('p.excerpt')->text(''),
  'author'  => $article->filter('.author')->text(''),
  'url'  => $article->filter('h2 a')->attr('href'),
  ];
});

print_r($posts);

Pros: clean, encoding handled, CSS + XPath, Symfony-integrated. Cons: Composer dependency, libxml-backed (so HTML5 quirks inherited).

Option 3: paquettg/php-html-parser

<?php
require 'vendor/autoload.php';
use PHPHtmlParser\Dom;

$dom = new Dom();
$dom->loadFromUrl('https://practice.scrapingcentral.com/blog');

$posts = [];
foreach ($dom->find('article.post') as $article) {
  $posts[] = [
  'title'  => trim($article->find('h2', 0)->text),
  'excerpt' => trim($article->find('p.excerpt', 0)->text),
  'author'  => trim($article->find('.author', 0)->text),
  'url'  => $article->find('h2 a', 0)->getAttribute('href'),
  ];
}

print_r($posts);

Install: composer require paquettg/php-html-parser.

Pros: very jQuery-like syntax, friendly for JS developers, no libxml. Cons: pure-PHP parser is slower on large documents; less actively maintained than DomCrawler; some edge-case HTML5 differences.

Feature comparison

Concern DOMDocument DomCrawler paquettg
Dependency None (stdlib) Composer (~symfony/*) Composer
CSS selectors No (XPath only) Yes Yes
XPath Yes Yes No
Encoding Manual UTF-8 workaround Automatic Automatic
Speed (libxml backend) Fast Fast Slow (pure PHP)
HTML5 awareness Limited Limited (libxml) Better
API style Verbose, DOM-classic Fluent, jQuery-ish jQuery-clone
Form/session helpers No Via BrowserKit No
Active maintenance Stable Active Less active

Performance rough numbers

On a typical 100KB HTML page, parsing + 10 queries:

  • DOMDocument + DOMXPath, ~5-10 ms
  • DomCrawler, ~6-12 ms (small overhead over raw DOMDocument)
  • paquettg, ~30-80 ms (pure PHP, no libxml)

Differences only matter when you parse thousands of pages per minute. For 100 pages over coffee, all three feel identical.

Memory characteristics

DOMDocument and DomCrawler share the libxml memory model, relatively compact. paquettg builds Python-like node objects per element; memory grows faster with document size. On very large pages (1MB+ of HTML, lots of elements), paquettg can struggle where libxml-based parsers shrug.

Encoding handling

This is where DOMDocument is most painful:

  • DOMDocument: defaults to ISO-8859-1; you must prefix <?xml encoding="UTF-8"?> (or use PHP 8.4+'s Dom\HTMLDocument).
  • DomCrawler: auto-detects from <meta charset>.
  • paquettg: auto-detects.

If you handle a lot of non-English content, this alone is enough to skip raw DOMDocument.

Form submission

  • DOMDocument / paquettg: nothing built-in. You parse the form fields out manually and POST them yourself with cURL/Guzzle.
  • DomCrawler + BrowserKit: the killer combo. selectButton('Submit')->form([...]) + $browser->submit($form).

If your scraping involves logins, multi-step flows, or CSRF-protected forms, DomCrawler + BrowserKit pays for itself immediately.

XPath availability

XPath is more powerful than CSS for axis-based traversal and text-content matching (//dt[text()="Brand"]). DOMXPath and DomCrawler's filterXPath() give you full XPath; paquettg doesn't. Sites with complex semi-structured layouts are easier in XPath; pure CSS handles 90% but trips on the last 10%.

Pick a default

For most PHP scraping work:

  • Use DomCrawler as your default. It's actively maintained, ergonomic, fast, and integrates with BrowserKit for stateful flows.
  • Use DOMDocument when you can't install Composer dependencies, or for the smallest one-off scripts.
  • Use paquettg if your team strongly prefers a jQuery-like syntax and you're working with pages where speed isn't critical.

There's no wrong answer, all three produce correct results, but DomCrawler is the one most likely to scale with your project.

Hands-on lab

Implement the blog scrape with all three parsers (you've already got the code above). Use microtime(true) to compare wall time over 50 iterations. Print first 3 results from each to confirm they match. Notice the ergonomic differences, which felt easier? Which would you reach for next time?

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /blog

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

PHP Parser Comparison: DOMDocument vs DomCrawler vs paquettg1 / 8

Which of the three parsers has zero Composer dependencies?

Score so far: 0 / 0