PHP Parser Comparison: DOMDocument vs DomCrawler vs paquettg
Three popular PHP HTML parsers compared on the same page: DOMDocument (stdlib), Symfony DomCrawler, and paquettg/php-html-parser. Honest tradeoffs.
What you’ll learn
- Implement the same scrape with all three parsers on the same page.
- Compare verbosity, performance, encoding handling, and feature surface.
- Pick a default parser for your projects.
You've seen DOMDocument and DomCrawler. A third popular option in the PHP world is paquettg/php-html-parser, a pure-PHP, jQuery-like selector engine that some projects standardize on. This lesson runs the same scrape through all three on /blog, side by side.
The task
Scrape blog post cards from /blog: title, excerpt, author, link.
Option 1: DOMDocument + DOMXPath
<?php
$html = file_get_contents('https://practice.scrapingcentral.com/blog');
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML('<?xml encoding="UTF-8"?>' . $html);
libxml_clear_errors();
$xpath = new DOMXPath($doc);
$posts = [];
foreach ($xpath->query('//article[contains(@class, "post")]') as $article) {
$posts[] = [
'title' => trim($xpath->evaluate('string(.//h2)', $article)),
'excerpt' => trim($xpath->evaluate('string(.//p[contains(@class, "excerpt")])', $article)),
'author' => trim($xpath->evaluate('string(.//*[contains(@class, "author")])', $article)),
'url' => $xpath->evaluate('string(.//h2/a/@href)', $article),
];
}
print_r($posts);
Pros: zero dependencies, ships with PHP. Cons: verbose, UTF-8 workaround required, libxml warning suppression boilerplate.
Option 2: Symfony DomCrawler
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = file_get_contents('https://practice.scrapingcentral.com/blog');
$crawler = new Crawler($html);
$posts = $crawler->filter('article.post')->each(function (Crawler $article) {
return [
'title' => $article->filter('h2')->text(''),
'excerpt' => $article->filter('p.excerpt')->text(''),
'author' => $article->filter('.author')->text(''),
'url' => $article->filter('h2 a')->attr('href'),
];
});
print_r($posts);
Pros: clean, encoding handled, CSS + XPath, Symfony-integrated. Cons: Composer dependency, libxml-backed (so HTML5 quirks inherited).
Option 3: paquettg/php-html-parser
<?php
require 'vendor/autoload.php';
use PHPHtmlParser\Dom;
$dom = new Dom();
$dom->loadFromUrl('https://practice.scrapingcentral.com/blog');
$posts = [];
foreach ($dom->find('article.post') as $article) {
$posts[] = [
'title' => trim($article->find('h2', 0)->text),
'excerpt' => trim($article->find('p.excerpt', 0)->text),
'author' => trim($article->find('.author', 0)->text),
'url' => $article->find('h2 a', 0)->getAttribute('href'),
];
}
print_r($posts);
Install: composer require paquettg/php-html-parser.
Pros: very jQuery-like syntax, friendly for JS developers, no libxml. Cons: pure-PHP parser is slower on large documents; less actively maintained than DomCrawler; some edge-case HTML5 differences.
Feature comparison
| Concern | DOMDocument | DomCrawler | paquettg |
|---|---|---|---|
| Dependency | None (stdlib) | Composer (~symfony/*) | Composer |
| CSS selectors | No (XPath only) | Yes | Yes |
| XPath | Yes | Yes | No |
| Encoding | Manual UTF-8 workaround | Automatic | Automatic |
| Speed (libxml backend) | Fast | Fast | Slow (pure PHP) |
| HTML5 awareness | Limited | Limited (libxml) | Better |
| API style | Verbose, DOM-classic | Fluent, jQuery-ish | jQuery-clone |
| Form/session helpers | No | Via BrowserKit | No |
| Active maintenance | Stable | Active | Less active |
Performance rough numbers
On a typical 100KB HTML page, parsing + 10 queries:
- DOMDocument + DOMXPath, ~5-10 ms
- DomCrawler, ~6-12 ms (small overhead over raw DOMDocument)
- paquettg, ~30-80 ms (pure PHP, no libxml)
Differences only matter when you parse thousands of pages per minute. For 100 pages over coffee, all three feel identical.
Memory characteristics
DOMDocument and DomCrawler share the libxml memory model, relatively compact. paquettg builds Python-like node objects per element; memory grows faster with document size. On very large pages (1MB+ of HTML, lots of elements), paquettg can struggle where libxml-based parsers shrug.
Encoding handling
This is where DOMDocument is most painful:
- DOMDocument: defaults to ISO-8859-1; you must prefix
<?xml encoding="UTF-8"?>(or use PHP 8.4+'sDom\HTMLDocument). - DomCrawler: auto-detects from
<meta charset>. - paquettg: auto-detects.
If you handle a lot of non-English content, this alone is enough to skip raw DOMDocument.
Form submission
- DOMDocument / paquettg: nothing built-in. You parse the form fields out manually and POST them yourself with cURL/Guzzle.
- DomCrawler + BrowserKit: the killer combo.
selectButton('Submit')->form([...])+$browser->submit($form).
If your scraping involves logins, multi-step flows, or CSRF-protected forms, DomCrawler + BrowserKit pays for itself immediately.
XPath availability
XPath is more powerful than CSS for axis-based traversal and text-content matching (//dt[text()="Brand"]). DOMXPath and DomCrawler's filterXPath() give you full XPath; paquettg doesn't. Sites with complex semi-structured layouts are easier in XPath; pure CSS handles 90% but trips on the last 10%.
Pick a default
For most PHP scraping work:
- Use DomCrawler as your default. It's actively maintained, ergonomic, fast, and integrates with BrowserKit for stateful flows.
- Use DOMDocument when you can't install Composer dependencies, or for the smallest one-off scripts.
- Use paquettg if your team strongly prefers a jQuery-like syntax and you're working with pages where speed isn't critical.
There's no wrong answer, all three produce correct results, but DomCrawler is the one most likely to scale with your project.
Hands-on lab
Implement the blog scrape with all three parsers (you've already got the code above). Use microtime(true) to compare wall time over 50 iterations. Print first 3 results from each to confirm they match. Notice the ergonomic differences, which felt easier? Which would you reach for next time?
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/blogQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.