PHP DOMDocument and DOMXPath, Static Scraping

The native PHP HTML/XML parser. No Composer dependencies, ships with every PHP install, supports DOM traversal and XPath queries.

DOMDocument and DOMXPath ship with every PHP install (via libxml2). They're verbose compared to Guzzle + DomCrawler, but they have one big advantage: zero dependencies. For quick scripts, restricted hosts, or environments where Composer is unavailable, they're invaluable.

Loading HTML

<?php
$html = file_get_contents('https://practice.scrapingcentral.com/challenges/static/tables/simple');

$doc = new DOMDocument();
libxml_use_internal_errors(true);  // suppress noisy parse warnings
$doc->loadHTML($html);
libxml_clear_errors();

The libxml_use_internal_errors(true) call is essential. Real-world HTML triggers dozens of libxml warnings (htmlParseEntityRef, Tag X invalid, etc.) that you don't want littering your output. Suppress, then clear after loading.

A common encoding pitfall

DOMDocument::loadHTML assumes ISO-8859-1 by default. For UTF-8 pages, which is most pages, you must signal it:

$doc->loadHTML('<?xml encoding="UTF-8"?>' . $html);

Or, on PHP 8.4+, use the new DOM\HTMLDocument::createFromString($html) which handles UTF-8 by default. For broad compatibility today, the XML processing instruction prefix is the workaround.

Basic queries

// By tag name
$h1s = $doc->getElementsByTagName('h1');
foreach ($h1s as $h1) {
  echo $h1->textContent . "\n";
}

// By id
$nav = $doc->getElementById('nav');

// Iteration
foreach ($doc->getElementsByTagName('a') as $a) {
  echo $a->getAttribute('href') . "\n";
}

getElementsByTagName returns a DOMNodeList (live, iterable). getElementById returns a single node or null, but requires the document to have a proper DOCTYPE for it to work reliably. Often you'll fall back to XPath.

XPath: the workhorse

For anything beyond tag-name lookups, use DOMXPath:

$xpath = new DOMXPath($doc);

$cards = $xpath->query('//article[contains(@class, "product-card")]');
foreach ($cards as $card) {
  $name  = $xpath->evaluate('string(.//h2)', $card);
  $price = $xpath->evaluate('string(.//*[contains(@class, "price")])', $card);
  $url  = $xpath->evaluate('string(.//a/@href)', $card);
  echo "$name | $price | $url\n";
}

Two important methods:

$xpath->query($xpath, $contextNode = null), returns a DOMNodeList.
$xpath->evaluate($xpath, $contextNode = null), returns a node list OR a scalar (string, number, bool) depending on the XPath expression.

string(...) and count(...) are scalar XPath functions; use evaluate for those.

Context node is the second arg, same . semantics as Python lxml. .//h2 inside a card iteration searches only that card.

Extracting common things

Want	How
Text content	`$node->textContent` (recursive, includes descendants)
Attribute	`$node->getAttribute('href')`
Inner HTML	`$doc->saveHTML($node)`, gives full element including the opening tag
Tag name	`$node->nodeName`
Children	`$node->childNodes`
Parent	`$node->parentNode`
Has attribute	`$node->hasAttribute('href')`

Note: there's no direct "inner HTML" the way browser DOM has. $doc->saveHTML($node) is the closest, but it gives outer HTML (including the element itself). For inner-only, iterate children and concat their saveHTML.

Walking the tree

$row = $xpath->query('//tr[1]')->item(0);

// Previous and next siblings (raw DOM, includes whitespace text nodes)
$prev = $row->previousSibling;
$next = $row->nextSibling;

// Filter to element siblings
while ($next && $next->nodeType !== XML_ELEMENT_NODE) {
  $next = $next->nextSibling;
}

The whitespace-sibling problem from BeautifulSoup applies here too. Always check nodeType === XML_ELEMENT_NODE (constant value 1) when walking siblings.

A complete table-scraping example

<?php
$html = file_get_contents('https://practice.scrapingcentral.com/challenges/static/tables/simple');

$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML('<?xml encoding="UTF-8"?>' . $html);
libxml_clear_errors();

$xpath = new DOMXPath($doc);

$rows = [];

// Headers
$headers = [];
foreach ($xpath->query('//table[1]/thead/tr/th') as $th) {
  $headers[] = trim($th->textContent);
}

// Rows
foreach ($xpath->query('//table[1]/tbody/tr') as $tr) {
  $row = [];
  $cells = $xpath->query('./td', $tr);
  foreach ($cells as $i => $td) {
  $row[$headers[$i] ?? $i] = trim($td->textContent);
  }
  $rows[] = $row;
}

print_r($rows);

30 lines, no Composer, just PHP stdlib. The thead/tbody hierarchy is real here, libxml normalizes table HTML, inserting a <tbody> even if the source didn't have one. Account for that or your XPath will return nothing.

Comparison to DomCrawler

DOMDocument is one layer below Symfony DomCrawler. In fact, DomCrawler wraps a DOMDocument and adds a fluent API. If you're not on Composer, raw DOMDocument is what you have; if you are, DomCrawler (Lesson 1.18) is far more pleasant.

When DOMDocument bites

Comments inside <script> confuse libxml's HTML mode. Strip them or use a different parser.
HTML5 tags (<section>, <article>, <nav>), libxml's HTML parser is HTML4-era but accepts unknown tags as generic elements. Mostly fine in practice.
Self-closing tags in HTML (<br>, <img>) are sometimes serialized weirdly back out. Read with it, serialize with care.

For modern fully-spec'd HTML5 parsing in PHP, look at masterminds/html5 (a Composer package), but for the workhorse 90% of scraping, DOMDocument is fine.

Hands-on lab

Fetch /challenges/static/tables/simple and parse it with DOMDocument + DOMXPath. Extract the first table's headers and rows into a PHP array. Verify the count and one example row. Then deliberately mis-set the encoding (no UTF-8 prefix) and confirm you get garbled non-ASCII characters in the output, fixing it is part of the exercise.

PHP DOMDocument and DOMXPath

What you’ll learn

Loading HTML

A common encoding pitfall

Basic queries

XPath: the workhorse

Extracting common things

Walking the tree

A complete table-scraping example

Comparison to DomCrawler

When DOMDocument bites

Hands-on lab

Hands-on lab

Quiz, check your understanding

Why call `libxml_use_internal_errors(true)` before `$doc->loadHTML($html)`?