Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

1.17beginner4 min read

PHP DOMDocument and DOMXPath

The native PHP HTML/XML parser. No Composer dependencies, ships with every PHP install, supports DOM traversal and XPath queries.

What you’ll learn

  • Load HTML into a `DOMDocument` and silence its noisy warnings.
  • Query with `getElementsByTagName`, `getElementById`.
  • Run XPath queries via `DOMXPath`.
  • Extract text, attributes, and inner HTML reliably.

DOMDocument and DOMXPath ship with every PHP install (via libxml2). They're verbose compared to Guzzle + DomCrawler, but they have one big advantage: zero dependencies. For quick scripts, restricted hosts, or environments where Composer is unavailable, they're invaluable.

Loading HTML

<?php
$html = file_get_contents('https://practice.scrapingcentral.com/challenges/static/tables/simple');

$doc = new DOMDocument();
libxml_use_internal_errors(true);  // suppress noisy parse warnings
$doc->loadHTML($html);
libxml_clear_errors();

The libxml_use_internal_errors(true) call is essential. Real-world HTML triggers dozens of libxml warnings (htmlParseEntityRef, Tag X invalid, etc.) that you don't want littering your output. Suppress, then clear after loading.

A common encoding pitfall

DOMDocument::loadHTML assumes ISO-8859-1 by default. For UTF-8 pages, which is most pages, you must signal it:

$doc->loadHTML('<?xml encoding="UTF-8"?>' . $html);

Or, on PHP 8.4+, use the new DOM\HTMLDocument::createFromString($html) which handles UTF-8 by default. For broad compatibility today, the XML processing instruction prefix is the workaround.

Basic queries

// By tag name
$h1s = $doc->getElementsByTagName('h1');
foreach ($h1s as $h1) {
  echo $h1->textContent . "\n";
}

// By id
$nav = $doc->getElementById('nav');

// Iteration
foreach ($doc->getElementsByTagName('a') as $a) {
  echo $a->getAttribute('href') . "\n";
}

getElementsByTagName returns a DOMNodeList (live, iterable). getElementById returns a single node or null, but requires the document to have a proper DOCTYPE for it to work reliably. Often you'll fall back to XPath.

XPath: the workhorse

For anything beyond tag-name lookups, use DOMXPath:

$xpath = new DOMXPath($doc);

$cards = $xpath->query('//article[contains(@class, "product-card")]');
foreach ($cards as $card) {
  $name  = $xpath->evaluate('string(.//h2)', $card);
  $price = $xpath->evaluate('string(.//*[contains(@class, "price")])', $card);
  $url  = $xpath->evaluate('string(.//a/@href)', $card);
  echo "$name | $price | $url\n";
}

Two important methods:

  • $xpath->query($xpath, $contextNode = null), returns a DOMNodeList.
  • $xpath->evaluate($xpath, $contextNode = null), returns a node list OR a scalar (string, number, bool) depending on the XPath expression.

string(...) and count(...) are scalar XPath functions; use evaluate for those.

Context node is the second arg, same . semantics as Python lxml. .//h2 inside a card iteration searches only that card.

Extracting common things

Want How
Text content $node->textContent (recursive, includes descendants)
Attribute $node->getAttribute('href')
Inner HTML $doc->saveHTML($node), gives full element including the opening tag
Tag name $node->nodeName
Children $node->childNodes
Parent $node->parentNode
Has attribute $node->hasAttribute('href')

Note: there's no direct "inner HTML" the way browser DOM has. $doc->saveHTML($node) is the closest, but it gives outer HTML (including the element itself). For inner-only, iterate children and concat their saveHTML.

Walking the tree

$row = $xpath->query('//tr[1]')->item(0);

// Previous and next siblings (raw DOM, includes whitespace text nodes)
$prev = $row->previousSibling;
$next = $row->nextSibling;

// Filter to element siblings
while ($next && $next->nodeType !== XML_ELEMENT_NODE) {
  $next = $next->nextSibling;
}

The whitespace-sibling problem from BeautifulSoup applies here too. Always check nodeType === XML_ELEMENT_NODE (constant value 1) when walking siblings.

A complete table-scraping example

<?php
$html = file_get_contents('https://practice.scrapingcentral.com/challenges/static/tables/simple');

$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML('<?xml encoding="UTF-8"?>' . $html);
libxml_clear_errors();

$xpath = new DOMXPath($doc);

$rows = [];

// Headers
$headers = [];
foreach ($xpath->query('//table[1]/thead/tr/th') as $th) {
  $headers[] = trim($th->textContent);
}

// Rows
foreach ($xpath->query('//table[1]/tbody/tr') as $tr) {
  $row = [];
  $cells = $xpath->query('./td', $tr);
  foreach ($cells as $i => $td) {
  $row[$headers[$i] ?? $i] = trim($td->textContent);
  }
  $rows[] = $row;
}

print_r($rows);

30 lines, no Composer, just PHP stdlib. The thead/tbody hierarchy is real here, libxml normalizes table HTML, inserting a <tbody> even if the source didn't have one. Account for that or your XPath will return nothing.

Comparison to DomCrawler

DOMDocument is one layer below Symfony DomCrawler. In fact, DomCrawler wraps a DOMDocument and adds a fluent API. If you're not on Composer, raw DOMDocument is what you have; if you are, DomCrawler (Lesson 1.18) is far more pleasant.

When DOMDocument bites

  • Comments inside <script> confuse libxml's HTML mode. Strip them or use a different parser.
  • HTML5 tags (<section>, <article>, <nav>), libxml's HTML parser is HTML4-era but accepts unknown tags as generic elements. Mostly fine in practice.
  • Self-closing tags in HTML (<br>, <img>) are sometimes serialized weirdly back out. Read with it, serialize with care.

For modern fully-spec'd HTML5 parsing in PHP, look at masterminds/html5 (a Composer package), but for the workhorse 90% of scraping, DOMDocument is fine.

Hands-on lab

Fetch /challenges/static/tables/simple and parse it with DOMDocument + DOMXPath. Extract the first table's headers and rows into a PHP array. Verify the count and one example row. Then deliberately mis-set the encoding (no UTF-8 prefix) and confirm you get garbled non-ASCII characters in the output, fixing it is part of the exercise.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /challenges/static/tables/simple

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

PHP DOMDocument and DOMXPath1 / 8

Why call `libxml_use_internal_errors(true)` before `$doc->loadHTML($html)`?

Score so far: 0 / 0