PHP DOMDocument and DOMXPath
The native PHP HTML/XML parser. No Composer dependencies, ships with every PHP install, supports DOM traversal and XPath queries.
What you’ll learn
- Load HTML into a `DOMDocument` and silence its noisy warnings.
- Query with `getElementsByTagName`, `getElementById`.
- Run XPath queries via `DOMXPath`.
- Extract text, attributes, and inner HTML reliably.
DOMDocument and DOMXPath ship with every PHP install (via libxml2). They're verbose compared to Guzzle + DomCrawler, but they have one big advantage: zero dependencies. For quick scripts, restricted hosts, or environments where Composer is unavailable, they're invaluable.
Loading HTML
<?php
$html = file_get_contents('https://practice.scrapingcentral.com/challenges/static/tables/simple');
$doc = new DOMDocument();
libxml_use_internal_errors(true); // suppress noisy parse warnings
$doc->loadHTML($html);
libxml_clear_errors();
The libxml_use_internal_errors(true) call is essential. Real-world HTML triggers dozens of libxml warnings (htmlParseEntityRef, Tag X invalid, etc.) that you don't want littering your output. Suppress, then clear after loading.
A common encoding pitfall
DOMDocument::loadHTML assumes ISO-8859-1 by default. For UTF-8 pages, which is most pages, you must signal it:
$doc->loadHTML('<?xml encoding="UTF-8"?>' . $html);
Or, on PHP 8.4+, use the new DOM\HTMLDocument::createFromString($html) which handles UTF-8 by default. For broad compatibility today, the XML processing instruction prefix is the workaround.
Basic queries
// By tag name
$h1s = $doc->getElementsByTagName('h1');
foreach ($h1s as $h1) {
echo $h1->textContent . "\n";
}
// By id
$nav = $doc->getElementById('nav');
// Iteration
foreach ($doc->getElementsByTagName('a') as $a) {
echo $a->getAttribute('href') . "\n";
}
getElementsByTagName returns a DOMNodeList (live, iterable). getElementById returns a single node or null, but requires the document to have a proper DOCTYPE for it to work reliably. Often you'll fall back to XPath.
XPath: the workhorse
For anything beyond tag-name lookups, use DOMXPath:
$xpath = new DOMXPath($doc);
$cards = $xpath->query('//article[contains(@class, "product-card")]');
foreach ($cards as $card) {
$name = $xpath->evaluate('string(.//h2)', $card);
$price = $xpath->evaluate('string(.//*[contains(@class, "price")])', $card);
$url = $xpath->evaluate('string(.//a/@href)', $card);
echo "$name | $price | $url\n";
}
Two important methods:
$xpath->query($xpath, $contextNode = null), returns aDOMNodeList.$xpath->evaluate($xpath, $contextNode = null), returns a node list OR a scalar (string, number, bool) depending on the XPath expression.
string(...) and count(...) are scalar XPath functions; use evaluate for those.
Context node is the second arg, same . semantics as Python lxml. .//h2 inside a card iteration searches only that card.
Extracting common things
| Want | How |
|---|---|
| Text content | $node->textContent (recursive, includes descendants) |
| Attribute | $node->getAttribute('href') |
| Inner HTML | $doc->saveHTML($node), gives full element including the opening tag |
| Tag name | $node->nodeName |
| Children | $node->childNodes |
| Parent | $node->parentNode |
| Has attribute | $node->hasAttribute('href') |
Note: there's no direct "inner HTML" the way browser DOM has. $doc->saveHTML($node) is the closest, but it gives outer HTML (including the element itself). For inner-only, iterate children and concat their saveHTML.
Walking the tree
$row = $xpath->query('//tr[1]')->item(0);
// Previous and next siblings (raw DOM, includes whitespace text nodes)
$prev = $row->previousSibling;
$next = $row->nextSibling;
// Filter to element siblings
while ($next && $next->nodeType !== XML_ELEMENT_NODE) {
$next = $next->nextSibling;
}
The whitespace-sibling problem from BeautifulSoup applies here too. Always check nodeType === XML_ELEMENT_NODE (constant value 1) when walking siblings.
A complete table-scraping example
<?php
$html = file_get_contents('https://practice.scrapingcentral.com/challenges/static/tables/simple');
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML('<?xml encoding="UTF-8"?>' . $html);
libxml_clear_errors();
$xpath = new DOMXPath($doc);
$rows = [];
// Headers
$headers = [];
foreach ($xpath->query('//table[1]/thead/tr/th') as $th) {
$headers[] = trim($th->textContent);
}
// Rows
foreach ($xpath->query('//table[1]/tbody/tr') as $tr) {
$row = [];
$cells = $xpath->query('./td', $tr);
foreach ($cells as $i => $td) {
$row[$headers[$i] ?? $i] = trim($td->textContent);
}
$rows[] = $row;
}
print_r($rows);
30 lines, no Composer, just PHP stdlib. The thead/tbody hierarchy is real here, libxml normalizes table HTML, inserting a <tbody> even if the source didn't have one. Account for that or your XPath will return nothing.
Comparison to DomCrawler
DOMDocument is one layer below Symfony DomCrawler. In fact, DomCrawler wraps a DOMDocument and adds a fluent API. If you're not on Composer, raw DOMDocument is what you have; if you are, DomCrawler (Lesson 1.18) is far more pleasant.
When DOMDocument bites
- Comments inside
<script>confuse libxml's HTML mode. Strip them or use a different parser. - HTML5 tags (
<section>,<article>,<nav>), libxml's HTML parser is HTML4-era but accepts unknown tags as generic elements. Mostly fine in practice. - Self-closing tags in HTML (
<br>,<img>) are sometimes serialized weirdly back out. Read with it, serialize with care.
For modern fully-spec'd HTML5 parsing in PHP, look at masterminds/html5 (a Composer package), but for the workhorse 90% of scraping, DOMDocument is fine.
Hands-on lab
Fetch /challenges/static/tables/simple and parse it with DOMDocument + DOMXPath. Extract the first table's headers and rows into a PHP array. Verify the count and one example row. Then deliberately mis-set the encoding (no UTF-8 prefix) and confirm you get garbled non-ASCII characters in the output, fixing it is part of the exercise.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/challenges/static/tables/simpleQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.