Symfony DomCrawler, The Modern PHP Parser
DomCrawler wraps DOMDocument with a fluent jQuery-like API, supports both CSS and XPath, and is the default HTML parser for any non-trivial PHP scraper.
What you’ll learn
- Install and instantiate a Crawler from HTML string or response.
- Filter elements with CSS (`filter`) and XPath (`filterXPath`).
- Extract text, attributes, and multiple values cleanly.
- Combine DomCrawler with Guzzle or Symfony HttpClient in a single pipeline.
DomCrawler is what raw DOMDocument should have been. It provides a jQuery-like fluent API, accepts both CSS and XPath, handles UTF-8 correctly, and integrates cleanly with Symfony BrowserKit and HttpClient. It's the modern default for PHP scraping.
Install
composer require symfony/dom-crawler symfony/css-selector
The css-selector package is what enables CSS-selector input to filter(). Without it, you'd be stuck with filterXPath() only.
Building a Crawler
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = file_get_contents('https://practice.scrapingcentral.com/challenges/static/lists/cards');
$crawler = new Crawler($html);
echo $crawler->count() . " top-level nodes\n";
echo $crawler->filter('h1')->text() . "\n";
That's it. The Crawler auto-detects encoding from <meta charset> and decodes correctly, no DOMDocument workarounds needed.
Combined with Guzzle:
use GuzzleHttp\Client;
$client = new Client(['base_uri' => 'https://practice.scrapingcentral.com']);
$response = $client->get('/challenges/static/lists/cards');
$crawler = new Crawler((string) $response->getBody());
Filtering: CSS and XPath
// CSS
$cards = $crawler->filter('article.product-card');
echo $cards->count() . " cards\n";
// XPath
$cards = $crawler->filterXPath('//article[contains(@class, "product-card")]');
// Mix and match, both return a Crawler instance you can chain
$titles = $crawler->filter('article.product-card')->filterXPath('.//h2');
filter and filterXPath both return a new Crawler representing the matched set. Chain them freely.
Iteration
foreach ($crawler->filter('article.product-card') as $domNode) {
// $domNode is a raw \DOMNode, useful for direct DOM operations
echo $domNode->nodeName . "\n";
}
// Or get a Crawler per node:
$crawler->filter('article.product-card')->each(function (Crawler $node) {
echo $node->filter('h2')->text() . "\n";
});
each is the idiomatic per-node iteration with Crawler semantics. The closure can return a value, which each collects into an array, map-style behavior in PHP closures.
Extraction
$node = $crawler->filter('h1')->first();
$node->text(); // trimmed text content
$node->text('default'); // default if not found (avoids exceptions)
$node->html(); // inner HTML
$node->outerHtml(); // outer HTML (Symfony 5.4+)
$node->attr('href'); // attribute value
$node->attr('href', 'fallback'); // with default
$node->nodeName(); // 'h1'
$node->count(); // 1 (or 0 if not found)
text() is whitespace-trimmed by default. For raw text including whitespace: text(null, false).
If filter() matched nothing and you call text(), you get an InvalidArgumentException. Either check ->count() > 0 first or use the $default argument.
extract, pull arrays of values
For "give me every href from these links":
$urls = $crawler->filter('a.product-link')->extract(['href']);
// ['/p/1', '/p/2', '/p/3'...]
// Multiple attributes at once:
$pairs = $crawler->filter('a.product-link')->extract(['_text', 'href']);
// [['Item 1', '/p/1'], ['Item 2', '/p/2']...]
The special pseudo-attribute _text gives the element's text. This is the cleanest way to pull a column-shaped list of data from a list of repeating elements.
A complete card scrape
<?php
use Symfony\Component\DomCrawler\Crawler;
$html = file_get_contents('https://practice.scrapingcentral.com/challenges/static/lists/cards');
$crawler = new Crawler($html);
$cards = $crawler->filter('article.product-card')->each(function (Crawler $card) {
return [
'name' => $card->filter('h2')->text(),
'price' => $card->filter('.price')->text(null, ''),
'url' => $card->filter('a')->attr('href'),
'tags' => $card->filter('.tag')->extract(['_text']),
];
});
print_r($cards);
That's a clean, readable scraper. No DOMDocument boilerplate, no UTF-8 workarounds, no XPath when CSS suffices.
Navigation methods
DomCrawler exposes parent/sibling/child operations:
$node->parents(); // all ancestor crawlers
$node->children(); // direct children
$node->children('a'); // children filtered by selector
$node->siblings(); // sibling crawlers
$node->nextAll(); // following siblings
$node->previousAll(); // preceding siblings
$node->first();
$node->last();
$node->eq(2); // 3rd match (0-indexed)
For text-anchored navigation (find the <dd> after the <dt> with text "Brand"):
$crawler->filterXPath('//dt[normalize-space()="Brand"]/following-sibling::dd[1]')->text();
CSS can't do text-content filtering, drop to XPath for those cases.
Forms (preview)
DomCrawler can also parse and fill forms:
$form = $crawler->selectButton('Submit')->form();
$form['username'] = 'student';
$form['password'] = 'practice123';
This pairs perfectly with BrowserKit (Lesson 1.19), which submits the form for you. Together they let you replicate browser-style interaction without launching a browser.
Combining with Symfony HttpClient
use Symfony\Component\HttpClient\HttpClient;
use Symfony\Component\DomCrawler\Crawler;
$client = HttpClient::create();
$response = $client->request('GET', 'https://practice.scrapingcentral.com/blog');
$crawler = new Crawler($response->getContent());
foreach ($crawler->filter('article.post') as $post) {
$c = new Crawler($post);
echo $c->filter('h2 a')->text() . "\n";
}
The cleanest PHP-native stack: HttpClient for fetching, DomCrawler for parsing. Fewer dependencies, modern Symfony idioms throughout.
DomCrawler vs BeautifulSoup vs lxml
| Concern | DomCrawler | BeautifulSoup | lxml |
|---|---|---|---|
| CSS selectors | Yes (with cssselect) | Yes | Yes (with cssselect) |
| XPath | Yes | No (need lxml) | Yes |
| Encoding handling | Automatic | Depends on parser arg | Automatic on bytes |
| Performance | Fast (wraps libxml) | Slow (Python objects) | Fastest |
| Ergonomics | Fluent / jQuery-like | Pythonic | Lower-level |
For PHP, DomCrawler is the de facto right answer. The same arguments don't apply 1:1 to Python, but conceptually it's "BeautifulSoup with XPath baked in."
Hands-on lab
Scrape /challenges/static/lists/cards with DomCrawler. Use filter for the cards, then per-card extract title, subtitle, and any visible tag list as an array. Try the same query using filterXPath instead of filter to feel the difference. Use extract(['_text', 'href']) on the card links to get a flat list of (label, URL) pairs.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/challenges/static/lists/cardsQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.