Symfony DomCrawler, The Modern PHP Parser, Static Scraping

DomCrawler wraps DOMDocument with a fluent jQuery-like API, supports both CSS and XPath, and is the default HTML parser for any non-trivial PHP scraper.

DomCrawler is what raw DOMDocument should have been. It provides a jQuery-like fluent API, accepts both CSS and XPath, handles UTF-8 correctly, and integrates cleanly with Symfony BrowserKit and HttpClient. It's the modern default for PHP scraping.

Install

composer require symfony/dom-crawler symfony/css-selector

The css-selector package is what enables CSS-selector input to filter(). Without it, you'd be stuck with filterXPath() only.

Building a Crawler

<?php
require 'vendor/autoload.php';

use Symfony\Component\DomCrawler\Crawler;

$html = file_get_contents('https://practice.scrapingcentral.com/challenges/static/lists/cards');
$crawler = new Crawler($html);

echo $crawler->count() . " top-level nodes\n";
echo $crawler->filter('h1')->text() . "\n";

That's it. The Crawler auto-detects encoding from <meta charset> and decodes correctly, no DOMDocument workarounds needed.

Combined with Guzzle:

use GuzzleHttp\Client;

$client = new Client(['base_uri' => 'https://practice.scrapingcentral.com']);
$response = $client->get('/challenges/static/lists/cards');
$crawler = new Crawler((string) $response->getBody());

Filtering: CSS and XPath

// CSS
$cards = $crawler->filter('article.product-card');
echo $cards->count() . " cards\n";

// XPath
$cards = $crawler->filterXPath('//article[contains(@class, "product-card")]');

// Mix and match, both return a Crawler instance you can chain
$titles = $crawler->filter('article.product-card')->filterXPath('.//h2');

filter and filterXPath both return a new Crawler representing the matched set. Chain them freely.

Iteration

foreach ($crawler->filter('article.product-card') as $domNode) {
  // $domNode is a raw \DOMNode, useful for direct DOM operations
  echo $domNode->nodeName . "\n";
}

// Or get a Crawler per node:
$crawler->filter('article.product-card')->each(function (Crawler $node) {
  echo $node->filter('h2')->text() . "\n";
});

each is the idiomatic per-node iteration with Crawler semantics. The closure can return a value, which each collects into an array, map-style behavior in PHP closures.

Extraction

$node = $crawler->filter('h1')->first();

$node->text();  // trimmed text content
$node->text('default');  // default if not found (avoids exceptions)
$node->html();  // inner HTML
$node->outerHtml();  // outer HTML (Symfony 5.4+)
$node->attr('href');  // attribute value
$node->attr('href', 'fallback');  // with default
$node->nodeName();  // 'h1'
$node->count();  // 1 (or 0 if not found)

text() is whitespace-trimmed by default. For raw text including whitespace: text(null, false).

If filter() matched nothing and you call text(), you get an InvalidArgumentException. Either check ->count() > 0 first or use the $default argument.

`extract`, pull arrays of values

For "give me every href from these links":

$urls = $crawler->filter('a.product-link')->extract(['href']);
// ['/p/1', '/p/2', '/p/3'...]

// Multiple attributes at once:
$pairs = $crawler->filter('a.product-link')->extract(['_text', 'href']);
// [['Item 1', '/p/1'], ['Item 2', '/p/2']...]

The special pseudo-attribute _text gives the element's text. This is the cleanest way to pull a column-shaped list of data from a list of repeating elements.

A complete card scrape

<?php
use Symfony\Component\DomCrawler\Crawler;

$html = file_get_contents('https://practice.scrapingcentral.com/challenges/static/lists/cards');
$crawler = new Crawler($html);

$cards = $crawler->filter('article.product-card')->each(function (Crawler $card) {
  return [
  'name'  => $card->filter('h2')->text(),
  'price'  => $card->filter('.price')->text(null, ''),
  'url'  => $card->filter('a')->attr('href'),
  'tags'  => $card->filter('.tag')->extract(['_text']),
  ];
});

print_r($cards);

That's a clean, readable scraper. No DOMDocument boilerplate, no UTF-8 workarounds, no XPath when CSS suffices.

Navigation methods

DomCrawler exposes parent/sibling/child operations:

$node->parents();  // all ancestor crawlers
$node->children();  // direct children
$node->children('a');  // children filtered by selector
$node->siblings();  // sibling crawlers
$node->nextAll();  // following siblings
$node->previousAll();  // preceding siblings
$node->first();
$node->last();
$node->eq(2);  // 3rd match (0-indexed)

For text-anchored navigation (find the <dd> after the <dt> with text "Brand"):

$crawler->filterXPath('//dt[normalize-space()="Brand"]/following-sibling::dd[1]')->text();

CSS can't do text-content filtering, drop to XPath for those cases.

Forms (preview)

DomCrawler can also parse and fill forms:

$form = $crawler->selectButton('Submit')->form();
$form['username'] = 'student';
$form['password'] = 'practice123';

This pairs perfectly with BrowserKit (Lesson 1.19), which submits the form for you. Together they let you replicate browser-style interaction without launching a browser.

Combining with Symfony HttpClient

use Symfony\Component\HttpClient\HttpClient;
use Symfony\Component\DomCrawler\Crawler;

$client = HttpClient::create();
$response = $client->request('GET', 'https://practice.scrapingcentral.com/blog');
$crawler = new Crawler($response->getContent());

foreach ($crawler->filter('article.post') as $post) {
  $c = new Crawler($post);
  echo $c->filter('h2 a')->text() . "\n";
}

The cleanest PHP-native stack: HttpClient for fetching, DomCrawler for parsing. Fewer dependencies, modern Symfony idioms throughout.

DomCrawler vs BeautifulSoup vs lxml

Concern	DomCrawler	BeautifulSoup	lxml
CSS selectors	Yes (with cssselect)	Yes	Yes (with cssselect)
XPath	Yes	No (need lxml)	Yes
Encoding handling	Automatic	Depends on parser arg	Automatic on bytes
Performance	Fast (wraps libxml)	Slow (Python objects)	Fastest
Ergonomics	Fluent / jQuery-like	Pythonic	Lower-level

For PHP, DomCrawler is the de facto right answer. The same arguments don't apply 1:1 to Python, but conceptually it's "BeautifulSoup with XPath baked in."

Hands-on lab

Scrape /challenges/static/lists/cards with DomCrawler. Use filter for the cards, then per-card extract title, subtitle, and any visible tag list as an array. Try the same query using filterXPath instead of filter to feel the difference. Use extract(['_text', 'href']) on the card links to get a flat list of (label, URL) pairs.

Symfony DomCrawler, The Modern PHP Parser

What you’ll learn

Install

Building a Crawler

Filtering: CSS and XPath

Iteration

Extraction

`extract`, pull arrays of values

A complete card scrape

Navigation methods

Forms (preview)

Combining with Symfony HttpClient

DomCrawler vs BeautifulSoup vs lxml

Hands-on lab

Hands-on lab

Quiz, check your understanding

Which Composer package enables `$crawler->filter('h1')` with CSS selectors?

Symfony DomCrawler, The Modern PHP Parser

What you’ll learn

Install

Building a Crawler

Filtering: CSS and XPath

Iteration

Extraction

extract, pull arrays of values

A complete card scrape

Navigation methods

Forms (preview)

Combining with Symfony HttpClient

DomCrawler vs BeautifulSoup vs lxml

Hands-on lab

Hands-on lab

Quiz, check your understanding

Which Composer package enables `$crawler->filter('h1')` with CSS selectors?

`extract`, pull arrays of values