Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

1.18beginner4 min read

Symfony DomCrawler, The Modern PHP Parser

DomCrawler wraps DOMDocument with a fluent jQuery-like API, supports both CSS and XPath, and is the default HTML parser for any non-trivial PHP scraper.

What you’ll learn

  • Install and instantiate a Crawler from HTML string or response.
  • Filter elements with CSS (`filter`) and XPath (`filterXPath`).
  • Extract text, attributes, and multiple values cleanly.
  • Combine DomCrawler with Guzzle or Symfony HttpClient in a single pipeline.

DomCrawler is what raw DOMDocument should have been. It provides a jQuery-like fluent API, accepts both CSS and XPath, handles UTF-8 correctly, and integrates cleanly with Symfony BrowserKit and HttpClient. It's the modern default for PHP scraping.

Install

composer require symfony/dom-crawler symfony/css-selector

The css-selector package is what enables CSS-selector input to filter(). Without it, you'd be stuck with filterXPath() only.

Building a Crawler

<?php
require 'vendor/autoload.php';

use Symfony\Component\DomCrawler\Crawler;

$html = file_get_contents('https://practice.scrapingcentral.com/challenges/static/lists/cards');
$crawler = new Crawler($html);

echo $crawler->count() . " top-level nodes\n";
echo $crawler->filter('h1')->text() . "\n";

That's it. The Crawler auto-detects encoding from <meta charset> and decodes correctly, no DOMDocument workarounds needed.

Combined with Guzzle:

use GuzzleHttp\Client;

$client = new Client(['base_uri' => 'https://practice.scrapingcentral.com']);
$response = $client->get('/challenges/static/lists/cards');
$crawler = new Crawler((string) $response->getBody());

Filtering: CSS and XPath

// CSS
$cards = $crawler->filter('article.product-card');
echo $cards->count() . " cards\n";

// XPath
$cards = $crawler->filterXPath('//article[contains(@class, "product-card")]');

// Mix and match, both return a Crawler instance you can chain
$titles = $crawler->filter('article.product-card')->filterXPath('.//h2');

filter and filterXPath both return a new Crawler representing the matched set. Chain them freely.

Iteration

foreach ($crawler->filter('article.product-card') as $domNode) {
  // $domNode is a raw \DOMNode, useful for direct DOM operations
  echo $domNode->nodeName . "\n";
}

// Or get a Crawler per node:
$crawler->filter('article.product-card')->each(function (Crawler $node) {
  echo $node->filter('h2')->text() . "\n";
});

each is the idiomatic per-node iteration with Crawler semantics. The closure can return a value, which each collects into an array, map-style behavior in PHP closures.

Extraction

$node = $crawler->filter('h1')->first();

$node->text();  // trimmed text content
$node->text('default');  // default if not found (avoids exceptions)
$node->html();  // inner HTML
$node->outerHtml();  // outer HTML (Symfony 5.4+)
$node->attr('href');  // attribute value
$node->attr('href', 'fallback');  // with default
$node->nodeName();  // 'h1'
$node->count();  // 1 (or 0 if not found)

text() is whitespace-trimmed by default. For raw text including whitespace: text(null, false).

If filter() matched nothing and you call text(), you get an InvalidArgumentException. Either check ->count() > 0 first or use the $default argument.

extract, pull arrays of values

For "give me every href from these links":

$urls = $crawler->filter('a.product-link')->extract(['href']);
// ['/p/1', '/p/2', '/p/3'...]

// Multiple attributes at once:
$pairs = $crawler->filter('a.product-link')->extract(['_text', 'href']);
// [['Item 1', '/p/1'], ['Item 2', '/p/2']...]

The special pseudo-attribute _text gives the element's text. This is the cleanest way to pull a column-shaped list of data from a list of repeating elements.

A complete card scrape

<?php
use Symfony\Component\DomCrawler\Crawler;

$html = file_get_contents('https://practice.scrapingcentral.com/challenges/static/lists/cards');
$crawler = new Crawler($html);

$cards = $crawler->filter('article.product-card')->each(function (Crawler $card) {
  return [
  'name'  => $card->filter('h2')->text(),
  'price'  => $card->filter('.price')->text(null, ''),
  'url'  => $card->filter('a')->attr('href'),
  'tags'  => $card->filter('.tag')->extract(['_text']),
  ];
});

print_r($cards);

That's a clean, readable scraper. No DOMDocument boilerplate, no UTF-8 workarounds, no XPath when CSS suffices.

Navigation methods

DomCrawler exposes parent/sibling/child operations:

$node->parents();  // all ancestor crawlers
$node->children();  // direct children
$node->children('a');  // children filtered by selector
$node->siblings();  // sibling crawlers
$node->nextAll();  // following siblings
$node->previousAll();  // preceding siblings
$node->first();
$node->last();
$node->eq(2);  // 3rd match (0-indexed)

For text-anchored navigation (find the <dd> after the <dt> with text "Brand"):

$crawler->filterXPath('//dt[normalize-space()="Brand"]/following-sibling::dd[1]')->text();

CSS can't do text-content filtering, drop to XPath for those cases.

Forms (preview)

DomCrawler can also parse and fill forms:

$form = $crawler->selectButton('Submit')->form();
$form['username'] = 'student';
$form['password'] = 'practice123';

This pairs perfectly with BrowserKit (Lesson 1.19), which submits the form for you. Together they let you replicate browser-style interaction without launching a browser.

Combining with Symfony HttpClient

use Symfony\Component\HttpClient\HttpClient;
use Symfony\Component\DomCrawler\Crawler;

$client = HttpClient::create();
$response = $client->request('GET', 'https://practice.scrapingcentral.com/blog');
$crawler = new Crawler($response->getContent());

foreach ($crawler->filter('article.post') as $post) {
  $c = new Crawler($post);
  echo $c->filter('h2 a')->text() . "\n";
}

The cleanest PHP-native stack: HttpClient for fetching, DomCrawler for parsing. Fewer dependencies, modern Symfony idioms throughout.

DomCrawler vs BeautifulSoup vs lxml

Concern DomCrawler BeautifulSoup lxml
CSS selectors Yes (with cssselect) Yes Yes (with cssselect)
XPath Yes No (need lxml) Yes
Encoding handling Automatic Depends on parser arg Automatic on bytes
Performance Fast (wraps libxml) Slow (Python objects) Fastest
Ergonomics Fluent / jQuery-like Pythonic Lower-level

For PHP, DomCrawler is the de facto right answer. The same arguments don't apply 1:1 to Python, but conceptually it's "BeautifulSoup with XPath baked in."

Hands-on lab

Scrape /challenges/static/lists/cards with DomCrawler. Use filter for the cards, then per-card extract title, subtitle, and any visible tag list as an array. Try the same query using filterXPath instead of filter to feel the difference. Use extract(['_text', 'href']) on the card links to get a flat list of (label, URL) pairs.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /challenges/static/lists/cards

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Symfony DomCrawler, The Modern PHP Parser1 / 8

Which Composer package enables `$crawler->filter('h1')` with CSS selectors?

Score so far: 0 / 0