Goutte, The Original PHP Scraping Wrapper
Goutte was the go-to PHP scraper for a decade. It still works, it's still in many codebases, and its abstractions live on in modern Symfony. Why it matters and when to use it.
What you’ll learn
- Use Goutte (a.k.a. BrowserKit + HttpBrowser) for static scraping in legacy PHP code.
- Recognize where modern Symfony components replaced its functionality.
- Decide whether to keep Goutte or migrate.
If you're reading PHP scraping code older than about 2022, you'll see Goutte. It was the de facto PHP scraper from roughly 2010 to 2020, originally created by Symfony's Fabien Potencier. The library is now archived, its functionality lives inside Symfony's BrowserKit + HttpBrowser components, but plenty of production code still uses the Goutte namespace.
This lesson is short because Goutte is largely a stable, mature, slightly dated wrapper. You should know it exists, what it does, and when to migrate.
What Goutte was
A web crawler that combined Guzzle (HTTP) with Symfony\Component\DomCrawler (HTML parsing) and Symfony\Component\BrowserKit (form handling, cookies). One client, three concerns:
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', 'https://practice.scrapingcentral.com/products');
echo $crawler->filter('h1')->text();
// Forms
$form = $crawler->selectButton('Search')->form();
$crawler = $client->submit($form, ['q' => 'keyboard']);
foreach ($crawler->filter('.product-card') as $card) {
echo $card->textContent . "\n";
}
// Following links
$link = $crawler->selectLink('Next page')->link();
$crawler = $client->click($link);
The API is delightful, readable, browser-like, doesn't make you think about HTTP. For static sites with forms and pagination, it's hard to beat.
What replaced it
Goutte was archived in 2023. Its functionality is now:
| Goutte feature | Modern Symfony equivalent |
|---|---|
Goutte\Client |
Symfony\Component\BrowserKit\HttpBrowser |
| HTTP requests | Symfony\Component\HttpClient\HttpClient |
| HTML parsing | Symfony\Component\DomCrawler\Crawler |
| Forms, links, cookies | BrowserKit (cookies, forms, links) |
use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\HttpClient\HttpClient;
$browser = new HttpBrowser(HttpClient::create());
$crawler = $browser->request('GET', 'https://practice.scrapingcentral.com/products');
$browser->submitForm('Search', ['q' => 'keyboard']);
$browser->click($crawler->selectLink('Next page')->link());
That's the same code, slightly more explicit imports. If you're starting a new project, use HttpBrowser. If you're maintaining a Goutte project, the migration is straightforward, change the namespace, the calls are nearly identical.
When Goutte (or HttpBrowser) is the right tool
Strong fit when you need:
-
Static HTML with forms. Goutte's form abstraction is exceptional,
$form['email'] = 'x@y.com'; $client->submit($form);. Cleaner than building a POST body by hand. -
A browser-like cookie jar. Cookies persist automatically across
request()calls. No manualSet-Cookieparsing. -
Link traversal.
$client->click($crawler->selectLink('Next')->link())reads like "click the Next link", and is implemented that way. -
Legacy maintenance. Touching Goutte code in a 5-year-old project? Don't rewrite, the API still works.
Where it falls short
- JavaScript. Goutte does not run JS. For SPAs, you need Panther or Playwright.
- Concurrency. Sequential by design. For batched concurrent fetches, use HttpClient directly with
stream(). - Modern conveniences. No retry strategies, no rate limiters, no scoped clients. You bolt those on yourself.
- No active development. The repo is archived. Bugs go to Symfony's BrowserKit; if Symfony doesn't have a Goutte-equivalent fix, you wait.
For new code, HttpBrowser is preferred, same API, supported.
A complete pagination scrape with HttpBrowser
use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\HttpClient\HttpClient;
$browser = new HttpBrowser(HttpClient::create([
'headers' => ['User-Agent' => 'CatalogScraper/1.0'],
]));
$crawler = $browser->request('GET', 'https://practice.scrapingcentral.com/products');
$items = [];
do {
foreach ($crawler->filter('.product-card') as $card) {
$sub = new \Symfony\Component\DomCrawler\Crawler($card);
$items[] = [
'title' => $sub->filter('h3')->text(''),
'price' => $sub->filter('.price')->text(''),
'url' => $sub->filter('a')->attr('href'),
];
}
try {
$next = $crawler->selectLink('Next')->link();
$crawler = $browser->click($next);
} catch (\InvalidArgumentException) {
break; // no Next link, done
}
} while (true);
echo count($items) . " products\n";
Clean. Linear. No boilerplate. For static catalogues, this is genuinely the right amount of code.
Goutte vs HttpClient directly, when to skip HttpBrowser
If you don't need cookies, forms, or link traversal, just "fetch this URL, parse HTML", drop straight to:
$client = HttpClient::create();
$response = $client->request('GET', $url);
$crawler = new Crawler($response->getContent());
You skip the BrowserKit layer entirely. Slightly more verbose, but no extra dependency. For pure read-only API-ish scrapes, this is leaner.
Migration checklist
If you're moving a Goutte project to modern Symfony:
composer require symfony/browser-kit symfony/http-client symfony/dom-crawler symfony/css-selectorcomposer remove fabpot/goutte- Find/replace
Goutte\Client→Symfony\Component\BrowserKit\HttpBrowser - Construction:
new HttpBrowser(HttpClient::create())instead ofnew Client(). - Run tests. Most projects need only those four changes.
Hands-on lab
Take a small Goutte-style scrape against Catalog108:
- Write it with the legacy
Goutte\ClientAPI (in a sandbox project, since the library is archived). - Rewrite the same script using
HttpBrowser+HttpClient. - Diff them, under 10 lines of meaningful change.
The exercise builds two intuitions: legacy code is maintainable; the migration path forward is short.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.