Roach PHP, A Scrapy-Inspired PHP Scraping Framework
Roach brings Scrapy's spider/pipeline architecture to PHP. When the framework is worth its overhead and where it fits relative to Symfony and Laravel.
What you’ll learn
- Build a Roach spider with rules, extensions, and pipelines.
- Compare Roach's architecture to Scrapy and to a from-scratch Symfony scraper.
- Decide whether Roach beats rolling your own with Symfony components.
Roach (roach-php/core) is the PHP framework that most closely mirrors Scrapy's mental model, spiders, middleware, item processors, extensions. Written by Kai Sassnowski. It works with both standalone PHP and inside Laravel (via roach-php/laravel).
If you want Scrapy's ergonomics in PHP without rebuilding them yourself, Roach is the candidate.
Install
composer require roach-php/core
# or for Laravel
composer require roach-php/laravel
A minimal spider
<?php
// src/Spiders/ProductSpider.php
namespace App\Spiders;
use Generator;
use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;
use RoachPHP\Spider\Configuration\Overrides;
class ProductSpider extends BasicSpider
{
public array $startUrls = [
'https://practice.scrapingcentral.com/products',
];
public int $concurrency = 4;
public int $requestDelay = 1;
public function parse(Response $response): Generator
{
foreach ($response->filter('.product-card')->each(fn($node) => $node) as $card) {
yield $this->item([
'title' => $card->filter('h3')->text(''),
'price' => $card->filter('.price')->text(''),
'url' => $card->filter('a')->attr('href'),
]);
}
$next = $response->filter('a.next')->attr('href');
if ($next) {
yield $this->request('GET', $next, 'parse');
}
}
}
The structure should feel familiar from §4.1–§4.7:
startUrls, like Scrapy.parse(Response), callback.yield $this->item(...), emit an item.yield $this->request(...), schedule a follow-up.
Running
use RoachPHP\Roach;
$items = Roach::collectSpider(ProductSpider::class);
Or asynchronously (in Laravel, integrate with queues):
Roach::startSpider(ProductSpider::class);
Extensions
Extensions plug into the engine lifecycle, analogous to Scrapy extensions/signals.
class StatsExtension implements ExtensionInterface
{
public function handleStarting(StartEvent $e): void { ... }
public function handleItemScraped(ItemScraped $e): void { ... }
public function handleFinished(FinishedEvent $e): void { ... }
}
Register on the spider:
public array $extensions = [
StatsExtension::class,
LoggingExtension::class,
];
Item processors (pipelines)
class ValidateProcessor implements ItemProcessorInterface
{
public function processItem(ItemInterface $item): ItemInterface
{
if (!$item->get('title')) {
return $item->drop('missing title');
}
return $item;
}
}
class StoreProcessor implements ItemProcessorInterface
{
public function processItem(ItemInterface $item): ItemInterface
{
// persist to DB
return $item;
}
}
Register:
public array $itemProcessors = [
ValidateProcessor::class,
StoreProcessor::class,
];
Same mental model as Scrapy pipelines: chain of processors, drop early, store at the end.
Downloader middleware
Modify requests in flight:
class UserAgentMiddleware implements RequestMiddlewareInterface
{
public function handleRequest(Request $request): Request
{
return $request->addHeader('User-Agent', 'RoachScraper/1.0');
}
}
Register on the spider:
public array $downloaderMiddleware = [
UserAgentMiddleware::class,
];
Concurrency model
Roach uses ReactPHP under the hood, true async I/O. concurrency = 4 means 4 in-flight requests at once. requestDelay = 1 enforces a minimum gap. The combo gives you "polite parallelism" out of the box.
For higher throughput, raise concurrency. For politer scraping, raise the delay. Same knobs Scrapy exposes, same trade-offs.
Roach vs Scrapy
| Feature | Scrapy | Roach |
|---|---|---|
| Spiders | scrapy.Spider | BasicSpider |
| Pipelines | ITEM_PIPELINES | itemProcessors |
| Middleware | DOWNLOADER_MIDDLEWARES | downloaderMiddleware |
| Extensions / signals | Extensions + signals | Extensions (event-based) |
| Selectors | response.css/.xpath | response->filter() (DomCrawler) |
| Async | Twisted | ReactPHP |
| Ecosystem maturity | 15 years, huge | Younger but solid |
| Browser integration | scrapy-playwright | Not built-in (you'd integrate Panther manually) |
If you know Scrapy, Roach is a 30-minute learning curve. If you don't know Scrapy, Roach's docs are smaller, bootstrapping is easier.
Roach vs rolling your own with Symfony
The honest comparison:
- Roach gives you scraping idioms. Less code for spider-style crawls.
- Symfony components give you general infrastructure. More flexible, more verbose.
For a project that is primarily a scraper:
- Many concurrent spiders.
- Pipelines for processing.
- Standard crawl shapes (paginated catalogue, sitemap).
Roach wins. It's purpose-built. You ship faster.
For a project that contains a scraper among other features:
- Symfony app with web UI, API, scheduled tasks.
- Scraper is one of many Console commands.
- Persistence in shared Doctrine entities.
Symfony components win. Roach inside that context would create two parallel architectures (Roach's spider lifecycle, Symfony's Messenger/Scheduler). Pick one.
Laravel integration
roach-php/laravel ships an Artisan command:
php artisan roach:run ProductSpider
It uses Laravel's queues, container, and config. For Laravel projects that need scraping, Roach is the most natural choice, fewer abstractions to learn than wrapping Scrapy from PHP.
Limitations
- Browser automation isn't built in. Bring your own Panther/Playwright integration if you need JS rendering.
- The ecosystem is smaller. Third-party Roach middleware libraries are limited. You'll write more middleware yourself than in Scrapy.
- Community size is smaller. Stack Overflow answers are sparse. The maintainer is responsive but it's not a hive.
For "I need a Scrapy-like framework in PHP and I'm OK with a smaller ecosystem," Roach is excellent. Otherwise, you're often better off composing Symfony components yourself.
Hands-on lab
If you have a PHP environment:
- Install Roach in a fresh project:
composer require roach-php/core. - Build a spider against
/productson Catalog108. - Add a ValidateProcessor that drops items missing
price. - Add a downloader middleware that injects a User-Agent.
- Run and collect items.
If you completed the Scrapy lab earlier, compare side by side. Roach is the closest "feels like Scrapy" PHP framework on offer.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.