Sub-path 5 of 6

Production, Scale & Career

Run everything at scale, reliably.

Scrapy and Symfony for production scrapers. Async, proxies, fingerprinting, CAPTCHAs, distributed crawling, monitoring, deployment, and the legal/career framing that turns this into a livelihood.

~10 weeks part-time · 85 lessons

Lessons

4.1
Why Scrapy Beats Hand-Rolled Scripts
When a scraper outgrows a single file, Scrapy gives you the architecture for free. The case for adopting a framework, and when not to.
Lab: /products
intermediate
4.2
Scrapy Architecture: Engine, Scheduler, Spiders, Pipelines, Middlewares
The six pieces inside Scrapy and how a request flows through them. Once you can draw this diagram, every Scrapy mystery becomes debuggable.
intermediate
4.3
Items, ItemLoaders, Selectors
The three Scrapy primitives that make scraped data clean and consistent: typed Items, ItemLoaders for normalization, and Selectors for extraction.
Lab: /products
intermediate
4.4
Item Pipelines: Validation, Deduplication, Storage
Chain processors that transform every scraped item: validate, dedupe, enrich, store. The Scrapy abstraction that scales from a CSV to a Postgres cluster.
Lab: /products
intermediate
4.5
Custom Middlewares for Headers, Proxies, Cookies
Three middleware patterns every production Scrapy project ships: User-Agent rotation, proxy injection, cookie session management.
Lab: /challenges/antibot/header-fingerprint
intermediate
4.6
CrawlSpider, SitemapSpider, and Other Specialized Spiders
Scrapy ships specialized spider classes for whole-site crawls, sitemap traversal, and CSV/XML feed parsing. Knowing which to pick saves dozens of lines.
Lab: /sitemap.xml
intermediate
4.7
scrapy-playwright, Hybrid Scrapy + Browser
Add a real browser to Scrapy for the pages that need JavaScript, without throwing away the framework. The bridge between Sub-Path 3 and production-scale scraping.
Lab: /challenges/dynamic/spa-pure
intermediate
4.8
Why Symfony for Scraping Infrastructure
PHP isn't the obvious scraping language, but Symfony's component ecosystem is unusually good fit for production scraping infrastructure. Here's why.
intermediate
4.9
Symfony Console, Building Scraper CLI Commands
Every Symfony scraper starts as a Console command. Arguments, options, progress bars, dependency injection, the right way to ship a CLI scraper.
intermediate
4.10
Symfony HttpClient in Production Context
Beyond `HttpClient::create()`, retries, timeouts, concurrent batching, decorators, and the mocking story for tests.
Lab: /api/products
intermediate
4.11
Symfony Messenger, Async Jobs and Queues
Push scraping work off the main process into queue workers. Messenger is the Symfony component that makes distributed scraping straightforward.
Lab: /products
intermediate
4.12
Symfony Scheduler, Cron-Style Scraping Inside Your App
Replace external cron with Symfony Scheduler. Recurring messages, missed-run handling, and the right way to schedule scrapers inside a Symfony app.
intermediate
4.13
Doctrine ORM for Scraped Data Persistence
How to model scraped entities, write efficient upserts, and avoid Doctrine's classic memory pitfalls at scrape-scale.
Lab: /products
intermediate
4.14
Symfony Panther in Production
Real browser automation from PHP. When Panther is the right tool, how to run it reliably, and where it falls short of Playwright.
Lab: /deals/live
intermediate
4.15
Building a Scraping API with API Platform
Expose scraped data as a queryable REST/GraphQL API in a few hours. API Platform turns Doctrine entities into production endpoints with filters, pagination, and OpenAPI docs.
intermediate
4.16
Symfony Lock and Rate Limiter for Polite Scraping
Two Symfony components that turn 'be polite to the target' from intention into enforcement. Distributed locks for one-scraper-per-domain; rate limiters for request-per-second caps.
Lab: /robots.txt
intermediate
4.17
Symfony Serializer for Multi-Format Output
One serializer, many formats. Turn scraped DTOs into JSON, CSV, XML, YAML, or custom formats, without writing per-format code.
intermediate
4.18
Goutte, The Original PHP Scraping Wrapper
Goutte was the go-to PHP scraper for a decade. It still works, it's still in many codebases, and its abstractions live on in modern Symfony. Why it matters and when to use it.
intermediate
4.19
Roach PHP, A Scrapy-Inspired PHP Scraping Framework
Roach brings Scrapy's spider/pipeline architecture to PHP. When the framework is worth its overhead and where it fits relative to Symfony and Laravel.
intermediate
4.20
When to Use Which PHP Framework
Symfony, Roach, Goutte/HttpBrowser, Laravel, raw Guzzle, a decision matrix for which to reach for, with honest trade-offs.
intermediate
4.21
Python: asyncio, httpx, aiohttp for High Throughput
The async toolkit for Python scraping at scale. When to reach for asyncio over Scrapy, and how to write a clean async scraper without footguns.
Lab: /api/products
intermediate
4.22
Python Concurrency Control: Semaphores and Rate Limits
Bound concurrency, enforce request rates, honour 429 backoff. Three primitives that turn an async scraper into a polite, well-behaved one.
Lab: /api/products
intermediate
4.23
PHP: ReactPHP for Async Scraping
The async PHP runtime. ReactPHP's event loop, promises, and HTTP client, the PHP analog to Node.js or Python's asyncio.
intermediate
4.24
PHP: Amp for Concurrent HTTP
The fiber-based async runtime for PHP. Amp v3 lets you write sync-looking code that runs concurrently, the most ergonomic PHP async model.
intermediate
4.25
PHP: Symfony HttpClient Async Streaming
Often the easiest way to get PHP concurrency: Symfony HttpClient's `stream()` API. No fibers, no promises, just sync-looking code that multiplexes underneath.
intermediate
4.26
Proxy Types: Datacenter, Residential, Mobile, ISP
The four proxy categories every scraper needs to understand. Cost, detectability, and the right use case for each.
intermediate
4.27
Provider Comparison: Bright Data, Oxylabs, Smartproxy, IPRoyal, Others
An honest survey of the major proxy providers in 2026. Pricing tiers, target fit, ergonomic quirks. Not sponsored.
intermediate
4.28
Rotating Proxy Strategies (Per-Request, Per-Session, Sticky)
Three rotation strategies and when to use each. The mismatch between rotation and session state is the #1 source of bans.
intermediate
4.29
Geographic Targeting
Country-, region-, and city-level proxy targeting. When geography matters, how to specify it correctly, and the gotchas every scraper hits.
intermediate
4.30
Proxy Health Checks, Failover, and Cost Optimization
Production proxy management: detecting dead proxies, failing over, and cutting waste. The operational side of proxy infrastructure.
intermediate
4.31
Building Your Own Lightweight Proxy Pool
When buying isn't appropriate or you need fine-grained control, here's the architecture for a self-managed proxy pool with rotation, health, and failover.
intermediate
4.32
Browser Fingerprinting, A Complete Map
Every dimension along which anti-bot vendors fingerprint your scraper. A reference map for what you're up against, and what's hardest to spoof.
Lab: /challenges/antibot/canvas-fingerprint
advanced
4.33
User-Agent and Header Rotation (the Right Way)
Header rotation done badly is worse than not rotating. The principles that produce headers an anti-bot vendor can't distinguish from a real browser.
Lab: /challenges/antibot/header-fingerprint
advanced
4.34
TLS / HTTP/2 Fingerprint Rotation
Below HTTP, TLS and HTTP/2 reveal which library is calling. The tooling that makes Python and PHP look like real browsers at the wire level.
Lab: /challenges/antibot/tls-fingerprint
advanced
4.35
Defeating Cloudflare in 2026, Current Strategies
What actually works against Cloudflare's bot management in 2026. Tools, patterns, and honest assessment of the ongoing cat-and-mouse.
Lab: /challenges/antibot/js-challenge
advanced
4.36
DataDome, PerimeterX, Akamai, Kasada, Survey
A practical tour of the bot-management vendors you'll encounter besides Cloudflare. What each is known for and how scrapers approach them.
advanced
4.37
When to Give Up and Use a SERP/Scraping API Instead
Honest economics. When the cost of a commercial scraping API beats the cost of DIY. The signals that say 'stop fighting; pay someone.'
intermediate
4.38
CAPTCHA Types in 2026
A current map of CAPTCHAs scrapers encounter. What each is, what it's checking, and which are practically solvable.
intermediate
4.39
Third-Party Solvers: 2Captcha, CapSolver, Anti-Captcha
The main CAPTCHA-solving services in 2026. Pricing, API patterns, reliability differences.
intermediate
4.40
Integrating CAPTCHA Solving in Python and PHP Scrapers
Wire a CAPTCHA solver into a real scraper. The patterns that handle detection, solving, retry, and token injection in Scrapy and Symfony.
Lab: /challenges/antibot/captcha-mock
intermediate
4.41
Avoiding CAPTCHAs in the First Place (Cheaper, Always)
Every CAPTCHA you don't trigger is one you don't pay for, wait for, or fail at. The hygiene that keeps CAPTCHA rates low.
intermediate
4.42
When You've Outgrown a Single Machine
Signals that your scraper needs to become distributed. The architectural patterns and the cost of crossing that line.
advanced
4.43
Redis as a Task Queue (rq, custom)
Redis is the simplest queue for distributed scraping. The patterns from raw LPUSH/BLPOP up to RQ, and where each fits.
intermediate
4.44
Celery for Python Workers
Celery is the heavyweight Python distributed-task system. When its complexity earns its keep, and when rq or raw Redis is enough.
advanced
4.45
Symfony Messenger Multi-Worker Setup for PHP
Scale Symfony Messenger from one worker to many. Worker management with systemd, supervisor, and Docker.
intermediate
4.46
RabbitMQ for Complex Routing
When Redis lists aren't enough: RabbitMQ's exchanges, routing keys, fanout, and dead-letter queues. The heavyweight broker for complex pipelines.
advanced
4.47
Frontera and Scrapy Cluster
Frontiers, the queue-of-URLs abstraction for huge crawls. Frontera and Scrapy Cluster are the two major implementations for Python.
advanced
4.48
Coordinator/Worker Patterns
The classical distributed-scraping architecture. Coordinator schedules; workers execute. Three concrete patterns and the right one for your scale.
advanced
4.49
PostgreSQL for Structured Scraped Data
Postgres is the default sink for production scrapers. Schema design, upserts, JSONB for semi-structured fields, and the indexes that keep ingestion fast.
intermediate
4.50
MongoDB for Nested/Variable Data
When scraped documents have deep nesting, optional fields, or wildly varying shapes per source, MongoDB is often easier than coercing them into Postgres rows.
intermediate
4.51
ClickHouse for Analytics-Scale Storage
When you're storing billions of scraped events and need second-level analytical queries, ClickHouse is the right tool. Schema design, ingestion patterns, and the pitfalls.
advanced
4.52
S3 + Parquet for Cold Storage
Object storage with columnar Parquet files is the cheapest durable home for scraped data you don't query daily. The patterns that make it efficient.
intermediate
4.53
Deduplication at Scale (Bloom Filters, Content Hashing)
Once you're scraping millions of URLs, naive sets blow up your RAM. Bloom filters, content hashes, and the right approximate-or-exact tradeoff.
advanced
4.54
Change Detection (Diff-Only Storage)
Scraping the same pages repeatedly produces 99% redundant data. Store only what changed, and you'll cut storage and downstream load by orders of magnitude.
intermediate
4.55
Structured Logging (JSON Logs in Python and PHP)
Print statements don't scale. Structured logging, every event a JSON object, makes scrapers queryable, alertable, and debuggable in production.
intermediate
4.56
Centralized Logging (Loki, Elasticsearch)
Once scrapers run on multiple hosts, you need a central place to query logs. Loki and Elasticsearch are the two main options, their tradeoffs, pipelines, and costs.
intermediate
4.57
Metrics with Prometheus + Grafana
Logs tell you what happened. Metrics tell you the shape of normal vs abnormal. Prometheus scraping plus Grafana dashboards is the de facto stack.
intermediate
4.58
Key Scraper Metrics: Success Rate, Items/Sec, Ban Rate, Proxy Health
The specific metrics that matter for a scraper, what they tell you, how to compute them, and the alert thresholds that catch real problems.
Lab: /admin/stats
intermediate
4.59
Alerting (Slack, Email, PagerDuty)
Alerts wake people up. Wrong alerts wake them up for nothing. The principles and concrete configs for scraper alerts that engineers thank you for.
intermediate
4.60
Incident Runbooks for Scrapers
A runbook is the checklist you wish you had at 3am. The structure, the most common scraper incidents, and the runbook entries that cut mean-time-to-recover.
intermediate
4.61
Dockerizing Python Scrapers
A reproducible Docker image is the unit of deployment for a modern scraper. Multi-stage builds, slim base images, and the runtime surface area you actually need.
intermediate
4.62
Dockerizing PHP / Symfony Scrapers
PHP scrapers ship in Docker too. Symfony-specific patterns, FrankenPHP, opcache, and the differences from Python images.
intermediate
4.63
docker-compose for Local Full-Stack Dev
Production scrapers depend on Postgres, Redis, Mongo, Loki. docker-compose runs the whole stack locally so your dev environment matches prod.
intermediate
4.64
Kubernetes for Scraper Workloads (Overview)
What Kubernetes actually gives a scraping team, and when it's worth the operational cost. The minimum vocabulary and a runnable scraper deployment.
advanced
4.65
CI/CD for Scrapers (GitHub Actions for Python and PHP)
Automated test, build, and deploy pipelines for scraping projects. The pipeline that catches selector-breakage before it hits production.
intermediate
4.66
Scheduling: cron, Airflow, Prefect, Symfony Scheduler
From a cron line on a VPS to a workflow orchestrator with DAGs and retries, the scheduling tools you'll actually pick from.
intermediate
4.67
Cloud Platforms: VPS, AWS, Apify, Zyte Cloud, Hostinger VPS
Where to actually run your scrapers, from a $5 VPS to managed scraping platforms. Cost, control, and the tradeoffs honestly compared.
intermediate
4.68
Serverless Scrapers on AWS Lambda
When scrape workloads are bursty and you don't want idle infra, serverless can be cheap and clean. Where Lambda shines for scraping, and the hard limits to know.
advanced
4.69
Why Contributing to Scraping Libraries Matters for Your Career
Public contributions to the libraries you depend on are the highest-leverage career investment a scraping engineer can make. Why, and what to aim for.
beginner
4.70
Finding Good First Issues in Open-Source Scraping Projects
A walkthrough for finding contribution opportunities in scraping libraries that actually fit a beginner's skill envelope.
beginner
4.71
Writing PRs Maintainers Actually Merge
The difference between PRs that languish and PRs that merge is rarely the code, it's the framing, scope, tests, and tone. The concrete checklist.
beginner
4.72
Maintaining Your Own Composer / PyPI Package
Publishing and maintaining a small scraping utility on PyPI or Packagist is one of the highest-leverage career moves a scraping engineer can make. The how, and the responsibilities.
intermediate
4.73
Documentation as a Force Multiplier
The best engineers in scraping write docs. Why documentation pays back disproportionately, for your code, your team, and your career.
beginner
4.74
Freelance Scraping, Pricing, Clients, Contracts
How freelance scraping work actually flows in 2026, pricing models, where clients come from, and the contract clauses that save your weekends.
intermediate
4.75
Productized Services and Data-as-a-Service
From scratching custom scrapers for hours to selling scraped data as a productised offering. The transition that breaks the time-for-money trade.
intermediate
4.76
Building a Scraping SaaS (Real Examples and Margins)
The shape, economics, and risks of running a scraping SaaS, drawn from the public histories of companies in the space.
advanced
4.77
Writing Technical Content for Developer Audiences
Why every scraping engineer should write publicly, and the formats, cadence, and distribution that actually work.
beginner
4.78
YouTube and Live-Coding for Scraping Devs
Video and live-coding reach an audience text doesn't, and scraping is naturally suited to demonstration. The real cost-benefit, and how to start without overproducing.
beginner
4.79
Speaking at Meetups, Conferences, Submitting CFPs
Speaking at events is high-leverage for career and network. The CFP process, talk structure, and how to start at meetups before tackling major conferences.
beginner
4.80
Communities Where Scraping Devs Hang Out (Reddit, dev.to, HN, X, Discord)
Where the scraping community actually talks, learns, and trades work. A practical map of the online watering holes.
beginner
4.81
Building Your Personal Brand as a Scraping Expert
Personal brand isn't marketing, it's the accumulated signal that you specialize in scraping. The components, the compounding, and the honest tradeoffs.
intermediate
4.82
The Legal Landscape: hiQ v. LinkedIn, CFAA, GDPR
The landmark cases and statutes that shape what scraping is and isn't legally OK in 2026. Not legal advice; a working compass.
intermediate
4.83
Terms of Service, Enforceable or Not?
Most sites prohibit scraping in their Terms of Service. When that prohibition is legally enforceable, when it isn't, and how courts have ruled.
intermediate
4.84
Copyright vs Facts, What You Can and Can't Redistribute
Scraping data is one thing. Publishing or commercializing it is another. The line between facts and expression, and how courts have drawn it.
intermediate
4.85
Ethical Framework for Scraping Decisions
Beyond what's legal, what's right? A practical ethics framework for deciding which scraping projects to take, which to decline, and how to operate.
intermediate

Every lesson has a hands-on lab target on Catalog108 , our first-party practice scraping sandbox. Each lab page has a /grade endpoint that returns pass/fail on your scraper output.

Production, Scale & Career

Lessons

Why Scrapy Beats Hand-Rolled Scripts

Scrapy Architecture: Engine, Scheduler, Spiders, Pipelines, Middlewares

Items, ItemLoaders, Selectors

Item Pipelines: Validation, Deduplication, Storage

Custom Middlewares for Headers, Proxies, Cookies

CrawlSpider, SitemapSpider, and Other Specialized Spiders

scrapy-playwright, Hybrid Scrapy + Browser

Why Symfony for Scraping Infrastructure

Symfony Console, Building Scraper CLI Commands

Symfony HttpClient in Production Context

Symfony Messenger, Async Jobs and Queues

Symfony Scheduler, Cron-Style Scraping Inside Your App

Doctrine ORM for Scraped Data Persistence

Symfony Panther in Production

Building a Scraping API with API Platform

Symfony Lock and Rate Limiter for Polite Scraping

Symfony Serializer for Multi-Format Output

Goutte, The Original PHP Scraping Wrapper

Roach PHP, A Scrapy-Inspired PHP Scraping Framework

When to Use Which PHP Framework

Python: asyncio, httpx, aiohttp for High Throughput

Python Concurrency Control: Semaphores and Rate Limits

PHP: ReactPHP for Async Scraping

PHP: Amp for Concurrent HTTP

PHP: Symfony HttpClient Async Streaming

Proxy Types: Datacenter, Residential, Mobile, ISP

Provider Comparison: Bright Data, Oxylabs, Smartproxy, IPRoyal, Others

Rotating Proxy Strategies (Per-Request, Per-Session, Sticky)

Geographic Targeting

Proxy Health Checks, Failover, and Cost Optimization

Building Your Own Lightweight Proxy Pool

Browser Fingerprinting, A Complete Map

User-Agent and Header Rotation (the Right Way)

TLS / HTTP/2 Fingerprint Rotation

Defeating Cloudflare in 2026, Current Strategies

DataDome, PerimeterX, Akamai, Kasada, Survey

When to Give Up and Use a SERP/Scraping API Instead

CAPTCHA Types in 2026

Third-Party Solvers: 2Captcha, CapSolver, Anti-Captcha

Integrating CAPTCHA Solving in Python and PHP Scrapers

Avoiding CAPTCHAs in the First Place (Cheaper, Always)

When You've Outgrown a Single Machine

Redis as a Task Queue (rq, custom)

Celery for Python Workers

Symfony Messenger Multi-Worker Setup for PHP

RabbitMQ for Complex Routing

Frontera and Scrapy Cluster

Coordinator/Worker Patterns

PostgreSQL for Structured Scraped Data

MongoDB for Nested/Variable Data

ClickHouse for Analytics-Scale Storage

S3 + Parquet for Cold Storage

Deduplication at Scale (Bloom Filters, Content Hashing)

Change Detection (Diff-Only Storage)

Structured Logging (JSON Logs in Python and PHP)

Centralized Logging (Loki, Elasticsearch)

Metrics with Prometheus + Grafana

Key Scraper Metrics: Success Rate, Items/Sec, Ban Rate, Proxy Health

Alerting (Slack, Email, PagerDuty)

Incident Runbooks for Scrapers

Dockerizing Python Scrapers

Dockerizing PHP / Symfony Scrapers

docker-compose for Local Full-Stack Dev

Kubernetes for Scraper Workloads (Overview)

CI/CD for Scrapers (GitHub Actions for Python and PHP)

Scheduling: cron, Airflow, Prefect, Symfony Scheduler

Cloud Platforms: VPS, AWS, Apify, Zyte Cloud, Hostinger VPS

Serverless Scrapers on AWS Lambda

Why Contributing to Scraping Libraries Matters for Your Career

Finding Good First Issues in Open-Source Scraping Projects

Writing PRs Maintainers Actually Merge

Maintaining Your Own Composer / PyPI Package

Documentation as a Force Multiplier

Freelance Scraping, Pricing, Clients, Contracts

Productized Services and Data-as-a-Service

Building a Scraping SaaS (Real Examples and Margins)

Writing Technical Content for Developer Audiences

YouTube and Live-Coding for Scraping Devs