Sub-path 5 of 6
Production, Scale & Career
Run everything at scale, reliably.
Scrapy and Symfony for production scrapers. Async, proxies, fingerprinting, CAPTCHAs, distributed crawling, monitoring, deployment, and the legal/career framing that turns this into a livelihood.
~10 weeks part-time ยท 85 lessons
Lessons
- 4.1intermediate
Why Scrapy Beats Hand-Rolled Scripts
When a scraper outgrows a single file, Scrapy gives you the architecture for free. The case for adopting a framework, and when not to.
Lab:
/products - 4.2intermediate
Scrapy Architecture: Engine, Scheduler, Spiders, Pipelines, Middlewares
The six pieces inside Scrapy and how a request flows through them. Once you can draw this diagram, every Scrapy mystery becomes debuggable.
- 4.3intermediate
Items, ItemLoaders, Selectors
The three Scrapy primitives that make scraped data clean and consistent: typed Items, ItemLoaders for normalization, and Selectors for extraction.
Lab:
/products - 4.4intermediate
Item Pipelines: Validation, Deduplication, Storage
Chain processors that transform every scraped item: validate, dedupe, enrich, store. The Scrapy abstraction that scales from a CSV to a Postgres cluster.
Lab:
/products - 4.5intermediate
Custom Middlewares for Headers, Proxies, Cookies
Three middleware patterns every production Scrapy project ships: User-Agent rotation, proxy injection, cookie session management.
Lab:
/challenges/antibot/header-fingerprint - 4.6intermediate
CrawlSpider, SitemapSpider, and Other Specialized Spiders
Scrapy ships specialized spider classes for whole-site crawls, sitemap traversal, and CSV/XML feed parsing. Knowing which to pick saves dozens of lines.
Lab:
/sitemap.xml - 4.7intermediate
scrapy-playwright, Hybrid Scrapy + Browser
Add a real browser to Scrapy for the pages that need JavaScript, without throwing away the framework. The bridge between Sub-Path 3 and production-scale scraping.
Lab:
/challenges/dynamic/spa-pure - 4.8intermediate
Why Symfony for Scraping Infrastructure
PHP isn't the obvious scraping language, but Symfony's component ecosystem is unusually good fit for production scraping infrastructure. Here's why.
- 4.9intermediate
Symfony Console, Building Scraper CLI Commands
Every Symfony scraper starts as a Console command. Arguments, options, progress bars, dependency injection, the right way to ship a CLI scraper.
- 4.10intermediate
Symfony HttpClient in Production Context
Beyond `HttpClient::create()`, retries, timeouts, concurrent batching, decorators, and the mocking story for tests.
Lab:
/api/products - 4.11intermediate
Symfony Messenger, Async Jobs and Queues
Push scraping work off the main process into queue workers. Messenger is the Symfony component that makes distributed scraping straightforward.
Lab:
/products - 4.12intermediate
Symfony Scheduler, Cron-Style Scraping Inside Your App
Replace external cron with Symfony Scheduler. Recurring messages, missed-run handling, and the right way to schedule scrapers inside a Symfony app.
- 4.13intermediate
Doctrine ORM for Scraped Data Persistence
How to model scraped entities, write efficient upserts, and avoid Doctrine's classic memory pitfalls at scrape-scale.
Lab:
/products - 4.14intermediate
Symfony Panther in Production
Real browser automation from PHP. When Panther is the right tool, how to run it reliably, and where it falls short of Playwright.
Lab:
/deals/live - 4.15intermediate
Building a Scraping API with API Platform
Expose scraped data as a queryable REST/GraphQL API in a few hours. API Platform turns Doctrine entities into production endpoints with filters, pagination, and OpenAPI docs.
- 4.16intermediate
Symfony Lock and Rate Limiter for Polite Scraping
Two Symfony components that turn 'be polite to the target' from intention into enforcement. Distributed locks for one-scraper-per-domain; rate limiters for request-per-second caps.
Lab:
/robots.txt - 4.17intermediate
Symfony Serializer for Multi-Format Output
One serializer, many formats. Turn scraped DTOs into JSON, CSV, XML, YAML, or custom formats, without writing per-format code.
- 4.18intermediate
Goutte, The Original PHP Scraping Wrapper
Goutte was the go-to PHP scraper for a decade. It still works, it's still in many codebases, and its abstractions live on in modern Symfony. Why it matters and when to use it.
- 4.19intermediate
Roach PHP, A Scrapy-Inspired PHP Scraping Framework
Roach brings Scrapy's spider/pipeline architecture to PHP. When the framework is worth its overhead and where it fits relative to Symfony and Laravel.
- 4.20intermediate
When to Use Which PHP Framework
Symfony, Roach, Goutte/HttpBrowser, Laravel, raw Guzzle, a decision matrix for which to reach for, with honest trade-offs.
- 4.21intermediate
Python: asyncio, httpx, aiohttp for High Throughput
The async toolkit for Python scraping at scale. When to reach for asyncio over Scrapy, and how to write a clean async scraper without footguns.
Lab:
/api/products - 4.22intermediate
Python Concurrency Control: Semaphores and Rate Limits
Bound concurrency, enforce request rates, honour 429 backoff. Three primitives that turn an async scraper into a polite, well-behaved one.
Lab:
/api/products - 4.23intermediate
PHP: ReactPHP for Async Scraping
The async PHP runtime. ReactPHP's event loop, promises, and HTTP client, the PHP analog to Node.js or Python's asyncio.
- 4.24intermediate
PHP: Amp for Concurrent HTTP
The fiber-based async runtime for PHP. Amp v3 lets you write sync-looking code that runs concurrently, the most ergonomic PHP async model.
- 4.25intermediate
PHP: Symfony HttpClient Async Streaming
Often the easiest way to get PHP concurrency: Symfony HttpClient's `stream()` API. No fibers, no promises, just sync-looking code that multiplexes underneath.
- 4.26intermediate
Proxy Types: Datacenter, Residential, Mobile, ISP
The four proxy categories every scraper needs to understand. Cost, detectability, and the right use case for each.
- 4.27intermediate
Provider Comparison: Bright Data, Oxylabs, Smartproxy, IPRoyal, Others
An honest survey of the major proxy providers in 2026. Pricing tiers, target fit, ergonomic quirks. Not sponsored.
- 4.28intermediate
Rotating Proxy Strategies (Per-Request, Per-Session, Sticky)
Three rotation strategies and when to use each. The mismatch between rotation and session state is the #1 source of bans.
- 4.29intermediate
Geographic Targeting
Country-, region-, and city-level proxy targeting. When geography matters, how to specify it correctly, and the gotchas every scraper hits.
- 4.30intermediate
Proxy Health Checks, Failover, and Cost Optimization
Production proxy management: detecting dead proxies, failing over, and cutting waste. The operational side of proxy infrastructure.
- 4.31intermediate
Building Your Own Lightweight Proxy Pool
When buying isn't appropriate or you need fine-grained control, here's the architecture for a self-managed proxy pool with rotation, health, and failover.
- 4.32advanced
Browser Fingerprinting, A Complete Map
Every dimension along which anti-bot vendors fingerprint your scraper. A reference map for what you're up against, and what's hardest to spoof.
Lab:
/challenges/antibot/canvas-fingerprint - 4.33advanced
User-Agent and Header Rotation (the Right Way)
Header rotation done badly is worse than not rotating. The principles that produce headers an anti-bot vendor can't distinguish from a real browser.
Lab:
/challenges/antibot/header-fingerprint - 4.34advanced
TLS / HTTP/2 Fingerprint Rotation
Below HTTP, TLS and HTTP/2 reveal which library is calling. The tooling that makes Python and PHP look like real browsers at the wire level.
Lab:
/challenges/antibot/tls-fingerprint - 4.35advanced
Defeating Cloudflare in 2026, Current Strategies
What actually works against Cloudflare's bot management in 2026. Tools, patterns, and honest assessment of the ongoing cat-and-mouse.
Lab:
/challenges/antibot/js-challenge - 4.36advanced
DataDome, PerimeterX, Akamai, Kasada, Survey
A practical tour of the bot-management vendors you'll encounter besides Cloudflare. What each is known for and how scrapers approach them.
- 4.37intermediate
When to Give Up and Use a SERP/Scraping API Instead
Honest economics. When the cost of a commercial scraping API beats the cost of DIY. The signals that say 'stop fighting; pay someone.'
- 4.38intermediate
CAPTCHA Types in 2026
A current map of CAPTCHAs scrapers encounter. What each is, what it's checking, and which are practically solvable.
- 4.39intermediate
Third-Party Solvers: 2Captcha, CapSolver, Anti-Captcha
The main CAPTCHA-solving services in 2026. Pricing, API patterns, reliability differences.
- 4.40intermediate
Integrating CAPTCHA Solving in Python and PHP Scrapers
Wire a CAPTCHA solver into a real scraper. The patterns that handle detection, solving, retry, and token injection in Scrapy and Symfony.
Lab:
/challenges/antibot/captcha-mock - 4.41intermediate
Avoiding CAPTCHAs in the First Place (Cheaper, Always)
Every CAPTCHA you don't trigger is one you don't pay for, wait for, or fail at. The hygiene that keeps CAPTCHA rates low.
- 4.42advanced
When You've Outgrown a Single Machine
Signals that your scraper needs to become distributed. The architectural patterns and the cost of crossing that line.
- 4.43intermediate
Redis as a Task Queue (rq, custom)
Redis is the simplest queue for distributed scraping. The patterns from raw LPUSH/BLPOP up to RQ, and where each fits.
- 4.44advanced
Celery for Python Workers
Celery is the heavyweight Python distributed-task system. When its complexity earns its keep, and when rq or raw Redis is enough.
- 4.45intermediate
Symfony Messenger Multi-Worker Setup for PHP
Scale Symfony Messenger from one worker to many. Worker management with systemd, supervisor, and Docker.
- 4.46advanced
RabbitMQ for Complex Routing
When Redis lists aren't enough: RabbitMQ's exchanges, routing keys, fanout, and dead-letter queues. The heavyweight broker for complex pipelines.
- 4.47advanced
Frontera and Scrapy Cluster
Frontiers, the queue-of-URLs abstraction for huge crawls. Frontera and Scrapy Cluster are the two major implementations for Python.
- 4.48advanced
Coordinator/Worker Patterns
The classical distributed-scraping architecture. Coordinator schedules; workers execute. Three concrete patterns and the right one for your scale.
- 4.49intermediate
PostgreSQL for Structured Scraped Data
Postgres is the default sink for production scrapers. Schema design, upserts, JSONB for semi-structured fields, and the indexes that keep ingestion fast.
- 4.50intermediate
MongoDB for Nested/Variable Data
When scraped documents have deep nesting, optional fields, or wildly varying shapes per source, MongoDB is often easier than coercing them into Postgres rows.
- 4.51advanced
ClickHouse for Analytics-Scale Storage
When you're storing billions of scraped events and need second-level analytical queries, ClickHouse is the right tool. Schema design, ingestion patterns, and the pitfalls.
- 4.52intermediate
S3 + Parquet for Cold Storage
Object storage with columnar Parquet files is the cheapest durable home for scraped data you don't query daily. The patterns that make it efficient.
- 4.53advanced
Deduplication at Scale (Bloom Filters, Content Hashing)
Once you're scraping millions of URLs, naive sets blow up your RAM. Bloom filters, content hashes, and the right approximate-or-exact tradeoff.
- 4.54intermediate
Change Detection (Diff-Only Storage)
Scraping the same pages repeatedly produces 99% redundant data. Store only what changed, and you'll cut storage and downstream load by orders of magnitude.
- 4.55intermediate
Structured Logging (JSON Logs in Python and PHP)
Print statements don't scale. Structured logging, every event a JSON object, makes scrapers queryable, alertable, and debuggable in production.
- 4.56intermediate
Centralized Logging (Loki, Elasticsearch)
Once scrapers run on multiple hosts, you need a central place to query logs. Loki and Elasticsearch are the two main options, their tradeoffs, pipelines, and costs.
- 4.57intermediate
Metrics with Prometheus + Grafana
Logs tell you what happened. Metrics tell you the shape of normal vs abnormal. Prometheus scraping plus Grafana dashboards is the de facto stack.
- 4.58intermediate
Key Scraper Metrics: Success Rate, Items/Sec, Ban Rate, Proxy Health
The specific metrics that matter for a scraper, what they tell you, how to compute them, and the alert thresholds that catch real problems.
Lab:
/admin/stats - 4.59intermediate
Alerting (Slack, Email, PagerDuty)
Alerts wake people up. Wrong alerts wake them up for nothing. The principles and concrete configs for scraper alerts that engineers thank you for.
- 4.60intermediate
Incident Runbooks for Scrapers
A runbook is the checklist you wish you had at 3am. The structure, the most common scraper incidents, and the runbook entries that cut mean-time-to-recover.
- 4.61intermediate
Dockerizing Python Scrapers
A reproducible Docker image is the unit of deployment for a modern scraper. Multi-stage builds, slim base images, and the runtime surface area you actually need.
- 4.62intermediate
Dockerizing PHP / Symfony Scrapers
PHP scrapers ship in Docker too. Symfony-specific patterns, FrankenPHP, opcache, and the differences from Python images.
- 4.63intermediate
docker-compose for Local Full-Stack Dev
Production scrapers depend on Postgres, Redis, Mongo, Loki. docker-compose runs the whole stack locally so your dev environment matches prod.
- 4.64advanced
Kubernetes for Scraper Workloads (Overview)
What Kubernetes actually gives a scraping team, and when it's worth the operational cost. The minimum vocabulary and a runnable scraper deployment.
- 4.65intermediate
CI/CD for Scrapers (GitHub Actions for Python and PHP)
Automated test, build, and deploy pipelines for scraping projects. The pipeline that catches selector-breakage before it hits production.
- 4.66intermediate
Scheduling: cron, Airflow, Prefect, Symfony Scheduler
From a cron line on a VPS to a workflow orchestrator with DAGs and retries, the scheduling tools you'll actually pick from.
- 4.67intermediate
Cloud Platforms: VPS, AWS, Apify, Zyte Cloud, Hostinger VPS
Where to actually run your scrapers, from a $5 VPS to managed scraping platforms. Cost, control, and the tradeoffs honestly compared.
- 4.68advanced
Serverless Scrapers on AWS Lambda
When scrape workloads are bursty and you don't want idle infra, serverless can be cheap and clean. Where Lambda shines for scraping, and the hard limits to know.
- 4.69beginner
Why Contributing to Scraping Libraries Matters for Your Career
Public contributions to the libraries you depend on are the highest-leverage career investment a scraping engineer can make. Why, and what to aim for.
- 4.70beginner
Finding Good First Issues in Open-Source Scraping Projects
A walkthrough for finding contribution opportunities in scraping libraries that actually fit a beginner's skill envelope.
- 4.71beginner
Writing PRs Maintainers Actually Merge
The difference between PRs that languish and PRs that merge is rarely the code, it's the framing, scope, tests, and tone. The concrete checklist.
- 4.72intermediate
Maintaining Your Own Composer / PyPI Package
Publishing and maintaining a small scraping utility on PyPI or Packagist is one of the highest-leverage career moves a scraping engineer can make. The how, and the responsibilities.
- 4.73beginner
Documentation as a Force Multiplier
The best engineers in scraping write docs. Why documentation pays back disproportionately, for your code, your team, and your career.
- 4.74intermediate
Freelance Scraping, Pricing, Clients, Contracts
How freelance scraping work actually flows in 2026, pricing models, where clients come from, and the contract clauses that save your weekends.
- 4.75intermediate
Productized Services and Data-as-a-Service
From scratching custom scrapers for hours to selling scraped data as a productised offering. The transition that breaks the time-for-money trade.
- 4.76advanced
Building a Scraping SaaS (Real Examples and Margins)
The shape, economics, and risks of running a scraping SaaS, drawn from the public histories of companies in the space.
- 4.77beginner
Writing Technical Content for Developer Audiences
Why every scraping engineer should write publicly, and the formats, cadence, and distribution that actually work.
- 4.78beginner
YouTube and Live-Coding for Scraping Devs
Video and live-coding reach an audience text doesn't, and scraping is naturally suited to demonstration. The real cost-benefit, and how to start without overproducing.
- 4.79beginner
Speaking at Meetups, Conferences, Submitting CFPs
Speaking at events is high-leverage for career and network. The CFP process, talk structure, and how to start at meetups before tackling major conferences.
- 4.80beginner
Communities Where Scraping Devs Hang Out (Reddit, dev.to, HN, X, Discord)
Where the scraping community actually talks, learns, and trades work. A practical map of the online watering holes.
- 4.81intermediate
Building Your Personal Brand as a Scraping Expert
Personal brand isn't marketing, it's the accumulated signal that you specialize in scraping. The components, the compounding, and the honest tradeoffs.
- 4.82intermediate
The Legal Landscape: hiQ v. LinkedIn, CFAA, GDPR
The landmark cases and statutes that shape what scraping is and isn't legally OK in 2026. Not legal advice; a working compass.
- 4.83intermediate
Terms of Service, Enforceable or Not?
Most sites prohibit scraping in their Terms of Service. When that prohibition is legally enforceable, when it isn't, and how courts have ruled.
- 4.84intermediate
Copyright vs Facts, What You Can and Can't Redistribute
Scraping data is one thing. Publishing or commercializing it is another. The line between facts and expression, and how courts have drawn it.
- 4.85intermediate
Ethical Framework for Scraping Decisions
Beyond what's legal, what's right? A practical ethics framework for deciding which scraping projects to take, which to decline, and how to operate.
Every lesson has a hands-on lab target on Catalog108 , our first-party practice scraping sandbox. Each lab page has a /grade endpoint that returns pass/fail on your scraper output.