Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Sub-path 5 of 6

Production, Scale & Career

Run everything at scale, reliably.

Scrapy and Symfony for production scrapers. Async, proxies, fingerprinting, CAPTCHAs, distributed crawling, monitoring, deployment, and the legal/career framing that turns this into a livelihood.

~10 weeks part-time ยท 85 lessons

Lessons

  1. 4.1

    Why Scrapy Beats Hand-Rolled Scripts

    When a scraper outgrows a single file, Scrapy gives you the architecture for free. The case for adopting a framework, and when not to.

    Lab: /products

    intermediate
  2. 4.2

    Scrapy Architecture: Engine, Scheduler, Spiders, Pipelines, Middlewares

    The six pieces inside Scrapy and how a request flows through them. Once you can draw this diagram, every Scrapy mystery becomes debuggable.

    intermediate
  3. 4.3

    Items, ItemLoaders, Selectors

    The three Scrapy primitives that make scraped data clean and consistent: typed Items, ItemLoaders for normalization, and Selectors for extraction.

    Lab: /products

    intermediate
  4. 4.4

    Item Pipelines: Validation, Deduplication, Storage

    Chain processors that transform every scraped item: validate, dedupe, enrich, store. The Scrapy abstraction that scales from a CSV to a Postgres cluster.

    Lab: /products

    intermediate
  5. 4.5

    Custom Middlewares for Headers, Proxies, Cookies

    Three middleware patterns every production Scrapy project ships: User-Agent rotation, proxy injection, cookie session management.

    Lab: /challenges/antibot/header-fingerprint

    intermediate
  6. 4.6

    CrawlSpider, SitemapSpider, and Other Specialized Spiders

    Scrapy ships specialized spider classes for whole-site crawls, sitemap traversal, and CSV/XML feed parsing. Knowing which to pick saves dozens of lines.

    Lab: /sitemap.xml

    intermediate
  7. 4.7

    scrapy-playwright, Hybrid Scrapy + Browser

    Add a real browser to Scrapy for the pages that need JavaScript, without throwing away the framework. The bridge between Sub-Path 3 and production-scale scraping.

    Lab: /challenges/dynamic/spa-pure

    intermediate
  8. 4.8

    Why Symfony for Scraping Infrastructure

    PHP isn't the obvious scraping language, but Symfony's component ecosystem is unusually good fit for production scraping infrastructure. Here's why.

    intermediate
  9. 4.9

    Symfony Console, Building Scraper CLI Commands

    Every Symfony scraper starts as a Console command. Arguments, options, progress bars, dependency injection, the right way to ship a CLI scraper.

    intermediate
  10. 4.10

    Symfony HttpClient in Production Context

    Beyond `HttpClient::create()`, retries, timeouts, concurrent batching, decorators, and the mocking story for tests.

    Lab: /api/products

    intermediate
  11. 4.11

    Symfony Messenger, Async Jobs and Queues

    Push scraping work off the main process into queue workers. Messenger is the Symfony component that makes distributed scraping straightforward.

    Lab: /products

    intermediate
  12. 4.12

    Symfony Scheduler, Cron-Style Scraping Inside Your App

    Replace external cron with Symfony Scheduler. Recurring messages, missed-run handling, and the right way to schedule scrapers inside a Symfony app.

    intermediate
  13. 4.13

    Doctrine ORM for Scraped Data Persistence

    How to model scraped entities, write efficient upserts, and avoid Doctrine's classic memory pitfalls at scrape-scale.

    Lab: /products

    intermediate
  14. 4.14

    Symfony Panther in Production

    Real browser automation from PHP. When Panther is the right tool, how to run it reliably, and where it falls short of Playwright.

    Lab: /deals/live

    intermediate
  15. 4.15

    Building a Scraping API with API Platform

    Expose scraped data as a queryable REST/GraphQL API in a few hours. API Platform turns Doctrine entities into production endpoints with filters, pagination, and OpenAPI docs.

    intermediate
  16. 4.16

    Symfony Lock and Rate Limiter for Polite Scraping

    Two Symfony components that turn 'be polite to the target' from intention into enforcement. Distributed locks for one-scraper-per-domain; rate limiters for request-per-second caps.

    Lab: /robots.txt

    intermediate
  17. 4.17

    Symfony Serializer for Multi-Format Output

    One serializer, many formats. Turn scraped DTOs into JSON, CSV, XML, YAML, or custom formats, without writing per-format code.

    intermediate
  18. 4.18

    Goutte, The Original PHP Scraping Wrapper

    Goutte was the go-to PHP scraper for a decade. It still works, it's still in many codebases, and its abstractions live on in modern Symfony. Why it matters and when to use it.

    intermediate
  19. 4.19

    Roach PHP, A Scrapy-Inspired PHP Scraping Framework

    Roach brings Scrapy's spider/pipeline architecture to PHP. When the framework is worth its overhead and where it fits relative to Symfony and Laravel.

    intermediate
  20. 4.20

    When to Use Which PHP Framework

    Symfony, Roach, Goutte/HttpBrowser, Laravel, raw Guzzle, a decision matrix for which to reach for, with honest trade-offs.

    intermediate
  21. 4.21

    Python: asyncio, httpx, aiohttp for High Throughput

    The async toolkit for Python scraping at scale. When to reach for asyncio over Scrapy, and how to write a clean async scraper without footguns.

    Lab: /api/products

    intermediate
  22. 4.22

    Python Concurrency Control: Semaphores and Rate Limits

    Bound concurrency, enforce request rates, honour 429 backoff. Three primitives that turn an async scraper into a polite, well-behaved one.

    Lab: /api/products

    intermediate
  23. 4.23

    PHP: ReactPHP for Async Scraping

    The async PHP runtime. ReactPHP's event loop, promises, and HTTP client, the PHP analog to Node.js or Python's asyncio.

    intermediate
  24. 4.24

    PHP: Amp for Concurrent HTTP

    The fiber-based async runtime for PHP. Amp v3 lets you write sync-looking code that runs concurrently, the most ergonomic PHP async model.

    intermediate
  25. 4.25

    PHP: Symfony HttpClient Async Streaming

    Often the easiest way to get PHP concurrency: Symfony HttpClient's `stream()` API. No fibers, no promises, just sync-looking code that multiplexes underneath.

    intermediate
  26. 4.26

    Proxy Types: Datacenter, Residential, Mobile, ISP

    The four proxy categories every scraper needs to understand. Cost, detectability, and the right use case for each.

    intermediate
  27. 4.27

    Provider Comparison: Bright Data, Oxylabs, Smartproxy, IPRoyal, Others

    An honest survey of the major proxy providers in 2026. Pricing tiers, target fit, ergonomic quirks. Not sponsored.

    intermediate
  28. 4.28

    Rotating Proxy Strategies (Per-Request, Per-Session, Sticky)

    Three rotation strategies and when to use each. The mismatch between rotation and session state is the #1 source of bans.

    intermediate
  29. 4.29

    Geographic Targeting

    Country-, region-, and city-level proxy targeting. When geography matters, how to specify it correctly, and the gotchas every scraper hits.

    intermediate
  30. 4.30

    Proxy Health Checks, Failover, and Cost Optimization

    Production proxy management: detecting dead proxies, failing over, and cutting waste. The operational side of proxy infrastructure.

    intermediate
  31. 4.31

    Building Your Own Lightweight Proxy Pool

    When buying isn't appropriate or you need fine-grained control, here's the architecture for a self-managed proxy pool with rotation, health, and failover.

    intermediate
  32. 4.32

    Browser Fingerprinting, A Complete Map

    Every dimension along which anti-bot vendors fingerprint your scraper. A reference map for what you're up against, and what's hardest to spoof.

    Lab: /challenges/antibot/canvas-fingerprint

    advanced
  33. 4.33

    User-Agent and Header Rotation (the Right Way)

    Header rotation done badly is worse than not rotating. The principles that produce headers an anti-bot vendor can't distinguish from a real browser.

    Lab: /challenges/antibot/header-fingerprint

    advanced
  34. 4.34

    TLS / HTTP/2 Fingerprint Rotation

    Below HTTP, TLS and HTTP/2 reveal which library is calling. The tooling that makes Python and PHP look like real browsers at the wire level.

    Lab: /challenges/antibot/tls-fingerprint

    advanced
  35. 4.35

    Defeating Cloudflare in 2026, Current Strategies

    What actually works against Cloudflare's bot management in 2026. Tools, patterns, and honest assessment of the ongoing cat-and-mouse.

    Lab: /challenges/antibot/js-challenge

    advanced
  36. 4.36

    DataDome, PerimeterX, Akamai, Kasada, Survey

    A practical tour of the bot-management vendors you'll encounter besides Cloudflare. What each is known for and how scrapers approach them.

    advanced
  37. 4.37

    When to Give Up and Use a SERP/Scraping API Instead

    Honest economics. When the cost of a commercial scraping API beats the cost of DIY. The signals that say 'stop fighting; pay someone.'

    intermediate
  38. 4.38

    CAPTCHA Types in 2026

    A current map of CAPTCHAs scrapers encounter. What each is, what it's checking, and which are practically solvable.

    intermediate
  39. 4.39

    Third-Party Solvers: 2Captcha, CapSolver, Anti-Captcha

    The main CAPTCHA-solving services in 2026. Pricing, API patterns, reliability differences.

    intermediate
  40. 4.40

    Integrating CAPTCHA Solving in Python and PHP Scrapers

    Wire a CAPTCHA solver into a real scraper. The patterns that handle detection, solving, retry, and token injection in Scrapy and Symfony.

    Lab: /challenges/antibot/captcha-mock

    intermediate
  41. 4.41

    Avoiding CAPTCHAs in the First Place (Cheaper, Always)

    Every CAPTCHA you don't trigger is one you don't pay for, wait for, or fail at. The hygiene that keeps CAPTCHA rates low.

    intermediate
  42. 4.42

    When You've Outgrown a Single Machine

    Signals that your scraper needs to become distributed. The architectural patterns and the cost of crossing that line.

    advanced
  43. 4.43

    Redis as a Task Queue (rq, custom)

    Redis is the simplest queue for distributed scraping. The patterns from raw LPUSH/BLPOP up to RQ, and where each fits.

    intermediate
  44. 4.44

    Celery for Python Workers

    Celery is the heavyweight Python distributed-task system. When its complexity earns its keep, and when rq or raw Redis is enough.

    advanced
  45. 4.45

    Symfony Messenger Multi-Worker Setup for PHP

    Scale Symfony Messenger from one worker to many. Worker management with systemd, supervisor, and Docker.

    intermediate
  46. 4.46

    RabbitMQ for Complex Routing

    When Redis lists aren't enough: RabbitMQ's exchanges, routing keys, fanout, and dead-letter queues. The heavyweight broker for complex pipelines.

    advanced
  47. 4.47

    Frontera and Scrapy Cluster

    Frontiers, the queue-of-URLs abstraction for huge crawls. Frontera and Scrapy Cluster are the two major implementations for Python.

    advanced
  48. 4.48

    Coordinator/Worker Patterns

    The classical distributed-scraping architecture. Coordinator schedules; workers execute. Three concrete patterns and the right one for your scale.

    advanced
  49. 4.49

    PostgreSQL for Structured Scraped Data

    Postgres is the default sink for production scrapers. Schema design, upserts, JSONB for semi-structured fields, and the indexes that keep ingestion fast.

    intermediate
  50. 4.50

    MongoDB for Nested/Variable Data

    When scraped documents have deep nesting, optional fields, or wildly varying shapes per source, MongoDB is often easier than coercing them into Postgres rows.

    intermediate
  51. 4.51

    ClickHouse for Analytics-Scale Storage

    When you're storing billions of scraped events and need second-level analytical queries, ClickHouse is the right tool. Schema design, ingestion patterns, and the pitfalls.

    advanced
  52. 4.52

    S3 + Parquet for Cold Storage

    Object storage with columnar Parquet files is the cheapest durable home for scraped data you don't query daily. The patterns that make it efficient.

    intermediate
  53. 4.53

    Deduplication at Scale (Bloom Filters, Content Hashing)

    Once you're scraping millions of URLs, naive sets blow up your RAM. Bloom filters, content hashes, and the right approximate-or-exact tradeoff.

    advanced
  54. 4.54

    Change Detection (Diff-Only Storage)

    Scraping the same pages repeatedly produces 99% redundant data. Store only what changed, and you'll cut storage and downstream load by orders of magnitude.

    intermediate
  55. 4.55

    Structured Logging (JSON Logs in Python and PHP)

    Print statements don't scale. Structured logging, every event a JSON object, makes scrapers queryable, alertable, and debuggable in production.

    intermediate
  56. 4.56

    Centralized Logging (Loki, Elasticsearch)

    Once scrapers run on multiple hosts, you need a central place to query logs. Loki and Elasticsearch are the two main options, their tradeoffs, pipelines, and costs.

    intermediate
  57. 4.57

    Metrics with Prometheus + Grafana

    Logs tell you what happened. Metrics tell you the shape of normal vs abnormal. Prometheus scraping plus Grafana dashboards is the de facto stack.

    intermediate
  58. 4.58

    Key Scraper Metrics: Success Rate, Items/Sec, Ban Rate, Proxy Health

    The specific metrics that matter for a scraper, what they tell you, how to compute them, and the alert thresholds that catch real problems.

    Lab: /admin/stats

    intermediate
  59. 4.59

    Alerting (Slack, Email, PagerDuty)

    Alerts wake people up. Wrong alerts wake them up for nothing. The principles and concrete configs for scraper alerts that engineers thank you for.

    intermediate
  60. 4.60

    Incident Runbooks for Scrapers

    A runbook is the checklist you wish you had at 3am. The structure, the most common scraper incidents, and the runbook entries that cut mean-time-to-recover.

    intermediate
  61. 4.61

    Dockerizing Python Scrapers

    A reproducible Docker image is the unit of deployment for a modern scraper. Multi-stage builds, slim base images, and the runtime surface area you actually need.

    intermediate
  62. 4.62

    Dockerizing PHP / Symfony Scrapers

    PHP scrapers ship in Docker too. Symfony-specific patterns, FrankenPHP, opcache, and the differences from Python images.

    intermediate
  63. 4.63

    docker-compose for Local Full-Stack Dev

    Production scrapers depend on Postgres, Redis, Mongo, Loki. docker-compose runs the whole stack locally so your dev environment matches prod.

    intermediate
  64. 4.64

    Kubernetes for Scraper Workloads (Overview)

    What Kubernetes actually gives a scraping team, and when it's worth the operational cost. The minimum vocabulary and a runnable scraper deployment.

    advanced
  65. 4.65

    CI/CD for Scrapers (GitHub Actions for Python and PHP)

    Automated test, build, and deploy pipelines for scraping projects. The pipeline that catches selector-breakage before it hits production.

    intermediate
  66. 4.66

    Scheduling: cron, Airflow, Prefect, Symfony Scheduler

    From a cron line on a VPS to a workflow orchestrator with DAGs and retries, the scheduling tools you'll actually pick from.

    intermediate
  67. 4.67

    Cloud Platforms: VPS, AWS, Apify, Zyte Cloud, Hostinger VPS

    Where to actually run your scrapers, from a $5 VPS to managed scraping platforms. Cost, control, and the tradeoffs honestly compared.

    intermediate
  68. 4.68

    Serverless Scrapers on AWS Lambda

    When scrape workloads are bursty and you don't want idle infra, serverless can be cheap and clean. Where Lambda shines for scraping, and the hard limits to know.

    advanced
  69. 4.69

    Why Contributing to Scraping Libraries Matters for Your Career

    Public contributions to the libraries you depend on are the highest-leverage career investment a scraping engineer can make. Why, and what to aim for.

    beginner
  70. 4.70

    Finding Good First Issues in Open-Source Scraping Projects

    A walkthrough for finding contribution opportunities in scraping libraries that actually fit a beginner's skill envelope.

    beginner
  71. 4.71

    Writing PRs Maintainers Actually Merge

    The difference between PRs that languish and PRs that merge is rarely the code, it's the framing, scope, tests, and tone. The concrete checklist.

    beginner
  72. 4.72

    Maintaining Your Own Composer / PyPI Package

    Publishing and maintaining a small scraping utility on PyPI or Packagist is one of the highest-leverage career moves a scraping engineer can make. The how, and the responsibilities.

    intermediate
  73. 4.73

    Documentation as a Force Multiplier

    The best engineers in scraping write docs. Why documentation pays back disproportionately, for your code, your team, and your career.

    beginner
  74. 4.74

    Freelance Scraping, Pricing, Clients, Contracts

    How freelance scraping work actually flows in 2026, pricing models, where clients come from, and the contract clauses that save your weekends.

    intermediate
  75. 4.75

    Productized Services and Data-as-a-Service

    From scratching custom scrapers for hours to selling scraped data as a productised offering. The transition that breaks the time-for-money trade.

    intermediate
  76. 4.76

    Building a Scraping SaaS (Real Examples and Margins)

    The shape, economics, and risks of running a scraping SaaS, drawn from the public histories of companies in the space.

    advanced
  77. 4.77

    Writing Technical Content for Developer Audiences

    Why every scraping engineer should write publicly, and the formats, cadence, and distribution that actually work.

    beginner
  78. 4.78

    YouTube and Live-Coding for Scraping Devs

    Video and live-coding reach an audience text doesn't, and scraping is naturally suited to demonstration. The real cost-benefit, and how to start without overproducing.

    beginner
  79. 4.79

    Speaking at Meetups, Conferences, Submitting CFPs

    Speaking at events is high-leverage for career and network. The CFP process, talk structure, and how to start at meetups before tackling major conferences.

    beginner
  80. 4.80

    Communities Where Scraping Devs Hang Out (Reddit, dev.to, HN, X, Discord)

    Where the scraping community actually talks, learns, and trades work. A practical map of the online watering holes.

    beginner
  81. 4.81

    Building Your Personal Brand as a Scraping Expert

    Personal brand isn't marketing, it's the accumulated signal that you specialize in scraping. The components, the compounding, and the honest tradeoffs.

    intermediate
  82. 4.82

    The Legal Landscape: hiQ v. LinkedIn, CFAA, GDPR

    The landmark cases and statutes that shape what scraping is and isn't legally OK in 2026. Not legal advice; a working compass.

    intermediate
  83. 4.83

    Terms of Service, Enforceable or Not?

    Most sites prohibit scraping in their Terms of Service. When that prohibition is legally enforceable, when it isn't, and how courts have ruled.

    intermediate
  84. 4.84

    Copyright vs Facts, What You Can and Can't Redistribute

    Scraping data is one thing. Publishing or commercializing it is another. The line between facts and expression, and how courts have drawn it.

    intermediate
  85. 4.85

    Ethical Framework for Scraping Decisions

    Beyond what's legal, what's right? A practical ethics framework for deciding which scraping projects to take, which to decline, and how to operate.

    intermediate

Every lesson has a hands-on lab target on Catalog108 , our first-party practice scraping sandbox. Each lab page has a /grade endpoint that returns pass/fail on your scraper output.