Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Sub-path 2 of 6

Static Scraping

HTTP + HTML. Fast, lightweight. Python and PHP.

Send requests, parse HTML, follow pagination, submit forms, store results. Taught in Python (requests + BeautifulSoup + lxml) and PHP (Guzzle + DomCrawler), equally first-class. Every lesson lands on a stable lab target at Catalog108.

~6 weeks part-time · 34 lessons

Lessons

  1. 1.1

    Your First Scraper: requests + BeautifulSoup

    Build a working scraper in fifteen lines of Python. Fetch a page, parse it, pull out structured data, the canonical static-scraping pipeline.

    Lab: /

    beginner
  2. 1.2

    GET Requests, Query Parameters, Headers

    Anatomy of an HTTP GET request: URLs, query strings, headers, and how to control them precisely in Python's requests library.

    Lab: /products

    beginner
  3. 1.3

    POST Requests: Form Data and JSON Payloads

    When scraping requires sending data, search forms, login forms, JSON APIs, you need POST. Master the three body formats and when each applies.

    Lab: /challenges/static/forms/post

    beginner
  4. 1.4

    Sessions, Cookies, and Persistent State

    Use `requests.Session` to persist cookies, default headers, and connection pools across many requests, the right way to scrape any site that tracks state.

    Lab: /challenges/static/cookies/required

    beginner
  5. 1.5

    User-Agents, Why They Matter, How to Set Them

    The User-Agent header is the single biggest tell that you're a scraper. Learn what it's for, what real browsers send, and how to use it strategically.

    Lab: /challenges/antibot/ua-blocklist

    beginner
  6. 1.6

    Timeouts, Retries, Exponential Backoff

    Real networks fail. Real scrapers handle it. Learn to set timeouts properly, retry transient failures, and back off exponentially without hammering a struggling server.

    Lab: /challenges/api/rest/flaky

    intermediate
  7. 1.7

    SSL Verification, Proxies, Authentication

    Three production concerns: TLS certificate handling, routing requests through proxies, and authenticating to protected endpoints with Basic, Bearer, and Digest schemes.

    Lab: /challenges/api/auth/basic

    intermediate
  8. 1.8

    Raw cURL in PHP, Foundations Every PHP Dev Must Know

    The libcurl bindings ship with every PHP install. Master them, and every HTTP library you use later makes more sense.

    Lab: /products

    beginner
  9. 1.9

    Guzzle: The Industry-Standard PHP HTTP Client

    Guzzle wraps cURL with a clean, modern API. PSR-7 messages, sessions, async, middleware. The default choice for any serious PHP scraper.

    Lab: /products?page=2

    beginner
  10. 1.10

    Symfony HttpClient, Modern, Async-Ready Alternative

    Symfony's HTTP client is the modern PHP alternative to Guzzle: chunked streaming, native HTTP/2, async by default, and tight integration with the rest of Symfony.

    Lab: /challenges/static/pagination/cursor

    intermediate
  11. 1.11

    PHP Sessions, Cookies, and Headers, Hands-On

    Concretely: how PHP scrapers persist cookies across requests with Guzzle, Symfony HttpClient, and raw cURL, and how to inspect and override request headers.

    Lab: /challenges/static/cookies/set-on-visit

    intermediate
  12. 1.12

    Python requests vs PHP Guzzle, Side-by-Side

    The same scraping task, implemented in both Python and PHP, side by side. Honest tradeoffs so you can pick the right language for the right job.

    Lab: /products

    intermediate
  13. 1.13

    BeautifulSoup: find, find_all, select

    The three workhorse selection methods of BeautifulSoup, when to use each, and the small idioms that separate beginner from comfortable.

    Lab: /challenges/static/lists/cards

    beginner
  14. 1.14

    BeautifulSoup Tree Navigation

    Once you've found one element, you can walk to any other. Parents, children, siblings, next, previous, the navigation API that handles layouts with no clean selectors.

    Lab: /challenges/static/lists/nested

    intermediate
  15. 1.15

    lxml and XPath in Python, 10x Faster

    When BeautifulSoup is too slow or the structure too irregular, drop down to lxml directly. XPath gives you axes and predicates BeautifulSoup can't match.

    Lab: /challenges/static/tables/nested

    intermediate
  16. 1.16

    Handling Encoding and Broken HTML

    Real-world HTML is messy, mixed encodings, malformed tags, garbage characters. How to detect, decode, and parse it without losing data.

    Lab: /challenges/static/encoding/broken

    intermediate
  17. 1.17

    PHP DOMDocument and DOMXPath

    The native PHP HTML/XML parser. No Composer dependencies, ships with every PHP install, supports DOM traversal and XPath queries.

    Lab: /challenges/static/tables/simple

    beginner
  18. 1.18

    Symfony DomCrawler, The Modern PHP Parser

    DomCrawler wraps DOMDocument with a fluent jQuery-like API, supports both CSS and XPath, and is the default HTML parser for any non-trivial PHP scraper.

    Lab: /challenges/static/lists/cards

    beginner
  19. 1.19

    Symfony BrowserKit, Simulating a Browser in Pure PHP

    BrowserKit gives you browser-like navigation in pure PHP, cookies, history, form submission, follow-redirect, without launching a real browser. The right tool for stateful scraping flows.

    Lab: /account/login

    intermediate
  20. 1.20

    PHP Parser Comparison: DOMDocument vs DomCrawler vs paquettg

    Three popular PHP HTML parsers compared on the same page: DOMDocument (stdlib), Symfony DomCrawler, and paquettg/php-html-parser. Honest tradeoffs.

    Lab: /blog

    intermediate
  21. 1.21

    Scraping Tables, From HTML to Structured Data

    Tables are everywhere in scraping. Headers, rows, cells, rowspan/colspan, nested tables. Master the patterns and turn HTML grids into clean dicts and DataFrames.

    Lab: /challenges/static/tables/simple

    intermediate
  22. 1.22

    Scraping Lists, Cards, Repeating Patterns

    Card grids, list views, search results, the second-most-common HTML data pattern after tables. The systematic 'find the container, iterate items, extract per-item' approach.

    Lab: /challenges/static/lists/cards

    intermediate
  23. 1.23

    Pagination, The 5 Common Patterns and How to Detect Them

    Every paginated site uses one of five patterns: numbered, offset, cursor, load-more, or unknown-end. Identify which, scrape it correctly, stop at the right time.

    Lab: /challenges/static/pagination/numbered

    intermediate
  24. 1.24

    Following Sitemaps for Discovery

    `sitemap.xml` is the structured index of a site's URLs that the site itself publishes. Use it to discover every page worth scraping without crawling blindly.

    Lab: /sitemap.xml

    intermediate
  25. 1.25

    Form Submission with CSRF Tokens

    Most forms hide a CSRF token to block bots. Fetch the form, extract the token, submit it back along with your real fields, the canonical scraper pattern.

    Lab: /challenges/static/forms/csrf

    intermediate
  26. 1.26

    Multi-Step Login Flows

    Beyond a single login form: multi-step wizards, MFA prompts, captchas, and the patterns to handle each from a scraper.

    Lab: /challenges/static/forms/multi-step

    intermediate
  27. 1.27

    File Downloads: Images, PDFs, ZIPs

    Beyond HTML: how to download binary files efficiently, stream big files without exhausting memory, and verify the file you got is the file you wanted.

    Lab: /challenges/static/files/images

    intermediate
  28. 1.28

    Polite Scraping, robots.txt, Delays, Rate Limits

    Stay welcome on the sites you scrape. Respect robots.txt, throttle yourself, identify cleanly, and recognise when you're being told to slow down.

    Lab: /robots.txt

    intermediate
  29. 1.29

    Output Formats: CSV, JSON, JSONL (Python and PHP)

    Saving scraped data right: when to choose CSV vs JSON vs JSONL, how to write them safely in Python and PHP, and how to avoid the common quoting and encoding bugs.

    Lab: /api/products

    beginner
  30. 1.30

    SQLite for Embedded Scraper Storage

    SQLite is the perfect scraper backend: zero-config, file-based, queryable. Skip CSV/JSON for scrapes you'll re-run or query.

    Lab: /products

    intermediate
  31. 1.31

    Data Cleaning with pandas (Python)

    Scraped data is dirty. Use pandas to type-coerce, normalize, dedupe, and reshape into something usable downstream, the canonical post-scrape pipeline.

    Lab: /api/products/1/reviews

    intermediate
  32. 1.32

    Data Cleaning with PHP, Filters, Validators, Pipelines

    PHP scrapers also produce dirty data. Use filter_var, validators, and a small pipeline pattern to coerce, validate, and reshape, the PHP counterpart to pandas.

    Lab: /api/products/1/reviews

    intermediate
  33. 1.33

    Deduplication Strategies

    Scrapers produce duplicates: re-runs, paginated overlap, multiple URLs for the same item, near-identical rows with whitespace differences. Strategies from exact-match to fuzzy.

    Lab: /blog

    intermediate
  34. 1.34

    Resumable Scraping with Checkpoints

    Real scrapers crash, get killed, or are politely stopped mid-run. Resume from where you left off, without re-downloading or duplicating, using checkpoints and state files.

    Lab: /products

    intermediate

Every lesson has a hands-on lab target on Catalog108 , our first-party practice scraping sandbox. Each lab page has a /grade endpoint that returns pass/fail on your scraper output.