Sub-path 2 of 6

Static Scraping

HTTP + HTML. Fast, lightweight. Python and PHP.

Send requests, parse HTML, follow pagination, submit forms, store results. Taught in Python (requests + BeautifulSoup + lxml) and PHP (Guzzle + DomCrawler), equally first-class. Every lesson lands on a stable lab target at Catalog108.

~6 weeks part-time · 34 lessons

Lessons

1.1
Your First Scraper: requests + BeautifulSoup
Build a working scraper in fifteen lines of Python. Fetch a page, parse it, pull out structured data, the canonical static-scraping pipeline.
Lab: /
beginner
1.2
GET Requests, Query Parameters, Headers
Anatomy of an HTTP GET request: URLs, query strings, headers, and how to control them precisely in Python's requests library.
Lab: /products
beginner
1.3
POST Requests: Form Data and JSON Payloads
When scraping requires sending data, search forms, login forms, JSON APIs, you need POST. Master the three body formats and when each applies.
Lab: /challenges/static/forms/post
beginner
1.4
Sessions, Cookies, and Persistent State
Use `requests.Session` to persist cookies, default headers, and connection pools across many requests, the right way to scrape any site that tracks state.
Lab: /challenges/static/cookies/required
beginner
1.5
User-Agents, Why They Matter, How to Set Them
The User-Agent header is the single biggest tell that you're a scraper. Learn what it's for, what real browsers send, and how to use it strategically.
Lab: /challenges/antibot/ua-blocklist
beginner
1.6
Timeouts, Retries, Exponential Backoff
Real networks fail. Real scrapers handle it. Learn to set timeouts properly, retry transient failures, and back off exponentially without hammering a struggling server.
Lab: /challenges/api/rest/flaky
intermediate
1.7
SSL Verification, Proxies, Authentication
Three production concerns: TLS certificate handling, routing requests through proxies, and authenticating to protected endpoints with Basic, Bearer, and Digest schemes.
Lab: /challenges/api/auth/basic
intermediate
1.8
Raw cURL in PHP, Foundations Every PHP Dev Must Know
The libcurl bindings ship with every PHP install. Master them, and every HTTP library you use later makes more sense.
Lab: /products
beginner
1.9
Guzzle: The Industry-Standard PHP HTTP Client
Guzzle wraps cURL with a clean, modern API. PSR-7 messages, sessions, async, middleware. The default choice for any serious PHP scraper.
Lab: /products?page=2
beginner
1.10
Symfony HttpClient, Modern, Async-Ready Alternative
Symfony's HTTP client is the modern PHP alternative to Guzzle: chunked streaming, native HTTP/2, async by default, and tight integration with the rest of Symfony.
Lab: /challenges/static/pagination/cursor
intermediate
1.11
PHP Sessions, Cookies, and Headers, Hands-On
Concretely: how PHP scrapers persist cookies across requests with Guzzle, Symfony HttpClient, and raw cURL, and how to inspect and override request headers.
Lab: /challenges/static/cookies/set-on-visit
intermediate
1.12
Python requests vs PHP Guzzle, Side-by-Side
The same scraping task, implemented in both Python and PHP, side by side. Honest tradeoffs so you can pick the right language for the right job.
Lab: /products
intermediate
1.13
BeautifulSoup: find, find_all, select
The three workhorse selection methods of BeautifulSoup, when to use each, and the small idioms that separate beginner from comfortable.
Lab: /challenges/static/lists/cards
beginner
1.14
BeautifulSoup Tree Navigation
Once you've found one element, you can walk to any other. Parents, children, siblings, next, previous, the navigation API that handles layouts with no clean selectors.
Lab: /challenges/static/lists/nested
intermediate
1.15
lxml and XPath in Python, 10x Faster
When BeautifulSoup is too slow or the structure too irregular, drop down to lxml directly. XPath gives you axes and predicates BeautifulSoup can't match.
Lab: /challenges/static/tables/nested
intermediate
1.16
Handling Encoding and Broken HTML
Real-world HTML is messy, mixed encodings, malformed tags, garbage characters. How to detect, decode, and parse it without losing data.
Lab: /challenges/static/encoding/broken
intermediate
1.17
PHP DOMDocument and DOMXPath
The native PHP HTML/XML parser. No Composer dependencies, ships with every PHP install, supports DOM traversal and XPath queries.
Lab: /challenges/static/tables/simple
beginner
1.18
Symfony DomCrawler, The Modern PHP Parser
DomCrawler wraps DOMDocument with a fluent jQuery-like API, supports both CSS and XPath, and is the default HTML parser for any non-trivial PHP scraper.
Lab: /challenges/static/lists/cards
beginner
1.19
Symfony BrowserKit, Simulating a Browser in Pure PHP
BrowserKit gives you browser-like navigation in pure PHP, cookies, history, form submission, follow-redirect, without launching a real browser. The right tool for stateful scraping flows.
Lab: /account/login
intermediate
1.20
PHP Parser Comparison: DOMDocument vs DomCrawler vs paquettg
Three popular PHP HTML parsers compared on the same page: DOMDocument (stdlib), Symfony DomCrawler, and paquettg/php-html-parser. Honest tradeoffs.
Lab: /blog
intermediate
1.21
Scraping Tables, From HTML to Structured Data
Tables are everywhere in scraping. Headers, rows, cells, rowspan/colspan, nested tables. Master the patterns and turn HTML grids into clean dicts and DataFrames.
Lab: /challenges/static/tables/simple
intermediate
1.22
Scraping Lists, Cards, Repeating Patterns
Card grids, list views, search results, the second-most-common HTML data pattern after tables. The systematic 'find the container, iterate items, extract per-item' approach.
Lab: /challenges/static/lists/cards
intermediate
1.23
Pagination, The 5 Common Patterns and How to Detect Them
Every paginated site uses one of five patterns: numbered, offset, cursor, load-more, or unknown-end. Identify which, scrape it correctly, stop at the right time.
Lab: /challenges/static/pagination/numbered
intermediate
1.24
Following Sitemaps for Discovery
`sitemap.xml` is the structured index of a site's URLs that the site itself publishes. Use it to discover every page worth scraping without crawling blindly.
Lab: /sitemap.xml
intermediate
1.25
Form Submission with CSRF Tokens
Most forms hide a CSRF token to block bots. Fetch the form, extract the token, submit it back along with your real fields, the canonical scraper pattern.
Lab: /challenges/static/forms/csrf
intermediate
1.26
Multi-Step Login Flows
Beyond a single login form: multi-step wizards, MFA prompts, captchas, and the patterns to handle each from a scraper.
Lab: /challenges/static/forms/multi-step
intermediate
1.27
File Downloads: Images, PDFs, ZIPs
Beyond HTML: how to download binary files efficiently, stream big files without exhausting memory, and verify the file you got is the file you wanted.
Lab: /challenges/static/files/images
intermediate
1.28
Polite Scraping, robots.txt, Delays, Rate Limits
Stay welcome on the sites you scrape. Respect robots.txt, throttle yourself, identify cleanly, and recognise when you're being told to slow down.
Lab: /robots.txt
intermediate
1.29
Output Formats: CSV, JSON, JSONL (Python and PHP)
Saving scraped data right: when to choose CSV vs JSON vs JSONL, how to write them safely in Python and PHP, and how to avoid the common quoting and encoding bugs.
Lab: /api/products
beginner
1.30
SQLite for Embedded Scraper Storage
SQLite is the perfect scraper backend: zero-config, file-based, queryable. Skip CSV/JSON for scrapes you'll re-run or query.
Lab: /products
intermediate
1.31
Data Cleaning with pandas (Python)
Scraped data is dirty. Use pandas to type-coerce, normalize, dedupe, and reshape into something usable downstream, the canonical post-scrape pipeline.
Lab: /api/products/1/reviews
intermediate
1.32
Data Cleaning with PHP, Filters, Validators, Pipelines
PHP scrapers also produce dirty data. Use filter_var, validators, and a small pipeline pattern to coerce, validate, and reshape, the PHP counterpart to pandas.
Lab: /api/products/1/reviews
intermediate
1.33
Deduplication Strategies
Scrapers produce duplicates: re-runs, paginated overlap, multiple URLs for the same item, near-identical rows with whitespace differences. Strategies from exact-match to fuzzy.
Lab: /blog
intermediate
1.34
Resumable Scraping with Checkpoints
Real scrapers crash, get killed, or are politely stopped mid-run. Resume from where you left off, without re-downloading or duplicating, using checkpoints and state files.
Lab: /products
intermediate

Every lesson has a hands-on lab target on Catalog108 , our first-party practice scraping sandbox. Each lab page has a /grade endpoint that returns pass/fail on your scraper output.

Static Scraping

Lessons

Your First Scraper: requests + BeautifulSoup

GET Requests, Query Parameters, Headers

POST Requests: Form Data and JSON Payloads

Sessions, Cookies, and Persistent State

User-Agents, Why They Matter, How to Set Them

Timeouts, Retries, Exponential Backoff

SSL Verification, Proxies, Authentication

Raw cURL in PHP, Foundations Every PHP Dev Must Know

Guzzle: The Industry-Standard PHP HTTP Client

Symfony HttpClient, Modern, Async-Ready Alternative

PHP Sessions, Cookies, and Headers, Hands-On

Python requests vs PHP Guzzle, Side-by-Side

BeautifulSoup: find, find_all, select

BeautifulSoup Tree Navigation

lxml and XPath in Python, 10x Faster

Handling Encoding and Broken HTML

PHP DOMDocument and DOMXPath

Symfony DomCrawler, The Modern PHP Parser

Symfony BrowserKit, Simulating a Browser in Pure PHP

PHP Parser Comparison: DOMDocument vs DomCrawler vs paquettg

Scraping Tables, From HTML to Structured Data

Scraping Lists, Cards, Repeating Patterns

Pagination, The 5 Common Patterns and How to Detect Them

Following Sitemaps for Discovery

Form Submission with CSRF Tokens

Multi-Step Login Flows

File Downloads: Images, PDFs, ZIPs

Polite Scraping, robots.txt, Delays, Rate Limits

Output Formats: CSV, JSON, JSONL (Python and PHP)

SQLite for Embedded Scraper Storage

Data Cleaning with pandas (Python)

Data Cleaning with PHP, Filters, Validators, Pipelines

Deduplication Strategies

Resumable Scraping with Checkpoints