Sub-path 2 of 6
Static Scraping
HTTP + HTML. Fast, lightweight. Python and PHP.
Send requests, parse HTML, follow pagination, submit forms, store results. Taught in Python (requests + BeautifulSoup + lxml) and PHP (Guzzle + DomCrawler), equally first-class. Every lesson lands on a stable lab target at Catalog108.
~6 weeks part-time · 34 lessons
Lessons
- 1.1beginner
Your First Scraper: requests + BeautifulSoup
Build a working scraper in fifteen lines of Python. Fetch a page, parse it, pull out structured data, the canonical static-scraping pipeline.
Lab:
/ - 1.2beginner
GET Requests, Query Parameters, Headers
Anatomy of an HTTP GET request: URLs, query strings, headers, and how to control them precisely in Python's requests library.
Lab:
/products - 1.3beginner
POST Requests: Form Data and JSON Payloads
When scraping requires sending data, search forms, login forms, JSON APIs, you need POST. Master the three body formats and when each applies.
Lab:
/challenges/static/forms/post - 1.4beginner
Sessions, Cookies, and Persistent State
Use `requests.Session` to persist cookies, default headers, and connection pools across many requests, the right way to scrape any site that tracks state.
Lab:
/challenges/static/cookies/required - 1.5beginner
User-Agents, Why They Matter, How to Set Them
The User-Agent header is the single biggest tell that you're a scraper. Learn what it's for, what real browsers send, and how to use it strategically.
Lab:
/challenges/antibot/ua-blocklist - 1.6intermediate
Timeouts, Retries, Exponential Backoff
Real networks fail. Real scrapers handle it. Learn to set timeouts properly, retry transient failures, and back off exponentially without hammering a struggling server.
Lab:
/challenges/api/rest/flaky - 1.7intermediate
SSL Verification, Proxies, Authentication
Three production concerns: TLS certificate handling, routing requests through proxies, and authenticating to protected endpoints with Basic, Bearer, and Digest schemes.
Lab:
/challenges/api/auth/basic - 1.8beginner
Raw cURL in PHP, Foundations Every PHP Dev Must Know
The libcurl bindings ship with every PHP install. Master them, and every HTTP library you use later makes more sense.
Lab:
/products - 1.9beginner
Guzzle: The Industry-Standard PHP HTTP Client
Guzzle wraps cURL with a clean, modern API. PSR-7 messages, sessions, async, middleware. The default choice for any serious PHP scraper.
Lab:
/products?page=2 - 1.10intermediate
Symfony HttpClient, Modern, Async-Ready Alternative
Symfony's HTTP client is the modern PHP alternative to Guzzle: chunked streaming, native HTTP/2, async by default, and tight integration with the rest of Symfony.
Lab:
/challenges/static/pagination/cursor - 1.11intermediate
PHP Sessions, Cookies, and Headers, Hands-On
Concretely: how PHP scrapers persist cookies across requests with Guzzle, Symfony HttpClient, and raw cURL, and how to inspect and override request headers.
Lab:
/challenges/static/cookies/set-on-visit - 1.12intermediate
Python requests vs PHP Guzzle, Side-by-Side
The same scraping task, implemented in both Python and PHP, side by side. Honest tradeoffs so you can pick the right language for the right job.
Lab:
/products - 1.13beginner
BeautifulSoup: find, find_all, select
The three workhorse selection methods of BeautifulSoup, when to use each, and the small idioms that separate beginner from comfortable.
Lab:
/challenges/static/lists/cards - 1.14intermediate
BeautifulSoup Tree Navigation
Once you've found one element, you can walk to any other. Parents, children, siblings, next, previous, the navigation API that handles layouts with no clean selectors.
Lab:
/challenges/static/lists/nested - 1.15intermediate
lxml and XPath in Python, 10x Faster
When BeautifulSoup is too slow or the structure too irregular, drop down to lxml directly. XPath gives you axes and predicates BeautifulSoup can't match.
Lab:
/challenges/static/tables/nested - 1.16intermediate
Handling Encoding and Broken HTML
Real-world HTML is messy, mixed encodings, malformed tags, garbage characters. How to detect, decode, and parse it without losing data.
Lab:
/challenges/static/encoding/broken - 1.17beginner
PHP DOMDocument and DOMXPath
The native PHP HTML/XML parser. No Composer dependencies, ships with every PHP install, supports DOM traversal and XPath queries.
Lab:
/challenges/static/tables/simple - 1.18beginner
Symfony DomCrawler, The Modern PHP Parser
DomCrawler wraps DOMDocument with a fluent jQuery-like API, supports both CSS and XPath, and is the default HTML parser for any non-trivial PHP scraper.
Lab:
/challenges/static/lists/cards - 1.19intermediate
Symfony BrowserKit, Simulating a Browser in Pure PHP
BrowserKit gives you browser-like navigation in pure PHP, cookies, history, form submission, follow-redirect, without launching a real browser. The right tool for stateful scraping flows.
Lab:
/account/login - 1.20intermediate
PHP Parser Comparison: DOMDocument vs DomCrawler vs paquettg
Three popular PHP HTML parsers compared on the same page: DOMDocument (stdlib), Symfony DomCrawler, and paquettg/php-html-parser. Honest tradeoffs.
Lab:
/blog - 1.21intermediate
Scraping Tables, From HTML to Structured Data
Tables are everywhere in scraping. Headers, rows, cells, rowspan/colspan, nested tables. Master the patterns and turn HTML grids into clean dicts and DataFrames.
Lab:
/challenges/static/tables/simple - 1.22intermediate
Scraping Lists, Cards, Repeating Patterns
Card grids, list views, search results, the second-most-common HTML data pattern after tables. The systematic 'find the container, iterate items, extract per-item' approach.
Lab:
/challenges/static/lists/cards - 1.23intermediate
Pagination, The 5 Common Patterns and How to Detect Them
Every paginated site uses one of five patterns: numbered, offset, cursor, load-more, or unknown-end. Identify which, scrape it correctly, stop at the right time.
Lab:
/challenges/static/pagination/numbered - 1.24intermediate
Following Sitemaps for Discovery
`sitemap.xml` is the structured index of a site's URLs that the site itself publishes. Use it to discover every page worth scraping without crawling blindly.
Lab:
/sitemap.xml - 1.25intermediate
Form Submission with CSRF Tokens
Most forms hide a CSRF token to block bots. Fetch the form, extract the token, submit it back along with your real fields, the canonical scraper pattern.
Lab:
/challenges/static/forms/csrf - 1.26intermediate
Multi-Step Login Flows
Beyond a single login form: multi-step wizards, MFA prompts, captchas, and the patterns to handle each from a scraper.
Lab:
/challenges/static/forms/multi-step - 1.27intermediate
File Downloads: Images, PDFs, ZIPs
Beyond HTML: how to download binary files efficiently, stream big files without exhausting memory, and verify the file you got is the file you wanted.
Lab:
/challenges/static/files/images - 1.28intermediate
Polite Scraping, robots.txt, Delays, Rate Limits
Stay welcome on the sites you scrape. Respect robots.txt, throttle yourself, identify cleanly, and recognise when you're being told to slow down.
Lab:
/robots.txt - 1.29beginner
Output Formats: CSV, JSON, JSONL (Python and PHP)
Saving scraped data right: when to choose CSV vs JSON vs JSONL, how to write them safely in Python and PHP, and how to avoid the common quoting and encoding bugs.
Lab:
/api/products - 1.30intermediate
SQLite for Embedded Scraper Storage
SQLite is the perfect scraper backend: zero-config, file-based, queryable. Skip CSV/JSON for scrapes you'll re-run or query.
Lab:
/products - 1.31intermediate
Data Cleaning with pandas (Python)
Scraped data is dirty. Use pandas to type-coerce, normalize, dedupe, and reshape into something usable downstream, the canonical post-scrape pipeline.
Lab:
/api/products/1/reviews - 1.32intermediate
Data Cleaning with PHP, Filters, Validators, Pipelines
PHP scrapers also produce dirty data. Use filter_var, validators, and a small pipeline pattern to coerce, validate, and reshape, the PHP counterpart to pandas.
Lab:
/api/products/1/reviews - 1.33intermediate
Deduplication Strategies
Scrapers produce duplicates: re-runs, paginated overlap, multiple URLs for the same item, near-identical rows with whitespace differences. Strategies from exact-match to fuzzy.
Lab:
/blog - 1.34intermediate
Resumable Scraping with Checkpoints
Real scrapers crash, get killed, or are politely stopped mid-run. Resume from where you left off, without re-downloading or duplicating, using checkpoints and state files.
Lab:
/products
Every lesson has a hands-on lab target on Catalog108 , our first-party practice scraping sandbox. Each lab page has a /grade endpoint that returns pass/fail on your scraper output.