Python Crash-Course for Scrapers
The 5% of Python you'll use 95% of the time when writing scrapers. Strings, lists, dicts, comprehensions, file I/O, error handling, and the f-string.
What you’ll learn
- Manipulate strings, lists, and dicts fluently.
- Use list/dict comprehensions to express transformations in one line.
- Read and write files and JSON cleanly.
- Catch exceptions correctly, narrow, not bare.
This isn't a Python tutorial. It's the specific subset you'll use writing scrapers, expressed densely.
Strings
url = "https://practice.scrapingcentral.com/products?page=2"
url.startswith("https://") # True
url.endswith(".pdf") # False
"products" in url # True
url.split("?") # ['https://practice...', 'page=2']
url.replace("page=2", "page=3") # rewrite a param
url.lower() # case-insensitive comparison
url.strip() # remove leading/trailing whitespace
The methods are obvious once you've seen them. The pattern s.startswith(...) and s in big_string cover most string conditions you'll write.
f-strings
The interpolation syntax you'll use 100x a day:
page = 2
url = f"https://practice.scrapingcentral.com/products?page={page}"
print(f"Fetching page {page}, URL = {url}")
# Formatting inside:
price = 14.99
print(f"${price:.2f}") # "$14.99", 2 decimal places
print(f"{page:03d}") # "002", zero-padded to 3 digits
f-strings are clearer and faster than %-formatting and .format(). Use them exclusively.
Lists
prices = [14.99, 24.95, 9.50, 49.00]
prices[0] # 14.99
prices[-1] # 49.00
prices[1:3] # [24.95, 9.50], slice
len(prices) # 4
prices.append(7.00) # add to end
prices.sort() # mutate in place
sorted(prices) # return new sorted list
sum(prices) / len(prices) # average
[p for p in prices if p < 20] # filter (comprehension)
The comprehension is the form you'll use most:
products = [...] # list of dicts
titles = [p["title"] for p in products] # extract one field
expensive = [p for p in products if p["price"] > 50] # filter
discounted = [{**p, "price": p["price"] * 0.9} for p in products] # transform
Three forms ([x for x in y], [x for x in y if cond], [f(x) for x in y if cond]) cover almost every transformation pattern.
Dicts
The structure you'll move data through:
product = {
"id": 42,
"title": "Yellow ceramic mug",
"price": 14.99,
"tags": ["kitchen", "ceramic"],
}
product["title"] # access, KeyError if missing
product.get("title") # access, None if missing
product.get("description", "n/a") # access, default if missing
product["stock"] = 15 # assign
"id" in product # True
list(product.keys()) # ['id', 'title', 'price', 'tags']
list(product.values())
list(product.items()) # [('id', 42), ('title', '...')...]
{**product, "price": 12.99} # copy with overrides, does NOT mutate original
The .get(key, default) form is critical in scraping, missing fields are normal, and dict[key] raises on missing.
Iterating
for k, v in product.items():
print(f"{k} = {v}")
# Build a new dict from another (dict comprehension)
slim = {k: v for k, v in product.items() if k in ("id", "title", "price")}
Functions
def fetch_page(url, timeout=10):
"""Fetch a single URL with a timeout. Returns response text."""
r = requests.get(url, timeout=timeout)
r.raise_for_status()
return r.text
# Default arguments avoid most overloading.
fetch_page("https://example.com")
fetch_page("https://example.com", timeout=30)
fetch_page("https://example.com", timeout=30)
Type hints (optional but recommended for shareable code):
def fetch_page(url: str, timeout: int = 10) -> str:
...
File I/O
The context-manager form (with) handles closing the file automatically:
# Write text
with open("output.csv", "w", encoding="utf-8") as f:
f.write("id,title,price\n")
for p in products:
f.write(f'{p["id"]},"{p["title"]}",{p["price"]}\n')
# Read text
with open("input.txt", encoding="utf-8") as f:
for line in f: # iterates line by line, memory-friendly
line = line.strip()
process(line)
For CSV specifically, use the csv module, it handles quoting:
import csv
with open("output.csv", "w", encoding="utf-8", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["id", "title", "price"])
writer.writeheader()
writer.writerows(products)
JSON
import json
# Parse a JSON string
data = json.loads(r.text)
# Read a JSON file
with open("products.json", encoding="utf-8") as f:
data = json.load(f)
# Write a JSON file
with open("products.json", "w", encoding="utf-8") as f:
json.dump(products, f, ensure_ascii=False, indent=2)
ensure_ascii=False is essential, otherwise non-ASCII characters get escaped to \uXXXX in the output. indent=2 makes the file human-readable.
Error handling
Narrow, not bare:
try:
r = requests.get(url, timeout=10)
r.raise_for_status()
except requests.exceptions.Timeout:
log.warning("Timeout for %s, retrying", url)
retry_later(url)
except requests.exceptions.HTTPError as e:
log.error("HTTP %d for %s", e.response.status_code, url)
What NOT to do:
# DON'T, swallows everything including bugs in your own code
try:
do_stuff()
except: # bare except, never
pass
# DON'T, Exception catches too much, including KeyboardInterrupt
try:
do_stuff()
except Exception:
pass
Specific exceptions for known failure modes; let unexpected errors crash so you find them.
Iterating with enumerate and zip
# Get both index and value
for i, product in enumerate(products):
print(f"[{i}] {product['title']}")
# Pair two lists
for title, price in zip(titles, prices):
print(f"{title}: ${price}")
enumerate and zip replace 90% of "I need a counter variable" or "I need to walk two lists at once" needs.
The standard library functions you'll use
import re # regex
import json # JSON
import csv # CSV
import time # time.sleep(), time.time()
import os # os.path.join, env vars
import pathlib # Path(...), newer, nicer
from datetime import datetime, timedelta
from urllib.parse import urlparse, urlencode, parse_qs
urllib.parse deserves a callout, it handles URL parsing safely without string-manipulation bugs:
from urllib.parse import urlparse, parse_qs, urlencode
u = urlparse("https://example.com/products?page=2&category=mugs")
u.netloc # 'example.com'
u.path # '/products'
parse_qs(u.query) # {'page': ['2'], 'category': ['mugs']}
# Build a query string safely
urlencode({"page": 2, "category": "mugs"}) # 'page=2&category=mugs'
Hands-on lab
Write a 20-line script that:
- Fetches
https://practice.scrapingcentral.com/withrequests. - Counts how many
<a>tags it contains using BeautifulSoup. - Writes the count and the title to a JSON file with
json.dump. - Wraps the network call in a
try/except requests.exceptions.RequestException.
If you can do that in 20 lines, you're already comfortable in the 5% of Python that scraping demands.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.