Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

F14beginner5 min read

Python Crash-Course for Scrapers

The 5% of Python you'll use 95% of the time when writing scrapers. Strings, lists, dicts, comprehensions, file I/O, error handling, and the f-string.

What you’ll learn

  • Manipulate strings, lists, and dicts fluently.
  • Use list/dict comprehensions to express transformations in one line.
  • Read and write files and JSON cleanly.
  • Catch exceptions correctly, narrow, not bare.

This isn't a Python tutorial. It's the specific subset you'll use writing scrapers, expressed densely.

Strings

url = "https://practice.scrapingcentral.com/products?page=2"
url.startswith("https://")  # True
url.endswith(".pdf")  # False
"products" in url  # True
url.split("?")  # ['https://practice...', 'page=2']
url.replace("page=2", "page=3")  # rewrite a param
url.lower()  # case-insensitive comparison
url.strip()  # remove leading/trailing whitespace

The methods are obvious once you've seen them. The pattern s.startswith(...) and s in big_string cover most string conditions you'll write.

f-strings

The interpolation syntax you'll use 100x a day:

page = 2
url = f"https://practice.scrapingcentral.com/products?page={page}"
print(f"Fetching page {page}, URL = {url}")

# Formatting inside:
price = 14.99
print(f"${price:.2f}")  # "$14.99", 2 decimal places
print(f"{page:03d}")  # "002", zero-padded to 3 digits

f-strings are clearer and faster than %-formatting and .format(). Use them exclusively.

Lists

prices = [14.99, 24.95, 9.50, 49.00]
prices[0]  # 14.99
prices[-1]  # 49.00
prices[1:3]  # [24.95, 9.50], slice
len(prices)  # 4
prices.append(7.00)  # add to end
prices.sort()  # mutate in place
sorted(prices)  # return new sorted list
sum(prices) / len(prices)  # average
[p for p in prices if p < 20]  # filter (comprehension)

The comprehension is the form you'll use most:

products = [...]  # list of dicts
titles = [p["title"] for p in products]  # extract one field
expensive = [p for p in products if p["price"] > 50]  # filter
discounted = [{**p, "price": p["price"] * 0.9} for p in products]  # transform

Three forms ([x for x in y], [x for x in y if cond], [f(x) for x in y if cond]) cover almost every transformation pattern.

Dicts

The structure you'll move data through:

product = {
  "id": 42,
  "title": "Yellow ceramic mug",
  "price": 14.99,
  "tags": ["kitchen", "ceramic"],
}

product["title"]  # access, KeyError if missing
product.get("title")  # access, None if missing
product.get("description", "n/a")  # access, default if missing
product["stock"] = 15  # assign
"id" in product  # True
list(product.keys())  # ['id', 'title', 'price', 'tags']
list(product.values())
list(product.items())  # [('id', 42), ('title', '...')...]
{**product, "price": 12.99}  # copy with overrides, does NOT mutate original

The .get(key, default) form is critical in scraping, missing fields are normal, and dict[key] raises on missing.

Iterating

for k, v in product.items():
  print(f"{k} = {v}")

# Build a new dict from another (dict comprehension)
slim = {k: v for k, v in product.items() if k in ("id", "title", "price")}

Functions

def fetch_page(url, timeout=10):
  """Fetch a single URL with a timeout. Returns response text."""
  r = requests.get(url, timeout=timeout)
  r.raise_for_status()
  return r.text

# Default arguments avoid most overloading.
fetch_page("https://example.com")
fetch_page("https://example.com", timeout=30)
fetch_page("https://example.com", timeout=30)

Type hints (optional but recommended for shareable code):

def fetch_page(url: str, timeout: int = 10) -> str:
  ...

File I/O

The context-manager form (with) handles closing the file automatically:

# Write text
with open("output.csv", "w", encoding="utf-8") as f:
  f.write("id,title,price\n")
  for p in products:
  f.write(f'{p["id"]},"{p["title"]}",{p["price"]}\n')

# Read text
with open("input.txt", encoding="utf-8") as f:
  for line in f:  # iterates line by line, memory-friendly
  line = line.strip()
  process(line)

For CSV specifically, use the csv module, it handles quoting:

import csv

with open("output.csv", "w", encoding="utf-8", newline="") as f:
  writer = csv.DictWriter(f, fieldnames=["id", "title", "price"])
  writer.writeheader()
  writer.writerows(products)

JSON

import json

# Parse a JSON string
data = json.loads(r.text)

# Read a JSON file
with open("products.json", encoding="utf-8") as f:
  data = json.load(f)

# Write a JSON file
with open("products.json", "w", encoding="utf-8") as f:
  json.dump(products, f, ensure_ascii=False, indent=2)

ensure_ascii=False is essential, otherwise non-ASCII characters get escaped to \uXXXX in the output. indent=2 makes the file human-readable.

Error handling

Narrow, not bare:

try:
  r = requests.get(url, timeout=10)
  r.raise_for_status()
except requests.exceptions.Timeout:
  log.warning("Timeout for %s, retrying", url)
  retry_later(url)
except requests.exceptions.HTTPError as e:
  log.error("HTTP %d for %s", e.response.status_code, url)

What NOT to do:

# DON'T, swallows everything including bugs in your own code
try:
  do_stuff()
except:  # bare except, never
  pass

# DON'T, Exception catches too much, including KeyboardInterrupt
try:
  do_stuff()
except Exception:
  pass

Specific exceptions for known failure modes; let unexpected errors crash so you find them.

Iterating with enumerate and zip

# Get both index and value
for i, product in enumerate(products):
  print(f"[{i}] {product['title']}")

# Pair two lists
for title, price in zip(titles, prices):
  print(f"{title}: ${price}")

enumerate and zip replace 90% of "I need a counter variable" or "I need to walk two lists at once" needs.

The standard library functions you'll use

import re  # regex
import json  # JSON
import csv  # CSV
import time  # time.sleep(), time.time()
import os  # os.path.join, env vars
import pathlib  # Path(...), newer, nicer
from datetime import datetime, timedelta
from urllib.parse import urlparse, urlencode, parse_qs

urllib.parse deserves a callout, it handles URL parsing safely without string-manipulation bugs:

from urllib.parse import urlparse, parse_qs, urlencode

u = urlparse("https://example.com/products?page=2&category=mugs")
u.netloc  # 'example.com'
u.path  # '/products'
parse_qs(u.query)  # {'page': ['2'], 'category': ['mugs']}

# Build a query string safely
urlencode({"page": 2, "category": "mugs"})  # 'page=2&category=mugs'

Hands-on lab

Write a 20-line script that:

  1. Fetches https://practice.scrapingcentral.com/ with requests.
  2. Counts how many <a> tags it contains using BeautifulSoup.
  3. Writes the count and the title to a JSON file with json.dump.
  4. Wraps the network call in a try / except requests.exceptions.RequestException.

If you can do that in 20 lines, you're already comfortable in the 5% of Python that scraping demands.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Python Crash-Course for Scrapers1 / 8

Which expression gives 'product not found' if `data['title']` doesn't exist, without raising an exception?

Score so far: 0 / 0