JSON, CSV, and Regex Essentials in Python, Foundations

The three data-handling skills you'll use in every scraper: parsing JSON responses, writing structured output, and reaching for regex without abusing it.

A scraper has three data-shaped jobs: parse the JSON it fetches, write structured output that's easy to load elsewhere, and occasionally pluck a value from unstructured text. This lesson is those three.

JSON

Python's json module is the entire toolkit. Two functions for strings, two for files.

Parsing what a server sent

import json
import requests

r = requests.get("https://practice.scrapingcentral.com/api/products")
data = r.json()  # requests has a shortcut, same as json.loads(r.text)

# Or, manually:
data = json.loads(r.text)

Navigating nested JSON

Real API responses are nested. A typical structure:

{
  "meta": { "total": 5000, "page": 2, "per_page": 12 },
  "data": [
  { "id": 13, "title": "Yellow mug", "price": 14.99, "reviews": { "count": 47, "avg": 4.3 }},
  { "id": 14, "title": "Blue mug", "price": 13.50, "reviews": { "count": 12, "avg": 3.9 }}
  ]
}

Reach into it:

total = data["meta"]["total"]  # 5000
first_title = data["data"][0]["title"]  # 'Yellow mug'
review_avgs = [p["reviews"]["avg"] for p in data["data"]]

When fields might be missing, chain .get():

avg = data.get("data", [{}])[0].get("reviews", {}).get("avg")

For deeper safety, write a tiny helper:

def deep_get(obj, *keys, default=None):
  for k in keys:
  if isinstance(obj, dict):
  obj = obj.get(k, default)
  elif isinstance(obj, list) and isinstance(k, int) and 0 <= k < len(obj):
  obj = obj[k]
  else:
  return default
  return obj

deep_get(data, "data", 0, "reviews", "avg")

Or just let it crash for now and add safety when you actually hit a missing field.

Writing JSON

import json

products = [...]  # list of dicts

with open("products.json", "w", encoding="utf-8") as f:
  json.dump(products, f, ensure_ascii=False, indent=2)

Three flags worth knowing:

ensure_ascii=False, preserves Unicode characters instead of escaping them
indent=2, pretty-prints with 2-space indent
sort_keys=True, sorts dict keys alphabetically (helpful for reproducible diffs)

JSONL, one JSON object per line

For large datasets (>1M records), prefer JSONL (newline-delimited JSON):

with open("products.jsonl", "w", encoding="utf-8") as f:
  for p in products:
  f.write(json.dumps(p, ensure_ascii=False) + "\n")

JSONL streams record-by-record, you can cat, grep, and jq it line by line, and you don't need to load the whole file to read one record. The de-facto format for large scraping outputs.

CSV

Never hand-roll CSV. Comma-quoting rules are subtle (what if the field contains a comma? a newline? a quote?) and you will get them wrong. Use the csv module.

Writing

import csv

products = [
  {"id": 1, "title": "Yellow mug", "price": 14.99},
  {"id": 2, "title": "Blue, ceramic mug", "price": 13.50},  # comma in title, csv handles it
]

with open("products.csv", "w", encoding="utf-8", newline="") as f:
  writer = csv.DictWriter(f, fieldnames=["id", "title", "price"])
  writer.writeheader()
  writer.writerows(products)

Three things to remember:

newline="" when opening the file. Without it, Windows generates extra blank lines.
encoding="utf-8". Always.
DictWriter lets you write dicts directly; pure writer takes lists.

Reading

with open("products.csv", encoding="utf-8") as f:
  reader = csv.DictReader(f)
  for row in reader:
  print(row["title"], row["price"])

CSV cells are always strings. If the original was a number, you need to cast:

for row in reader:
  price = float(row["price"])

When CSV vs JSON vs JSONL

Format	When to use
JSON	Small data, when consumers expect JSON, when nesting matters
JSONL	Large data, streaming pipelines, log-style writes
CSV	Tabular data, Excel/sheets compatibility, simple columnar exports
SQLite	When you want SQL queries against your scrape; covered in Sub-Path 1

For a typical scraper, CSV is the "send to a human" format and JSONL is the "send to a downstream pipeline" format.

Regex

Regex is for extracting bounded patterns from text. It's not for parsing HTML. It's not for parsing JSON. It's for things like "find the order number embedded in this freeform note."

Python's `re` module

import re

text = "Order #A82B-9991 was placed on 2026-04-12."
m = re.search(r"#([A-Z0-9-]+)", text)
if m:
  order_id = m.group(1)  # 'A82B-9991'

dates = re.findall(r"\d{4}-\d{2}-\d{2}", text)  # ['2026-04-12']

# Multiple captures
parts = re.match(r"(\d{4})-(\d{2})-(\d{2})", "2026-04-12")
parts.groups()  # ('2026', '04', '12')

Compile once, use many

If you'll run the same pattern thousands of times in a loop, compile it once:

DATE_RE = re.compile(r"\d{4}-\d{2}-\d{2}")
for line in big_file:
  if DATE_RE.search(line):
  ...

The pattern reference (what you'll actually use)

Pattern	Matches
`.`	Any character (except newline)
`\d`	A digit
`\D`	A non-digit
`\w`	A word character (letter, digit, underscore)
`\s`	Whitespace
`[abc]`	Any of a, b, c
`[^abc]`	Any character NOT a, b, c
`[a-z]`	Range
`*`	Zero or more
`+`	One or more
`?`	Zero or one
`{n}`	Exactly n
`{n,m}`	n to m
`^`	Start of string (or line, with re.MULTILINE)
`$`	End of string
`()`	Capture group
`(?:)`	Non-capturing group
`	`
`\b`	Word boundary

Common scraper patterns

EMAIL = r"\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b"
URL = r"https?://[^\s\"]+"
PRICE = r"\$\s*(\d+(?:\.\d{2})?)"  # captures the number portion
PHONE = r"\+?\d[\d\s().-]{7,}"  # crude, phone numbers are hard
DATE_ISO = r"\d{4}-\d{2}-\d{2}"

The price regex is worth dissecting:

\$, literal $
\s*, optional whitespace
(, start capture
\d+, one or more digits
(?:\.\d{2})?, optional non-capturing group with . and exactly two digits
), end capture

Returns just the numeric part of $14.99 → 14.99 (use float() to cast).

Why regex on HTML is wrong

People reach for regex against HTML constantly. It works on simple cases and breaks on the slightly less simple. Real HTML isn't regular:

<a class="link" href="/products/1" data-id="1">Buy now</a>
<a class='link' href="/products/2" data-id="2" >Buy <strong>now</strong></a>

A regex that handles single quotes, optional whitespace, content with nested tags, attribute orders, and escaped characters becomes pages long and still misses cases. A parser handles them all by design.

The rule: if you're matching against HTML, use BeautifulSoup or lxml. If you're matching against a string (a data-* value, a <title> text, an extracted blob of JSON-as-string), regex is fine.

When regex actually shines

Extracting a UUID, ID, or date from freeform text
Splitting a string on multiple delimiters (re.split(r"[,;|]", s))
Validating that a string matches a format (price, ISO date, JWT shape)
One-off data cleaning (re.sub(r"\s+", " ", text) to collapse whitespace)

Hands-on lab

Fetch https://practice.scrapingcentral.com/api/products with requests.get(...).json(). From the response:

Extract every product title into a list using a comprehension.
Filter to titles containing "mug" (case-insensitive).
Write the filtered list to a CSV with columns id, title, price.
Bonus: write a JSONL version too.

If you can do that in 30 lines, you've internalized the data-handling layer that every scraper sits on top of.

JSON, CSV, and Regex Essentials in Python

What you’ll learn

JSON

Parsing what a server sent

Navigating nested JSON

Writing JSON

JSONL, one JSON object per line

CSV

Writing

Reading

When CSV vs JSON vs JSONL

Regex

Python's `re` module

Compile once, use many

The pattern reference (what you'll actually use)

Common scraper patterns

Why regex on HTML is wrong

When regex actually shines

Hands-on lab

Hands-on lab

Quiz, check your understanding

Why use the `csv` module instead of constructing CSV strings with f-strings and commas?

JSON, CSV, and Regex Essentials in Python

What you’ll learn

JSON

Parsing what a server sent

Navigating nested JSON

Writing JSON

JSONL, one JSON object per line

CSV

Writing

Reading

When CSV vs JSON vs JSONL

Regex

Python's re module

Compile once, use many

The pattern reference (what you'll actually use)

Common scraper patterns

Why regex on HTML is wrong

When regex actually shines

Hands-on lab

Hands-on lab

Quiz, check your understanding

Why use the `csv` module instead of constructing CSV strings with f-strings and commas?

Python's `re` module