Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

F15beginner6 min read

JSON, CSV, and Regex Essentials in Python

The three data-handling skills you'll use in every scraper: parsing JSON responses, writing structured output, and reaching for regex without abusing it.

What you’ll learn

  • Parse and produce JSON cleanly, including handling nested structures and arrays.
  • Read and write CSV files using the csv module, never hand-roll commas.
  • Use regex for the cases it's actually good at: extracting bounded patterns from text.
  • Know when NOT to use regex (HTML).

A scraper has three data-shaped jobs: parse the JSON it fetches, write structured output that's easy to load elsewhere, and occasionally pluck a value from unstructured text. This lesson is those three.

JSON

Python's json module is the entire toolkit. Two functions for strings, two for files.

Parsing what a server sent

import json
import requests

r = requests.get("https://practice.scrapingcentral.com/api/products")
data = r.json()  # requests has a shortcut, same as json.loads(r.text)

# Or, manually:
data = json.loads(r.text)

Navigating nested JSON

Real API responses are nested. A typical structure:

{
  "meta": { "total": 5000, "page": 2, "per_page": 12 },
  "data": [
  { "id": 13, "title": "Yellow mug", "price": 14.99, "reviews": { "count": 47, "avg": 4.3 }},
  { "id": 14, "title": "Blue mug", "price": 13.50, "reviews": { "count": 12, "avg": 3.9 }}
  ]
}

Reach into it:

total = data["meta"]["total"]  # 5000
first_title = data["data"][0]["title"]  # 'Yellow mug'
review_avgs = [p["reviews"]["avg"] for p in data["data"]]

When fields might be missing, chain .get():

avg = data.get("data", [{}])[0].get("reviews", {}).get("avg")

For deeper safety, write a tiny helper:

def deep_get(obj, *keys, default=None):
  for k in keys:
  if isinstance(obj, dict):
  obj = obj.get(k, default)
  elif isinstance(obj, list) and isinstance(k, int) and 0 <= k < len(obj):
  obj = obj[k]
  else:
  return default
  return obj

deep_get(data, "data", 0, "reviews", "avg")

Or just let it crash for now and add safety when you actually hit a missing field.

Writing JSON

import json

products = [...]  # list of dicts

with open("products.json", "w", encoding="utf-8") as f:
  json.dump(products, f, ensure_ascii=False, indent=2)

Three flags worth knowing:

  • ensure_ascii=False, preserves Unicode characters instead of escaping them
  • indent=2, pretty-prints with 2-space indent
  • sort_keys=True, sorts dict keys alphabetically (helpful for reproducible diffs)

JSONL, one JSON object per line

For large datasets (>1M records), prefer JSONL (newline-delimited JSON):

with open("products.jsonl", "w", encoding="utf-8") as f:
  for p in products:
  f.write(json.dumps(p, ensure_ascii=False) + "\n")

JSONL streams record-by-record, you can cat, grep, and jq it line by line, and you don't need to load the whole file to read one record. The de-facto format for large scraping outputs.

CSV

Never hand-roll CSV. Comma-quoting rules are subtle (what if the field contains a comma? a newline? a quote?) and you will get them wrong. Use the csv module.

Writing

import csv

products = [
  {"id": 1, "title": "Yellow mug", "price": 14.99},
  {"id": 2, "title": "Blue, ceramic mug", "price": 13.50},  # comma in title, csv handles it
]

with open("products.csv", "w", encoding="utf-8", newline="") as f:
  writer = csv.DictWriter(f, fieldnames=["id", "title", "price"])
  writer.writeheader()
  writer.writerows(products)

Three things to remember:

  • newline="" when opening the file. Without it, Windows generates extra blank lines.
  • encoding="utf-8". Always.
  • DictWriter lets you write dicts directly; pure writer takes lists.

Reading

with open("products.csv", encoding="utf-8") as f:
  reader = csv.DictReader(f)
  for row in reader:
  print(row["title"], row["price"])

CSV cells are always strings. If the original was a number, you need to cast:

for row in reader:
  price = float(row["price"])

When CSV vs JSON vs JSONL

Format When to use
JSON Small data, when consumers expect JSON, when nesting matters
JSONL Large data, streaming pipelines, log-style writes
CSV Tabular data, Excel/sheets compatibility, simple columnar exports
SQLite When you want SQL queries against your scrape; covered in Sub-Path 1

For a typical scraper, CSV is the "send to a human" format and JSONL is the "send to a downstream pipeline" format.

Regex

Regex is for extracting bounded patterns from text. It's not for parsing HTML. It's not for parsing JSON. It's for things like "find the order number embedded in this freeform note."

Python's re module

import re

text = "Order #A82B-9991 was placed on 2026-04-12."
m = re.search(r"#([A-Z0-9-]+)", text)
if m:
  order_id = m.group(1)  # 'A82B-9991'

dates = re.findall(r"\d{4}-\d{2}-\d{2}", text)  # ['2026-04-12']

# Multiple captures
parts = re.match(r"(\d{4})-(\d{2})-(\d{2})", "2026-04-12")
parts.groups()  # ('2026', '04', '12')

Compile once, use many

If you'll run the same pattern thousands of times in a loop, compile it once:

DATE_RE = re.compile(r"\d{4}-\d{2}-\d{2}")
for line in big_file:
  if DATE_RE.search(line):
  ...

The pattern reference (what you'll actually use)

Pattern Matches
. Any character (except newline)
\d A digit
\D A non-digit
\w A word character (letter, digit, underscore)
\s Whitespace
[abc] Any of a, b, c
[^abc] Any character NOT a, b, c
[a-z] Range
* Zero or more
+ One or more
? Zero or one
{n} Exactly n
{n,m} n to m
^ Start of string (or line, with re.MULTILINE)
$ End of string
() Capture group
(?:) Non-capturing group
` `
\b Word boundary

Common scraper patterns

EMAIL = r"\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b"
URL = r"https?://[^\s\"]+"
PRICE = r"\$\s*(\d+(?:\.\d{2})?)"  # captures the number portion
PHONE = r"\+?\d[\d\s().-]{7,}"  # crude, phone numbers are hard
DATE_ISO = r"\d{4}-\d{2}-\d{2}"

The price regex is worth dissecting:

  • \$, literal $
  • \s*, optional whitespace
  • (, start capture
  • \d+, one or more digits
  • (?:\.\d{2})?, optional non-capturing group with . and exactly two digits
  • ), end capture

Returns just the numeric part of $14.9914.99 (use float() to cast).

Why regex on HTML is wrong

People reach for regex against HTML constantly. It works on simple cases and breaks on the slightly less simple. Real HTML isn't regular:

<a class="link" href="/products/1" data-id="1">Buy now</a>
<a class='link' href="/products/2" data-id="2" >Buy <strong>now</strong></a>

A regex that handles single quotes, optional whitespace, content with nested tags, attribute orders, and escaped characters becomes pages long and still misses cases. A parser handles them all by design.

The rule: if you're matching against HTML, use BeautifulSoup or lxml. If you're matching against a string (a data-* value, a <title> text, an extracted blob of JSON-as-string), regex is fine.

When regex actually shines

  • Extracting a UUID, ID, or date from freeform text
  • Splitting a string on multiple delimiters (re.split(r"[,;|]", s))
  • Validating that a string matches a format (price, ISO date, JWT shape)
  • One-off data cleaning (re.sub(r"\s+", " ", text) to collapse whitespace)

Hands-on lab

Fetch https://practice.scrapingcentral.com/api/products with requests.get(...).json(). From the response:

  1. Extract every product title into a list using a comprehension.
  2. Filter to titles containing "mug" (case-insensitive).
  3. Write the filtered list to a CSV with columns id, title, price.
  4. Bonus: write a JSONL version too.

If you can do that in 30 lines, you've internalized the data-handling layer that every scraper sits on top of.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /api/products

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

JSON, CSV, and Regex Essentials in Python1 / 8

Why use the `csv` module instead of constructing CSV strings with f-strings and commas?

Score so far: 0 / 0