JSON, CSV, and Regex Essentials in Python
The three data-handling skills you'll use in every scraper: parsing JSON responses, writing structured output, and reaching for regex without abusing it.
What you’ll learn
- Parse and produce JSON cleanly, including handling nested structures and arrays.
- Read and write CSV files using the csv module, never hand-roll commas.
- Use regex for the cases it's actually good at: extracting bounded patterns from text.
- Know when NOT to use regex (HTML).
A scraper has three data-shaped jobs: parse the JSON it fetches, write structured output that's easy to load elsewhere, and occasionally pluck a value from unstructured text. This lesson is those three.
JSON
Python's json module is the entire toolkit. Two functions for strings, two for files.
Parsing what a server sent
import json
import requests
r = requests.get("https://practice.scrapingcentral.com/api/products")
data = r.json() # requests has a shortcut, same as json.loads(r.text)
# Or, manually:
data = json.loads(r.text)
Navigating nested JSON
Real API responses are nested. A typical structure:
{
"meta": { "total": 5000, "page": 2, "per_page": 12 },
"data": [
{ "id": 13, "title": "Yellow mug", "price": 14.99, "reviews": { "count": 47, "avg": 4.3 }},
{ "id": 14, "title": "Blue mug", "price": 13.50, "reviews": { "count": 12, "avg": 3.9 }}
]
}
Reach into it:
total = data["meta"]["total"] # 5000
first_title = data["data"][0]["title"] # 'Yellow mug'
review_avgs = [p["reviews"]["avg"] for p in data["data"]]
When fields might be missing, chain .get():
avg = data.get("data", [{}])[0].get("reviews", {}).get("avg")
For deeper safety, write a tiny helper:
def deep_get(obj, *keys, default=None):
for k in keys:
if isinstance(obj, dict):
obj = obj.get(k, default)
elif isinstance(obj, list) and isinstance(k, int) and 0 <= k < len(obj):
obj = obj[k]
else:
return default
return obj
deep_get(data, "data", 0, "reviews", "avg")
Or just let it crash for now and add safety when you actually hit a missing field.
Writing JSON
import json
products = [...] # list of dicts
with open("products.json", "w", encoding="utf-8") as f:
json.dump(products, f, ensure_ascii=False, indent=2)
Three flags worth knowing:
ensure_ascii=False, preserves Unicode characters instead of escaping themindent=2, pretty-prints with 2-space indentsort_keys=True, sorts dict keys alphabetically (helpful for reproducible diffs)
JSONL, one JSON object per line
For large datasets (>1M records), prefer JSONL (newline-delimited JSON):
with open("products.jsonl", "w", encoding="utf-8") as f:
for p in products:
f.write(json.dumps(p, ensure_ascii=False) + "\n")
JSONL streams record-by-record, you can cat, grep, and jq it line by line, and you don't need to load the whole file to read one record. The de-facto format for large scraping outputs.
CSV
Never hand-roll CSV. Comma-quoting rules are subtle (what if the field contains a comma? a newline? a quote?) and you will get them wrong. Use the csv module.
Writing
import csv
products = [
{"id": 1, "title": "Yellow mug", "price": 14.99},
{"id": 2, "title": "Blue, ceramic mug", "price": 13.50}, # comma in title, csv handles it
]
with open("products.csv", "w", encoding="utf-8", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["id", "title", "price"])
writer.writeheader()
writer.writerows(products)
Three things to remember:
newline=""when opening the file. Without it, Windows generates extra blank lines.encoding="utf-8". Always.DictWriterlets you write dicts directly; purewritertakes lists.
Reading
with open("products.csv", encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
print(row["title"], row["price"])
CSV cells are always strings. If the original was a number, you need to cast:
for row in reader:
price = float(row["price"])
When CSV vs JSON vs JSONL
| Format | When to use |
|---|---|
| JSON | Small data, when consumers expect JSON, when nesting matters |
| JSONL | Large data, streaming pipelines, log-style writes |
| CSV | Tabular data, Excel/sheets compatibility, simple columnar exports |
| SQLite | When you want SQL queries against your scrape; covered in Sub-Path 1 |
For a typical scraper, CSV is the "send to a human" format and JSONL is the "send to a downstream pipeline" format.
Regex
Regex is for extracting bounded patterns from text. It's not for parsing HTML. It's not for parsing JSON. It's for things like "find the order number embedded in this freeform note."
Python's re module
import re
text = "Order #A82B-9991 was placed on 2026-04-12."
m = re.search(r"#([A-Z0-9-]+)", text)
if m:
order_id = m.group(1) # 'A82B-9991'
dates = re.findall(r"\d{4}-\d{2}-\d{2}", text) # ['2026-04-12']
# Multiple captures
parts = re.match(r"(\d{4})-(\d{2})-(\d{2})", "2026-04-12")
parts.groups() # ('2026', '04', '12')
Compile once, use many
If you'll run the same pattern thousands of times in a loop, compile it once:
DATE_RE = re.compile(r"\d{4}-\d{2}-\d{2}")
for line in big_file:
if DATE_RE.search(line):
...
The pattern reference (what you'll actually use)
| Pattern | Matches |
|---|---|
. |
Any character (except newline) |
\d |
A digit |
\D |
A non-digit |
\w |
A word character (letter, digit, underscore) |
\s |
Whitespace |
[abc] |
Any of a, b, c |
[^abc] |
Any character NOT a, b, c |
[a-z] |
Range |
* |
Zero or more |
+ |
One or more |
? |
Zero or one |
{n} |
Exactly n |
{n,m} |
n to m |
^ |
Start of string (or line, with re.MULTILINE) |
$ |
End of string |
() |
Capture group |
(?:) |
Non-capturing group |
| ` | ` |
\b |
Word boundary |
Common scraper patterns
EMAIL = r"\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b"
URL = r"https?://[^\s\"]+"
PRICE = r"\$\s*(\d+(?:\.\d{2})?)" # captures the number portion
PHONE = r"\+?\d[\d\s().-]{7,}" # crude, phone numbers are hard
DATE_ISO = r"\d{4}-\d{2}-\d{2}"
The price regex is worth dissecting:
\$, literal$\s*, optional whitespace(, start capture\d+, one or more digits(?:\.\d{2})?, optional non-capturing group with.and exactly two digits), end capture
Returns just the numeric part of $14.99 → 14.99 (use float() to cast).
Why regex on HTML is wrong
People reach for regex against HTML constantly. It works on simple cases and breaks on the slightly less simple. Real HTML isn't regular:
<a class="link" href="/products/1" data-id="1">Buy now</a>
<a class='link' href="/products/2" data-id="2" >Buy <strong>now</strong></a>
A regex that handles single quotes, optional whitespace, content with nested tags, attribute orders, and escaped characters becomes pages long and still misses cases. A parser handles them all by design.
The rule: if you're matching against HTML, use BeautifulSoup or lxml. If you're matching against a string (a data-* value, a <title> text, an extracted blob of JSON-as-string), regex is fine.
When regex actually shines
- Extracting a UUID, ID, or date from freeform text
- Splitting a string on multiple delimiters (
re.split(r"[,;|]", s)) - Validating that a string matches a format (price, ISO date, JWT shape)
- One-off data cleaning (
re.sub(r"\s+", " ", text)to collapse whitespace)
Hands-on lab
Fetch https://practice.scrapingcentral.com/api/products with requests.get(...).json(). From the response:
- Extract every product title into a list using a comprehension.
- Filter to titles containing "mug" (case-insensitive).
- Write the filtered list to a CSV with columns
id, title, price. - Bonus: write a JSONL version too.
If you can do that in 30 lines, you've internalized the data-handling layer that every scraper sits on top of.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/api/productsQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.