Output Formats: CSV, JSON, JSONL (Python and PHP)
Saving scraped data right: when to choose CSV vs JSON vs JSONL, how to write them safely in Python and PHP, and how to avoid the common quoting and encoding bugs.
What you’ll learn
- Choose the right output format for your data shape and downstream consumer.
- Write CSV in Python and PHP without quoting bugs.
- Write JSON and JSONL, and know why JSONL is what production scrapers usually want.
- Handle non-ASCII text and embedded newlines correctly.
Scraping ends with a write. The format you pick affects how easy the next step is, spreadsheet import, database load, pandas analysis, downstream pipelines. Three formats cover ~95% of real-world output.
Quick chooser
| Format | Best for | Avoid when |
|---|---|---|
| CSV | Flat tabular data, spreadsheet consumers, classic ETL | Nested data, fields with commas/newlines, non-Latin text on some readers |
| JSON | Nested data, single-document outputs, API responses | Very large outputs (whole file must be parsed) |
| JSONL | Append-friendly streams, line-by-line processing, large datasets | Heavy nesting where you want a single doc tree |
Real talk: JSONL is the production scraper default. One record per line. Streamable. Resumable. Append-friendly. The right answer for any scrape big enough to matter.
CSV in Python
import csv
rows = [
{"name": "Yellow mug", "price": "$14.99", "url": "/p/1"},
{"name": "Black, glossy mug", "price": "$12.99", "url": "/p/2"},
]
with open("products.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["name", "price", "url"])
writer.writeheader()
writer.writerows(rows)
Three small things that matter:
newline="", the CSV module handles line endings; passingnewline=""prevents double-newlines on Windows.encoding="utf-8", non-Latin characters need this. Some tools (Excel) preferutf-8-sig(UTF-8 with BOM) to auto-detect encoding. Useencoding="utf-8-sig"if Excel is the consumer.DictWriterhandles quoting for you, values with commas or quotes are escaped per RFC 4180 ("a, b",""quoted""). Never roll your own CSV escaping.
CSV in PHP
$rows = [
['name' => 'Yellow mug', 'price' => '$14.99', 'url' => '/p/1'],
['name' => 'Black, glossy mug', 'price' => '$12.99', 'url' => '/p/2'],
];
$fp = fopen('products.csv', 'w');
fputcsv($fp, array_keys($rows[0])); // header
foreach ($rows as $row) {
fputcsv($fp, $row);
}
fclose($fp);
fputcsv handles quoting and escaping per RFC 4180. Pass the separator/enclosure args to override defaults if needed (e.g. ; for European locales, ' for quoting).
JSON in Python
import json
with open("products.json", "w", encoding="utf-8") as f:
json.dump(rows, f, indent=2, ensure_ascii=False)
ensure_ascii=False keeps non-ASCII characters as themselves ('Año' not 'Año'), almost always what you want for human-readable output.
For streaming reads of large JSON, use ijson (incremental JSON parsing), but at that point, JSONL is usually a better output choice.
JSON in PHP
file_put_contents(
'products.json',
json_encode($rows, JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE)
);
JSON_UNESCAPED_UNICODE is the PHP equivalent of ensure_ascii=False. JSON_PRETTY_PRINT adds indentation. Pair them.
JSONL, the production format
JSON Lines (also called NDJSON): one JSON object per line, no surrounding array.
{"name": "Yellow mug", "price": "$14.99", "url": "/p/1"}
{"name": "Black mug", "price": "$12.99", "url": "/p/2"}
Why scrapers love it:
- Append-friendly. Scrape a record, append a line. No need to keep the whole file in memory.
- Streamable. Process line-by-line without parsing the entire file.
- Resumable. If the scraper crashes, you have all completed records on disk already.
- Easy to merge. Concatenate JSONL files; the result is still valid.
Writing JSONL in Python
import json
with open("products.jsonl", "w", encoding="utf-8") as f:
for row in rows:
f.write(json.dumps(row, ensure_ascii=False) + "\n")
For real scrapers, write incrementally as you scrape (and flush periodically):
def scrape_to_jsonl(urls, out_path):
with open(out_path, "a", encoding="utf-8") as f: # 'a' for append
for url in urls:
row = scrape_one(url)
f.write(json.dumps(row, ensure_ascii=False) + "\n")
f.flush() # ensure on-disk after each row
Now a crash at row 500 leaves rows 1-499 on disk. Restart, skip already-seen URLs, continue.
Writing JSONL in PHP
$fp = fopen('products.jsonl', 'a');
foreach ($rows as $row) {
fwrite($fp, json_encode($row, JSON_UNESCAPED_UNICODE) . "\n");
}
fclose($fp);
Reading JSONL back
Python:
with open("products.jsonl", encoding="utf-8") as f:
for line in f:
row = json.loads(line)
# process row
PHP:
$fp = fopen('products.jsonl', 'r');
while (($line = fgets($fp)) !== false) {
$row = json_decode($line, true);
// process row
}
fclose($fp);
Stream-processing 10GB of JSONL is the same code as 10MB. That's the killer feature.
Nested data: choose carefully
row = {
"name": "Yellow mug",
"price": "$14.99",
"tags": ["kitchen", "ceramic"],
"reviews": [
{"author": "Alice", "rating": 5},
{"author": "Bob", "rating": 4},
],
}
This row is fine in JSON / JSONL, exactly the structure JSON was designed for. In CSV, you have two bad choices:
- Stringify nested fields:
tags = "kitchen,ceramic". Loses structure, breaks if values contain commas. - Flatten by exploding rows: one CSV row per review. Duplicates the product info, but every row is flat.
For nested data, prefer JSON/JSONL. For tabular data that's purely flat, CSV is fine.
Encoding pitfalls
The single biggest CSV bug: Excel opening a UTF-8 CSV as Latin-1 and showing mojibake. Two fixes:
- Write with
encoding="utf-8-sig"in Python (UTF-8 BOM), Excel detects it. - Tell users to open via Data → From Text/CSV and pick UTF-8, but they won't.
The single biggest JSON bug: forgetting ensure_ascii=False/JSON_UNESCAPED_UNICODE and producing 'é' everywhere, technically valid but unreadable. Always set the flag.
Compression
For archives of large scraped data, gzip or zstd reduces JSONL by 5-10x:
import gzip, json
with gzip.open("products.jsonl.gz", "wt", encoding="utf-8") as f:
for row in rows:
f.write(json.dumps(row, ensure_ascii=False) + "\n")
gzip.open(..., "rt") reads transparently. JSONL.gz is the default "ship to a colleague" format.
Hands-on lab
Scrape /api/products (a JSON API, return the list). Save it three ways: as CSV (flatten the nested fields), as a single JSON document with json.dump, and as JSONL. Open all three in your text editor; compare. Try opening the CSV in Excel and check whether non-ASCII text renders correctly. Then write a script that reads back each format and verifies row counts match.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/api/productsQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.