Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

1.29beginner5 min read

Output Formats: CSV, JSON, JSONL (Python and PHP)

Saving scraped data right: when to choose CSV vs JSON vs JSONL, how to write them safely in Python and PHP, and how to avoid the common quoting and encoding bugs.

What you’ll learn

  • Choose the right output format for your data shape and downstream consumer.
  • Write CSV in Python and PHP without quoting bugs.
  • Write JSON and JSONL, and know why JSONL is what production scrapers usually want.
  • Handle non-ASCII text and embedded newlines correctly.

Scraping ends with a write. The format you pick affects how easy the next step is, spreadsheet import, database load, pandas analysis, downstream pipelines. Three formats cover ~95% of real-world output.

Quick chooser

Format Best for Avoid when
CSV Flat tabular data, spreadsheet consumers, classic ETL Nested data, fields with commas/newlines, non-Latin text on some readers
JSON Nested data, single-document outputs, API responses Very large outputs (whole file must be parsed)
JSONL Append-friendly streams, line-by-line processing, large datasets Heavy nesting where you want a single doc tree

Real talk: JSONL is the production scraper default. One record per line. Streamable. Resumable. Append-friendly. The right answer for any scrape big enough to matter.

CSV in Python

import csv

rows = [
  {"name": "Yellow mug", "price": "$14.99", "url": "/p/1"},
  {"name": "Black, glossy mug", "price": "$12.99", "url": "/p/2"},
]

with open("products.csv", "w", newline="", encoding="utf-8") as f:
  writer = csv.DictWriter(f, fieldnames=["name", "price", "url"])
  writer.writeheader()
  writer.writerows(rows)

Three small things that matter:

  1. newline="", the CSV module handles line endings; passing newline="" prevents double-newlines on Windows.
  2. encoding="utf-8", non-Latin characters need this. Some tools (Excel) prefer utf-8-sig (UTF-8 with BOM) to auto-detect encoding. Use encoding="utf-8-sig" if Excel is the consumer.
  3. DictWriter handles quoting for you, values with commas or quotes are escaped per RFC 4180 ("a, b", ""quoted""). Never roll your own CSV escaping.

CSV in PHP

$rows = [
  ['name' => 'Yellow mug', 'price' => '$14.99', 'url' => '/p/1'],
  ['name' => 'Black, glossy mug', 'price' => '$12.99', 'url' => '/p/2'],
];

$fp = fopen('products.csv', 'w');
fputcsv($fp, array_keys($rows[0]));  // header
foreach ($rows as $row) {
  fputcsv($fp, $row);
}
fclose($fp);

fputcsv handles quoting and escaping per RFC 4180. Pass the separator/enclosure args to override defaults if needed (e.g. ; for European locales, ' for quoting).

JSON in Python

import json

with open("products.json", "w", encoding="utf-8") as f:
  json.dump(rows, f, indent=2, ensure_ascii=False)

ensure_ascii=False keeps non-ASCII characters as themselves ('Año' not 'Año'), almost always what you want for human-readable output.

For streaming reads of large JSON, use ijson (incremental JSON parsing), but at that point, JSONL is usually a better output choice.

JSON in PHP

file_put_contents(
  'products.json',
  json_encode($rows, JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE)
);

JSON_UNESCAPED_UNICODE is the PHP equivalent of ensure_ascii=False. JSON_PRETTY_PRINT adds indentation. Pair them.

JSONL, the production format

JSON Lines (also called NDJSON): one JSON object per line, no surrounding array.

{"name": "Yellow mug", "price": "$14.99", "url": "/p/1"}
{"name": "Black mug", "price": "$12.99", "url": "/p/2"}

Why scrapers love it:

  1. Append-friendly. Scrape a record, append a line. No need to keep the whole file in memory.
  2. Streamable. Process line-by-line without parsing the entire file.
  3. Resumable. If the scraper crashes, you have all completed records on disk already.
  4. Easy to merge. Concatenate JSONL files; the result is still valid.

Writing JSONL in Python

import json

with open("products.jsonl", "w", encoding="utf-8") as f:
  for row in rows:
  f.write(json.dumps(row, ensure_ascii=False) + "\n")

For real scrapers, write incrementally as you scrape (and flush periodically):

def scrape_to_jsonl(urls, out_path):
  with open(out_path, "a", encoding="utf-8") as f:  # 'a' for append
  for url in urls:
  row = scrape_one(url)
  f.write(json.dumps(row, ensure_ascii=False) + "\n")
  f.flush()  # ensure on-disk after each row

Now a crash at row 500 leaves rows 1-499 on disk. Restart, skip already-seen URLs, continue.

Writing JSONL in PHP

$fp = fopen('products.jsonl', 'a');
foreach ($rows as $row) {
  fwrite($fp, json_encode($row, JSON_UNESCAPED_UNICODE) . "\n");
}
fclose($fp);

Reading JSONL back

Python:

with open("products.jsonl", encoding="utf-8") as f:
  for line in f:
  row = json.loads(line)
  # process row

PHP:

$fp = fopen('products.jsonl', 'r');
while (($line = fgets($fp)) !== false) {
  $row = json_decode($line, true);
  // process row
}
fclose($fp);

Stream-processing 10GB of JSONL is the same code as 10MB. That's the killer feature.

Nested data: choose carefully

row = {
  "name": "Yellow mug",
  "price": "$14.99",
  "tags": ["kitchen", "ceramic"],
  "reviews": [
  {"author": "Alice", "rating": 5},
  {"author": "Bob",  "rating": 4},
  ],
}

This row is fine in JSON / JSONL, exactly the structure JSON was designed for. In CSV, you have two bad choices:

  1. Stringify nested fields: tags = "kitchen,ceramic". Loses structure, breaks if values contain commas.
  2. Flatten by exploding rows: one CSV row per review. Duplicates the product info, but every row is flat.

For nested data, prefer JSON/JSONL. For tabular data that's purely flat, CSV is fine.

Encoding pitfalls

The single biggest CSV bug: Excel opening a UTF-8 CSV as Latin-1 and showing mojibake. Two fixes:

  1. Write with encoding="utf-8-sig" in Python (UTF-8 BOM), Excel detects it.
  2. Tell users to open via Data → From Text/CSV and pick UTF-8, but they won't.

The single biggest JSON bug: forgetting ensure_ascii=False/JSON_UNESCAPED_UNICODE and producing 'é' everywhere, technically valid but unreadable. Always set the flag.

Compression

For archives of large scraped data, gzip or zstd reduces JSONL by 5-10x:

import gzip, json

with gzip.open("products.jsonl.gz", "wt", encoding="utf-8") as f:
  for row in rows:
  f.write(json.dumps(row, ensure_ascii=False) + "\n")

gzip.open(..., "rt") reads transparently. JSONL.gz is the default "ship to a colleague" format.

Hands-on lab

Scrape /api/products (a JSON API, return the list). Save it three ways: as CSV (flatten the nested fields), as a single JSON document with json.dump, and as JSONL. Open all three in your text editor; compare. Try opening the CSV in Excel and check whether non-ASCII text renders correctly. Then write a script that reads back each format and verifies row counts match.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /api/products

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Output Formats: CSV, JSON, JSONL (Python and PHP)1 / 8

Why is JSONL often preferred over a single JSON array file for scraper output?

Score so far: 0 / 0