Guide
Best Ways to Store Scraped Data (CSV, JSON, Database)
Learn the best storage formats for scraped data. Compare CSV, JSON, SQLite, PostgreSQL, and MongoDB with practical Python examples.
Choosing the right storage format is critical for working with scraped data. Here is a comparison of the most common options.
Quick Comparison
| Format | Best For | Size Limit | Query Support | Setup |
|---|---|---|---|---|
| CSV | Tabular data, spreadsheets | ~1M rows | None | Zero |
| JSON | Nested/hierarchical data | ~100MB | None | Zero |
| SQLite | Medium datasets, queries | ~10GB | Full SQL | Minimal |
| PostgreSQL | Large-scale, production | Unlimited | Full SQL | Server |
| MongoDB | Flexible schemas | Unlimited | Rich queries | Server |
CSV Storage
Best for simple, flat data that fits in a spreadsheet.
import csv
data = [
{"title": "Product A", "price": 29.99, "url": "https://..."},
{"title": "Product B", "price": 49.99, "url": "https://..."},
]
with open("products.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["title", "price", "url"])
writer.writeheader()
writer.writerows(data)
JSON Storage
Best for nested or irregular data structures.
import json
data = {
"scraped_at": "2026-04-21",
"products": [
{"title": "Product A", "variants": [{"size": "S", "price": 29.99}]},
]
}
with open("products.json", "w") as f:
json.dump(data, f, indent=2)
For large datasets, use JSON Lines (one JSON object per line):
with open("products.jsonl", "a") as f:
for item in data:
f.write(json.dumps(item) + "\n")
SQLite Database
Best for medium datasets that need querying. No server required.
import sqlite3
conn = sqlite3.connect("scraped_data.db")
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT,
price REAL,
url TEXT UNIQUE,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
cursor.execute(
"INSERT OR IGNORE INTO products (title, price, url) VALUES (?, ?, ?)",
("Product A", 29.99, "https://example.com/product-a")
)
conn.commit()
PostgreSQL
Best for production systems with concurrent access and large datasets.
import psycopg2
conn = psycopg2.connect("postgresql://user:pass@localhost/scraping_db")
cursor = conn.cursor()
cursor.execute(
"INSERT INTO products (title, price, url) VALUES (%s, %s, %s) ON CONFLICT (url) DO UPDATE SET price = %s",
("Product A", 29.99, "https://...", 29.99)
)
conn.commit()
Our Recommendations
- Starting out? Use CSV or JSON files
- Need queries? Use SQLite, zero setup, full SQL
- Production system? Use PostgreSQL
- Flexible schemas? Use MongoDB
- Pair with ScraperAPI for reliable data collection and any of these storage options for persistence