Best Ways to Store Scraped Data (CSV, JSON, Database)

Learn the best storage formats for scraped data. Compare CSV, JSON, SQLite, PostgreSQL, and MongoDB with practical Python examples.

Choosing the right storage format is critical for working with scraped data. Here is a comparison of the most common options.

Quick Comparison

Format	Best For	Size Limit	Query Support	Setup
CSV	Tabular data, spreadsheets	~1M rows	None	Zero
JSON	Nested/hierarchical data	~100MB	None	Zero
SQLite	Medium datasets, queries	~10GB	Full SQL	Minimal
PostgreSQL	Large-scale, production	Unlimited	Full SQL	Server
MongoDB	Flexible schemas	Unlimited	Rich queries	Server

CSV Storage

Best for simple, flat data that fits in a spreadsheet.

import csv

data = [
    {"title": "Product A", "price": 29.99, "url": "https://..."},
    {"title": "Product B", "price": 49.99, "url": "https://..."},
]

with open("products.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["title", "price", "url"])
    writer.writeheader()
    writer.writerows(data)

JSON Storage

Best for nested or irregular data structures.

import json

data = {
    "scraped_at": "2026-04-21",
    "products": [
        {"title": "Product A", "variants": [{"size": "S", "price": 29.99}]},
    ]
}

with open("products.json", "w") as f:
    json.dump(data, f, indent=2)

For large datasets, use JSON Lines (one JSON object per line):

with open("products.jsonl", "a") as f:
    for item in data:
        f.write(json.dumps(item) + "\n")

SQLite Database

Best for medium datasets that need querying. No server required.

import sqlite3

conn = sqlite3.connect("scraped_data.db")
cursor = conn.cursor()

cursor.execute("""
    CREATE TABLE IF NOT EXISTS products (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        title TEXT,
        price REAL,
        url TEXT UNIQUE,
        scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    )
""")

cursor.execute(
    "INSERT OR IGNORE INTO products (title, price, url) VALUES (?, ?, ?)",
    ("Product A", 29.99, "https://example.com/product-a")
)
conn.commit()

PostgreSQL

Best for production systems with concurrent access and large datasets.

import psycopg2

conn = psycopg2.connect("postgresql://user:pass@localhost/scraping_db")
cursor = conn.cursor()

cursor.execute(
    "INSERT INTO products (title, price, url) VALUES (%s, %s, %s) ON CONFLICT (url) DO UPDATE SET price = %s",
    ("Product A", 29.99, "https://...", 29.99)
)
conn.commit()

Our Recommendations

Starting out? Use CSV or JSON files
Need queries? Use SQLite, zero setup, full SQL
Production system? Use PostgreSQL
Flexible schemas? Use MongoDB
Pair with ScraperAPI for reliable data collection and any of these storage options for persistence