Storing Scraped Data in Databases (SQLite, PostgreSQL) - Python Scraping

Store scraped data in SQLite and PostgreSQL databases. Learn schema design, upserts, and best practices for persistent scraping data storage.

Databases are the best choice when you need to query, update, or deduplicate scraped data over time. SQLite requires zero setup, while PostgreSQL handles large-scale, multi-user workloads.

SQLite, Zero Setup

SQLite is built into Python. No server needed.

import sqlite3
import requests
from bs4 import BeautifulSoup

# Create database and table
conn = sqlite3.connect("quotes.db")
cursor = conn.cursor()
cursor.execute("""
    CREATE TABLE IF NOT EXISTS quotes (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        text TEXT UNIQUE,
        author TEXT,
        scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    )
""")
conn.commit()

# Scrape and store
url = "https://quotes.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

inserted = 0
for quote in soup.select("div.quote"):
    text = quote.select_one("span.text").get_text()
    author = quote.select_one("small.author").get_text()
    try:
        cursor.execute(
            "INSERT INTO quotes (text, author) VALUES (?, ?)",
            (text, author),
        )
        inserted += 1
    except sqlite3.IntegrityError:
        pass  # Duplicate, skip

conn.commit()
conn.close()
print(f"Inserted {inserted} new quotes.")

PostgreSQL with psycopg2

For production scrapers, PostgreSQL offers better concurrency and advanced features.

pip install psycopg2-binary

import psycopg2
from psycopg2.extras import execute_values

conn = psycopg2.connect(
    host="localhost",
    database="scraping_db",
    user="scraper",
    password="your_password",
)
cursor = conn.cursor()

# Create table with upsert support
cursor.execute("""
    CREATE TABLE IF NOT EXISTS products (
        url TEXT PRIMARY KEY,
        name TEXT NOT NULL,
        price DECIMAL(10, 2),
        updated_at TIMESTAMP DEFAULT NOW()
    )
""")
conn.commit()

# Bulk insert with upsert (update on conflict)
products = [
    ("https://example.com/p/1", "Widget Pro", 29.99),
    ("https://example.com/p/2", "Gadget Plus", 49.99),
    ("https://example.com/p/3", "Tool Max", 19.99),
]

execute_values(
    cursor,
    """
    INSERT INTO products (url, name, price)
    VALUES %s
    ON CONFLICT (url) DO UPDATE SET
        price = EXCLUDED.price,
        updated_at = NOW()
    """,
    products,
)

conn.commit()
cursor.close()
conn.close()
print(f"Upserted {len(products)} products.")

A Reusable Storage Class

import sqlite3
from contextlib import contextmanager


class ScrapingDatabase:
    def __init__(self, db_path="scraping.db"):
        self.db_path = db_path
        self._init_db()

    def _init_db(self):
        with self._connect() as conn:
            conn.execute("""
                CREATE TABLE IF NOT EXISTS items (
                    url TEXT PRIMARY KEY,
                    data TEXT,
                    scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
                )
            """)

    @contextmanager
    def _connect(self):
        conn = sqlite3.connect(self.db_path)
        try:
            yield conn
            conn.commit()
        finally:
            conn.close()

    def save(self, url, data):
        with self._connect() as conn:
            conn.execute(
                "INSERT OR REPLACE INTO items (url, data) VALUES (?, ?)",
                (url, data),
            )

    def exists(self, url):
        with self._connect() as conn:
            row = conn.execute(
                "SELECT 1 FROM items WHERE url = ?", (url,)
            ).fetchone()
            return row is not None


# Usage
db = ScrapingDatabase()
db.save("https://example.com/page1", '{"title": "Example"}')
print(db.exists("https://example.com/page1"))  # True

SQLite vs PostgreSQL

Feature	SQLite	PostgreSQL
Setup	None (built in)	Requires server
Concurrency	Single writer	Multiple writers
Scale	Small to medium	Large
Upserts	INSERT OR REPLACE	ON CONFLICT DO UPDATE

Tips

Use UNIQUE constraints to prevent duplicate entries automatically.
Batch your inserts, inserting 1,000 rows in one transaction is much faster than 1,000 separate inserts.
Add a scraped_at timestamp to track data freshness.
When running production scrapers with ScraperAPI, database storage lets you track which URLs have been scraped and skip them on reruns.

Next Steps

Add error handling and retry logic to your scraper
Build pipelines that clean data before inserting into the database