Storing Scraped Data in Databases (SQLite, PostgreSQL)
Store scraped data in SQLite and PostgreSQL databases. Learn schema design, upserts, and best practices for persistent scraping data storage.
Python Scraping · #10intermediate3 min read
Databases are the best choice when you need to query, update, or deduplicate scraped data over time. SQLite requires zero setup, while PostgreSQL handles large-scale, multi-user workloads.
SQLite, Zero Setup
SQLite is built into Python. No server needed.
import sqlite3
import requests
from bs4 import BeautifulSoup
# Create database and table
conn = sqlite3.connect("quotes.db")
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS quotes (
id INTEGER PRIMARY KEY AUTOINCREMENT,
text TEXT UNIQUE,
author TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
conn.commit()
# Scrape and store
url = "https://quotes.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
inserted = 0
for quote in soup.select("div.quote"):
text = quote.select_one("span.text").get_text()
author = quote.select_one("small.author").get_text()
try:
cursor.execute(
"INSERT INTO quotes (text, author) VALUES (?, ?)",
(text, author),
)
inserted += 1
except sqlite3.IntegrityError:
pass # Duplicate, skip
conn.commit()
conn.close()
print(f"Inserted {inserted} new quotes.")
PostgreSQL with psycopg2
For production scrapers, PostgreSQL offers better concurrency and advanced features.
pip install psycopg2-binary
import psycopg2
from psycopg2.extras import execute_values
conn = psycopg2.connect(
host="localhost",
database="scraping_db",
user="scraper",
password="your_password",
)
cursor = conn.cursor()
# Create table with upsert support
cursor.execute("""
CREATE TABLE IF NOT EXISTS products (
url TEXT PRIMARY KEY,
name TEXT NOT NULL,
price DECIMAL(10, 2),
updated_at TIMESTAMP DEFAULT NOW()
)
""")
conn.commit()
# Bulk insert with upsert (update on conflict)
products = [
("https://example.com/p/1", "Widget Pro", 29.99),
("https://example.com/p/2", "Gadget Plus", 49.99),
("https://example.com/p/3", "Tool Max", 19.99),
]
execute_values(
cursor,
"""
INSERT INTO products (url, name, price)
VALUES %s
ON CONFLICT (url) DO UPDATE SET
price = EXCLUDED.price,
updated_at = NOW()
""",
products,
)
conn.commit()
cursor.close()
conn.close()
print(f"Upserted {len(products)} products.")
A Reusable Storage Class
import sqlite3
from contextlib import contextmanager
class ScrapingDatabase:
def __init__(self, db_path="scraping.db"):
self.db_path = db_path
self._init_db()
def _init_db(self):
with self._connect() as conn:
conn.execute("""
CREATE TABLE IF NOT EXISTS items (
url TEXT PRIMARY KEY,
data TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
@contextmanager
def _connect(self):
conn = sqlite3.connect(self.db_path)
try:
yield conn
conn.commit()
finally:
conn.close()
def save(self, url, data):
with self._connect() as conn:
conn.execute(
"INSERT OR REPLACE INTO items (url, data) VALUES (?, ?)",
(url, data),
)
def exists(self, url):
with self._connect() as conn:
row = conn.execute(
"SELECT 1 FROM items WHERE url = ?", (url,)
).fetchone()
return row is not None
# Usage
db = ScrapingDatabase()
db.save("https://example.com/page1", '{"title": "Example"}')
print(db.exists("https://example.com/page1")) # True
SQLite vs PostgreSQL
| Feature | SQLite | PostgreSQL |
|---|---|---|
| Setup | None (built in) | Requires server |
| Concurrency | Single writer | Multiple writers |
| Scale | Small to medium | Large |
| Upserts | INSERT OR REPLACE | ON CONFLICT DO UPDATE |
Tips
- Use
UNIQUEconstraints to prevent duplicate entries automatically. - Batch your inserts, inserting 1,000 rows in one transaction is much faster than 1,000 separate inserts.
- Add a
scraped_attimestamp to track data freshness. - When running production scrapers with ScraperAPI, database storage lets you track which URLs have been scraped and skip them on reruns.
Next Steps
- Add error handling and retry logic to your scraper
- Build pipelines that clean data before inserting into the database