Storing Scraped Data in CSV and JSON
Save your scraped data to CSV and JSON files using Python's built-in modules. Learn best practices for data export, encoding, and file organization.
Python Scraping · #9beginner3 min read
Once you extract data from websites, you need to store it somewhere useful. CSV and JSON are the two most common formats for scraped data, CSV for spreadsheets and quick analysis, JSON for nested or complex structures.
Saving to CSV
import csv
import requests
from bs4 import BeautifulSoup
url = "https://quotes.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
quotes = []
for quote in soup.select("div.quote"):
quotes.append({
"text": quote.select_one("span.text").get_text(),
"author": quote.select_one("small.author").get_text(),
"tags": ", ".join(tag.get_text() for tag in quote.select("a.tag")),
})
# Write to CSV
with open("quotes.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["text", "author", "tags"])
writer.writeheader()
writer.writerows(quotes)
print(f"Saved {len(quotes)} quotes to quotes.csv")
Saving to JSON
import json
import requests
from bs4 import BeautifulSoup
url = "https://quotes.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
quotes = []
for quote in soup.select("div.quote"):
quotes.append({
"text": quote.select_one("span.text").get_text(),
"author": quote.select_one("small.author").get_text(),
"tags": [tag.get_text() for tag in quote.select("a.tag")],
})
# Write to JSON
with open("quotes.json", "w", encoding="utf-8") as f:
json.dump(quotes, f, indent=2, ensure_ascii=False)
print(f"Saved {len(quotes)} quotes to quotes.json")
Appending Data Incrementally
For long-running scrapers, write data as you go instead of holding everything in memory.
import csv
import requests
from bs4 import BeautifulSoup
fieldnames = ["text", "author"]
with open("quotes_all.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
for page in range(1, 11):
url = f"https://quotes.toscrape.com/page/{page}/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
for quote in soup.select("div.quote"):
writer.writerow({
"text": quote.select_one("span.text").get_text(),
"author": quote.select_one("small.author").get_text(),
})
f.flush() # Write to disk after each page
print("Done, data saved incrementally.")
JSON Lines (JSONL) for Large Datasets
JSON Lines stores one JSON object per line, making it easy to append and process large files.
import json
def append_jsonl(filepath, record):
with open(filepath, "a", encoding="utf-8") as f:
f.write(json.dumps(record, ensure_ascii=False) + "\n")
# Usage during scraping
for item in scraped_items:
append_jsonl("quotes.jsonl", item)
CSV vs JSON
| Feature | CSV | JSON |
|---|---|---|
| Nested data | Not supported | Supported |
| Spreadsheet-friendly | Yes | Not directly |
| File size | Smaller | Larger |
| Streaming writes | Easy | Harder (JSONL is easy) |
| Human-readable | Yes | Yes |
Tips
- Always use
encoding="utf-8"to handle international characters. - Use
newline=""when opening CSV files on Windows to avoid blank rows. - For very large datasets, prefer JSONL over JSON, it allows streaming reads and writes.
- When scraping at scale with ScraperAPI or ScrapingAnt, incremental writes prevent data loss if the scraper crashes.
Next Steps
- Store data in databases for querying and long-term storage
- Learn to handle errors and retries for more reliable data collection