Cost Optimization for Scraping Infrastructure
Practical strategies to reduce the cost of your web scraping infrastructure including proxies, compute, storage, and API services.
Scraping costs can spiral quickly, proxy bandwidth, compute resources, CAPTCHA solving, and storage all add up. Here are practical strategies to cut costs without sacrificing data quality.
Where Your Money Goes
| Component | Typical Cost Share | Optimization Potential |
|---|---|---|
| Proxies | 40-60% | High |
| Compute (VPS/Cloud) | 15-25% | Medium |
| CAPTCHA solving | 10-20% | High |
| Storage | 5-10% | Medium |
| API services | 5-15% | Medium |
1. Reduce Proxy Costs
Proxies are usually the biggest expense. Use the cheapest proxy type that works:
import requests
def smart_proxy_selector(url: str) -> dict:
"""Use the cheapest proxy type that works for each target."""
# No proxy needed for APIs without anti-bot
no_proxy_domains = ["api.github.com", "jsonplaceholder.typicode.com"]
# Datacenter proxies work for moderately protected sites
datacenter_domains = ["example.com", "simple-store.com"]
# Residential needed only for heavily protected sites
# Everything else falls through to residential
from urllib.parse import urlparse
domain = urlparse(url).netloc
if domain in no_proxy_domains:
return {} # No proxy ($0)
elif domain in datacenter_domains:
return {
"https": "http://user:pass@dc-proxy.example.com:8080" # ~$1/GB
}
else:
return {
"https": "http://user:pass@res-proxy.example.com:8080" # ~$8/GB
}
response = requests.get(
"https://example.com",
proxies=smart_proxy_selector("https://example.com"),
timeout=15,
)
2. Cache Aggressively
Never scrape the same page twice:
import hashlib
import json
import os
from pathlib import Path
class ScrapeCache:
def __init__(self, cache_dir: str = ".cache", ttl_hours: int = 24):
self.cache_dir = Path(cache_dir)
self.cache_dir.mkdir(exist_ok=True)
self.ttl_seconds = ttl_hours * 3600
def _key(self, url: str) -> str:
return hashlib.md5(url.encode()).hexdigest()
def get(self, url: str) -> str | None:
path = self.cache_dir / f"{self._key(url)}.html"
if not path.exists():
return None
age = os.time() - path.stat().st_mtime
if age > self.ttl_seconds:
path.unlink()
return None
return path.read_text()
def set(self, url: str, content: str):
path = self.cache_dir / f"{self._key(url)}.html"
path.write_text(content)
# Usage
cache = ScrapeCache(ttl_hours=12)
def scrape(url):
cached = cache.get(url)
if cached:
print(f"Cache hit: {url}")
return cached
response = requests.get(url, timeout=15)
cache.set(url, response.text)
return response.text
3. Avoid JavaScript Rendering When Not Needed
JS rendering costs 5-10x more on API services:
import requests
from bs4 import BeautifulSoup
def needs_js_rendering(url: str) -> bool:
"""Check if a page needs JavaScript rendering."""
# First, try without JS rendering
resp = requests.get(url, timeout=15, headers={
"User-Agent": "Mozilla/5.0 Chrome/124.0.0.0"
})
soup = BeautifulSoup(resp.text, "html.parser")
# If the page has meaningful content, JS is not needed
text_content = soup.get_text(strip=True)
if len(text_content) > 1000:
return False
# Check for SPA indicators
if soup.find("div", id="__next") or soup.find("div", id="root"):
return True
return False
4. Use Conditional Requests
Only download pages that have changed:
import requests
class ConditionalScraper:
def __init__(self):
self.etags = {}
def fetch(self, url: str) -> str | None:
headers = {"User-Agent": "Mozilla/5.0 Chrome/124.0.0.0"}
if url in self.etags:
headers["If-None-Match"] = self.etags[url]
response = requests.get(url, headers=headers, timeout=15)
if response.status_code == 304:
print(f"Not modified: {url}")
return None # Page has not changed
if "ETag" in response.headers:
self.etags[url] = response.headers["ETag"]
return response.text
5. Compress and Clean Data Before Storage
import gzip
import json
def save_compressed(data: list[dict], filepath: str):
"""Save as compressed JSON to reduce storage costs."""
json_bytes = json.dumps(data).encode("utf-8")
compressed = gzip.compress(json_bytes)
with open(filepath, "wb") as f:
f.write(compressed)
ratio = len(compressed) / len(json_bytes) * 100
print(f"Compressed to {ratio:.0f}% of original size")
Cost Comparison: DIY vs Managed Services
| Scale (pages/month) | DIY (VPS + proxies) | ScraperAPI | ScrapingAnt |
|---|---|---|---|
| 10,000 | ~$15 | $49 | Free tier |
| 50,000 | ~$50 | $49 | $19 |
| 500,000 | ~$200 | $149 | $49 |
At lower volumes, ScrapingAnt and ScraperAPI can be cheaper than managing your own infrastructure when you factor in development and maintenance time.
Quick Wins Checklist
- Use datacenter proxies as default, residential only when blocked
- Cache every response for at least a few hours
- Skip JS rendering unless the page actually needs it
- Use conditional requests (ETags / If-Modified-Since)
- Compress stored data with gzip
- Schedule scrapes during off-peak hours (cheaper spot instances)
- Monitor and remove duplicate URLs from your scrape queue