Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Cost Optimization for Scraping Infrastructure

Practical strategies to reduce the cost of your web scraping infrastructure including proxies, compute, storage, and API services.

Deployment · #14intermediate4 min read
Share:WhatsAppLinkedIn

Scraping costs can spiral quickly, proxy bandwidth, compute resources, CAPTCHA solving, and storage all add up. Here are practical strategies to cut costs without sacrificing data quality.

Where Your Money Goes

Component Typical Cost Share Optimization Potential
Proxies 40-60% High
Compute (VPS/Cloud) 15-25% Medium
CAPTCHA solving 10-20% High
Storage 5-10% Medium
API services 5-15% Medium

1. Reduce Proxy Costs

Proxies are usually the biggest expense. Use the cheapest proxy type that works:

import requests

def smart_proxy_selector(url: str) -> dict:
    """Use the cheapest proxy type that works for each target."""
    # No proxy needed for APIs without anti-bot
    no_proxy_domains = ["api.github.com", "jsonplaceholder.typicode.com"]
    # Datacenter proxies work for moderately protected sites
    datacenter_domains = ["example.com", "simple-store.com"]
    # Residential needed only for heavily protected sites
    # Everything else falls through to residential

    from urllib.parse import urlparse
    domain = urlparse(url).netloc

    if domain in no_proxy_domains:
        return {}  # No proxy ($0)
    elif domain in datacenter_domains:
        return {
            "https": "http://user:pass@dc-proxy.example.com:8080"  # ~$1/GB
        }
    else:
        return {
            "https": "http://user:pass@res-proxy.example.com:8080"  # ~$8/GB
        }

response = requests.get(
    "https://example.com",
    proxies=smart_proxy_selector("https://example.com"),
    timeout=15,
)

2. Cache Aggressively

Never scrape the same page twice:

import hashlib
import json
import os
from pathlib import Path

class ScrapeCache:
    def __init__(self, cache_dir: str = ".cache", ttl_hours: int = 24):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)
        self.ttl_seconds = ttl_hours * 3600

    def _key(self, url: str) -> str:
        return hashlib.md5(url.encode()).hexdigest()

    def get(self, url: str) -> str | None:
        path = self.cache_dir / f"{self._key(url)}.html"
        if not path.exists():
            return None
        age = os.time() - path.stat().st_mtime
        if age > self.ttl_seconds:
            path.unlink()
            return None
        return path.read_text()

    def set(self, url: str, content: str):
        path = self.cache_dir / f"{self._key(url)}.html"
        path.write_text(content)

# Usage
cache = ScrapeCache(ttl_hours=12)

def scrape(url):
    cached = cache.get(url)
    if cached:
        print(f"Cache hit: {url}")
        return cached

    response = requests.get(url, timeout=15)
    cache.set(url, response.text)
    return response.text

3. Avoid JavaScript Rendering When Not Needed

JS rendering costs 5-10x more on API services:

import requests
from bs4 import BeautifulSoup

def needs_js_rendering(url: str) -> bool:
    """Check if a page needs JavaScript rendering."""
    # First, try without JS rendering
    resp = requests.get(url, timeout=15, headers={
        "User-Agent": "Mozilla/5.0 Chrome/124.0.0.0"
    })
    soup = BeautifulSoup(resp.text, "html.parser")

    # If the page has meaningful content, JS is not needed
    text_content = soup.get_text(strip=True)
    if len(text_content) > 1000:
        return False

    # Check for SPA indicators
    if soup.find("div", id="__next") or soup.find("div", id="root"):
        return True

    return False

4. Use Conditional Requests

Only download pages that have changed:

import requests

class ConditionalScraper:
    def __init__(self):
        self.etags = {}

    def fetch(self, url: str) -> str | None:
        headers = {"User-Agent": "Mozilla/5.0 Chrome/124.0.0.0"}

        if url in self.etags:
            headers["If-None-Match"] = self.etags[url]

        response = requests.get(url, headers=headers, timeout=15)

        if response.status_code == 304:
            print(f"Not modified: {url}")
            return None  # Page has not changed

        if "ETag" in response.headers:
            self.etags[url] = response.headers["ETag"]

        return response.text

5. Compress and Clean Data Before Storage

import gzip
import json

def save_compressed(data: list[dict], filepath: str):
    """Save as compressed JSON to reduce storage costs."""
    json_bytes = json.dumps(data).encode("utf-8")
    compressed = gzip.compress(json_bytes)

    with open(filepath, "wb") as f:
        f.write(compressed)

    ratio = len(compressed) / len(json_bytes) * 100
    print(f"Compressed to {ratio:.0f}% of original size")

Cost Comparison: DIY vs Managed Services

Scale (pages/month) DIY (VPS + proxies) ScraperAPI ScrapingAnt
10,000 ~$15 $49 Free tier
50,000 ~$50 $49 $19
500,000 ~$200 $149 $49

At lower volumes, ScrapingAnt and ScraperAPI can be cheaper than managing your own infrastructure when you factor in development and maintenance time.

Quick Wins Checklist

  • Use datacenter proxies as default, residential only when blocked
  • Cache every response for at least a few hours
  • Skip JS rendering unless the page actually needs it
  • Use conditional requests (ETags / If-Modified-Since)
  • Compress stored data with gzip
  • Schedule scrapes during off-peak hours (cheaper spot instances)
  • Monitor and remove duplicate URLs from your scrape queue