Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Deploying Scrapers to Google Cloud Run

Deploy containerized Python web scrapers to Google Cloud Run for serverless, auto-scaling scraping infrastructure.

Deployment · #6intermediate3 min read
Share:WhatsAppLinkedIn

Google Cloud Run runs your Docker containers in a fully managed serverless environment. You pay only when your scraper is actively processing requests, and it scales to zero when idle.

Why Cloud Run for Scraping?

  • Scale to zero, no cost when not running
  • Auto-scaling, handles burst workloads automatically
  • Up to 60 minutes execution time (vs 15 min on Lambda)
  • Up to 32GB RAM, enough for browser-based scraping
  • Container-based, bring any dependencies including Chromium

Project Setup

# main.py
from flask import Flask, request, jsonify
import requests
from bs4 import BeautifulSoup
from google.cloud import storage
from datetime import datetime
import json

app = Flask(__name__)

@app.route("/scrape", methods=["POST"])
def scrape():
    data = request.get_json()
    url = data.get("url", "https://news.ycombinator.com")

    response = requests.get(url, timeout=30, headers={
        "User-Agent": "Mozilla/5.0 (compatible; CloudRunScraper/1.0)"
    })
    soup = BeautifulSoup(response.text, "html.parser")

    items = []
    for link in soup.select(".titleline > a"):
        items.append({"title": link.text, "url": link.get("href", "")})

    # Save to Google Cloud Storage
    client = storage.Client()
    bucket = client.bucket("my-scraper-bucket")
    timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
    blob = bucket.blob(f"scrapes/{timestamp}.json")
    blob.upload_from_string(json.dumps(items), content_type="application/json")

    return jsonify({"items_count": len(items), "stored_at": f"scrapes/{timestamp}.json"})

@app.route("/health")
def health():
    return "OK"

if __name__ == "__main__":
    import os
    port = int(os.environ.get("PORT", 8080))
    app.run(host="0.0.0.0", port=port)

Dockerfile

FROM python:3.12-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["gunicorn", "--bind", "0.0.0.0:8080", "--timeout", "300", "main:app"]
# requirements.txt
flask==3.0.3
gunicorn==22.0.0
requests==2.32.3
beautifulsoup4==4.12.3
google-cloud-storage==2.16.0

Deploy

# Build and deploy in one command
gcloud run deploy scraper-service \
    --source . \
    --region us-central1 \
    --memory 512Mi \
    --timeout 300 \
    --allow-unauthenticated \
    --set-env-vars="PROXY_URL=http://user:pass@proxy.example.com:8080"

Schedule with Cloud Scheduler

Trigger your Cloud Run scraper on a schedule:

# Create a Cloud Scheduler job to run every hour
gcloud scheduler jobs create http scraper-hourly \
    --schedule="0 * * * *" \
    --uri="https://scraper-service-xxxxx.run.app/scrape" \
    --http-method=POST \
    --headers="Content-Type=application/json" \
    --body='{"url": "https://example.com"}' \
    --time-zone="UTC"

Cloud Run with Playwright

For browser-based scraping, use the Playwright Docker image:

FROM mcr.microsoft.com/playwright/python:v1.44.0-jammy

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .

CMD ["gunicorn", "--bind", "0.0.0.0:8080", "--timeout", "300", "main:app"]

Set the memory to at least 1GB and timeout to 300 seconds for browser-based scraping.

Cost Estimate

Usage Monthly Cost
1,000 scrapes/month, 5s each ~$0.01
10,000 scrapes/month, 10s each ~$0.50
100,000 scrapes/month, 10s each ~$5.00

Cloud Run's free tier includes 2 million requests and 360,000 vCPU-seconds per month.

Tips

  • Use ScraperAPI for proxy rotation to avoid getting Cloud Run's IPs blocked
  • Set appropriate timeouts in both Cloud Run config and your HTTP client
  • Use Cloud Run jobs (not services) for batch scraping tasks
  • Enable min-instances=1 if you need fast cold starts