Deploying Scrapers to Google Cloud Run - Deployment

Deploy containerized Python web scrapers to Google Cloud Run for serverless, auto-scaling scraping infrastructure.

Google Cloud Run runs your Docker containers in a fully managed serverless environment. You pay only when your scraper is actively processing requests, and it scales to zero when idle.

Why Cloud Run for Scraping?

Scale to zero, no cost when not running
Auto-scaling, handles burst workloads automatically
Up to 60 minutes execution time (vs 15 min on Lambda)
Up to 32GB RAM, enough for browser-based scraping
Container-based, bring any dependencies including Chromium

Project Setup

# main.py
from flask import Flask, request, jsonify
import requests
from bs4 import BeautifulSoup
from google.cloud import storage
from datetime import datetime
import json

app = Flask(__name__)

@app.route("/scrape", methods=["POST"])
def scrape():
    data = request.get_json()
    url = data.get("url", "https://news.ycombinator.com")

    response = requests.get(url, timeout=30, headers={
        "User-Agent": "Mozilla/5.0 (compatible; CloudRunScraper/1.0)"
    })
    soup = BeautifulSoup(response.text, "html.parser")

    items = []
    for link in soup.select(".titleline > a"):
        items.append({"title": link.text, "url": link.get("href", "")})

    # Save to Google Cloud Storage
    client = storage.Client()
    bucket = client.bucket("my-scraper-bucket")
    timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
    blob = bucket.blob(f"scrapes/{timestamp}.json")
    blob.upload_from_string(json.dumps(items), content_type="application/json")

    return jsonify({"items_count": len(items), "stored_at": f"scrapes/{timestamp}.json"})

@app.route("/health")
def health():
    return "OK"

if __name__ == "__main__":
    import os
    port = int(os.environ.get("PORT", 8080))
    app.run(host="0.0.0.0", port=port)

Dockerfile

FROM python:3.12-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["gunicorn", "--bind", "0.0.0.0:8080", "--timeout", "300", "main:app"]

# requirements.txt
flask==3.0.3
gunicorn==22.0.0
requests==2.32.3
beautifulsoup4==4.12.3
google-cloud-storage==2.16.0

Deploy

# Build and deploy in one command
gcloud run deploy scraper-service \
    --source . \
    --region us-central1 \
    --memory 512Mi \
    --timeout 300 \
    --allow-unauthenticated \
    --set-env-vars="PROXY_URL=http://user:pass@proxy.example.com:8080"

Schedule with Cloud Scheduler

Trigger your Cloud Run scraper on a schedule:

# Create a Cloud Scheduler job to run every hour
gcloud scheduler jobs create http scraper-hourly \
    --schedule="0 * * * *" \
    --uri="https://scraper-service-xxxxx.run.app/scrape" \
    --http-method=POST \
    --headers="Content-Type=application/json" \
    --body='{"url": "https://example.com"}' \
    --time-zone="UTC"

Cloud Run with Playwright

For browser-based scraping, use the Playwright Docker image:

FROM mcr.microsoft.com/playwright/python:v1.44.0-jammy

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .

CMD ["gunicorn", "--bind", "0.0.0.0:8080", "--timeout", "300", "main:app"]

Set the memory to at least 1GB and timeout to 300 seconds for browser-based scraping.

Cost Estimate

Usage	Monthly Cost
1,000 scrapes/month, 5s each	~$0.01
10,000 scrapes/month, 10s each	~$0.50
100,000 scrapes/month, 10s each	~$5.00

Cloud Run's free tier includes 2 million requests and 360,000 vCPU-seconds per month.

Tips

Use ScraperAPI for proxy rotation to avoid getting Cloud Run's IPs blocked
Set appropriate timeouts in both Cloud Run config and your HTTP client
Use Cloud Run jobs (not services) for batch scraping tasks
Enable min-instances=1 if you need fast cold starts