Deploying Scrapers to Google Cloud Run
Deploy containerized Python web scrapers to Google Cloud Run for serverless, auto-scaling scraping infrastructure.
Deployment · #6intermediate3 min read
Google Cloud Run runs your Docker containers in a fully managed serverless environment. You pay only when your scraper is actively processing requests, and it scales to zero when idle.
Why Cloud Run for Scraping?
- Scale to zero, no cost when not running
- Auto-scaling, handles burst workloads automatically
- Up to 60 minutes execution time (vs 15 min on Lambda)
- Up to 32GB RAM, enough for browser-based scraping
- Container-based, bring any dependencies including Chromium
Project Setup
# main.py
from flask import Flask, request, jsonify
import requests
from bs4 import BeautifulSoup
from google.cloud import storage
from datetime import datetime
import json
app = Flask(__name__)
@app.route("/scrape", methods=["POST"])
def scrape():
data = request.get_json()
url = data.get("url", "https://news.ycombinator.com")
response = requests.get(url, timeout=30, headers={
"User-Agent": "Mozilla/5.0 (compatible; CloudRunScraper/1.0)"
})
soup = BeautifulSoup(response.text, "html.parser")
items = []
for link in soup.select(".titleline > a"):
items.append({"title": link.text, "url": link.get("href", "")})
# Save to Google Cloud Storage
client = storage.Client()
bucket = client.bucket("my-scraper-bucket")
timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
blob = bucket.blob(f"scrapes/{timestamp}.json")
blob.upload_from_string(json.dumps(items), content_type="application/json")
return jsonify({"items_count": len(items), "stored_at": f"scrapes/{timestamp}.json"})
@app.route("/health")
def health():
return "OK"
if __name__ == "__main__":
import os
port = int(os.environ.get("PORT", 8080))
app.run(host="0.0.0.0", port=port)
Dockerfile
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["gunicorn", "--bind", "0.0.0.0:8080", "--timeout", "300", "main:app"]
# requirements.txt
flask==3.0.3
gunicorn==22.0.0
requests==2.32.3
beautifulsoup4==4.12.3
google-cloud-storage==2.16.0
Deploy
# Build and deploy in one command
gcloud run deploy scraper-service \
--source . \
--region us-central1 \
--memory 512Mi \
--timeout 300 \
--allow-unauthenticated \
--set-env-vars="PROXY_URL=http://user:pass@proxy.example.com:8080"
Schedule with Cloud Scheduler
Trigger your Cloud Run scraper on a schedule:
# Create a Cloud Scheduler job to run every hour
gcloud scheduler jobs create http scraper-hourly \
--schedule="0 * * * *" \
--uri="https://scraper-service-xxxxx.run.app/scrape" \
--http-method=POST \
--headers="Content-Type=application/json" \
--body='{"url": "https://example.com"}' \
--time-zone="UTC"
Cloud Run with Playwright
For browser-based scraping, use the Playwright Docker image:
FROM mcr.microsoft.com/playwright/python:v1.44.0-jammy
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["gunicorn", "--bind", "0.0.0.0:8080", "--timeout", "300", "main:app"]
Set the memory to at least 1GB and timeout to 300 seconds for browser-based scraping.
Cost Estimate
| Usage | Monthly Cost |
|---|---|
| 1,000 scrapes/month, 5s each | ~$0.01 |
| 10,000 scrapes/month, 10s each | ~$0.50 |
| 100,000 scrapes/month, 10s each | ~$5.00 |
Cloud Run's free tier includes 2 million requests and 360,000 vCPU-seconds per month.
Tips
- Use ScraperAPI for proxy rotation to avoid getting Cloud Run's IPs blocked
- Set appropriate timeouts in both Cloud Run config and your HTTP client
- Use Cloud Run jobs (not services) for batch scraping tasks
- Enable min-instances=1 if you need fast cold starts