Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Running Scrapers on AWS Lambda

Learn how to deploy Python web scrapers to AWS Lambda for serverless, pay-per-use scraping with automatic scaling.

Deployment · #2intermediate3 min read
Share:WhatsAppLinkedIn

AWS Lambda lets you run scrapers without managing servers. You pay only for execution time, and it scales automatically. It is ideal for scrapers that run on a schedule or process events.

When Lambda Works Well

  • Scraping tasks that complete in under 15 minutes
  • Scheduled scraping (daily, hourly)
  • Event-driven scraping (triggered by SQS, API Gateway)
  • Small to medium workloads (no heavy browser automation)

Project Setup

mkdir lambda-scraper && cd lambda-scraper
python3 -m venv venv
source venv/bin/activate
pip install requests beautifulsoup4

The Scraper Function

# lambda_function.py
import json
import requests
from bs4 import BeautifulSoup
import boto3
from datetime import datetime

def lambda_handler(event, context):
    """Lambda entry point for scraping."""
    url = event.get("url", "https://news.ycombinator.com")

    # Scrape the page
    response = requests.get(url, timeout=25, headers={
        "User-Agent": "Mozilla/5.0 (compatible; LambdaScraper/1.0)"
    })
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract data
    items = []
    for item in soup.select(".titleline > a"):
        items.append({
            "title": item.get_text(),
            "url": item.get("href", ""),
        })

    # Save to S3
    s3 = boto3.client("s3")
    timestamp = datetime.utcnow().strftime("%Y-%m-%d_%H-%M-%S")
    s3.put_object(
        Bucket="my-scraper-data",
        Key=f"hn/{timestamp}.json",
        Body=json.dumps(items),
        ContentType="application/json",
    )

    return {
        "statusCode": 200,
        "body": json.dumps({
            "message": f"Scraped {len(items)} items",
            "timestamp": timestamp,
        }),
    }

Packaging and Deploying

Package your dependencies into a zip file:

# Install dependencies into a package directory
pip install requests beautifulsoup4 -t package/
cd package && zip -r ../deployment.zip . && cd ..
zip deployment.zip lambda_function.py

# Deploy with AWS CLI
aws lambda create-function \
    --function-name web-scraper \
    --runtime python3.12 \
    --handler lambda_function.lambda_handler \
    --zip-file fileb://deployment.zip \
    --role arn:aws:iam::123456789:role/lambda-scraper-role \
    --timeout 300 \
    --memory-size 256

Scheduling with EventBridge

Run your scraper on a schedule using EventBridge (CloudWatch Events):

# Create a rule that runs every hour
aws events put-rule \
    --name scraper-hourly \
    --schedule-expression "rate(1 hour)"

# Add the Lambda function as a target
aws events put-targets \
    --rule scraper-hourly \
    --targets "Id"="1","Arn"="arn:aws:lambda:us-east-1:123456789:function:web-scraper"

Using Lambda Layers for Dependencies

For larger dependency sets, use Lambda layers:

# Build a layer
mkdir -p layer/python
pip install requests beautifulsoup4 lxml -t layer/python/
cd layer && zip -r ../scraper-layer.zip . && cd ..

aws lambda publish-layer-version \
    --layer-name scraper-deps \
    --zip-file fileb://scraper-layer.zip \
    --compatible-runtimes python3.12

Lambda Limitations for Scraping

Limit Value Impact
Max execution time 15 minutes Cannot run long crawls
Package size 250MB unzipped No room for Chromium
Temp storage 512MB (10GB with config) Limited data storage
Outbound IPs AWS-owned May be blocked

For the IP limitation, route requests through ScraperAPI to get rotating residential proxies within your Lambda function.

Tips

  • Set the timeout to at least 60 seconds for scraping
  • Use /tmp for temporary file storage during execution
  • Monitor with CloudWatch Logs and set alarms on errors
  • Use Lambda Powertools for structured logging