Running Scrapers on AWS Lambda - Deployment

Learn how to deploy Python web scrapers to AWS Lambda for serverless, pay-per-use scraping with automatic scaling.

AWS Lambda lets you run scrapers without managing servers. You pay only for execution time, and it scales automatically. It is ideal for scrapers that run on a schedule or process events.

When Lambda Works Well

Scraping tasks that complete in under 15 minutes
Scheduled scraping (daily, hourly)
Event-driven scraping (triggered by SQS, API Gateway)
Small to medium workloads (no heavy browser automation)

Project Setup

mkdir lambda-scraper && cd lambda-scraper
python3 -m venv venv
source venv/bin/activate
pip install requests beautifulsoup4

The Scraper Function

# lambda_function.py
import json
import requests
from bs4 import BeautifulSoup
import boto3
from datetime import datetime

def lambda_handler(event, context):
    """Lambda entry point for scraping."""
    url = event.get("url", "https://news.ycombinator.com")

    # Scrape the page
    response = requests.get(url, timeout=25, headers={
        "User-Agent": "Mozilla/5.0 (compatible; LambdaScraper/1.0)"
    })
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract data
    items = []
    for item in soup.select(".titleline > a"):
        items.append({
            "title": item.get_text(),
            "url": item.get("href", ""),
        })

    # Save to S3
    s3 = boto3.client("s3")
    timestamp = datetime.utcnow().strftime("%Y-%m-%d_%H-%M-%S")
    s3.put_object(
        Bucket="my-scraper-data",
        Key=f"hn/{timestamp}.json",
        Body=json.dumps(items),
        ContentType="application/json",
    )

    return {
        "statusCode": 200,
        "body": json.dumps({
            "message": f"Scraped {len(items)} items",
            "timestamp": timestamp,
        }),
    }

Packaging and Deploying

Package your dependencies into a zip file:

# Install dependencies into a package directory
pip install requests beautifulsoup4 -t package/
cd package && zip -r ../deployment.zip . && cd ..
zip deployment.zip lambda_function.py

# Deploy with AWS CLI
aws lambda create-function \
    --function-name web-scraper \
    --runtime python3.12 \
    --handler lambda_function.lambda_handler \
    --zip-file fileb://deployment.zip \
    --role arn:aws:iam::123456789:role/lambda-scraper-role \
    --timeout 300 \
    --memory-size 256

Scheduling with EventBridge

Run your scraper on a schedule using EventBridge (CloudWatch Events):

# Create a rule that runs every hour
aws events put-rule \
    --name scraper-hourly \
    --schedule-expression "rate(1 hour)"

# Add the Lambda function as a target
aws events put-targets \
    --rule scraper-hourly \
    --targets "Id"="1","Arn"="arn:aws:lambda:us-east-1:123456789:function:web-scraper"

Using Lambda Layers for Dependencies

For larger dependency sets, use Lambda layers:

# Build a layer
mkdir -p layer/python
pip install requests beautifulsoup4 lxml -t layer/python/
cd layer && zip -r ../scraper-layer.zip . && cd ..

aws lambda publish-layer-version \
    --layer-name scraper-deps \
    --zip-file fileb://scraper-layer.zip \
    --compatible-runtimes python3.12

Lambda Limitations for Scraping

Limit	Value	Impact
Max execution time	15 minutes	Cannot run long crawls
Package size	250MB unzipped	No room for Chromium
Temp storage	512MB (10GB with config)	Limited data storage
Outbound IPs	AWS-owned	May be blocked

For the IP limitation, route requests through ScraperAPI to get rotating residential proxies within your Lambda function.

Tips

Set the timeout to at least 60 seconds for scraping
Use /tmp for temporary file storage during execution
Monitor with CloudWatch Logs and set alarms on errors
Use Lambda Powertools for structured logging