Running Scrapers on AWS Lambda
Learn how to deploy Python web scrapers to AWS Lambda for serverless, pay-per-use scraping with automatic scaling.
Deployment · #2intermediate3 min read
AWS Lambda lets you run scrapers without managing servers. You pay only for execution time, and it scales automatically. It is ideal for scrapers that run on a schedule or process events.
When Lambda Works Well
- Scraping tasks that complete in under 15 minutes
- Scheduled scraping (daily, hourly)
- Event-driven scraping (triggered by SQS, API Gateway)
- Small to medium workloads (no heavy browser automation)
Project Setup
mkdir lambda-scraper && cd lambda-scraper
python3 -m venv venv
source venv/bin/activate
pip install requests beautifulsoup4
The Scraper Function
# lambda_function.py
import json
import requests
from bs4 import BeautifulSoup
import boto3
from datetime import datetime
def lambda_handler(event, context):
"""Lambda entry point for scraping."""
url = event.get("url", "https://news.ycombinator.com")
# Scrape the page
response = requests.get(url, timeout=25, headers={
"User-Agent": "Mozilla/5.0 (compatible; LambdaScraper/1.0)"
})
soup = BeautifulSoup(response.text, "html.parser")
# Extract data
items = []
for item in soup.select(".titleline > a"):
items.append({
"title": item.get_text(),
"url": item.get("href", ""),
})
# Save to S3
s3 = boto3.client("s3")
timestamp = datetime.utcnow().strftime("%Y-%m-%d_%H-%M-%S")
s3.put_object(
Bucket="my-scraper-data",
Key=f"hn/{timestamp}.json",
Body=json.dumps(items),
ContentType="application/json",
)
return {
"statusCode": 200,
"body": json.dumps({
"message": f"Scraped {len(items)} items",
"timestamp": timestamp,
}),
}
Packaging and Deploying
Package your dependencies into a zip file:
# Install dependencies into a package directory
pip install requests beautifulsoup4 -t package/
cd package && zip -r ../deployment.zip . && cd ..
zip deployment.zip lambda_function.py
# Deploy with AWS CLI
aws lambda create-function \
--function-name web-scraper \
--runtime python3.12 \
--handler lambda_function.lambda_handler \
--zip-file fileb://deployment.zip \
--role arn:aws:iam::123456789:role/lambda-scraper-role \
--timeout 300 \
--memory-size 256
Scheduling with EventBridge
Run your scraper on a schedule using EventBridge (CloudWatch Events):
# Create a rule that runs every hour
aws events put-rule \
--name scraper-hourly \
--schedule-expression "rate(1 hour)"
# Add the Lambda function as a target
aws events put-targets \
--rule scraper-hourly \
--targets "Id"="1","Arn"="arn:aws:lambda:us-east-1:123456789:function:web-scraper"
Using Lambda Layers for Dependencies
For larger dependency sets, use Lambda layers:
# Build a layer
mkdir -p layer/python
pip install requests beautifulsoup4 lxml -t layer/python/
cd layer && zip -r ../scraper-layer.zip . && cd ..
aws lambda publish-layer-version \
--layer-name scraper-deps \
--zip-file fileb://scraper-layer.zip \
--compatible-runtimes python3.12
Lambda Limitations for Scraping
| Limit | Value | Impact |
|---|---|---|
| Max execution time | 15 minutes | Cannot run long crawls |
| Package size | 250MB unzipped | No room for Chromium |
| Temp storage | 512MB (10GB with config) | Limited data storage |
| Outbound IPs | AWS-owned | May be blocked |
For the IP limitation, route requests through ScraperAPI to get rotating residential proxies within your Lambda function.
Tips
- Set the timeout to at least 60 seconds for scraping
- Use
/tmpfor temporary file storage during execution - Monitor with CloudWatch Logs and set alarms on errors
- Use Lambda Powertools for structured logging