Serverless Scrapers on AWS Lambda
When scrape workloads are bursty and you don't want idle infra, serverless can be cheap and clean. Where Lambda shines for scraping, and the hard limits to know.
What you’ll learn
- Identify the scraping workloads where Lambda is the right fit.
- Package and deploy a Lambda scraper, including dependencies and browser binaries.
- Recognize the 15-minute timeout and 10GB memory limits.
Serverless can be magical for bursty scraping: you pay only when scraping; the platform auto-scales. It can also be a frustrating choice if your workload doesn't fit Lambda's shape. This lesson tells you which case you're in.
When Lambda fits
| Workload | Fits? | Why |
|---|---|---|
| 1000 short-lived scrapes triggered by SQS | Yes | Per-event invocation, parallel by default |
| Single long scrape over 12 hours | No | 15-min max execution |
| Tiny scrape, runs every minute | Maybe | Cold starts may dominate |
| Headless Chrome on every invocation | Awkward | Browser binary is big; cold start slow |
| Scraping ~100k items per hour, embarrassingly parallel | Yes | Horizontal scale, no infra |
Rule of thumb: lots of small, independent scrapes triggered by events, Lambda. One long scrape, VM or container.
The hard limits
| Limit | Value |
|---|---|
| Max execution time | 15 minutes (900s) |
| Max memory | 10240 MB |
| Max ephemeral disk (/tmp) | 10240 MB |
| Max deployment package (zipped) | 250 MB (unzipped) |
| Max container image | 10 GB |
| Concurrency (default) | 1000 concurrent invocations per region |
The 15-minute timeout is the one that bites scrapers most often.
Basic Lambda scraper
handler.py:
import httpx
import json
import boto3
s3 = boto3.client("s3")
def handler(event, context):
"""Triggered by SQS message containing a URL."""
for record in event["Records"]:
body = json.loads(record["body"])
url = body["url"]
try:
resp = httpx.get(url, timeout=10)
resp.raise_for_status()
except httpx.HTTPError as e:
return {"status": "error", "error": str(e)}
# Process and persist
items = parse(resp.text)
s3.put_object(
Bucket="scraper-output",
Key=f"items/{context.aws_request_id}.json",
Body=json.dumps(items)
)
return {"status": "ok", "items": len(items)}
Trigger: SQS queue with messages. Lambda reads in batches (typically 10 messages), processes in parallel. For thousands of URLs, push them all to SQS, Lambda fans out automatically.
Packaging
Two options:
- Zip with dependencies (up to 250MB unzipped).
pip install -r requirements.txt -t package/
cp handler.py package/
cd package && zip -r ../lambda.zip . && cd ..
- Container image (up to 10GB). Recommended for anything non-trivial.
FROM public.ecr.aws/lambda/python:3.12
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY handler.py ${LAMBDA_TASK_ROOT}
CMD ["handler.handler"]
Push to ECR, deploy.
For complex scrapers, especially with Playwright, container images are the only practical option.
Playwright on Lambda
Chromium is 200+ MB. With Playwright's Python bindings and node deps, you blow the 250MB zip limit. Container image required.
FROM public.ecr.aws/lambda/python:3.12
RUN dnf install -y atk cups-libs gtk3 libxkbcommon mesa-libgbm ... # browser deps
COPY requirements.txt .
RUN pip install -r requirements.txt
RUN playwright install --with-deps chromium
COPY handler.py ${LAMBDA_TASK_ROOT}
CMD ["handler.handler"]
Alternative: chrome-aws-lambda or @sparticuz/chromium (the modern fork), minimal Chromium builds specifically for Lambda. Most cloud-scraping tutorials still use these because they save 100+ MB and start faster.
Cold-start with Playwright is 3–10 seconds. Warm invocations are fast. Mind the cost on short scrapes, cold starts can dominate.
Cold-start mitigations
- Provisioned Concurrency, pre-warmed instances always ready. Costs money even when idle, but eliminates cold starts.
- Periodic warmer, a CloudWatch rule that invokes the function every 5 minutes to keep it warm. Crude but cheap.
- Smaller deployment artifact, every MB cuts cold start. Strip unused libs, slim base images.
- ARM64 (Graviton2), typically 20% faster cold starts on Python.
Cost model
Lambda charges:
- Per millisecond of execution, scaled by memory size.
- Per request.
Concrete: 1024 MB function running 5 seconds, 1 million invocations/month:
- Compute: 1M × 5s × (1024/128) GB-sec = 40M GB-sec ≈ $66
- Requests: 1M × $0.20/M = $0.20
- Total: ~$66/month
Compare to a $30 EC2 instance handling the same throughput, Lambda is more expensive for steady load. Lambda wins on bursty / spiky / per-event workloads.
Step Functions for orchestration
If a single scrape needs to fan out (discover URLs → scrape each → aggregate), Step Functions can chain Lambda invocations:
{
"Comment": "Scrape catalog108",
"StartAt": "Discover",
"States": {
"Discover": {"Type": "Task", "Resource": "arn:...:DiscoverFn", "Next": "ScrapeAll"},
"ScrapeAll": {
"Type": "Map",
"ItemsPath": "$.urls",
"MaxConcurrency": 50,
"Iterator": {
"StartAt": "ScrapeOne",
"States": {"ScrapeOne": {"Type": "Task", "Resource": "arn:...:ScrapeFn", "End": true}}
},
"Next": "Aggregate"
},
"Aggregate": {"Type": "Task", "Resource": "arn:...:AggregateFn", "End": true}
}
}
Map state fans out, runs parallel scrapes, then aggregates, without you writing the orchestration.
When Lambda is the wrong call
- Long, sustained scrapes: 15-min hard limit.
- High steady throughput: per-invocation cost adds up.
- Workloads needing persistent state in memory: Lambda is stateless per invocation.
- Strict latency budgets sub-100ms: cold starts will violate them.
For long-lived scrapers, prefer ECS Fargate or EC2.
PHP on Lambda (Bref)
PHP runs on Lambda via Bref. For Symfony console scrapers:
# serverless.yml
service: scraper
provider:
name: aws
runtime: provided.al2023
functions:
scrape:
handler: bin/console
layers:
- ${bref:layer.php-83}
- ${bref:layer.console}
events:
- sqs: arn:aws:sqs:...:scrape-queue
Bref translates Lambda events into Symfony Console invocations. Works but is less idiomatic than Python Lambda.
What to try
Build a Lambda that scrapes one Catalog108 product URL given an SQS message. Push 1000 messages. Watch Lambda fan out to ~100 concurrent invocations, process all 1000 in seconds. Then try the same with Playwright, observe the cold-start tax.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.