Serverless Scrapers on AWS Lambda, Production, Scale & Career

When scrape workloads are bursty and you don't want idle infra, serverless can be cheap and clean. Where Lambda shines for scraping, and the hard limits to know.

Serverless can be magical for bursty scraping: you pay only when scraping; the platform auto-scales. It can also be a frustrating choice if your workload doesn't fit Lambda's shape. This lesson tells you which case you're in.

When Lambda fits

Workload	Fits?	Why
1000 short-lived scrapes triggered by SQS	Yes	Per-event invocation, parallel by default
Single long scrape over 12 hours	No	15-min max execution
Tiny scrape, runs every minute	Maybe	Cold starts may dominate
Headless Chrome on every invocation	Awkward	Browser binary is big; cold start slow
Scraping ~100k items per hour, embarrassingly parallel	Yes	Horizontal scale, no infra

Rule of thumb: lots of small, independent scrapes triggered by events, Lambda. One long scrape, VM or container.

The hard limits

Limit	Value
Max execution time	15 minutes (900s)
Max memory	10240 MB
Max ephemeral disk (/tmp)	10240 MB
Max deployment package (zipped)	250 MB (unzipped)
Max container image	10 GB
Concurrency (default)	1000 concurrent invocations per region

The 15-minute timeout is the one that bites scrapers most often.

Basic Lambda scraper

handler.py:

import httpx
import json
import boto3

s3 = boto3.client("s3")

def handler(event, context):
  """Triggered by SQS message containing a URL."""
  for record in event["Records"]:
  body = json.loads(record["body"])
  url = body["url"]
  try:
  resp = httpx.get(url, timeout=10)
  resp.raise_for_status()
  except httpx.HTTPError as e:
  return {"status": "error", "error": str(e)}

  # Process and persist
  items = parse(resp.text)
  s3.put_object(
  Bucket="scraper-output",
  Key=f"items/{context.aws_request_id}.json",
  Body=json.dumps(items)
  )
  return {"status": "ok", "items": len(items)}

Trigger: SQS queue with messages. Lambda reads in batches (typically 10 messages), processes in parallel. For thousands of URLs, push them all to SQS, Lambda fans out automatically.

Packaging

Two options:

Zip with dependencies (up to 250MB unzipped).

pip install -r requirements.txt -t package/
cp handler.py package/
cd package && zip -r ../lambda.zip . && cd ..

Container image (up to 10GB). Recommended for anything non-trivial.

FROM public.ecr.aws/lambda/python:3.12
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY handler.py ${LAMBDA_TASK_ROOT}
CMD ["handler.handler"]

Push to ECR, deploy.

For complex scrapers, especially with Playwright, container images are the only practical option.

Playwright on Lambda

Chromium is 200+ MB. With Playwright's Python bindings and node deps, you blow the 250MB zip limit. Container image required.

FROM public.ecr.aws/lambda/python:3.12
RUN dnf install -y atk cups-libs gtk3 libxkbcommon mesa-libgbm ... # browser deps
COPY requirements.txt .
RUN pip install -r requirements.txt
RUN playwright install --with-deps chromium
COPY handler.py ${LAMBDA_TASK_ROOT}
CMD ["handler.handler"]

Alternative: chrome-aws-lambda or @sparticuz/chromium (the modern fork), minimal Chromium builds specifically for Lambda. Most cloud-scraping tutorials still use these because they save 100+ MB and start faster.

Cold-start with Playwright is 3–10 seconds. Warm invocations are fast. Mind the cost on short scrapes, cold starts can dominate.

Cold-start mitigations

Provisioned Concurrency, pre-warmed instances always ready. Costs money even when idle, but eliminates cold starts.
Periodic warmer, a CloudWatch rule that invokes the function every 5 minutes to keep it warm. Crude but cheap.
Smaller deployment artifact, every MB cuts cold start. Strip unused libs, slim base images.
ARM64 (Graviton2), typically 20% faster cold starts on Python.

Cost model

Lambda charges:

Per millisecond of execution, scaled by memory size.
Per request.

Concrete: 1024 MB function running 5 seconds, 1 million invocations/month:

Compute: 1M × 5s × (1024/128) GB-sec = 40M GB-sec ≈ $66
Requests: 1M × $0.20/M = $0.20
Total: ~$66/month

Compare to a $30 EC2 instance handling the same throughput, Lambda is more expensive for steady load. Lambda wins on bursty / spiky / per-event workloads.

Step Functions for orchestration

If a single scrape needs to fan out (discover URLs → scrape each → aggregate), Step Functions can chain Lambda invocations:

{
  "Comment": "Scrape catalog108",
  "StartAt": "Discover",
  "States": {
  "Discover": {"Type": "Task", "Resource": "arn:...:DiscoverFn", "Next": "ScrapeAll"},
  "ScrapeAll": {
  "Type": "Map",
  "ItemsPath": "$.urls",
  "MaxConcurrency": 50,
  "Iterator": {
  "StartAt": "ScrapeOne",
  "States": {"ScrapeOne": {"Type": "Task", "Resource": "arn:...:ScrapeFn", "End": true}}
  },
  "Next": "Aggregate"
  },
  "Aggregate": {"Type": "Task", "Resource": "arn:...:AggregateFn", "End": true}
  }
}

Map state fans out, runs parallel scrapes, then aggregates, without you writing the orchestration.

When Lambda is the wrong call

Long, sustained scrapes: 15-min hard limit.
High steady throughput: per-invocation cost adds up.
Workloads needing persistent state in memory: Lambda is stateless per invocation.
Strict latency budgets sub-100ms: cold starts will violate them.

For long-lived scrapers, prefer ECS Fargate or EC2.

PHP on Lambda (Bref)

PHP runs on Lambda via Bref. For Symfony console scrapers:

# serverless.yml
service: scraper

provider:
  name: aws
  runtime: provided.al2023

functions:
  scrape:
  handler: bin/console
  layers:
  - ${bref:layer.php-83}
  - ${bref:layer.console}
  events:
  - sqs: arn:aws:sqs:...:scrape-queue

Bref translates Lambda events into Symfony Console invocations. Works but is less idiomatic than Python Lambda.

What to try

Build a Lambda that scrapes one Catalog108 product URL given an SQS message. Push 1000 messages. Watch Lambda fan out to ~100 concurrent invocations, process all 1000 in seconds. Then try the same with Playwright, observe the cold-start tax.

Serverless Scrapers on AWS Lambda

What you’ll learn