Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

4.68advanced5 min read

Serverless Scrapers on AWS Lambda

When scrape workloads are bursty and you don't want idle infra, serverless can be cheap and clean. Where Lambda shines for scraping, and the hard limits to know.

What you’ll learn

  • Identify the scraping workloads where Lambda is the right fit.
  • Package and deploy a Lambda scraper, including dependencies and browser binaries.
  • Recognize the 15-minute timeout and 10GB memory limits.

Serverless can be magical for bursty scraping: you pay only when scraping; the platform auto-scales. It can also be a frustrating choice if your workload doesn't fit Lambda's shape. This lesson tells you which case you're in.

When Lambda fits

Workload Fits? Why
1000 short-lived scrapes triggered by SQS Yes Per-event invocation, parallel by default
Single long scrape over 12 hours No 15-min max execution
Tiny scrape, runs every minute Maybe Cold starts may dominate
Headless Chrome on every invocation Awkward Browser binary is big; cold start slow
Scraping ~100k items per hour, embarrassingly parallel Yes Horizontal scale, no infra

Rule of thumb: lots of small, independent scrapes triggered by events, Lambda. One long scrape, VM or container.

The hard limits

Limit Value
Max execution time 15 minutes (900s)
Max memory 10240 MB
Max ephemeral disk (/tmp) 10240 MB
Max deployment package (zipped) 250 MB (unzipped)
Max container image 10 GB
Concurrency (default) 1000 concurrent invocations per region

The 15-minute timeout is the one that bites scrapers most often.

Basic Lambda scraper

handler.py:

import httpx
import json
import boto3

s3 = boto3.client("s3")

def handler(event, context):
  """Triggered by SQS message containing a URL."""
  for record in event["Records"]:
  body = json.loads(record["body"])
  url = body["url"]
  try:
  resp = httpx.get(url, timeout=10)
  resp.raise_for_status()
  except httpx.HTTPError as e:
  return {"status": "error", "error": str(e)}

  # Process and persist
  items = parse(resp.text)
  s3.put_object(
  Bucket="scraper-output",
  Key=f"items/{context.aws_request_id}.json",
  Body=json.dumps(items)
  )
  return {"status": "ok", "items": len(items)}

Trigger: SQS queue with messages. Lambda reads in batches (typically 10 messages), processes in parallel. For thousands of URLs, push them all to SQS, Lambda fans out automatically.

Packaging

Two options:

  1. Zip with dependencies (up to 250MB unzipped).
pip install -r requirements.txt -t package/
cp handler.py package/
cd package && zip -r ../lambda.zip . && cd ..
  1. Container image (up to 10GB). Recommended for anything non-trivial.
FROM public.ecr.aws/lambda/python:3.12
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY handler.py ${LAMBDA_TASK_ROOT}
CMD ["handler.handler"]

Push to ECR, deploy.

For complex scrapers, especially with Playwright, container images are the only practical option.

Playwright on Lambda

Chromium is 200+ MB. With Playwright's Python bindings and node deps, you blow the 250MB zip limit. Container image required.

FROM public.ecr.aws/lambda/python:3.12
RUN dnf install -y atk cups-libs gtk3 libxkbcommon mesa-libgbm ... # browser deps
COPY requirements.txt .
RUN pip install -r requirements.txt
RUN playwright install --with-deps chromium
COPY handler.py ${LAMBDA_TASK_ROOT}
CMD ["handler.handler"]

Alternative: chrome-aws-lambda or @sparticuz/chromium (the modern fork), minimal Chromium builds specifically for Lambda. Most cloud-scraping tutorials still use these because they save 100+ MB and start faster.

Cold-start with Playwright is 3–10 seconds. Warm invocations are fast. Mind the cost on short scrapes, cold starts can dominate.

Cold-start mitigations

  • Provisioned Concurrency, pre-warmed instances always ready. Costs money even when idle, but eliminates cold starts.
  • Periodic warmer, a CloudWatch rule that invokes the function every 5 minutes to keep it warm. Crude but cheap.
  • Smaller deployment artifact, every MB cuts cold start. Strip unused libs, slim base images.
  • ARM64 (Graviton2), typically 20% faster cold starts on Python.

Cost model

Lambda charges:

  • Per millisecond of execution, scaled by memory size.
  • Per request.

Concrete: 1024 MB function running 5 seconds, 1 million invocations/month:

  • Compute: 1M × 5s × (1024/128) GB-sec = 40M GB-sec ≈ $66
  • Requests: 1M × $0.20/M = $0.20
  • Total: ~$66/month

Compare to a $30 EC2 instance handling the same throughput, Lambda is more expensive for steady load. Lambda wins on bursty / spiky / per-event workloads.

Step Functions for orchestration

If a single scrape needs to fan out (discover URLs → scrape each → aggregate), Step Functions can chain Lambda invocations:

{
  "Comment": "Scrape catalog108",
  "StartAt": "Discover",
  "States": {
  "Discover": {"Type": "Task", "Resource": "arn:...:DiscoverFn", "Next": "ScrapeAll"},
  "ScrapeAll": {
  "Type": "Map",
  "ItemsPath": "$.urls",
  "MaxConcurrency": 50,
  "Iterator": {
  "StartAt": "ScrapeOne",
  "States": {"ScrapeOne": {"Type": "Task", "Resource": "arn:...:ScrapeFn", "End": true}}
  },
  "Next": "Aggregate"
  },
  "Aggregate": {"Type": "Task", "Resource": "arn:...:AggregateFn", "End": true}
  }
}

Map state fans out, runs parallel scrapes, then aggregates, without you writing the orchestration.

When Lambda is the wrong call

  • Long, sustained scrapes: 15-min hard limit.
  • High steady throughput: per-invocation cost adds up.
  • Workloads needing persistent state in memory: Lambda is stateless per invocation.
  • Strict latency budgets sub-100ms: cold starts will violate them.

For long-lived scrapers, prefer ECS Fargate or EC2.

PHP on Lambda (Bref)

PHP runs on Lambda via Bref. For Symfony console scrapers:

# serverless.yml
service: scraper

provider:
  name: aws
  runtime: provided.al2023

functions:
  scrape:
  handler: bin/console
  layers:
  - ${bref:layer.php-83}
  - ${bref:layer.console}
  events:
  - sqs: arn:aws:sqs:...:scrape-queue

Bref translates Lambda events into Symfony Console invocations. Works but is less idiomatic than Python Lambda.

What to try

Build a Lambda that scrapes one Catalog108 product URL given an SQS message. Push 1000 messages. Watch Lambda fan out to ~100 concurrent invocations, process all 1000 in seconds. Then try the same with Playwright, observe the cold-start tax.

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Serverless Scrapers on AWS Lambda1 / 8

What's the AWS Lambda execution time limit that most often forces scraper-architecture changes?

Score so far: 0 / 0