Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Serverless Scraping Architecture

Design a complete serverless web scraping architecture using AWS Lambda, SQS, S3, and DynamoDB with zero servers to manage.

Deployment · #13advanced3 min read
Share:WhatsAppLinkedIn

A serverless scraping system uses managed cloud services for every component, no servers to provision, patch, or scale. You pay only for what you use.

Architecture Overview

CloudWatch/Scheduler
        │
        ▼
  Lambda (Orchestrator)
        │
        ▼
    SQS Queue
        │
   ┌────┴────┐
   ▼         ▼
Lambda     Lambda   (Scraper workers)
   │         │
   ▼         ▼
    S3 Bucket (Raw HTML + JSON)
        │
        ▼
  Lambda (Processor)
        │
        ▼
   DynamoDB (Parsed data)

Component 1: The Orchestrator

This Lambda function generates URLs and feeds them into the queue:

# orchestrator.py
import boto3
import json

sqs = boto3.client("sqs")
QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789/scrape-urls"

def lambda_handler(event, context):
    """Generate URLs and push them to SQS."""
    urls = [f"https://example.com/products?page={i}" for i in range(1, 101)]

    # Send in batches of 10 (SQS limit)
    for i in range(0, len(urls), 10):
        batch = urls[i:i+10]
        entries = [
            {
                "Id": str(idx),
                "MessageBody": json.dumps({"url": url}),
            }
            for idx, url in enumerate(batch)
        ]
        sqs.send_message_batch(QueueUrl=QUEUE_URL, Entries=entries)

    return {"urls_queued": len(urls)}

Component 2: The Scraper Worker

Triggered by SQS messages, each invocation scrapes one URL:

# scraper_worker.py
import boto3
import requests
import json
from datetime import datetime

s3 = boto3.client("s3")
BUCKET = "scraper-raw-data"

def lambda_handler(event, context):
    """Process SQS messages and scrape URLs."""
    results = []

    for record in event["Records"]:
        body = json.loads(record["body"])
        url = body["url"]

        try:
            response = requests.get(url, timeout=25, headers={
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/124.0.0.0"
            })

            # Save raw HTML to S3
            timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
            key = f"raw/{timestamp}_{hash(url)}.html"
            s3.put_object(Bucket=BUCKET, Key=key, Body=response.text)

            results.append({"url": url, "status": "success", "s3_key": key})

        except Exception as e:
            results.append({"url": url, "status": "error", "error": str(e)})

    return {"processed": len(results), "results": results}

Component 3: The Processor

Triggered by S3 events when new HTML files are uploaded:

# processor.py
import boto3
import json
from bs4 import BeautifulSoup

s3 = boto3.client("s3")
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("scraped-products")

def lambda_handler(event, context):
    """Parse HTML from S3 and store structured data in DynamoDB."""
    for record in event["Records"]:
        bucket = record["s3"]["bucket"]["name"]
        key = record["s3"]["object"]["key"]

        # Download the HTML
        obj = s3.get_object(Bucket=bucket, Key=key)
        html = obj["Body"].read().decode("utf-8")

        # Parse
        soup = BeautifulSoup(html, "html.parser")
        products = []

        for item in soup.select(".product"):
            name = item.select_one(".name")
            price = item.select_one(".price")
            if name and price:
                product = {
                    "name": name.text.strip(),
                    "price": price.text.strip(),
                    "source_key": key,
                }
                products.append(product)

        # Batch write to DynamoDB
        with table.batch_writer() as batch:
            for product in products:
                batch.put_item(Item=product)

    return {"parsed": len(products)}

Infrastructure as Code (SAM Template)

# template.yaml
AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31

Resources:
  ScrapeQueue:
    Type: AWS::SQS::Queue
    Properties:
      QueueName: scrape-urls
      VisibilityTimeout: 300

  OrchestratorFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: orchestrator.lambda_handler
      Runtime: python3.12
      Timeout: 60
      Events:
        Schedule:
          Type: Schedule
          Properties:
            Schedule: rate(6 hours)

  ScraperWorkerFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: scraper_worker.lambda_handler
      Runtime: python3.12
      Timeout: 300
      MemorySize: 256
      Events:
        SQSEvent:
          Type: SQS
          Properties:
            Queue: !GetAtt ScrapeQueue.Arn
            BatchSize: 1

Cost Estimate

Component 10K URLs/month 100K URLs/month
Lambda ~$0.10 ~$1.00
SQS ~$0.01 ~$0.04
S3 ~$0.05 ~$0.50
DynamoDB ~$0.25 ~$2.50
Total ~$0.41 ~$4.04

Tips

  • Use ScraperAPI from Lambda to avoid AWS IP blocks on target sites
  • Set the SQS visibility timeout higher than the Lambda timeout
  • Use dead letter queues (DLQ) for URLs that fail multiple times
  • Monitor with CloudWatch dashboards and alarms