Serverless Scraping Architecture - Deployment

Design a complete serverless web scraping architecture using AWS Lambda, SQS, S3, and DynamoDB with zero servers to manage.

A serverless scraping system uses managed cloud services for every component, no servers to provision, patch, or scale. You pay only for what you use.

Architecture Overview

CloudWatch/Scheduler
        │
        ▼
  Lambda (Orchestrator)
        │
        ▼
    SQS Queue
        │
   ┌────┴────┐
   ▼         ▼
Lambda     Lambda   (Scraper workers)
   │         │
   ▼         ▼
    S3 Bucket (Raw HTML + JSON)
        │
        ▼
  Lambda (Processor)
        │
        ▼
   DynamoDB (Parsed data)

Component 1: The Orchestrator

This Lambda function generates URLs and feeds them into the queue:

# orchestrator.py
import boto3
import json

sqs = boto3.client("sqs")
QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789/scrape-urls"

def lambda_handler(event, context):
    """Generate URLs and push them to SQS."""
    urls = [f"https://example.com/products?page={i}" for i in range(1, 101)]

    # Send in batches of 10 (SQS limit)
    for i in range(0, len(urls), 10):
        batch = urls[i:i+10]
        entries = [
            {
                "Id": str(idx),
                "MessageBody": json.dumps({"url": url}),
            }
            for idx, url in enumerate(batch)
        ]
        sqs.send_message_batch(QueueUrl=QUEUE_URL, Entries=entries)

    return {"urls_queued": len(urls)}

Component 2: The Scraper Worker

Triggered by SQS messages, each invocation scrapes one URL:

# scraper_worker.py
import boto3
import requests
import json
from datetime import datetime

s3 = boto3.client("s3")
BUCKET = "scraper-raw-data"

def lambda_handler(event, context):
    """Process SQS messages and scrape URLs."""
    results = []

    for record in event["Records"]:
        body = json.loads(record["body"])
        url = body["url"]

        try:
            response = requests.get(url, timeout=25, headers={
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/124.0.0.0"
            })

            # Save raw HTML to S3
            timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
            key = f"raw/{timestamp}_{hash(url)}.html"
            s3.put_object(Bucket=BUCKET, Key=key, Body=response.text)

            results.append({"url": url, "status": "success", "s3_key": key})

        except Exception as e:
            results.append({"url": url, "status": "error", "error": str(e)})

    return {"processed": len(results), "results": results}

Component 3: The Processor

Triggered by S3 events when new HTML files are uploaded:

# processor.py
import boto3
import json
from bs4 import BeautifulSoup

s3 = boto3.client("s3")
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("scraped-products")

def lambda_handler(event, context):
    """Parse HTML from S3 and store structured data in DynamoDB."""
    for record in event["Records"]:
        bucket = record["s3"]["bucket"]["name"]
        key = record["s3"]["object"]["key"]

        # Download the HTML
        obj = s3.get_object(Bucket=bucket, Key=key)
        html = obj["Body"].read().decode("utf-8")

        # Parse
        soup = BeautifulSoup(html, "html.parser")
        products = []

        for item in soup.select(".product"):
            name = item.select_one(".name")
            price = item.select_one(".price")
            if name and price:
                product = {
                    "name": name.text.strip(),
                    "price": price.text.strip(),
                    "source_key": key,
                }
                products.append(product)

        # Batch write to DynamoDB
        with table.batch_writer() as batch:
            for product in products:
                batch.put_item(Item=product)

    return {"parsed": len(products)}

Infrastructure as Code (SAM Template)

# template.yaml
AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31

Resources:
  ScrapeQueue:
    Type: AWS::SQS::Queue
    Properties:
      QueueName: scrape-urls
      VisibilityTimeout: 300

  OrchestratorFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: orchestrator.lambda_handler
      Runtime: python3.12
      Timeout: 60
      Events:
        Schedule:
          Type: Schedule
          Properties:
            Schedule: rate(6 hours)

  ScraperWorkerFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: scraper_worker.lambda_handler
      Runtime: python3.12
      Timeout: 300
      MemorySize: 256
      Events:
        SQSEvent:
          Type: SQS
          Properties:
            Queue: !GetAtt ScrapeQueue.Arn
            BatchSize: 1

Cost Estimate

Component	10K URLs/month	100K URLs/month
Lambda	~$0.10	~$1.00
SQS	~$0.01	~$0.04
S3	~$0.05	~$0.50
DynamoDB	~$0.25	~$2.50
Total	~$0.41	~$4.04

Tips

Use ScraperAPI from Lambda to avoid AWS IP blocks on target sites
Set the SQS visibility timeout higher than the Lambda timeout
Use dead letter queues (DLQ) for URLs that fail multiple times
Monitor with CloudWatch dashboards and alarms