Serverless Scraping Architecture
Design a complete serverless web scraping architecture using AWS Lambda, SQS, S3, and DynamoDB with zero servers to manage.
Deployment · #13advanced3 min read
A serverless scraping system uses managed cloud services for every component, no servers to provision, patch, or scale. You pay only for what you use.
Architecture Overview
CloudWatch/Scheduler
│
▼
Lambda (Orchestrator)
│
▼
SQS Queue
│
┌────┴────┐
▼ ▼
Lambda Lambda (Scraper workers)
│ │
▼ ▼
S3 Bucket (Raw HTML + JSON)
│
▼
Lambda (Processor)
│
▼
DynamoDB (Parsed data)
Component 1: The Orchestrator
This Lambda function generates URLs and feeds them into the queue:
# orchestrator.py
import boto3
import json
sqs = boto3.client("sqs")
QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789/scrape-urls"
def lambda_handler(event, context):
"""Generate URLs and push them to SQS."""
urls = [f"https://example.com/products?page={i}" for i in range(1, 101)]
# Send in batches of 10 (SQS limit)
for i in range(0, len(urls), 10):
batch = urls[i:i+10]
entries = [
{
"Id": str(idx),
"MessageBody": json.dumps({"url": url}),
}
for idx, url in enumerate(batch)
]
sqs.send_message_batch(QueueUrl=QUEUE_URL, Entries=entries)
return {"urls_queued": len(urls)}
Component 2: The Scraper Worker
Triggered by SQS messages, each invocation scrapes one URL:
# scraper_worker.py
import boto3
import requests
import json
from datetime import datetime
s3 = boto3.client("s3")
BUCKET = "scraper-raw-data"
def lambda_handler(event, context):
"""Process SQS messages and scrape URLs."""
results = []
for record in event["Records"]:
body = json.loads(record["body"])
url = body["url"]
try:
response = requests.get(url, timeout=25, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/124.0.0.0"
})
# Save raw HTML to S3
timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
key = f"raw/{timestamp}_{hash(url)}.html"
s3.put_object(Bucket=BUCKET, Key=key, Body=response.text)
results.append({"url": url, "status": "success", "s3_key": key})
except Exception as e:
results.append({"url": url, "status": "error", "error": str(e)})
return {"processed": len(results), "results": results}
Component 3: The Processor
Triggered by S3 events when new HTML files are uploaded:
# processor.py
import boto3
import json
from bs4 import BeautifulSoup
s3 = boto3.client("s3")
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("scraped-products")
def lambda_handler(event, context):
"""Parse HTML from S3 and store structured data in DynamoDB."""
for record in event["Records"]:
bucket = record["s3"]["bucket"]["name"]
key = record["s3"]["object"]["key"]
# Download the HTML
obj = s3.get_object(Bucket=bucket, Key=key)
html = obj["Body"].read().decode("utf-8")
# Parse
soup = BeautifulSoup(html, "html.parser")
products = []
for item in soup.select(".product"):
name = item.select_one(".name")
price = item.select_one(".price")
if name and price:
product = {
"name": name.text.strip(),
"price": price.text.strip(),
"source_key": key,
}
products.append(product)
# Batch write to DynamoDB
with table.batch_writer() as batch:
for product in products:
batch.put_item(Item=product)
return {"parsed": len(products)}
Infrastructure as Code (SAM Template)
# template.yaml
AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Resources:
ScrapeQueue:
Type: AWS::SQS::Queue
Properties:
QueueName: scrape-urls
VisibilityTimeout: 300
OrchestratorFunction:
Type: AWS::Serverless::Function
Properties:
Handler: orchestrator.lambda_handler
Runtime: python3.12
Timeout: 60
Events:
Schedule:
Type: Schedule
Properties:
Schedule: rate(6 hours)
ScraperWorkerFunction:
Type: AWS::Serverless::Function
Properties:
Handler: scraper_worker.lambda_handler
Runtime: python3.12
Timeout: 300
MemorySize: 256
Events:
SQSEvent:
Type: SQS
Properties:
Queue: !GetAtt ScrapeQueue.Arn
BatchSize: 1
Cost Estimate
| Component | 10K URLs/month | 100K URLs/month |
|---|---|---|
| Lambda | ~$0.10 | ~$1.00 |
| SQS | ~$0.01 | ~$0.04 |
| S3 | ~$0.05 | ~$0.50 |
| DynamoDB | ~$0.25 | ~$2.50 |
| Total | ~$0.41 | ~$4.04 |
Tips
- Use ScraperAPI from Lambda to avoid AWS IP blocks on target sites
- Set the SQS visibility timeout higher than the Lambda timeout
- Use dead letter queues (DLQ) for URLs that fail multiple times
- Monitor with CloudWatch dashboards and alarms