Tutorial
Using MCP Servers for Web Scraping Automation
Learn how to use Model Context Protocol (MCP) servers to build AI-powered web scraping pipelines with Claude and other LLMs.
The Model Context Protocol (MCP) enables AI models to interact with external tools and data sources. For web scraping, MCP servers let you give an AI agent the ability to browse, scrape, and process web data autonomously.
What Is MCP?
MCP is an open protocol that connects AI models to external tools. An MCP server exposes "tools" that an AI can call. For scraping, you might expose tools like fetch_page, extract_data, and save_results.
Building a Scraping MCP Server
# scraping_server.py
from mcp.server.fastmcp import FastMCP
import requests
from bs4 import BeautifulSoup
mcp = FastMCP("Web Scraping Server")
SCRAPERAPI_KEY = "YOUR_SCRAPERAPI_KEY"
@mcp.tool()
def fetch_page(url: str, render: bool = False) -> str:
"""Fetch a web page, optionally with JavaScript rendering."""
params = {
"api_key": SCRAPERAPI_KEY,
"url": url
}
if render:
params["render"] = "true"
response = requests.get("http://api.scraperapi.com", params=params)
return response.text
@mcp.tool()
def extract_text(html: str) -> str:
"""Extract clean text from HTML, removing scripts, styles, and navigation."""
soup = BeautifulSoup(html, "html.parser")
for tag in soup(["script", "style", "nav", "footer", "header"]):
tag.decompose()
return soup.get_text(separator="\n", strip=True)
@mcp.tool()
def extract_links(html: str, base_url: str) -> list[dict]:
"""Extract all links from a page with their text and URLs."""
from urllib.parse import urljoin
soup = BeautifulSoup(html, "html.parser")
links = []
for a in soup.find_all("a", href=True):
links.append({
"text": a.get_text(strip=True),
"url": urljoin(base_url, a["href"])
})
return links
@mcp.tool()
def extract_tables(html: str) -> list[list[list[str]]]:
"""Extract all tables from HTML as nested arrays."""
soup = BeautifulSoup(html, "html.parser")
tables = []
for table in soup.find_all("table"):
rows = []
for tr in table.find_all("tr"):
cells = [td.get_text(strip=True) for td in tr.find_all(["td", "th"])]
rows.append(cells)
tables.append(rows)
return tables
if __name__ == "__main__":
mcp.run()
Configuring with Claude Desktop
Add the server to your Claude Desktop configuration:
{
"mcpServers": {
"scraping": {
"command": "python",
"args": ["scraping_server.py"]
}
}
}
Now Claude can scrape websites and extract data when you ask it to.
Using with Playwright MCP
The Playwright MCP server provides browser automation capabilities.
{
"mcpServers": {
"playwright": {
"command": "npx",
"args": ["@playwright/mcp@latest"]
}
}
}
This gives Claude the ability to navigate pages, click buttons, fill forms, and take screenshots, which is perfect for scraping SPAs and sites requiring interaction.
Advanced: Multi-Step Scraping Pipeline
@mcp.tool()
def scrape_search_results(query: str, num_pages: int = 3) -> list[dict]:
"""Scrape search results across multiple pages."""
all_results = []
for page in range(1, num_pages + 1):
html = fetch_page(
f"https://example.com/search?q={query}&page={page}",
render=True
)
soup = BeautifulSoup(html, "html.parser")
for item in soup.find_all("div", class_="result-item"):
all_results.append({
"title": item.find("h2").get_text(strip=True),
"url": item.find("a")["href"],
"snippet": item.find("p").get_text(strip=True)
})
return all_results
Why MCP for Scraping?
- Natural language control, Ask Claude to "scrape the top 10 products from this category page" and it calls the right tools
- Adaptive, The AI can handle unexpected page structures by combining tools
- Composable, Combine scraping with data processing, storage, and analysis tools
- Low code, Non-developers can drive complex scraping workflows through conversation