Guide
Using AI and LLMs for Web Scraping in 2026
Learn how AI and large language models (LLMs) are transforming web scraping. Covers AI-powered data extraction, schema generation, and adaptive scrapers.
Large language models have fundamentally changed web scraping. Instead of writing brittle CSS selectors and XPath queries, you can now describe what data you want in plain English and let an LLM extract it. Here is the state of AI-powered scraping in 2026.
How LLMs Change Scraping
Traditional scraping requires manually identifying selectors that break when sites update their HTML. LLMs can:
- Extract structured data from unstructured HTML without selectors
- Adapt to layout changes automatically
- Understand context to extract the right data from ambiguous pages
- Generate schemas from example pages
Method 1: LLM-Based Data Extraction
import openai
import requests
from bs4 import BeautifulSoup
# Fetch the page (use ScraperAPI for protected sites)
response = requests.get("http://api.scraperapi.com", params={
"api_key": "YOUR_SCRAPERAPI_KEY",
"url": "https://example-store.com/product/123"
})
# Clean the HTML
soup = BeautifulSoup(response.text, "html.parser")
for tag in soup(["script", "style", "nav", "footer"]):
tag.decompose()
text = soup.get_text(separator="\n", strip=True)
# Extract structured data with an LLM
client = openai.OpenAI()
completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Extract product data from this page as JSON:
- name, price, currency, description, rating, review_count, availability
Page content:
{text[:4000]}"""
}],
response_format={"type": "json_object"}
)
import json
product = json.loads(completion.choices[0].message.content)
print(json.dumps(product, indent=2))
Method 2: Using Claude for Extraction
import anthropic
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""Extract all product listings from this HTML as a JSON array.
Each product should have: name, price, url, image_url, rating.
HTML:
{html_content[:8000]}"""
}]
)
print(message.content[0].text)
Method 3: Crawl4AI (Open Source)
Crawl4AI is an open-source library purpose-built for AI-powered web scraping.
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
import asyncio
async def extract():
strategy = LLMExtractionStrategy(
provider="openai/gpt-4o-mini",
instruction="Extract all product names, prices, and ratings as JSON"
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example-store.com/products",
extraction_strategy=strategy
)
print(result.extracted_content)
asyncio.run(extract())
When to Use AI vs Traditional Scraping
Use AI extraction when:
- Page layouts change frequently
- Data is embedded in unstructured text
- You need to scrape many different sites with different layouts
- Accuracy on individual pages matters more than speed
Use traditional selectors when:
- You are scraping millions of pages with the same layout
- Speed and cost are priorities
- The site structure is stable
Cost Considerations
LLM extraction costs roughly $0.001-0.01 per page with GPT-4o-mini. For high-volume scraping, this adds up. The hybrid approach works best: use LLMs to generate selectors, then use those selectors for bulk extraction.