Guide
AI-Powered Data Extraction - How LLMs Are Changing Scraping
Explore how AI and LLMs are revolutionizing data extraction from websites. Covers vision models, structured output, and practical implementations.
Traditional scraping breaks when websites change their HTML. AI-powered extraction adapts automatically. Here is how modern LLMs are being used for reliable, maintenance-free data extraction.
The Shift from Selectors to Intelligence
Traditional approach:
# Breaks when HTML changes
price = soup.find("span", class_="price-value").text
AI approach:
# Works regardless of HTML structure
data = llm.extract("Get the product price", html_content)
Method 1: HTML-to-JSON with LLMs
import openai
import json
client = openai.OpenAI()
def extract_structured(html, schema):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Extract data matching this schema from the HTML below.
Return ONLY valid JSON.
Schema: {json.dumps(schema)}
HTML:
{html[:6000]}"""
}],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
schema = {
"products": [{
"name": "string",
"price": "number",
"currency": "string",
"in_stock": "boolean",
"rating": "number"
}]
}
# Use ScraperAPI to get the HTML first
import requests
html = requests.get("http://api.scraperapi.com", params={
"api_key": "YOUR_SCRAPERAPI_KEY",
"url": "https://store.example.com/products"
}).text
products = extract_structured(html, schema)
print(json.dumps(products, indent=2))
Method 2: Vision Models for Complex Layouts
When HTML parsing is impractical (canvas-rendered content, complex layouts), use vision models on screenshots.
import base64
from playwright.sync_api import sync_playwright
# Capture screenshot
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://example.com/dashboard")
page.screenshot(path="page.png")
browser.close()
# Extract data from screenshot using GPT-4o
with open("page.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Extract all data from this page as structured JSON."},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}}
]
}]
)
print(response.choices[0].message.content)
Method 3: Structured Output with Pydantic
from pydantic import BaseModel
from typing import List, Optional
import instructor
client = instructor.from_openai(openai.OpenAI())
class Product(BaseModel):
name: str
price: float
currency: str
rating: Optional[float]
review_count: Optional[int]
class ProductPage(BaseModel):
products: List[Product]
result = client.chat.completions.create(
model="gpt-4o-mini",
response_model=ProductPage,
messages=[{
"role": "user",
"content": f"Extract all products from this HTML:\n{html[:6000]}"
}]
)
for product in result.products:
print(f"{product.name}: {product.currency}{product.price}")
Cost and Performance Trade-offs
| Method | Cost per page | Speed | Reliability |
|---|---|---|---|
| CSS Selectors | ~$0 | Fast | Brittle |
| GPT-4o-mini | ~$0.002 | 1-2s | High |
| GPT-4o (vision) | ~$0.01 | 3-5s | Very high |
| Claude Sonnet | ~$0.005 | 1-3s | High |
Best Practice: Hybrid Approach
Use LLMs to generate selectors for a new site, validate them, then use traditional scraping for volume. Fall back to LLM extraction when selectors fail. This gives you the reliability of AI with the speed and cost of traditional scraping.