AI-Powered Data Extraction - How LLMs Are Changing Scraping

Explore how AI and LLMs are revolutionizing data extraction from websites. Covers vision models, structured output, and practical implementations.

Traditional scraping breaks when websites change their HTML. AI-powered extraction adapts automatically. Here is how modern LLMs are being used for reliable, maintenance-free data extraction.

The Shift from Selectors to Intelligence

Traditional approach:

# Breaks when HTML changes
price = soup.find("span", class_="price-value").text

AI approach:

# Works regardless of HTML structure
data = llm.extract("Get the product price", html_content)

Method 1: HTML-to-JSON with LLMs

import openai
import json

client = openai.OpenAI()

def extract_structured(html, schema):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""Extract data matching this schema from the HTML below.
            Return ONLY valid JSON.

            Schema: {json.dumps(schema)}

            HTML:
            {html[:6000]}"""
        }],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)

schema = {
    "products": [{
        "name": "string",
        "price": "number",
        "currency": "string",
        "in_stock": "boolean",
        "rating": "number"
    }]
}

# Use ScraperAPI to get the HTML first
import requests
html = requests.get("http://api.scraperapi.com", params={
    "api_key": "YOUR_SCRAPERAPI_KEY",
    "url": "https://store.example.com/products"
}).text

products = extract_structured(html, schema)
print(json.dumps(products, indent=2))

Method 2: Vision Models for Complex Layouts

When HTML parsing is impractical (canvas-rendered content, complex layouts), use vision models on screenshots.

import base64
from playwright.sync_api import sync_playwright

# Capture screenshot
with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://example.com/dashboard")
    page.screenshot(path="page.png")
    browser.close()

# Extract data from screenshot using GPT-4o
with open("page.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract all data from this page as structured JSON."},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}}
        ]
    }]
)

print(response.choices[0].message.content)

Method 3: Structured Output with Pydantic

from pydantic import BaseModel
from typing import List, Optional
import instructor

client = instructor.from_openai(openai.OpenAI())

class Product(BaseModel):
    name: str
    price: float
    currency: str
    rating: Optional[float]
    review_count: Optional[int]

class ProductPage(BaseModel):
    products: List[Product]

result = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=ProductPage,
    messages=[{
        "role": "user",
        "content": f"Extract all products from this HTML:\n{html[:6000]}"
    }]
)

for product in result.products:
    print(f"{product.name}: {product.currency}{product.price}")

Cost and Performance Trade-offs

Method	Cost per page	Speed	Reliability
CSS Selectors	~$0	Fast	Brittle
GPT-4o-mini	~$0.002	1-2s	High
GPT-4o (vision)	~$0.01	3-5s	Very high
Claude Sonnet	~$0.005	1-3s	High

Best Practice: Hybrid Approach

Use LLMs to generate selectors for a new site, validate them, then use traditional scraping for volume. Fall back to LLM extraction when selectors fail. This gives you the reliability of AI with the speed and cost of traditional scraping.