Using ChatGPT for Web Scraping - Prompts and Techniques

Learn how to use ChatGPT and GPT-4 to write web scrapers, generate selectors, parse data, and build scraping workflows faster.

ChatGPT is a powerful tool for accelerating web scraping development. It can write scrapers, debug code, generate selectors, and even extract data directly from HTML. Here are the most effective techniques.

Technique 1: Generate Scrapers from HTML

Paste a sample of the target HTML and ask ChatGPT to write the scraper.

Prompt:

Write a Python scraper using BeautifulSoup that extracts all product 
data (name, price, rating, url) from this HTML structure:

[paste sample HTML here]

Use ScraperAPI for the requests. Output as a list of dictionaries.

Technique 2: CSS Selector Generation

Instead of writing a full scraper, ask for just the selectors.

Prompt:

Given this HTML, provide CSS selectors to extract:
1. Product title
2. Price (including sale price)
3. Star rating
4. Number of reviews
5. Product image URL

Return as a Python dictionary mapping field names to selectors.

[paste HTML]

Output you will get:

selectors = {
    "title": "h1.product-title",
    "price": "span.price-current",
    "sale_price": "span.price-sale",
    "rating": "div.star-rating::attr(data-rating)",
    "reviews": "span.review-count",
    "image": "img.product-image::attr(src)"
}

Technique 3: Direct Data Extraction

For one-off extractions, paste the HTML directly and ask for structured data.

Prompt:

Extract all product listings from this HTML as a JSON array with 
fields: name, price, url, image_url.

[paste HTML content]

This works well with GPT-4 for pages with clear structure.

Technique 4: Debug Failing Scrapers

When your scraper breaks, paste the error and HTML context.

Prompt:

My scraper returns empty results. Here is my code:

[paste code]

And here is a sample of the current HTML from the target page:

[paste HTML]

What is wrong and how do I fix it?

Technique 5: Generate Pagination Logic

Prompt:

Write a Python function that scrapes all pages of search results from 
this site. The pagination uses this pattern:
- Page 1: /search?q=term
- Page 2: /search?q=term&page=2
- Last page indicator: a disabled "Next" button with class "btn-disabled"

Use ScraperAPI for requests and handle rate limiting with delays.

Practical Example: Full Scraper Generation

# This scraper was generated and refined with ChatGPT assistance
import requests
from bs4 import BeautifulSoup
import json
import time

API_KEY = "YOUR_SCRAPERAPI_KEY"

def scrape_listings(base_url, max_pages=10):
    all_items = []

    for page in range(1, max_pages + 1):
        response = requests.get(
            "http://api.scraperapi.com",
            params={
                "api_key": API_KEY,
                "url": f"{base_url}?page={page}"
            }
        )
        soup = BeautifulSoup(response.text, "html.parser")
        cards = soup.find_all("div", class_="listing-card")

        if not cards:
            break

        for card in cards:
            all_items.append({
                "title": card.find("h2").get_text(strip=True),
                "price": card.find("span", class_="price").get_text(strip=True),
                "link": card.find("a")["href"]
            })

        time.sleep(1)  # Rate limiting

    return all_items

Tips for Better Results

Provide sample HTML for accurate selector generation
Specify the libraries you want (BeautifulSoup, Scrapy, Playwright)
Ask for error handling explicitly
Iterate by sharing the output or errors and asking for fixes
Use ChatGPT to convert scrapers between frameworks (e.g., BeautifulSoup to Scrapy)

ChatGPT does not replace understanding how scraping works, but it dramatically speeds up the development cycle.