Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Guide

Using ChatGPT for Web Scraping - Prompts and Techniques

Learn how to use ChatGPT and GPT-4 to write web scrapers, generate selectors, parse data, and build scraping workflows faster.

ChatGPT is a powerful tool for accelerating web scraping development. It can write scrapers, debug code, generate selectors, and even extract data directly from HTML. Here are the most effective techniques.

Technique 1: Generate Scrapers from HTML

Paste a sample of the target HTML and ask ChatGPT to write the scraper.

Prompt:

Write a Python scraper using BeautifulSoup that extracts all product 
data (name, price, rating, url) from this HTML structure:

[paste sample HTML here]

Use ScraperAPI for the requests. Output as a list of dictionaries.

Technique 2: CSS Selector Generation

Instead of writing a full scraper, ask for just the selectors.

Prompt:

Given this HTML, provide CSS selectors to extract:
1. Product title
2. Price (including sale price)
3. Star rating
4. Number of reviews
5. Product image URL

Return as a Python dictionary mapping field names to selectors.

[paste HTML]

Output you will get:

selectors = {
    "title": "h1.product-title",
    "price": "span.price-current",
    "sale_price": "span.price-sale",
    "rating": "div.star-rating::attr(data-rating)",
    "reviews": "span.review-count",
    "image": "img.product-image::attr(src)"
}

Technique 3: Direct Data Extraction

For one-off extractions, paste the HTML directly and ask for structured data.

Prompt:

Extract all product listings from this HTML as a JSON array with 
fields: name, price, url, image_url.

[paste HTML content]

This works well with GPT-4 for pages with clear structure.

Technique 4: Debug Failing Scrapers

When your scraper breaks, paste the error and HTML context.

Prompt:

My scraper returns empty results. Here is my code:

[paste code]

And here is a sample of the current HTML from the target page:

[paste HTML]

What is wrong and how do I fix it?

Technique 5: Generate Pagination Logic

Prompt:

Write a Python function that scrapes all pages of search results from 
this site. The pagination uses this pattern:
- Page 1: /search?q=term
- Page 2: /search?q=term&page=2
- Last page indicator: a disabled "Next" button with class "btn-disabled"

Use ScraperAPI for requests and handle rate limiting with delays.

Practical Example: Full Scraper Generation

# This scraper was generated and refined with ChatGPT assistance
import requests
from bs4 import BeautifulSoup
import json
import time

API_KEY = "YOUR_SCRAPERAPI_KEY"

def scrape_listings(base_url, max_pages=10):
    all_items = []

    for page in range(1, max_pages + 1):
        response = requests.get(
            "http://api.scraperapi.com",
            params={
                "api_key": API_KEY,
                "url": f"{base_url}?page={page}"
            }
        )
        soup = BeautifulSoup(response.text, "html.parser")
        cards = soup.find_all("div", class_="listing-card")

        if not cards:
            break

        for card in cards:
            all_items.append({
                "title": card.find("h2").get_text(strip=True),
                "price": card.find("span", class_="price").get_text(strip=True),
                "link": card.find("a")["href"]
            })

        time.sleep(1)  # Rate limiting

    return all_items

Tips for Better Results

  • Provide sample HTML for accurate selector generation
  • Specify the libraries you want (BeautifulSoup, Scrapy, Playwright)
  • Ask for error handling explicitly
  • Iterate by sharing the output or errors and asking for fixes
  • Use ChatGPT to convert scrapers between frameworks (e.g., BeautifulSoup to Scrapy)

ChatGPT does not replace understanding how scraping works, but it dramatically speeds up the development cycle.