Guide
Using ChatGPT for Web Scraping - Prompts and Techniques
Learn how to use ChatGPT and GPT-4 to write web scrapers, generate selectors, parse data, and build scraping workflows faster.
ChatGPT is a powerful tool for accelerating web scraping development. It can write scrapers, debug code, generate selectors, and even extract data directly from HTML. Here are the most effective techniques.
Technique 1: Generate Scrapers from HTML
Paste a sample of the target HTML and ask ChatGPT to write the scraper.
Prompt:
Write a Python scraper using BeautifulSoup that extracts all product
data (name, price, rating, url) from this HTML structure:
[paste sample HTML here]
Use ScraperAPI for the requests. Output as a list of dictionaries.
Technique 2: CSS Selector Generation
Instead of writing a full scraper, ask for just the selectors.
Prompt:
Given this HTML, provide CSS selectors to extract:
1. Product title
2. Price (including sale price)
3. Star rating
4. Number of reviews
5. Product image URL
Return as a Python dictionary mapping field names to selectors.
[paste HTML]
Output you will get:
selectors = {
"title": "h1.product-title",
"price": "span.price-current",
"sale_price": "span.price-sale",
"rating": "div.star-rating::attr(data-rating)",
"reviews": "span.review-count",
"image": "img.product-image::attr(src)"
}
Technique 3: Direct Data Extraction
For one-off extractions, paste the HTML directly and ask for structured data.
Prompt:
Extract all product listings from this HTML as a JSON array with
fields: name, price, url, image_url.
[paste HTML content]
This works well with GPT-4 for pages with clear structure.
Technique 4: Debug Failing Scrapers
When your scraper breaks, paste the error and HTML context.
Prompt:
My scraper returns empty results. Here is my code:
[paste code]
And here is a sample of the current HTML from the target page:
[paste HTML]
What is wrong and how do I fix it?
Technique 5: Generate Pagination Logic
Prompt:
Write a Python function that scrapes all pages of search results from
this site. The pagination uses this pattern:
- Page 1: /search?q=term
- Page 2: /search?q=term&page=2
- Last page indicator: a disabled "Next" button with class "btn-disabled"
Use ScraperAPI for requests and handle rate limiting with delays.
Practical Example: Full Scraper Generation
# This scraper was generated and refined with ChatGPT assistance
import requests
from bs4 import BeautifulSoup
import json
import time
API_KEY = "YOUR_SCRAPERAPI_KEY"
def scrape_listings(base_url, max_pages=10):
all_items = []
for page in range(1, max_pages + 1):
response = requests.get(
"http://api.scraperapi.com",
params={
"api_key": API_KEY,
"url": f"{base_url}?page={page}"
}
)
soup = BeautifulSoup(response.text, "html.parser")
cards = soup.find_all("div", class_="listing-card")
if not cards:
break
for card in cards:
all_items.append({
"title": card.find("h2").get_text(strip=True),
"price": card.find("span", class_="price").get_text(strip=True),
"link": card.find("a")["href"]
})
time.sleep(1) # Rate limiting
return all_items
Tips for Better Results
- Provide sample HTML for accurate selector generation
- Specify the libraries you want (BeautifulSoup, Scrapy, Playwright)
- Ask for error handling explicitly
- Iterate by sharing the output or errors and asking for fixes
- Use ChatGPT to convert scrapers between frameworks (e.g., BeautifulSoup to Scrapy)
ChatGPT does not replace understanding how scraping works, but it dramatically speeds up the development cycle.