Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Scraping with Python and Regex

Use Python regular expressions to extract emails, phone numbers, prices, URLs, and other patterns from scraped web pages.

Python Scraping · #22intermediate3 min read
Share:WhatsAppLinkedIn

Regular expressions (regex) are a powerful tool for extracting structured data from unstructured text. While you should use CSS selectors or XPath for navigating HTML, regex excels at extracting specific patterns like emails, phone numbers, prices, and URLs from the text content.

When to Use Regex in Scraping

  • Extracting patterns from raw text (not structured HTML)
  • Pulling data from JavaScript blocks within <script> tags
  • Cleaning extracted data (removing unwanted characters)
  • Parsing URLs and query parameters

Extracting Common Patterns

import re
import requests
from bs4 import BeautifulSoup

response = requests.get("https://quotes.toscrape.com/")
soup = BeautifulSoup(response.text, "html.parser")
page_text = soup.get_text()

# Email addresses
emails = re.findall(
    r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
    page_text
)

# Phone numbers (US format)
phones = re.findall(
    r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}',
    page_text
)

# Prices ($XX.XX format)
prices = re.findall(
    r'\$\d+(?:,\d{3})*(?:\.\d{2})?',
    page_text
)

# URLs
urls = re.findall(
    r'https?://[^\s<>"\']+',
    str(soup)
)

print(f"Emails: {emails}")
print(f"Phones: {phones}")
print(f"Prices: {prices}")
print(f"URLs found: {len(urls)}")

Extracting JSON from Script Tags

This is one of the most useful regex patterns in web scraping, pulling structured data from inline JavaScript.

import re
import json
import requests

response = requests.get("https://quotes.toscrape.com/js/")
html = response.text

# Extract a JavaScript variable containing JSON
pattern = r'var\s+data\s*=\s*(\[.*?\]);'
match = re.search(pattern, html, re.DOTALL)

if match:
    json_str = match.group(1)
    data = json.loads(json_str)
    for item in data[:3]:
        print(item)

Named Groups for Structured Extraction

Named groups make your regex results more readable and easier to work with.

import re

text = """
Product: Widget Pro - Price: $29.99 - SKU: WP-12345
Product: Gadget Plus - Price: $49.99 - SKU: GP-67890
Product: Tool Max - Price: $19.50 - SKU: TM-11111
"""

pattern = r'Product:\s*(?P<name>.+?)\s*-\s*Price:\s*\$(?P<price>\d+\.\d{2})\s*-\s*SKU:\s*(?P<sku>[A-Z]{2}-\d+)'

products = []
for match in re.finditer(pattern, text):
    products.append({
        "name": match.group("name"),
        "price": float(match.group("price")),
        "sku": match.group("sku"),
    })

for p in products:
    print(p)
# {'name': 'Widget Pro', 'price': 29.99, 'sku': 'WP-12345'}
# {'name': 'Gadget Plus', 'price': 49.99, 'sku': 'GP-67890'}
# {'name': 'Tool Max', 'price': 19.5, 'sku': 'TM-11111'}

Cleaning Extracted Data with Regex

import re


def clean_price(text):
    """Extract numeric price from various formats."""
    match = re.search(r'[\d,]+\.?\d*', text.replace(",", ""))
    return float(match.group()) if match else None


def clean_whitespace(text):
    """Normalize whitespace in extracted text."""
    return re.sub(r'\s+', ' ', text).strip()


def extract_numbers(text):
    """Pull all numbers from a string."""
    return [int(n) for n in re.findall(r'\d+', text)]


# Examples
print(clean_price("Price: $1,299.99"))    # 1299.99
print(clean_price("EUR 49.50"))            # 49.5
print(clean_whitespace("  too   many   spaces  "))  # "too many spaces"
print(extract_numbers("Page 3 of 47 (940 results)"))  # [3, 47, 940]

Regex Quick Reference

Pattern Matches
\d+ One or more digits
\s+ One or more whitespace
[A-Za-z]+ One or more letters
.+? Any characters (non-greedy)
(?:...) Non-capturing group
(?P<name>...) Named capture group
^ / $ Start / end of string
re.DOTALL Make . match newlines

Tips

  • Never parse HTML structure with regex, use BeautifulSoup or lxml for that. Use regex on the text content or raw strings.
  • Prefer non-greedy quantifiers (.+? instead of .+) to avoid matching too much.
  • Use re.compile() for patterns you use repeatedly, it is faster.
  • Test your regex patterns at regex101.com before putting them in code.
  • When scraping pages that require proxy rotation, use ScraperAPI to fetch the HTML, then apply regex to the response text.

Next Steps

  • Learn to handle different text encodings in scraped content
  • Explore XML and RSS feed scraping for structured data sources