Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Using Regex for Data Extraction

Learn to use Python regular expressions to extract emails, URLs, prices, dates, and other patterns from scraped text.

Data Parsing · #5intermediate3 min read
Share:WhatsAppLinkedIn

Regular expressions (regex) are invaluable when you need to extract specific patterns from unstructured text, emails, phone numbers, prices, dates, or any data embedded in prose.

When to Use Regex vs. BeautifulSoup

Use Case Best Tool
Extracting HTML elements BeautifulSoup / CSS selectors
Extracting patterns from text Regex
Emails, phones, URLs in text Regex
Cleaning extracted strings Regex
Parsing structured HTML BeautifulSoup

Essential Regex Patterns

import re

text = """
Contact us at support@scrapingcentral.com or sales@example.com.
Call 1-800-555-0199 or (415) 555-0100.
Prices: $29.99, $149.00, and EUR 199.50.
Visit https://scrapingcentral.com/pricing for details.
Order #12345 placed on 2025-03-15.
"""

# Emails
emails = re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+', text)
print(f"Emails: {emails}")

# Phone numbers (US formats)
phones = re.findall(r'[\d-]{7,}|\(\d{3}\)\s?\d{3}-\d{4}', text)
print(f"Phones: {phones}")

# Prices (USD and EUR)
prices = re.findall(r'[\$\u20ac][\d,]+\.?\d*|EUR\s?[\d,]+\.?\d*', text)
print(f"Prices: {prices}")

# URLs
urls = re.findall(r'https?://[\w./\-?=&]+', text)
print(f"URLs: {urls}")

# Order numbers
orders = re.findall(r'#(\d{5,})', text)
print(f"Order IDs: {orders}")

# Dates (YYYY-MM-DD)
dates = re.findall(r'\d{4}-\d{2}-\d{2}', text)
print(f"Dates: {dates}")

Named Groups for Structured Extraction

import re

product_text = """
ScraperAPI - $49.99/month - 100,000 requests
ScrapingAnt - $29.00/month - 50,000 requests
Bright Data - $199.00/month - 500,000 requests
"""

pattern = r'(?P<name>[\w\s]+?)\s*-\s*\$(?P<price>[\d.]+)/month\s*-\s*(?P<requests>[\d,]+)\s*requests'

for match in re.finditer(pattern, product_text):
    print(f"{match.group('name').strip()}: ${match.group('price')} ({match.group('requests')} requests)")
ScraperAPI: $49.99 (100,000 requests)
ScrapingAnt: $29.00 (50,000 requests)
Bright Data: $199.00 (500,000 requests)

Regex with Scraped HTML

Combine BeautifulSoup for structure and regex for text patterns:

import re
import requests
from bs4 import BeautifulSoup

response = requests.get("https://quotes.toscrape.com/", timeout=15)
soup = BeautifulSoup(response.text, "lxml")

# Get all text from the page
full_text = soup.get_text()

# Extract all quoted text (between smart quotes)
quoted = re.findall(r'\u201c(.+?)\u201d', full_text)
for q in quoted[:3]:
    print(f"Quote: {q[:60]}...")

Cleaning Scraped Data with Regex

import re

def clean_price(text):
    """Extract numeric price from messy text."""
    match = re.search(r'[\d,]+\.?\d*', text.replace(",", ""))
    return float(match.group()) if match else None

def clean_whitespace(text):
    """Collapse multiple spaces and strip."""
    return re.sub(r'\s+', ' ', text).strip()

def strip_html_tags(text):
    """Remove HTML tags from a string."""
    return re.sub(r'<[^>]+>', '', text)

# Examples
print(clean_price("Price: $1,299.99 USD"))      # 1299.99
print(clean_whitespace("  hello   world  \n"))   # "hello world"
print(strip_html_tags("<b>Bold</b> text"))       # "Bold text"

Common Pitfalls

  • Greedy matching: Use .*? (non-greedy) instead of .* to avoid matching too much
  • Not escaping special chars: Dots, brackets, and dollar signs need escaping (\., \[, \$)
  • Parsing HTML with regex: Use BeautifulSoup for HTML structure, regex only for text patterns
  • Performance: Compile patterns with re.compile() when reusing them
# Compile for reuse in loops
email_pattern = re.compile(r'[\w.+-]+@[\w-]+\.[\w.-]+')

for page_text in scraped_pages:
    emails = email_pattern.findall(page_text)

Next Steps

  • Clean scraped data with pandas
  • Extract structured data from unstructured HTML
  • Parse dates and prices from text