Using Regex for Data Extraction - Data Parsing

Learn to use Python regular expressions to extract emails, URLs, prices, dates, and other patterns from scraped text.

Regular expressions (regex) are invaluable when you need to extract specific patterns from unstructured text, emails, phone numbers, prices, dates, or any data embedded in prose.

When to Use Regex vs. BeautifulSoup

Use Case	Best Tool
Extracting HTML elements	BeautifulSoup / CSS selectors
Extracting patterns from text	Regex
Emails, phones, URLs in text	Regex
Cleaning extracted strings	Regex
Parsing structured HTML	BeautifulSoup

Essential Regex Patterns

import re

text = """
Contact us at support@scrapingcentral.com or sales@example.com.
Call 1-800-555-0199 or (415) 555-0100.
Prices: $29.99, $149.00, and EUR 199.50.
Visit https://scrapingcentral.com/pricing for details.
Order #12345 placed on 2025-03-15.
"""

# Emails
emails = re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+', text)
print(f"Emails: {emails}")

# Phone numbers (US formats)
phones = re.findall(r'[\d-]{7,}|\(\d{3}\)\s?\d{3}-\d{4}', text)
print(f"Phones: {phones}")

# Prices (USD and EUR)
prices = re.findall(r'[\$\u20ac][\d,]+\.?\d*|EUR\s?[\d,]+\.?\d*', text)
print(f"Prices: {prices}")

# URLs
urls = re.findall(r'https?://[\w./\-?=&]+', text)
print(f"URLs: {urls}")

# Order numbers
orders = re.findall(r'#(\d{5,})', text)
print(f"Order IDs: {orders}")

# Dates (YYYY-MM-DD)
dates = re.findall(r'\d{4}-\d{2}-\d{2}', text)
print(f"Dates: {dates}")

Named Groups for Structured Extraction

import re

product_text = """
ScraperAPI - $49.99/month - 100,000 requests
ScrapingAnt - $29.00/month - 50,000 requests
Bright Data - $199.00/month - 500,000 requests
"""

pattern = r'(?P<name>[\w\s]+?)\s*-\s*\$(?P<price>[\d.]+)/month\s*-\s*(?P<requests>[\d,]+)\s*requests'

for match in re.finditer(pattern, product_text):
    print(f"{match.group('name').strip()}: ${match.group('price')} ({match.group('requests')} requests)")

ScraperAPI: $49.99 (100,000 requests)
ScrapingAnt: $29.00 (50,000 requests)
Bright Data: $199.00 (500,000 requests)

Regex with Scraped HTML

Combine BeautifulSoup for structure and regex for text patterns:

import re
import requests
from bs4 import BeautifulSoup

response = requests.get("https://quotes.toscrape.com/", timeout=15)
soup = BeautifulSoup(response.text, "lxml")

# Get all text from the page
full_text = soup.get_text()

# Extract all quoted text (between smart quotes)
quoted = re.findall(r'\u201c(.+?)\u201d', full_text)
for q in quoted[:3]:
    print(f"Quote: {q[:60]}...")

Cleaning Scraped Data with Regex

import re

def clean_price(text):
    """Extract numeric price from messy text."""
    match = re.search(r'[\d,]+\.?\d*', text.replace(",", ""))
    return float(match.group()) if match else None

def clean_whitespace(text):
    """Collapse multiple spaces and strip."""
    return re.sub(r'\s+', ' ', text).strip()

def strip_html_tags(text):
    """Remove HTML tags from a string."""
    return re.sub(r'<[^>]+>', '', text)

# Examples
print(clean_price("Price: $1,299.99 USD"))      # 1299.99
print(clean_whitespace("  hello   world  \n"))   # "hello world"
print(strip_html_tags("<b>Bold</b> text"))       # "Bold text"

Common Pitfalls

Greedy matching: Use .*? (non-greedy) instead of .* to avoid matching too much
Not escaping special chars: Dots, brackets, and dollar signs need escaping (\., \[, \$)
Parsing HTML with regex: Use BeautifulSoup for HTML structure, regex only for text patterns
Performance: Compile patterns with re.compile() when reusing them

# Compile for reuse in loops
email_pattern = re.compile(r'[\w.+-]+@[\w-]+\.[\w.-]+')

for page_text in scraped_pages:
    emails = email_pattern.findall(page_text)

Next Steps

Clean scraped data with pandas
Extract structured data from unstructured HTML
Parse dates and prices from text