Using Regex for Data Extraction
Learn to use Python regular expressions to extract emails, URLs, prices, dates, and other patterns from scraped text.
Data Parsing · #5intermediate3 min read
Regular expressions (regex) are invaluable when you need to extract specific patterns from unstructured text, emails, phone numbers, prices, dates, or any data embedded in prose.
When to Use Regex vs. BeautifulSoup
| Use Case | Best Tool |
|---|---|
| Extracting HTML elements | BeautifulSoup / CSS selectors |
| Extracting patterns from text | Regex |
| Emails, phones, URLs in text | Regex |
| Cleaning extracted strings | Regex |
| Parsing structured HTML | BeautifulSoup |
Essential Regex Patterns
import re
text = """
Contact us at support@scrapingcentral.com or sales@example.com.
Call 1-800-555-0199 or (415) 555-0100.
Prices: $29.99, $149.00, and EUR 199.50.
Visit https://scrapingcentral.com/pricing for details.
Order #12345 placed on 2025-03-15.
"""
# Emails
emails = re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+', text)
print(f"Emails: {emails}")
# Phone numbers (US formats)
phones = re.findall(r'[\d-]{7,}|\(\d{3}\)\s?\d{3}-\d{4}', text)
print(f"Phones: {phones}")
# Prices (USD and EUR)
prices = re.findall(r'[\$\u20ac][\d,]+\.?\d*|EUR\s?[\d,]+\.?\d*', text)
print(f"Prices: {prices}")
# URLs
urls = re.findall(r'https?://[\w./\-?=&]+', text)
print(f"URLs: {urls}")
# Order numbers
orders = re.findall(r'#(\d{5,})', text)
print(f"Order IDs: {orders}")
# Dates (YYYY-MM-DD)
dates = re.findall(r'\d{4}-\d{2}-\d{2}', text)
print(f"Dates: {dates}")
Named Groups for Structured Extraction
import re
product_text = """
ScraperAPI - $49.99/month - 100,000 requests
ScrapingAnt - $29.00/month - 50,000 requests
Bright Data - $199.00/month - 500,000 requests
"""
pattern = r'(?P<name>[\w\s]+?)\s*-\s*\$(?P<price>[\d.]+)/month\s*-\s*(?P<requests>[\d,]+)\s*requests'
for match in re.finditer(pattern, product_text):
print(f"{match.group('name').strip()}: ${match.group('price')} ({match.group('requests')} requests)")
ScraperAPI: $49.99 (100,000 requests)
ScrapingAnt: $29.00 (50,000 requests)
Bright Data: $199.00 (500,000 requests)
Regex with Scraped HTML
Combine BeautifulSoup for structure and regex for text patterns:
import re
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com/", timeout=15)
soup = BeautifulSoup(response.text, "lxml")
# Get all text from the page
full_text = soup.get_text()
# Extract all quoted text (between smart quotes)
quoted = re.findall(r'\u201c(.+?)\u201d', full_text)
for q in quoted[:3]:
print(f"Quote: {q[:60]}...")
Cleaning Scraped Data with Regex
import re
def clean_price(text):
"""Extract numeric price from messy text."""
match = re.search(r'[\d,]+\.?\d*', text.replace(",", ""))
return float(match.group()) if match else None
def clean_whitespace(text):
"""Collapse multiple spaces and strip."""
return re.sub(r'\s+', ' ', text).strip()
def strip_html_tags(text):
"""Remove HTML tags from a string."""
return re.sub(r'<[^>]+>', '', text)
# Examples
print(clean_price("Price: $1,299.99 USD")) # 1299.99
print(clean_whitespace(" hello world \n")) # "hello world"
print(strip_html_tags("<b>Bold</b> text")) # "Bold text"
Common Pitfalls
- Greedy matching: Use
.*?(non-greedy) instead of.*to avoid matching too much - Not escaping special chars: Dots, brackets, and dollar signs need escaping (
\.,\[,\$) - Parsing HTML with regex: Use BeautifulSoup for HTML structure, regex only for text patterns
- Performance: Compile patterns with
re.compile()when reusing them
# Compile for reuse in loops
email_pattern = re.compile(r'[\w.+-]+@[\w-]+\.[\w.-]+')
for page_text in scraped_pages:
emails = email_pattern.findall(page_text)
Next Steps
- Clean scraped data with pandas
- Extract structured data from unstructured HTML
- Parse dates and prices from text