Scraping with Python and Regex
Use Python regular expressions to extract emails, phone numbers, prices, URLs, and other patterns from scraped web pages.
Python Scraping · #22intermediate3 min read
Regular expressions (regex) are a powerful tool for extracting structured data from unstructured text. While you should use CSS selectors or XPath for navigating HTML, regex excels at extracting specific patterns like emails, phone numbers, prices, and URLs from the text content.
When to Use Regex in Scraping
- Extracting patterns from raw text (not structured HTML)
- Pulling data from JavaScript blocks within
<script>tags - Cleaning extracted data (removing unwanted characters)
- Parsing URLs and query parameters
Extracting Common Patterns
import re
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com/")
soup = BeautifulSoup(response.text, "html.parser")
page_text = soup.get_text()
# Email addresses
emails = re.findall(
r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
page_text
)
# Phone numbers (US format)
phones = re.findall(
r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}',
page_text
)
# Prices ($XX.XX format)
prices = re.findall(
r'\$\d+(?:,\d{3})*(?:\.\d{2})?',
page_text
)
# URLs
urls = re.findall(
r'https?://[^\s<>"\']+',
str(soup)
)
print(f"Emails: {emails}")
print(f"Phones: {phones}")
print(f"Prices: {prices}")
print(f"URLs found: {len(urls)}")
Extracting JSON from Script Tags
This is one of the most useful regex patterns in web scraping, pulling structured data from inline JavaScript.
import re
import json
import requests
response = requests.get("https://quotes.toscrape.com/js/")
html = response.text
# Extract a JavaScript variable containing JSON
pattern = r'var\s+data\s*=\s*(\[.*?\]);'
match = re.search(pattern, html, re.DOTALL)
if match:
json_str = match.group(1)
data = json.loads(json_str)
for item in data[:3]:
print(item)
Named Groups for Structured Extraction
Named groups make your regex results more readable and easier to work with.
import re
text = """
Product: Widget Pro - Price: $29.99 - SKU: WP-12345
Product: Gadget Plus - Price: $49.99 - SKU: GP-67890
Product: Tool Max - Price: $19.50 - SKU: TM-11111
"""
pattern = r'Product:\s*(?P<name>.+?)\s*-\s*Price:\s*\$(?P<price>\d+\.\d{2})\s*-\s*SKU:\s*(?P<sku>[A-Z]{2}-\d+)'
products = []
for match in re.finditer(pattern, text):
products.append({
"name": match.group("name"),
"price": float(match.group("price")),
"sku": match.group("sku"),
})
for p in products:
print(p)
# {'name': 'Widget Pro', 'price': 29.99, 'sku': 'WP-12345'}
# {'name': 'Gadget Plus', 'price': 49.99, 'sku': 'GP-67890'}
# {'name': 'Tool Max', 'price': 19.5, 'sku': 'TM-11111'}
Cleaning Extracted Data with Regex
import re
def clean_price(text):
"""Extract numeric price from various formats."""
match = re.search(r'[\d,]+\.?\d*', text.replace(",", ""))
return float(match.group()) if match else None
def clean_whitespace(text):
"""Normalize whitespace in extracted text."""
return re.sub(r'\s+', ' ', text).strip()
def extract_numbers(text):
"""Pull all numbers from a string."""
return [int(n) for n in re.findall(r'\d+', text)]
# Examples
print(clean_price("Price: $1,299.99")) # 1299.99
print(clean_price("EUR 49.50")) # 49.5
print(clean_whitespace(" too many spaces ")) # "too many spaces"
print(extract_numbers("Page 3 of 47 (940 results)")) # [3, 47, 940]
Regex Quick Reference
| Pattern | Matches |
|---|---|
\d+ |
One or more digits |
\s+ |
One or more whitespace |
[A-Za-z]+ |
One or more letters |
.+? |
Any characters (non-greedy) |
(?:...) |
Non-capturing group |
(?P<name>...) |
Named capture group |
^ / $ |
Start / end of string |
re.DOTALL |
Make . match newlines |
Tips
- Never parse HTML structure with regex, use BeautifulSoup or lxml for that. Use regex on the text content or raw strings.
- Prefer non-greedy quantifiers (
.+?instead of.+) to avoid matching too much. - Use
re.compile()for patterns you use repeatedly, it is faster. - Test your regex patterns at regex101.com before putting them in code.
- When scraping pages that require proxy rotation, use ScraperAPI to fetch the HTML, then apply regex to the response text.
Next Steps
- Learn to handle different text encodings in scraped content
- Explore XML and RSS feed scraping for structured data sources