Scraping XML and RSS Feeds
Parse XML documents and RSS/Atom feeds with Python. Extract structured data from feeds using feedparser, lxml, and the xml.etree module.
RSS and Atom feeds are among the easiest data sources to scrape. They provide structured, machine-readable content that is meant to be consumed programmatically. Many news sites, blogs, and content platforms offer RSS feeds.
Parsing RSS with feedparser
The feedparser library handles all the quirks of RSS, Atom, and RDF feeds.
pip install feedparser
import feedparser
feed = feedparser.parse("https://news.ycombinator.com/rss")
print(f"Feed: {feed.feed.title}")
print(f"Entries: {len(feed.entries)}")
print()
for entry in feed.entries[:5]:
print(f"Title: {entry.title}")
print(f"Link: {entry.link}")
print(f"Published: {entry.get('published', 'N/A')}")
print()
Parsing XML with xml.etree
Python's built-in xml.etree.ElementTree handles any XML document.
import xml.etree.ElementTree as ET
import requests
response = requests.get("https://news.ycombinator.com/rss")
root = ET.fromstring(response.content)
# RSS structure: <rss><channel><item>...</item></channel></rss>
channel = root.find("channel")
print(f"Feed title: {channel.find('title').text}")
for item in channel.findall("item")[:5]:
title = item.find("title").text
link = item.find("link").text
print(f"{title}\n {link}\n")
Parsing XML with lxml
lxml is faster and supports XPath, making it ideal for large XML documents.
from lxml import etree
import requests
response = requests.get("https://news.ycombinator.com/rss")
tree = etree.fromstring(response.content)
# Use XPath to extract items
titles = tree.xpath("//item/title/text()")
links = tree.xpath("//item/link/text()")
for title, link in zip(titles[:5], links[:5]):
print(f"{title}\n {link}\n")
Handling Atom Feeds
Atom feeds use XML namespaces, which require special handling.
import xml.etree.ElementTree as ET
import requests
response = requests.get("https://example.com/atom.xml")
root = ET.fromstring(response.content)
# Atom uses a namespace
ns = {"atom": "http://www.w3.org/2005/Atom"}
for entry in root.findall("atom:entry", ns):
title = entry.find("atom:title", ns).text
link = entry.find('atom:link[@rel="alternate"]', ns)
href = link.get("href") if link is not None else "N/A"
updated = entry.find("atom:updated", ns).text
print(f"{title}\n {href}\n Updated: {updated}\n")
Scraping Multiple RSS Feeds
import feedparser
from concurrent.futures import ThreadPoolExecutor
def parse_feed(url):
"""Parse a single RSS feed and return articles."""
try:
feed = feedparser.parse(url)
articles = []
for entry in feed.entries:
articles.append({
"source": feed.feed.get("title", url),
"title": entry.get("title", ""),
"link": entry.get("link", ""),
"published": entry.get("published", ""),
"summary": entry.get("summary", "")[:200],
})
return articles
except Exception as e:
print(f"Error parsing {url}: {e}")
return []
feeds = [
"https://news.ycombinator.com/rss",
"https://www.reddit.com/r/python/.rss",
"https://realpython.com/atom.xml",
]
all_articles = []
with ThreadPoolExecutor(max_workers=5) as executor:
results = executor.map(parse_feed, feeds)
for articles in results:
all_articles.extend(articles)
print(f"Collected {len(all_articles)} articles from {len(feeds)} feeds")
for article in all_articles[:3]:
print(f" [{article['source']}] {article['title']}")
Parsing Sitemaps (XML)
Website sitemaps are XML files that list all URLs. They are useful for discovering pages to scrape.
import requests
from lxml import etree
response = requests.get("https://quotes.toscrape.com/sitemap.xml")
if response.status_code == 200:
root = etree.fromstring(response.content)
ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
urls = root.xpath("//sm:url/sm:loc/text()", namespaces=ns)
print(f"Found {len(urls)} URLs in sitemap:")
for url in urls[:10]:
print(f" {url}")
Tips
- RSS feeds are polite to scrape, they are designed to be fetched by machines.
- Use
feedparserfor RSS/Atom feeds, it handles encoding, date parsing, and format differences automatically. - Check for
robots.txtand sitemaps before scraping a site, they often point you to RSS feeds and crawlable URLs. - If an RSS feed is behind a firewall or geo-restricted, use ScraperAPI to fetch it from different locations.
Next Steps
- Build a full news aggregator that collects articles from multiple RSS feeds
- Store feed data in a database for historical tracking