Scraping XML and RSS Feeds - Python Scraping

Parse XML documents and RSS/Atom feeds with Python. Extract structured data from feeds using feedparser, lxml, and the xml.etree module.

RSS and Atom feeds are among the easiest data sources to scrape. They provide structured, machine-readable content that is meant to be consumed programmatically. Many news sites, blogs, and content platforms offer RSS feeds.

Parsing RSS with feedparser

The feedparser library handles all the quirks of RSS, Atom, and RDF feeds.

pip install feedparser

import feedparser

feed = feedparser.parse("https://news.ycombinator.com/rss")

print(f"Feed: {feed.feed.title}")
print(f"Entries: {len(feed.entries)}")
print()

for entry in feed.entries[:5]:
    print(f"Title:   {entry.title}")
    print(f"Link:    {entry.link}")
    print(f"Published: {entry.get('published', 'N/A')}")
    print()

Parsing XML with xml.etree

Python's built-in xml.etree.ElementTree handles any XML document.

import xml.etree.ElementTree as ET
import requests

response = requests.get("https://news.ycombinator.com/rss")
root = ET.fromstring(response.content)

# RSS structure: <rss><channel><item>...</item></channel></rss>
channel = root.find("channel")
print(f"Feed title: {channel.find('title').text}")

for item in channel.findall("item")[:5]:
    title = item.find("title").text
    link = item.find("link").text
    print(f"{title}\n  {link}\n")

Parsing XML with lxml

lxml is faster and supports XPath, making it ideal for large XML documents.

from lxml import etree
import requests

response = requests.get("https://news.ycombinator.com/rss")
tree = etree.fromstring(response.content)

# Use XPath to extract items
titles = tree.xpath("//item/title/text()")
links = tree.xpath("//item/link/text()")

for title, link in zip(titles[:5], links[:5]):
    print(f"{title}\n  {link}\n")

Handling Atom Feeds

Atom feeds use XML namespaces, which require special handling.

import xml.etree.ElementTree as ET
import requests

response = requests.get("https://example.com/atom.xml")
root = ET.fromstring(response.content)

# Atom uses a namespace
ns = {"atom": "http://www.w3.org/2005/Atom"}

for entry in root.findall("atom:entry", ns):
    title = entry.find("atom:title", ns).text
    link = entry.find('atom:link[@rel="alternate"]', ns)
    href = link.get("href") if link is not None else "N/A"
    updated = entry.find("atom:updated", ns).text
    print(f"{title}\n  {href}\n  Updated: {updated}\n")

Scraping Multiple RSS Feeds

import feedparser
from concurrent.futures import ThreadPoolExecutor


def parse_feed(url):
    """Parse a single RSS feed and return articles."""
    try:
        feed = feedparser.parse(url)
        articles = []
        for entry in feed.entries:
            articles.append({
                "source": feed.feed.get("title", url),
                "title": entry.get("title", ""),
                "link": entry.get("link", ""),
                "published": entry.get("published", ""),
                "summary": entry.get("summary", "")[:200],
            })
        return articles
    except Exception as e:
        print(f"Error parsing {url}: {e}")
        return []


feeds = [
    "https://news.ycombinator.com/rss",
    "https://www.reddit.com/r/python/.rss",
    "https://realpython.com/atom.xml",
]

all_articles = []
with ThreadPoolExecutor(max_workers=5) as executor:
    results = executor.map(parse_feed, feeds)
    for articles in results:
        all_articles.extend(articles)

print(f"Collected {len(all_articles)} articles from {len(feeds)} feeds")
for article in all_articles[:3]:
    print(f"  [{article['source']}] {article['title']}")

Parsing Sitemaps (XML)

Website sitemaps are XML files that list all URLs. They are useful for discovering pages to scrape.

import requests
from lxml import etree

response = requests.get("https://quotes.toscrape.com/sitemap.xml")

if response.status_code == 200:
    root = etree.fromstring(response.content)
    ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}

    urls = root.xpath("//sm:url/sm:loc/text()", namespaces=ns)
    print(f"Found {len(urls)} URLs in sitemap:")
    for url in urls[:10]:
        print(f"  {url}")

Tips

RSS feeds are polite to scrape, they are designed to be fetched by machines.
Use feedparser for RSS/Atom feeds, it handles encoding, date parsing, and format differences automatically.
Check for robots.txt and sitemaps before scraping a site, they often point you to RSS feeds and crawlable URLs.
If an RSS feed is behind a firewall or geo-restricted, use ScraperAPI to fetch it from different locations.

Next Steps

Build a full news aggregator that collects articles from multiple RSS feeds
Store feed data in a database for historical tracking