Following Sitemaps for Discovery
`sitemap.xml` is the structured index of a site's URLs that the site itself publishes. Use it to discover every page worth scraping without crawling blindly.
What you’ll learn
- Locate a site's sitemap via convention and `robots.txt`.
- Parse `sitemap.xml` and `sitemap index` files.
- Filter URLs by date, location, or pattern.
- Combine sitemap discovery with category-page scraping.
Most public websites publish a sitemap.xml file declaring every URL they want indexed by search engines. For a scraper, this is gold: a free, structured, often timestamped list of every content page on the site. Before you write a crawler that follows links, check whether the site already gives you the list.
What a sitemap looks like
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://practice.scrapingcentral.com/products/yellow-mug</loc>
<lastmod>2025-03-12</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://practice.scrapingcentral.com/products/black-mug</loc>
<lastmod>2025-03-10</lastmod>
</url>
...
</urlset>
Five fields per URL, only <loc> is mandatory. The rest are hints to search engines and useful to scrapers for filtering.
Finding the sitemap
Three places to look, in order:
/sitemap.xmldirectly, the most common convention./robots.txt, almost always declares the sitemap location:
User-agent: *
Disallow: /admin
Sitemap: https://example.com/sitemap.xml
- Linked from the HTML,
<link rel="sitemap" ...>in the<head>. Rare but exists.
Always check robots.txt first; sites with multiple sitemaps (news, products, blog) sometimes list all of them there.
Parsing a sitemap
import requests, lxml.etree
r = requests.get("https://practice.scrapingcentral.com/sitemap.xml")
r.raise_for_status()
# Use lxml.etree for strict XML (sitemaps are real XML, unlike most HTML)
tree = lxml.etree.fromstring(r.content)
ns = {"s": "http://www.sitemaps.org/schemas/sitemap/0.9"}
urls = []
for url in tree.xpath("//s:url", namespaces=ns):
loc = url.find("s:loc", ns).text
lastmod = url.find("s:lastmod", ns)
urls.append({"url": loc, "lastmod": lastmod.text if lastmod is not None else None})
print(f"Found {len(urls)} URLs")
The XML namespace is required, every element is qualified with it. Forgetting to register the namespace in your XPath returns nothing.
Sitemap index files
Large sites split their sitemaps. sitemap.xml may be an index pointing to child sitemaps:
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-products.xml</loc>
<lastmod>2025-03-12</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-blog.xml</loc>
</sitemap>
</sitemapindex>
Detect this by looking at the root element name:
root_tag = tree.tag
if root_tag.endswith("sitemapindex"):
# Recurse into each child sitemap
for sitemap in tree.xpath("//s:sitemap/s:loc", namespaces=ns):
process_sitemap(sitemap.text)
elif root_tag.endswith("urlset"):
# Process URLs directly
...
A robust scraper handles both. Production sites can have 100+ child sitemaps; an index is normal.
Compressed sitemaps
Common file extension: .xml.gz. requests auto-decompresses for you IF the server sends Content-Encoding: gzip, but for .xml.gz URLs, the server typically sends the raw gzipped bytes without that header. Decompress manually:
import gzip
r = requests.get("https://example.com/sitemap.xml.gz")
xml_bytes = gzip.decompress(r.content)
tree = lxml.etree.fromstring(xml_bytes)
Filtering by date
If <lastmod> is populated, you can scrape only what's changed since your last run:
from datetime import datetime, timedelta
cutoff = datetime.now() - timedelta(days=7)
recent = [
u for u in urls
if u["lastmod"] and datetime.fromisoformat(u["lastmod"]) > cutoff
]
This is the killer feature for incremental scrapes, instead of refetching everything, just hit pages that changed.
Filtering by URL pattern
products = [u for u in urls if "/products/" in u["url"]]
blog = [u for u in urls if "/blog/" in u["url"]]
news = [u for u in urls if u["url"].startswith("https://example.com/news/")]
The sitemap is one big flat list, but URL prefixes usually correspond to site sections.
News sitemaps
A different XML namespace, specifically for news content:
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9">
<url>
<loc>https://example.com/news/article-1</loc>
<news:news>
<news:publication>
<news:name>Example News</news:name>
<news:language>en</news:language>
</news:publication>
<news:publication_date>2025-03-12T14:23:00Z</news:publication_date>
<news:title>Breaking news headline</news:title>
</news:news>
</url>
</urlset>
For news scraping, this gives you title and publication date upfront, sometimes enough that you don't even need to fetch the article HTML for headline-level monitoring.
Image sitemaps
Same pattern, <image:image> children with image URLs and captions. Rarely used by scrapers, but if you're building an image archive, look for them.
Using sitemaps + category pages together
Sitemap discovery is great for "every page that exists." But sometimes you want "every page in this category." For that, scrape the category index pages directly:
sitemap_urls = parse_sitemap(...) # all URLs
product_urls = [u for u in sitemap_urls if "/products/" in u]
# Or, more targeted: only urls in a specific category's category page
category_html = requests.get(f"{BASE}/products?category=kitchen").text
soup = BeautifulSoup(category_html, "lxml")
kitchen_urls = [a["href"] for a in soup.select("article.product-card a")]
The sitemap gives breadth; category pages give curation.
Sitemaps you can't trust
Some sitemaps are out of date, list draft URLs, or include staging entries. Verify against the live site by spot-checking. Also, <priority> and <changefreq> are advisory, many sites set them once and forget. Don't rely on them for scheduling.
A reusable sitemap walker
import requests, gzip, io, lxml.etree
from urllib.parse import urlparse
NS = {"s": "http://www.sitemaps.org/schemas/sitemap/0.9"}
def fetch_sitemap_xml(url):
r = requests.get(url, timeout=15)
r.raise_for_status()
content = r.content
if url.endswith(".gz") or content[:2] == b"\x1f\x8b":
content = gzip.decompress(content)
return lxml.etree.fromstring(content)
def walk_sitemap(url):
"""Yields {'url': ..., 'lastmod': ...} for every URL, recursing through indexes."""
tree = fetch_sitemap_xml(url)
if tree.tag.endswith("sitemapindex"):
for loc in tree.xpath("//s:sitemap/s:loc", namespaces=NS):
yield from walk_sitemap(loc.text)
else:
for url_el in tree.xpath("//s:url", namespaces=NS):
loc = url_el.find("s:loc", NS).text
lastmod_el = url_el.find("s:lastmod", NS)
yield {"url": loc, "lastmod": lastmod_el.text if lastmod_el is not None else None}
for entry in walk_sitemap("https://practice.scrapingcentral.com/sitemap.xml"):
print(entry["url"], entry.get("lastmod"))
This generator handles indexes, compression, and namespaces. Drop it into any project that needs to discover a site's URL inventory.
Hands-on lab
Fetch /sitemap.xml. If it's an index, recurse into the child sitemaps. Print a count of URLs per top-level path prefix (/products, /blog, etc.). Then filter to only URLs with lastmod in the last 30 days, those are the freshest pages worth re-scraping.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/sitemap.xmlQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.