Following Sitemaps for Discovery, Static Scraping

`sitemap.xml` is the structured index of a site's URLs that the site itself publishes. Use it to discover every page worth scraping without crawling blindly.

Most public websites publish a sitemap.xml file declaring every URL they want indexed by search engines. For a scraper, this is gold: a free, structured, often timestamped list of every content page on the site. Before you write a crawler that follows links, check whether the site already gives you the list.

What a sitemap looks like

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
  <loc>https://practice.scrapingcentral.com/products/yellow-mug</loc>
  <lastmod>2025-03-12</lastmod>
  <changefreq>weekly</changefreq>
  <priority>0.8</priority>
  </url>
  <url>
  <loc>https://practice.scrapingcentral.com/products/black-mug</loc>
  <lastmod>2025-03-10</lastmod>
  </url>
  ...
</urlset>

Five fields per URL, only <loc> is mandatory. The rest are hints to search engines and useful to scrapers for filtering.

Finding the sitemap

Three places to look, in order:

/sitemap.xml directly, the most common convention.
/robots.txt, almost always declares the sitemap location:

User-agent: *
Disallow: /admin
Sitemap: https://example.com/sitemap.xml

Linked from the HTML, <link rel="sitemap" ...> in the <head>. Rare but exists.

Always check robots.txt first; sites with multiple sitemaps (news, products, blog) sometimes list all of them there.

Parsing a sitemap

import requests, lxml.etree

r = requests.get("https://practice.scrapingcentral.com/sitemap.xml")
r.raise_for_status()

# Use lxml.etree for strict XML (sitemaps are real XML, unlike most HTML)
tree = lxml.etree.fromstring(r.content)
ns = {"s": "http://www.sitemaps.org/schemas/sitemap/0.9"}

urls = []
for url in tree.xpath("//s:url", namespaces=ns):
  loc = url.find("s:loc", ns).text
  lastmod = url.find("s:lastmod", ns)
  urls.append({"url": loc, "lastmod": lastmod.text if lastmod is not None else None})

print(f"Found {len(urls)} URLs")

The XML namespace is required, every element is qualified with it. Forgetting to register the namespace in your XPath returns nothing.

Sitemap index files

Large sites split their sitemaps. sitemap.xml may be an index pointing to child sitemaps:

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
  <loc>https://example.com/sitemap-products.xml</loc>
  <lastmod>2025-03-12</lastmod>
  </sitemap>
  <sitemap>
  <loc>https://example.com/sitemap-blog.xml</loc>
  </sitemap>
</sitemapindex>

Detect this by looking at the root element name:

root_tag = tree.tag
if root_tag.endswith("sitemapindex"):
  # Recurse into each child sitemap
  for sitemap in tree.xpath("//s:sitemap/s:loc", namespaces=ns):
  process_sitemap(sitemap.text)
elif root_tag.endswith("urlset"):
  # Process URLs directly
  ...

A robust scraper handles both. Production sites can have 100+ child sitemaps; an index is normal.

Compressed sitemaps

Common file extension: .xml.gz. requests auto-decompresses for you IF the server sends Content-Encoding: gzip, but for .xml.gz URLs, the server typically sends the raw gzipped bytes without that header. Decompress manually:

import gzip

r = requests.get("https://example.com/sitemap.xml.gz")
xml_bytes = gzip.decompress(r.content)
tree = lxml.etree.fromstring(xml_bytes)

Filtering by date

If <lastmod> is populated, you can scrape only what's changed since your last run:

from datetime import datetime, timedelta

cutoff = datetime.now() - timedelta(days=7)

recent = [
  u for u in urls
  if u["lastmod"] and datetime.fromisoformat(u["lastmod"]) > cutoff
]

This is the killer feature for incremental scrapes, instead of refetching everything, just hit pages that changed.

Filtering by URL pattern

products = [u for u in urls if "/products/" in u["url"]]
blog  = [u for u in urls if "/blog/" in u["url"]]
news  = [u for u in urls if u["url"].startswith("https://example.com/news/")]

The sitemap is one big flat list, but URL prefixes usually correspond to site sections.

News sitemaps

A different XML namespace, specifically for news content:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
  xmlns:news="http://www.google.com/schemas/sitemap-news/0.9">
  <url>
  <loc>https://example.com/news/article-1</loc>
  <news:news>
  <news:publication>
  <news:name>Example News</news:name>
  <news:language>en</news:language>
  </news:publication>
  <news:publication_date>2025-03-12T14:23:00Z</news:publication_date>
  <news:title>Breaking news headline</news:title>
  </news:news>
  </url>
</urlset>

For news scraping, this gives you title and publication date upfront, sometimes enough that you don't even need to fetch the article HTML for headline-level monitoring.

Image sitemaps

Same pattern, <image:image> children with image URLs and captions. Rarely used by scrapers, but if you're building an image archive, look for them.

Using sitemaps + category pages together

Sitemap discovery is great for "every page that exists." But sometimes you want "every page in this category." For that, scrape the category index pages directly:

sitemap_urls = parse_sitemap(...)  # all URLs
product_urls = [u for u in sitemap_urls if "/products/" in u]

# Or, more targeted: only urls in a specific category's category page
category_html = requests.get(f"{BASE}/products?category=kitchen").text
soup = BeautifulSoup(category_html, "lxml")
kitchen_urls = [a["href"] for a in soup.select("article.product-card a")]

The sitemap gives breadth; category pages give curation.

Sitemaps you can't trust

Some sitemaps are out of date, list draft URLs, or include staging entries. Verify against the live site by spot-checking. Also, <priority> and <changefreq> are advisory, many sites set them once and forget. Don't rely on them for scheduling.

A reusable sitemap walker

import requests, gzip, io, lxml.etree
from urllib.parse import urlparse

NS = {"s": "http://www.sitemaps.org/schemas/sitemap/0.9"}

def fetch_sitemap_xml(url):
  r = requests.get(url, timeout=15)
  r.raise_for_status()
  content = r.content
  if url.endswith(".gz") or content[:2] == b"\x1f\x8b":
  content = gzip.decompress(content)
  return lxml.etree.fromstring(content)

def walk_sitemap(url):
  """Yields {'url': ..., 'lastmod': ...} for every URL, recursing through indexes."""
  tree = fetch_sitemap_xml(url)
  if tree.tag.endswith("sitemapindex"):
  for loc in tree.xpath("//s:sitemap/s:loc", namespaces=NS):
  yield from walk_sitemap(loc.text)
  else:
  for url_el in tree.xpath("//s:url", namespaces=NS):
  loc = url_el.find("s:loc", NS).text
  lastmod_el = url_el.find("s:lastmod", NS)
  yield {"url": loc, "lastmod": lastmod_el.text if lastmod_el is not None else None}

for entry in walk_sitemap("https://practice.scrapingcentral.com/sitemap.xml"):
  print(entry["url"], entry.get("lastmod"))

This generator handles indexes, compression, and namespaces. Drop it into any project that needs to discover a site's URL inventory.

Hands-on lab

Fetch /sitemap.xml. If it's an index, recurse into the child sitemaps. Print a count of URLs per top-level path prefix (/products, /blog, etc.). Then filter to only URLs with lastmod in the last 30 days, those are the freshest pages worth re-scraping.

Following Sitemaps for Discovery

What you’ll learn

What a sitemap looks like

Finding the sitemap

Parsing a sitemap

Sitemap index files

Compressed sitemaps

Filtering by date

Filtering by URL pattern

News sitemaps

Image sitemaps

Using sitemaps + category pages together

Sitemaps you can't trust

A reusable sitemap walker

Hands-on lab

Hands-on lab

Quiz, check your understanding

Where is the most reliable place to discover a site's sitemap URL?