Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Guide

How to Scrape News Websites

A practical guide to scraping news articles from major outlets. Learn techniques for extracting headlines, article text, and metadata.

News scraping is widely used for media monitoring, sentiment analysis, and content aggregation. Here is how to scrape news sites effectively.

Common News Scraping Targets

  • Headlines and article titles
  • Full article text
  • Publication dates and authors
  • Categories and tags
  • Images and media

Method 1: RSS Feeds

Many news sites still offer RSS feeds, which are the easiest and most ethical way to get article data.

import feedparser

feed = feedparser.parse("https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml")
for entry in feed.entries:
    print(entry.title, entry.link, entry.published)

Method 2: Python + BeautifulSoup

For sites without RSS feeds, scrape the HTML directly.

import requests
from bs4 import BeautifulSoup

url = "https://news.ycombinator.com/"
resp = requests.get(url)
soup = BeautifulSoup(resp.text, "html.parser")

for item in soup.select(".titleline > a"):
    print(item.text, item["href"])

Method 3: ScraperAPI for Paywalled or Protected Sites

Many news sites use paywalls, bot detection, or require JavaScript rendering. ScraperAPI handles all of these.

import requests

API_KEY = "YOUR_SCRAPERAPI_KEY"
url = "https://www.reuters.com/technology/"

resp = requests.get(
    f"http://api.scraperapi.com?api_key={API_KEY}&url={url}&render=true"
)

Extracting Article Content

Use the newspaper3k or trafilatura libraries to cleanly extract article text from HTML.

from trafilatura import fetch_url, extract

downloaded = fetch_url("https://example.com/article")
text = extract(downloaded)
print(text)

Handling Common Challenges

Challenge Solution
Paywalls Use ScraperAPI with rendering, or check for cached versions
Infinite scroll Use Playwright or render via ScrapingAnt
Rate limiting Rotate proxies and add delays
Dynamic content Enable JavaScript rendering

Best Practices

  1. Check for RSS feeds first, they are the simplest approach
  2. Use trafilatura for clean article text extraction
  3. Store articles with timestamps for time-series analysis
  4. Deduplicate content, the same story appears on multiple pages
  5. Respect copyright, scraping for analysis is different from republishing