Guide
How to Scrape News Websites
A practical guide to scraping news articles from major outlets. Learn techniques for extracting headlines, article text, and metadata.
News scraping is widely used for media monitoring, sentiment analysis, and content aggregation. Here is how to scrape news sites effectively.
Common News Scraping Targets
- Headlines and article titles
- Full article text
- Publication dates and authors
- Categories and tags
- Images and media
Method 1: RSS Feeds
Many news sites still offer RSS feeds, which are the easiest and most ethical way to get article data.
import feedparser
feed = feedparser.parse("https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml")
for entry in feed.entries:
print(entry.title, entry.link, entry.published)
Method 2: Python + BeautifulSoup
For sites without RSS feeds, scrape the HTML directly.
import requests
from bs4 import BeautifulSoup
url = "https://news.ycombinator.com/"
resp = requests.get(url)
soup = BeautifulSoup(resp.text, "html.parser")
for item in soup.select(".titleline > a"):
print(item.text, item["href"])
Method 3: ScraperAPI for Paywalled or Protected Sites
Many news sites use paywalls, bot detection, or require JavaScript rendering. ScraperAPI handles all of these.
import requests
API_KEY = "YOUR_SCRAPERAPI_KEY"
url = "https://www.reuters.com/technology/"
resp = requests.get(
f"http://api.scraperapi.com?api_key={API_KEY}&url={url}&render=true"
)
Extracting Article Content
Use the newspaper3k or trafilatura libraries to cleanly extract article text from HTML.
from trafilatura import fetch_url, extract
downloaded = fetch_url("https://example.com/article")
text = extract(downloaded)
print(text)
Handling Common Challenges
| Challenge | Solution |
|---|---|
| Paywalls | Use ScraperAPI with rendering, or check for cached versions |
| Infinite scroll | Use Playwright or render via ScrapingAnt |
| Rate limiting | Rotate proxies and add delays |
| Dynamic content | Enable JavaScript rendering |
Best Practices
- Check for RSS feeds first, they are the simplest approach
- Use
trafilaturafor clean article text extraction - Store articles with timestamps for time-series analysis
- Deduplicate content, the same story appears on multiple pages
- Respect copyright, scraping for analysis is different from republishing