Scraping Social Media APIs - API Scraping

Learn techniques for extracting data from social media platforms using their official APIs and alternative approaches.

Social media platforms are rich data sources for market research, sentiment analysis, and trend monitoring. Each platform has different API access levels and restrictions.

Reddit API (Most Scraper-Friendly)

Reddit provides generous free API access. Use the .json suffix or the official API:

import requests

headers = {"User-Agent": "ScrapingCentral/1.0 (educational)"}

# Append .json to any Reddit URL
url = "https://www.reddit.com/r/python/hot.json"
params = {"limit": 10}

response = requests.get(url, headers=headers, params=params, timeout=15)
response.raise_for_status()

posts = response.json()["data"]["children"]
for post in posts:
    p = post["data"]
    print(f"[{p['score']:>5} pts] {p['title'][:60]}")
    print(f"          r/{p['subreddit']} | {p['num_comments']} comments")
    print()

Reddit with PRAW (Official Wrapper)

import praw

reddit = praw.Reddit(
    client_id="YOUR_CLIENT_ID",
    client_secret="YOUR_CLIENT_SECRET",
    user_agent="ScrapingCentral/1.0",
)

subreddit = reddit.subreddit("webdev")
for post in subreddit.hot(limit=5):
    print(f"[{post.score}] {post.title}")
    # Access comments
    post.comments.replace_more(limit=0)
    for comment in post.comments[:3]:
        print(f"  -> {comment.body[:80]}")

pip install praw

Twitter/X API v2

Twitter now requires a developer account. The free tier is limited but functional:

import requests

bearer_token = "YOUR_BEARER_TOKEN"
headers = {"Authorization": f"Bearer {bearer_token}"}

# Search recent tweets
url = "https://api.twitter.com/2/tweets/search/recent"
params = {
    "query": "web scraping python -is:retweet lang:en",
    "max_results": 10,
    "tweet.fields": "created_at,public_metrics",
}

response = requests.get(url, headers=headers, params=params, timeout=15)
response.raise_for_status()

for tweet in response.json().get("data", []):
    metrics = tweet["public_metrics"]
    print(f"[{metrics['like_count']} likes] {tweet['text'][:80]}...")

YouTube Data API

import requests

api_key = "YOUR_YOUTUBE_API_KEY"
url = "https://www.googleapis.com/youtube/v3/search"
params = {
    "part": "snippet",
    "q": "python web scraping tutorial",
    "type": "video",
    "maxResults": 5,
    "key": api_key,
    "order": "viewCount",
}

response = requests.get(url, params=params, timeout=15)
videos = response.json().get("items", [])

for video in videos:
    title = video["snippet"]["title"]
    video_id = video["id"]["videoId"]
    print(f"{title}")
    print(f"  https://youtube.com/watch?v={video_id}\n")

API Access Comparison

Platform	Free Tier	Rate Limits	Auth Required
Reddit	Generous	60 req/min	User-Agent only (basic)
Twitter/X	Very limited	100 tweets/month (free)	OAuth 2.0 Bearer
YouTube	10,000 units/day	Per-endpoint quotas	API Key
GitHub	Generous	60/hr unauth, 5000/hr auth	Optional (token)
LinkedIn	Restricted	Varies by product	OAuth 2.0

Ethical Considerations

Respect rate limits, social platforms actively ban scrapers
Check Terms of Service, some platforms prohibit scraping
Avoid personal data, be cautious with user information (GDPR, CCPA)
Use official APIs when available instead of scraping the frontend

For social media sites that block direct API access, ScrapingAnt provides headless browser rendering that can load JavaScript-heavy social feeds.

Next Steps

Build a data pipeline for continuous social media monitoring
Compare API scraping vs HTML scraping approaches
Process and clean social media data with pandas