Tutorial
How to Scrape IMDB Movie Data
Learn how to scrape IMDB movie data including ratings, cast, reviews, and box office information using Python and web scraping techniques.
IMDB is the definitive source for movie and TV show data. While they offer a commercial API, scraping the public website is a common approach for research and personal projects.
What You Can Extract
- Movie titles, release years, and genres
- Ratings and vote counts
- Cast and crew information
- Plot summaries and reviews
- Box office data
- Episode guides for TV series
Method 1: ScraperAPI (Recommended)
IMDB uses some anti-bot measures. ScraperAPI handles them automatically.
import requests
from bs4 import BeautifulSoup
API_KEY = "YOUR_SCRAPERAPI_KEY"
def scrape_movie(imdb_id):
response = requests.get(
"http://api.scraperapi.com",
params={
"api_key": API_KEY,
"url": f"https://www.imdb.com/title/{imdb_id}/"
}
)
soup = BeautifulSoup(response.text, "html.parser")
# Extract JSON-LD structured data (most reliable)
import json
ld_script = soup.find("script", type="application/ld+json")
if ld_script:
data = json.loads(ld_script.string)
return {
"title": data.get("name"),
"rating": data.get("aggregateRating", {}).get("ratingValue"),
"votes": data.get("aggregateRating", {}).get("ratingCount"),
"genre": data.get("genre"),
"director": data.get("director", [{}])[0].get("name") if isinstance(data.get("director"), list) else None,
"description": data.get("description")
}
movie = scrape_movie("tt1375666") # Inception
print(f"Title: {movie['title']}")
print(f"Rating: {movie['rating']}/10 ({movie['votes']} votes)")
Method 2: IMDB's Top 250 List
import requests
from bs4 import BeautifulSoup
import json
API_KEY = "YOUR_SCRAPERAPI_KEY"
response = requests.get(
"http://api.scraperapi.com",
params={
"api_key": API_KEY,
"url": "https://www.imdb.com/chart/top/"
}
)
soup = BeautifulSoup(response.text, "html.parser")
# IMDB embeds structured data in the page
ld_data = soup.find("script", type="application/ld+json")
if ld_data:
data = json.loads(ld_data.string)
items = data.get("itemListElement", [])
for item in items[:10]:
movie = item.get("item", {})
print(f"{item['position']}. {movie['name']} - Rating: {movie.get('aggregateRating', {}).get('ratingValue')}")
Method 3: IMDB Datasets (Free Official Data)
IMDB provides free downloadable datasets for non-commercial use at https://datasets.imdbws.com/.
import pandas as pd
# Download and load IMDB's title basics dataset
ratings = pd.read_csv(
"https://datasets.imdbws.com/title.ratings.tsv.gz",
sep="\t",
compression="gzip"
)
basics = pd.read_csv(
"https://datasets.imdbws.com/title.basics.tsv.gz",
sep="\t",
compression="gzip",
na_values="\\N"
)
# Merge and filter for top-rated movies
movies = basics.merge(ratings, on="tconst")
top_movies = movies[
(movies["titleType"] == "movie") &
(movies["numVotes"] > 100000)
].sort_values("averageRating", ascending=False)
print(top_movies[["primaryTitle", "startYear", "averageRating", "numVotes"]].head(10))
Best Strategy
Start with the free IMDB datasets for bulk data. Use web scraping only for data not available in the datasets (like reviews, box office data, or real-time updates). Always use the JSON-LD structured data embedded in IMDB pages as it is the most reliable extraction target.