robots.txt and Legal Considerations
Understand robots.txt, Terms of Service, and the legal landscape of web scraping to scrape responsibly.
Anti-Detection · #15beginner3 min read
Web scraping exists in a legal gray area. Understanding robots.txt, Terms of Service, and relevant laws helps you scrape responsibly and reduce legal risk.
Understanding robots.txt
The robots.txt file tells crawlers which paths they may and may not access. It lives at the root of every website:
https://example.com/robots.txt
Parsing robots.txt in Python
from urllib.robotparser import RobotFileParser
def can_scrape(url: str, user_agent: str = "*") -> bool:
"""Check if scraping a URL is allowed by robots.txt."""
from urllib.parse import urlparse
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp = RobotFileParser()
rp.set_url(robots_url)
rp.read()
return rp.can_fetch(user_agent, url)
# Examples
print(can_scrape("https://example.com/products")) # Likely True
print(can_scrape("https://example.com/admin")) # Likely False
print(can_scrape("https://twitter.com/search")) # Check for yourself
Common robots.txt Directives
# Allow all bots to access everything
User-agent: *
Allow: /
# Block all bots from /private/
User-agent: *
Disallow: /private/
# Specific crawl delay
User-agent: *
Crawl-delay: 10
# Sitemap location
Sitemap: https://example.com/sitemap.xml
Respecting Crawl-Delay
from urllib.robotparser import RobotFileParser
import time
import requests
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
crawl_delay = rp.crawl_delay("*") or 1 # Default 1 second
urls = [
"https://example.com/page/1",
"https://example.com/page/2",
"https://example.com/page/3",
]
for url in urls:
if rp.can_fetch("*", url):
response = requests.get(url, timeout=15)
print(f"{url}: {response.status_code}")
time.sleep(crawl_delay)
else:
print(f"Blocked by robots.txt: {url}")
Legal Landscape
Key Legal Cases
| Case | Year | Outcome | Significance |
|---|---|---|---|
| hiQ vs LinkedIn | 2022 | hiQ won | Scraping public data is not a CFAA violation |
| Clearview AI | 2022 | Fined | Scraping for facial recognition violated privacy laws |
| Meta vs Bright Data | 2024 | Bright Data won | Scraping public pages without login is legal |
General Guidelines
Usually Safe:
- Scraping publicly accessible data
- Respecting robots.txt
- Not circumventing login walls
- Not overloading the server
- Using data for research, comparison, or aggregation
Higher Risk:
- Scraping behind a login
- Ignoring robots.txt Disallow rules
- Scraping personal/private data
- Republishing copyrighted content verbatim
- Violating explicit Terms of Service after account creation
Best Practices for Legal Safety
- Check robots.txt before scraping any site
- Read the Terms of Service for sites you scrape heavily
- Do not scrape personal data without a legitimate basis (especially under GDPR)
- Do not republish copyrighted content; extract facts and data
- Identify your scraper with a descriptive User-Agent if asked
- Rate limit your requests to avoid harming the site
- Cache responses to minimize repeated requests
Disclaimer
This article is for educational purposes only and is not legal advice. Consult a lawyer if you have concerns about the legality of your scraping project.