Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

robots.txt and Legal Considerations

Understand robots.txt, Terms of Service, and the legal landscape of web scraping to scrape responsibly.

Anti-Detection · #15beginner3 min read
Share:WhatsAppLinkedIn

Web scraping exists in a legal gray area. Understanding robots.txt, Terms of Service, and relevant laws helps you scrape responsibly and reduce legal risk.

Understanding robots.txt

The robots.txt file tells crawlers which paths they may and may not access. It lives at the root of every website:

https://example.com/robots.txt

Parsing robots.txt in Python

from urllib.robotparser import RobotFileParser

def can_scrape(url: str, user_agent: str = "*") -> bool:
    """Check if scraping a URL is allowed by robots.txt."""
    from urllib.parse import urlparse

    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    rp = RobotFileParser()
    rp.set_url(robots_url)
    rp.read()

    return rp.can_fetch(user_agent, url)

# Examples
print(can_scrape("https://example.com/products"))      # Likely True
print(can_scrape("https://example.com/admin"))          # Likely False
print(can_scrape("https://twitter.com/search"))         # Check for yourself

Common robots.txt Directives

# Allow all bots to access everything
User-agent: *
Allow: /

# Block all bots from /private/
User-agent: *
Disallow: /private/

# Specific crawl delay
User-agent: *
Crawl-delay: 10

# Sitemap location
Sitemap: https://example.com/sitemap.xml

Respecting Crawl-Delay

from urllib.robotparser import RobotFileParser
import time
import requests

rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

crawl_delay = rp.crawl_delay("*") or 1  # Default 1 second

urls = [
    "https://example.com/page/1",
    "https://example.com/page/2",
    "https://example.com/page/3",
]

for url in urls:
    if rp.can_fetch("*", url):
        response = requests.get(url, timeout=15)
        print(f"{url}: {response.status_code}")
        time.sleep(crawl_delay)
    else:
        print(f"Blocked by robots.txt: {url}")

Legal Landscape

Key Legal Cases

Case Year Outcome Significance
hiQ vs LinkedIn 2022 hiQ won Scraping public data is not a CFAA violation
Clearview AI 2022 Fined Scraping for facial recognition violated privacy laws
Meta vs Bright Data 2024 Bright Data won Scraping public pages without login is legal

General Guidelines

Usually Safe:

  • Scraping publicly accessible data
  • Respecting robots.txt
  • Not circumventing login walls
  • Not overloading the server
  • Using data for research, comparison, or aggregation

Higher Risk:

  • Scraping behind a login
  • Ignoring robots.txt Disallow rules
  • Scraping personal/private data
  • Republishing copyrighted content verbatim
  • Violating explicit Terms of Service after account creation

Best Practices for Legal Safety

  1. Check robots.txt before scraping any site
  2. Read the Terms of Service for sites you scrape heavily
  3. Do not scrape personal data without a legitimate basis (especially under GDPR)
  4. Do not republish copyrighted content; extract facts and data
  5. Identify your scraper with a descriptive User-Agent if asked
  6. Rate limit your requests to avoid harming the site
  7. Cache responses to minimize repeated requests

Disclaimer

This article is for educational purposes only and is not legal advice. Consult a lawyer if you have concerns about the legality of your scraping project.