robots.txt and Legal Considerations - Anti-Detection

Understand robots.txt, Terms of Service, and the legal landscape of web scraping to scrape responsibly.

Web scraping exists in a legal gray area. Understanding robots.txt, Terms of Service, and relevant laws helps you scrape responsibly and reduce legal risk.

Understanding robots.txt

The robots.txt file tells crawlers which paths they may and may not access. It lives at the root of every website:

https://example.com/robots.txt

Parsing robots.txt in Python

from urllib.robotparser import RobotFileParser

def can_scrape(url: str, user_agent: str = "*") -> bool:
    """Check if scraping a URL is allowed by robots.txt."""
    from urllib.parse import urlparse

    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

    rp = RobotFileParser()
    rp.set_url(robots_url)
    rp.read()

    return rp.can_fetch(user_agent, url)

# Examples
print(can_scrape("https://example.com/products"))      # Likely True
print(can_scrape("https://example.com/admin"))          # Likely False
print(can_scrape("https://twitter.com/search"))         # Check for yourself

Common robots.txt Directives

# Allow all bots to access everything
User-agent: *
Allow: /

# Block all bots from /private/
User-agent: *
Disallow: /private/

# Specific crawl delay
User-agent: *
Crawl-delay: 10

# Sitemap location
Sitemap: https://example.com/sitemap.xml

Respecting Crawl-Delay

from urllib.robotparser import RobotFileParser
import time
import requests

rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

crawl_delay = rp.crawl_delay("*") or 1  # Default 1 second

urls = [
    "https://example.com/page/1",
    "https://example.com/page/2",
    "https://example.com/page/3",
]

for url in urls:
    if rp.can_fetch("*", url):
        response = requests.get(url, timeout=15)
        print(f"{url}: {response.status_code}")
        time.sleep(crawl_delay)
    else:
        print(f"Blocked by robots.txt: {url}")

Legal Landscape

Key Legal Cases

Case	Year	Outcome	Significance
hiQ vs LinkedIn	2022	hiQ won	Scraping public data is not a CFAA violation
Clearview AI	2022	Fined	Scraping for facial recognition violated privacy laws
Meta vs Bright Data	2024	Bright Data won	Scraping public pages without login is legal

General Guidelines

Usually Safe:

Scraping publicly accessible data
Respecting robots.txt
Not circumventing login walls
Not overloading the server
Using data for research, comparison, or aggregation

Higher Risk:

Scraping behind a login
Ignoring robots.txt Disallow rules
Scraping personal/private data
Republishing copyrighted content verbatim
Violating explicit Terms of Service after account creation

Best Practices for Legal Safety

Check robots.txt before scraping any site
Read the Terms of Service for sites you scrape heavily
Do not scrape personal data without a legitimate basis (especially under GDPR)
Do not republish copyrighted content; extract facts and data
Identify your scraper with a descriptive User-Agent if asked
Rate limit your requests to avoid harming the site
Cache responses to minimize repeated requests

Disclaimer

This article is for educational purposes only and is not legal advice. Consult a lawyer if you have concerns about the legality of your scraping project.