Legal & Ethical Scraping, Your Compass, Foundations

robots.txt, Terms of Service, GDPR, the CFAA, the hiQ vs LinkedIn ruling, the legal and ethical scaffolding every scraping project should consider before code is written.

Web scraping is legal in most places, but "legal" is a flat answer to a layered question. Whether you can scrape something depends on what you scrape, how you scrape it, where you and the target are, and what you do with the results. This lesson is a compass, not a legal opinion.

Disclaimer: This isn't legal advice. For anything serious, consult a lawyer who specialises in tech / data protection in your jurisdiction.

The four legal frameworks to know

Framework	What it covers	Where it lives
Computer-misuse statutes (CFAA, US; CMA, UK; etc.)	Unauthorised access to computer systems	USA, UK, most of EU + many others
Contract law / ToS	Terms of Service you agreed to	Anywhere, contract law is universal
Copyright	Original creative works	Anywhere
Privacy / data protection (GDPR, CCPA, etc.)	Personal data of individuals	EU, California, India, etc., extraterritorial reach

These overlap. A single scrape could potentially trigger all four.

robots.txt, what it is and isn't

https://practice.scrapingcentral.com/robots.txt:

User-agent: *
Allow: /
Disallow: /account/
Disallow: /admin/
Disallow: /api/internal/

Sitemap: https://practice.scrapingcentral.com/sitemap.xml

robots.txt is a convention, not a law. It tells well-behaved bots which paths a site prefers they avoid. It is:

Not legally binding, there's no statute that says "thou shalt obey robots.txt."
Not access control, anything Disallowed is still served on request. The server just asked you not to fetch it.
A strong ethical signal, ignoring robots.txt is the simplest reason a court (or PR team) decides you were operating in bad faith.

Default position: respect robots.txt unless you have a clear, defensible reason not to. Tools like Python's urllib.robotparser or PHP's various robots.txt parsers make compliance one line of code.

Terms of Service, a real bar

Most sites have Terms of Service ("ToS") that prohibit scraping. Two questions matter:

Did you agree to them? ToS bind you only if you've assented, typically by creating an account, ticking a box, or in some jurisdictions, by being aware of them as a "browsewrap." Anonymous public scraping of a public page where you never registered is a much weaker contract argument than scraping behind a login.
Does the prohibited activity actually exist? A clause saying "no automated access" is broad. Courts in different jurisdictions have ranged from "ToS is enforceable like any contract" to "ToS can't unilaterally restrict access to public information."

The landmark case hiQ Labs v. LinkedIn (2019, 9th Circuit) held that scraping publicly accessible LinkedIn profiles did not violate the CFAA, public means public. The narrower implication: scraping behind authentication, in violation of an active ToS, is much riskier.

The CFAA (USA), "unauthorised access"

The US Computer Fraud and Abuse Act criminalises accessing a computer "without authorisation" or "exceeding authorised access." The key phrase is "authorisation", what counts?

After Van Buren v. United States (2021), the Supreme Court narrowed the CFAA: simply violating ToS on a system you're permitted to use is not "exceeding authorised access." You exceed authorised access by entering parts of a system you have no permission to enter (e.g. circumventing login, exploiting bugs).

The practical translation:

Scraping a public page that anyone can view: low CFAA risk.
Scraping behind a login you legitimately have: medium risk; ToS becomes the active concern, not CFAA.
Bypassing login / paywall / IP block / CAPTCHA: high CFAA risk. Technical access controls are the line CFAA is built to protect.

Other countries have analogues, UK's Computer Misuse Act, Germany's § 202a StGB, etc. The "don't bypass access controls" principle is near-universal.

Copyright, facts vs. expression

The list of products on a catalog site? Mostly facts (name, price, SKU), not copyrightable.
The product description prose? Original expression, copyright protected.
The product photos? Each is its own copyrighted work.

You can scrape facts freely (within other constraints). You cannot republish copyrighted prose or images without permission, at least, not without a fair-use / fair-dealing argument that holds up.

A common safe pattern: scrape data, store data, use data internally or to power your own expression. Don't mirror the source verbatim on a new site.

Privacy, GDPR, CCPA, et al.

GDPR (EU) and CCPA (California) apply to personal data: anything that can identify an individual. Names, emails, phone numbers, IP addresses, photos of people, behavioural tracking IDs.

If you scrape personal data:

You need a lawful basis (Article 6 of GDPR). "Legitimate interest" is invoked often but requires a balancing test.
You must minimise: collect only what you need.
You must enable rectification / deletion on request.
The territorial reach is broad, GDPR applies to non-EU companies processing EU residents' data.

Practical: scraping a directory of consumer phone numbers in the EU is a project-ending lawsuit risk. Scraping aggregate-only data, or your own company's data, or anonymised public data, far less so.

The hiQ vs LinkedIn case, what it tells you

hiQ Labs scraped public LinkedIn profiles to power an HR analytics product. LinkedIn sent a cease-and-desist citing the CFAA. The 9th Circuit's ruling (upheld in 2022): scraping publicly accessible data is not CFAA-actionable.

LinkedIn eventually won on a different basis, contract law via the ToS hiQ had agreed to by creating accounts, and on California Penal Code 502. The case is the most-cited modern scraping precedent precisely because it laid out the slices:

Public data → CFAA doesn't apply.
ToS-bound activity → contract claims still live.
Personal data → privacy laws active.

It's a partial green light, not blanket permission.

A practical ethics framework

Beyond legal questions, ask:

Would the site owner be embarrassed (or grateful) if they saw exactly what I did? A polite rate, identifying yourself in User-Agent, scraping only what you need → defensible. Hammering the site, faking a browser, scraping personal data, not.
What's the cost to the site? Bandwidth, server load, distortion of analytics. Smaller costs = easier to justify.
What value am I creating? Aggregating prices for a comparison tool that helps consumers > training an LLM on scraped content > republishing as your own > spamming the personal data you collected.
Have I considered the people in the data? A list of products is different from a list of people's contact details. Scraping facts about a company differs from scraping facts about individuals.

The blunter version: don't do anything that you couldn't justify to a non-technical friend in plain English.

A safer recipe

For most scraping projects:

Check robots.txt. Avoid disallowed paths unless you have a defensible reason.
Send a User-Agent that identifies you (or your project): MyScraper/1.0 (mailto:you@example.com). Lets the target contact you instead of just blocking.
Rate-limit politely. 1–2 req/sec for most public sites; slower for small sites.
Don't scrape personal data unless you have a clear lawful basis and a real plan for handling deletion requests.
Don't circumvent technical access controls (CAPTCHAs, IP blocks, paywalls). If you really need that data, ask the site owner, or buy it.
Cache aggressively to avoid re-fetching.
If you publish or commercialise the results, get legal advice specific to your situation.

The robots.txt lab

Look at practice.scrapingcentral.com/robots.txt. Identify:

Which paths are Disallowed.
Which User-Agent the rules apply to (* = everyone).
Whether your scraper, by default, would respect those Disallows.

Then write a one-line check in Python:

from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://practice.scrapingcentral.com/robots.txt")
rp.read()
print(rp.can_fetch("MyScraper/1.0", "https://practice.scrapingcentral.com/account/dashboard"))
# False, the Disallow on /account/ is respected

That's the entire bot-side mechanic. The hard part is the habit, not the code.

Hands-on lab

This is the last Foundations lesson. Reflect:

Pick a real site you've been curious about scraping.
Read its robots.txt and Terms of Service.
Identify which of the four legal frameworks (CFAA / ToS / Copyright / Privacy) might apply.
Write one paragraph: "If I scraped this, here's my legal/ethical posture and why I think it's defensible."

If you can write that paragraph honestly, you're ready for the rest of the curriculum. If the answer is "I couldn't defend it," that's also informative, pick a different target.

Foundations complete. Next: Sub-Path 1, Static Scraping.

Legal & Ethical Scraping, Your Compass

What you’ll learn

The four legal frameworks to know

robots.txt, what it is and isn't

Terms of Service, a real bar

The CFAA (USA), "unauthorised access"

Copyright, facts vs. expression

Privacy, GDPR, CCPA, et al.

The hiQ vs LinkedIn case, what it tells you

A practical ethics framework

A safer recipe

The robots.txt lab

Hands-on lab

Hands-on lab

Quiz, check your understanding

Is robots.txt legally binding?