Copyright vs Facts, What You Can and Can't Redistribute, Production, Scale & Career

Scraping data is one thing. Publishing or commercializing it is another. The line between facts and expression, and how courts have drawn it.

Not legal advice. Engineer's summary.

Most legal anxiety around scraping focuses on whether you're allowed to fetch data. Often the bigger question is: now that you have it, what can you do with it?

Copyright law is the framework that decides what you can keep, what you can publish, and what crosses the line. The principles are clearer than people assume.

The fundamental distinction: facts vs expression

Facts (not copyrightable)	Expression (copyrightable)
Product name "Stainless Blender"	The product description prose
Price $49.99	The marketing copy explaining why $49.99 is a deal
Number of reviews: 142	A specific review's text
URL of a page	The article at that URL
GPS coordinates of a restaurant	The restaurant's logo
Phone number of a business	A photograph of the business
Number of attendees at an event	An eye-witness account of the event
The fact that "Apple released iPhone 15"	The Apple press release announcing it

The principle is Feist Publications v. Rural Telephone Service (1991) in the US: a phone book's white pages, names, numbers, addresses, are facts and not copyrightable, even though significant effort goes into compiling them. ("Sweat of the brow" doesn't create copyright.)

This is the doctrinal foundation of most scraping that works: facts can be scraped and used freely (subject to other constraints).

Compilation copyright

A particular arrangement of facts can itself be copyrightable if the arrangement is sufficiently original. A telephone directory's alphabetical order isn't original (not copyrightable). A curated "Best Restaurants" list with subjective ratings, ordering, and commentary is, that's expression on top of facts.

For scraping:

The individual data points are facts.
The way you present them in your output is yours.
Copying someone else's curated arrangement verbatim may infringe the compilation copyright even if individual facts don't.

Fair use (US) / fair dealing (UK, Canada, etc.)

Even copyrighted material can sometimes be used without permission under exceptions. US fair use is a four-factor test (17 USC § 107):

Purpose and character of the use, transformative? Commercial vs nonprofit?
Nature of the copyrighted work, factual vs creative? Published vs unpublished?
Amount used, small excerpt or substantial?
Effect on the market, does your use compete with / substitute for the original?

Common fair use scenarios in scraping:

Search engine indexing, pulling snippets to help users find the original. Generally fair (Kelly v. Arriba Soft, Field v. Google).
Academic research, using scraped data for non-commercial scholarship.
News reporting, quoting from sources for commentary.

Common fair-use FAILURES:

Republishing in full, "I scraped 10k articles and put them on my own site", clearly not fair use.
Commercial reuse of expression, using scraped marketing copy for your own marketing.
Substantive market substitution, your scraped product effectively replaces the original.

UK fair dealing is narrower than US fair use, specific permitted purposes (research, criticism, news reporting) rather than a balancing test.

AI training and copyright, current frontier

A particularly active area in 2026:

Multiple lawsuits over training LLMs on scraped copyrighted content (NYT v. OpenAI, getty v. stability, etc.).
Outcomes still unsettled.
The argument for "training is fair use because it's transformative" is in active litigation.
For your scraping projects, assume AI training on copyrighted scraped content has unsettled legal status and the safe assumption is "consult a lawyer for commercial AI training pipelines."

Database rights (EU)

The EU has a separate "sui generis" database right (Database Directive 96/9/EC) that protects substantial investment in compiling databases, independent of copyright in the contents.

Implication: a EU-based database compilation may be protected against extraction/reuse of substantial portions even if individual records are not copyrightable. Scraping a EU-based aggregated dataset (job board, real estate listings) can implicate this regardless of facts-vs-expression.

The US has no equivalent sui generis right. Feist holds.

Patterns that work

Practical redistribution patterns that hold up:

Use the facts; rewrite the expression

Scrape product specifications. Display them in your own structure with your own description. The facts come from the source; the expression is yours.

Source page: "The all-new Stainless Blender features cutting-edge titanium blades..."
Your output: { "title": "Stainless Blender", "blade_material": "titanium", "price_cents": 4999 }

The data is facts. Your presentation is yours. Clean.

Aggregation that creates new value

Compile facts from many sources into a comparison or analysis. Your aggregation is original even though individual facts aren't.

Indexing with snippets and link-out

Show a brief snippet (sufficient to identify) and link to the original. This is the search-engine pattern, generally fair use in the US.

Internal use only

Scrape into your own internal data store. Don't redistribute. Use to inform your own decisions or actions. The redistribution risk doesn't apply if you don't redistribute.

Patterns that break

Mirror sites, replicating someone's content verbatim on your domain.
Substantial verbatim quotes, long passages from copyrighted articles.
Republishing scraped photos, every photo is a copyrighted work.
Repackaging curated databases, copying not just the facts but the structure and selection.

How to be safe

For scrapers building real products:

Reduce scraped content to data structure, not free-form prose.
Generate your own descriptions if you display the data, even short variations are your expression.
Don't redistribute scraped images or video unless you have license.
Don't quote articles in full. Excerpts (a sentence or two with attribution) tend to be fair; whole reproductions aren't.
For curated source databases, be aware of compilation copyright and (in EU) database rights.
For AI training data, get legal advice. The space is too live to navigate by gut.

The trademark side-quest

Trademark is separate from copyright. A product name like "iPhone" is trademarked; using "iPhone" as a factual identifier (e.g. in a price comparison) is generally fine ("nominative fair use"). Using it as if you're affiliated with Apple is not.

Same with logos: factual reference is usually OK; embedding a competitor's logo into your marketing materials is risky.

A scraper's redistribution decision tree

What am I planning to publish/sell?
│
├─ Just data fields (prices, names, counts) → Probably fine (facts)
│
├─ Their prose / descriptions / reviews → Likely problematic; rewrite or summarize
│
├─ Their images → Don't redistribute; substitute or attribute carefully
│
├─ Their structured database verbatim → Compilation copyright (US) / database right (EU) risk
│
├─ Aggregated analysis across sources → Original work; usually fine
│
└─ AI training corpus → Get legal advice

What to try

Take one scraping project. List every output field you'd publish or commercialize. For each, classify:

Fact (clearly OK).
Expression rewritten by you (OK).
Source's verbatim expression (risky, usually not OK to republish).
Image / media (separate license needed).
Aggregated derivative (usually OK).

If anything is in the "source's verbatim expression" or "image" category, redesign. Either rewrite, link out, or drop.

That's the discipline. Most scraping projects are fine if disciplined; problems arise from scraping prose and then redistributing it as if it were yours.

Copyright vs Facts, What You Can and Can't Redistribute

What you’ll learn