Project D, Open Public-Data Aggregator
Pick an underserved public dataset and build the canonical free version. The most flexible capstone option, and the one most likely to outlive the curriculum.
What you’ll learn
- Identify a public dataset that's currently fragmented, badly documented, or paywalled despite being legally open.
- Design a long-term schema that survives source changes.
- Publish the dataset under a permissive license with a clear contribution path.
- Build a small but useful query / browse layer on top of it.
What you're building
You pick a public dataset that's legally open but practically hard to use, fragmented across 50 government sites, locked behind a 1990s-era portal, hidden in PDFs, or trapped in a paid aggregator that didn't add much value. You build the canonical free version: scraped, cleaned, normalised, hosted on GitHub Pages, refreshed on a schedule.
This is the most flexible capstone. It's also the one that's most likely to outlive the curriculum, open datasets accumulate users over time. A few well-chosen examples:
- All Indian Supreme Court judgments since 1950, with metadata (currently scattered across paywalled aggregators).
- Every US municipal water-quality test result, by zip code (EPA has it but the portal is unusable).
- Indian Parliament voting records, structured (currently only in PDFs).
- Every public-tender award above €100k in the EU (TED is the source; consolidation is the value).
- Drug pricing across all US Medicare formularies (CMS has the data; few have consolidated it).
You're not trying to scrape one site. You're trying to become the obvious starting point for anyone who needs this dataset.
Required features
- 10+ source endpoints (but they can all be on the same agency / portal).
- Public GitHub repo, the data + the scrapers are both open source.
- Open license on the data (CC-BY-4.0 or CC0).
- Clean schema with documented field definitions.
- Daily / weekly refresh running in CI.
- At least one Python source + one PHP source. (Yes, this one too.)
- Browser automation for at least one source if your target uses a JS-rendered portal.
- A search / browse layer, even if it's just a static SQLite-backed Datasette deployment.
- Blog post explaining the dataset's significance.
What makes a good Project D target
Three filters. Apply ruthlessly.
-
Legally open. Government public records, court judgments, regulatory filings, agency data. The robots.txt should be permissive, the data should be public-by-law, and there should be no paywall.
-
Practically inaccessible. If it's already on data.gov / data.gov.in / data.europa.eu in a clean form, you'd be reinventing the wheel. You want something that's "technically open but no one's done the work yet."
-
Used by someone who'd care. Journalists. Activists. Small businesses. Researchers. Lawyers. If you can name two real people who'd cite your dataset, you've found something. If the dataset is just interesting to you, save it for a side project.
Suggested datasets by category
| Category | Examples |
|---|---|
| Legal | Lower court judgments, tribunal orders, public-sector lawsuit records |
| Regulatory | SEC filings (already covered well, skip), national drug regulators, environmental clearances |
| Procurement | Public tenders, contract awards, ministry purchase orders |
| Health | Disease surveillance reports, drug pricing, hospital licensing |
| Education | School inspection reports, university accreditation status, faculty publication records |
| Transport | Public transport schedules, road accident data, license registration counts |
| Environment | Air quality readings, water quality, deforestation alerts, mining permits |
| Politics | Voting records, lobbying registrations, campaign-finance filings |
Pick from a domain you know enough about to spot data-quality bugs. Scraping accident data for a country whose road-classification system you don't understand will produce a useless dataset.
Architecture: lighter than the other capstones
This project doesn't need a dashboard. Its core deliverable is the dataset itself:
your-repo/
├── scrapers/
│ ├── python/
│ │ ├── source_a_spider.py
│ │ └── source_b_spider.py
│ └── php/
│ ├── SourceCCommand.php
│ └── SourceDCommand.php
├── data/ # the published dataset
│ ├── parquet/
│ │ ├── facts_2024.parquet
│ │ └── facts_2025.parquet
│ ├── csv/ # human-readable mirror
│ └── sqlite/
│ └── all_data.db
├── schema/
│ ├── SCHEMA.md # field definitions
│ └── checksums.json
├── docs/
│ ├── methodology.md # how you scraped each source
│ ├── known_gaps.md # what's missing, why
│ └── changelog.md
├── .github/workflows/
│ └── refresh.yml # daily cron, commits new data
├── LICENSE # CC-BY-4.0 or CC0
└── README.md
The dashboard, if you build one, is Datasette pointing at the SQLite file. It's free, it's good, and it costs you about 30 minutes to wire up.
Schema
Your schema is the product. Spend disproportionate time on it.
Three principles:
-
Long format, not wide. One row per fact, with
dimension_a, dimension_b..., value. Don't pivot. Pivoting locks future users into your assumed analytical shape; long format lets them pivot however they want. -
Provenance fields on every row.
source_url,source_fetched_at,source_version. When the dataset is questioned, you must be able to point at the exact source page. -
Append-only history. When a source corrects a value, append a new observation with
superseded_atset on the old one. Don't UPDATE. Researchers cite versions; you must be able to reproduce historical snapshots.
Example for a court-judgments dataset:
CREATE TABLE judgments (
id TEXT PRIMARY KEY, -- court_code + case_number
court_code TEXT NOT NULL,
case_number TEXT NOT NULL,
decision_date DATE,
bench TEXT[],
parties TEXT[],
subject_codes TEXT[],
pdf_url TEXT,
pdf_sha256 TEXT,
text_extracted TEXT, -- OCR / text-extracted body
source_url TEXT NOT NULL,
source_fetched_at TIMESTAMPTZ NOT NULL,
schema_version SMALLINT NOT NULL DEFAULT 1
);
Licensing
This is where most aggregator projects die. Get it right on day 1.
- Your scrapers: MIT or Apache-2.0. Permissive so anyone can build on them.
- Your data: CC-BY-4.0 (requires attribution) or CC0 (public domain). Pick CC-BY-4.0 unless you want zero requirements; the attribution is what brings you traffic.
- Cite the original sources in your README. Always. The whole project depends on the underlying source being trustable; pointing at it both attributes the source and helps users verify your data.
If the source is government data, double-check the country's open-government license. India has the Government Open Data License, India (GODL); EU has the EU Open Data Licence; US Federal works are public domain by default. Match your output license to the source's license; don't relicense more restrictively than the source allows.
Quality discipline
Open datasets fail when they're wrong silently. Three habits:
-
Schema validation in CI. Every refresh runs pytest / phpunit checking row counts, field formats, ID uniqueness. If validation fails, the new data isn't committed.
-
Diff alerts. When the daily refresh produces a 10%+ change in row count for any source, alert yourself. That's either a real change worth noting or a scraper bug, either way, you need to know.
-
Known-gaps doc. Maintain
docs/known_gaps.md, be explicit about what your dataset doesn't cover, what dates are missing, which fields are unreliable. Honest gap documentation makes your dataset more, not less, citeable.
Promotion
Building a great dataset that no one knows about is a tree falling in a forest. After launching:
- Post on r/dataisbeautiful, r/datasets (subreddit dependent, check rules).
- Email three journalists / researchers who'd plausibly use it. With a 200-word pitch, the dataset link, and a CSV they can open in Excel right now.
- Add a contribution path, invite people to file issues for data errors or new sources.
- Submit to awesome lists,
awesome-public-datasets,awesome-datasette,awesome-{your-domain}.
If even one journalist runs a story citing your dataset, the capstone has paid for itself a thousand times over.
Common pitfalls
- Scope creep. Don't start with "all government data." Start with one ministry, one type of filing.
- Bad schemas. Adding a column later is fine; removing one is breaking. Underspecify on day 1.
- Stale data without notice. If the source goes down or your scraper breaks, the dataset is silently stale. Loud failure modes save you.
- License confusion. Don't release data under an MIT-style license, MIT is for software. Use Creative Commons.
- Ego attachment. This project is most useful to others; let them shape it. Accept the PR that adds a new source even if it isn't perfect.
Deployment
Cheapest possible. GitHub Actions runs the scrapers, commits the new data files. GitHub Pages serves Datasette via Docker, or static HTML. Cost: $0 if your data fits in a public repo (<5GB), $5/mo for a tiny VPS if you outgrow.
For larger datasets, use Cloudflare R2 ($0.015/GB/month, no egress fees) for the raw files, with the GitHub repo containing only metadata + Datasette.
What "done" looks like
- 30 consecutive days of refreshes committed to the repo (visible in git log).
- The dataset is browseable via Datasette or a static UI.
- README explains: what's in here, what's NOT, the license, how to cite.
- Three external citations / mentions (a tweet, a blog post linking to it, a GitHub fork, a journalist quote, any one of these counts).
- The data has at least 10,000 rows. Smaller datasets don't justify the framing.
- Blog post documents the dataset's significance, sources, schema decisions, gaps, license.
Hands-on lab
Open the government open-data portal for your country. Browse for an hour. Identify three datasets that look useful but are actually a mess (multiple CSVs with inconsistent column names, broken Excel files, PDFs masquerading as data). Pick the one whose mess would most embarrass a senior official. That's probably your target. Start scraping today.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.