Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

6.5expert8 min read

Project D, Open Public-Data Aggregator

Pick an underserved public dataset and build the canonical free version. The most flexible capstone option, and the one most likely to outlive the curriculum.

What you’ll learn

  • Identify a public dataset that's currently fragmented, badly documented, or paywalled despite being legally open.
  • Design a long-term schema that survives source changes.
  • Publish the dataset under a permissive license with a clear contribution path.
  • Build a small but useful query / browse layer on top of it.

What you're building

You pick a public dataset that's legally open but practically hard to use, fragmented across 50 government sites, locked behind a 1990s-era portal, hidden in PDFs, or trapped in a paid aggregator that didn't add much value. You build the canonical free version: scraped, cleaned, normalised, hosted on GitHub Pages, refreshed on a schedule.

This is the most flexible capstone. It's also the one that's most likely to outlive the curriculum, open datasets accumulate users over time. A few well-chosen examples:

  • All Indian Supreme Court judgments since 1950, with metadata (currently scattered across paywalled aggregators).
  • Every US municipal water-quality test result, by zip code (EPA has it but the portal is unusable).
  • Indian Parliament voting records, structured (currently only in PDFs).
  • Every public-tender award above €100k in the EU (TED is the source; consolidation is the value).
  • Drug pricing across all US Medicare formularies (CMS has the data; few have consolidated it).

You're not trying to scrape one site. You're trying to become the obvious starting point for anyone who needs this dataset.

Required features

  • 10+ source endpoints (but they can all be on the same agency / portal).
  • Public GitHub repo, the data + the scrapers are both open source.
  • Open license on the data (CC-BY-4.0 or CC0).
  • Clean schema with documented field definitions.
  • Daily / weekly refresh running in CI.
  • At least one Python source + one PHP source. (Yes, this one too.)
  • Browser automation for at least one source if your target uses a JS-rendered portal.
  • A search / browse layer, even if it's just a static SQLite-backed Datasette deployment.
  • Blog post explaining the dataset's significance.

What makes a good Project D target

Three filters. Apply ruthlessly.

  1. Legally open. Government public records, court judgments, regulatory filings, agency data. The robots.txt should be permissive, the data should be public-by-law, and there should be no paywall.

  2. Practically inaccessible. If it's already on data.gov / data.gov.in / data.europa.eu in a clean form, you'd be reinventing the wheel. You want something that's "technically open but no one's done the work yet."

  3. Used by someone who'd care. Journalists. Activists. Small businesses. Researchers. Lawyers. If you can name two real people who'd cite your dataset, you've found something. If the dataset is just interesting to you, save it for a side project.

Suggested datasets by category

Category Examples
Legal Lower court judgments, tribunal orders, public-sector lawsuit records
Regulatory SEC filings (already covered well, skip), national drug regulators, environmental clearances
Procurement Public tenders, contract awards, ministry purchase orders
Health Disease surveillance reports, drug pricing, hospital licensing
Education School inspection reports, university accreditation status, faculty publication records
Transport Public transport schedules, road accident data, license registration counts
Environment Air quality readings, water quality, deforestation alerts, mining permits
Politics Voting records, lobbying registrations, campaign-finance filings

Pick from a domain you know enough about to spot data-quality bugs. Scraping accident data for a country whose road-classification system you don't understand will produce a useless dataset.

Architecture: lighter than the other capstones

This project doesn't need a dashboard. Its core deliverable is the dataset itself:

your-repo/
├── scrapers/
│  ├── python/
│  │  ├── source_a_spider.py
│  │  └── source_b_spider.py
│  └── php/
│  ├── SourceCCommand.php
│  └── SourceDCommand.php
├── data/  # the published dataset
│  ├── parquet/
│  │  ├── facts_2024.parquet
│  │  └── facts_2025.parquet
│  ├── csv/  # human-readable mirror
│  └── sqlite/
│  └── all_data.db
├── schema/
│  ├── SCHEMA.md  # field definitions
│  └── checksums.json
├── docs/
│  ├── methodology.md  # how you scraped each source
│  ├── known_gaps.md  # what's missing, why
│  └── changelog.md
├── .github/workflows/
│  └── refresh.yml  # daily cron, commits new data
├── LICENSE  # CC-BY-4.0 or CC0
└── README.md

The dashboard, if you build one, is Datasette pointing at the SQLite file. It's free, it's good, and it costs you about 30 minutes to wire up.

Schema

Your schema is the product. Spend disproportionate time on it.

Three principles:

  1. Long format, not wide. One row per fact, with dimension_a, dimension_b..., value. Don't pivot. Pivoting locks future users into your assumed analytical shape; long format lets them pivot however they want.

  2. Provenance fields on every row. source_url, source_fetched_at, source_version. When the dataset is questioned, you must be able to point at the exact source page.

  3. Append-only history. When a source corrects a value, append a new observation with superseded_at set on the old one. Don't UPDATE. Researchers cite versions; you must be able to reproduce historical snapshots.

Example for a court-judgments dataset:

CREATE TABLE judgments (
  id TEXT PRIMARY KEY,  -- court_code + case_number
  court_code TEXT NOT NULL,
  case_number TEXT NOT NULL,
  decision_date DATE,
  bench TEXT[],
  parties TEXT[],
  subject_codes TEXT[],
  pdf_url TEXT,
  pdf_sha256 TEXT,
  text_extracted TEXT,  -- OCR / text-extracted body
  source_url TEXT NOT NULL,
  source_fetched_at TIMESTAMPTZ NOT NULL,
  schema_version SMALLINT NOT NULL DEFAULT 1
);

Licensing

This is where most aggregator projects die. Get it right on day 1.

  • Your scrapers: MIT or Apache-2.0. Permissive so anyone can build on them.
  • Your data: CC-BY-4.0 (requires attribution) or CC0 (public domain). Pick CC-BY-4.0 unless you want zero requirements; the attribution is what brings you traffic.
  • Cite the original sources in your README. Always. The whole project depends on the underlying source being trustable; pointing at it both attributes the source and helps users verify your data.

If the source is government data, double-check the country's open-government license. India has the Government Open Data License, India (GODL); EU has the EU Open Data Licence; US Federal works are public domain by default. Match your output license to the source's license; don't relicense more restrictively than the source allows.

Quality discipline

Open datasets fail when they're wrong silently. Three habits:

  1. Schema validation in CI. Every refresh runs pytest / phpunit checking row counts, field formats, ID uniqueness. If validation fails, the new data isn't committed.

  2. Diff alerts. When the daily refresh produces a 10%+ change in row count for any source, alert yourself. That's either a real change worth noting or a scraper bug, either way, you need to know.

  3. Known-gaps doc. Maintain docs/known_gaps.md, be explicit about what your dataset doesn't cover, what dates are missing, which fields are unreliable. Honest gap documentation makes your dataset more, not less, citeable.

Promotion

Building a great dataset that no one knows about is a tree falling in a forest. After launching:

  • Post on r/dataisbeautiful, r/datasets (subreddit dependent, check rules).
  • Email three journalists / researchers who'd plausibly use it. With a 200-word pitch, the dataset link, and a CSV they can open in Excel right now.
  • Add a contribution path, invite people to file issues for data errors or new sources.
  • Submit to awesome lists, awesome-public-datasets, awesome-datasette, awesome-{your-domain}.

If even one journalist runs a story citing your dataset, the capstone has paid for itself a thousand times over.

Common pitfalls

  • Scope creep. Don't start with "all government data." Start with one ministry, one type of filing.
  • Bad schemas. Adding a column later is fine; removing one is breaking. Underspecify on day 1.
  • Stale data without notice. If the source goes down or your scraper breaks, the dataset is silently stale. Loud failure modes save you.
  • License confusion. Don't release data under an MIT-style license, MIT is for software. Use Creative Commons.
  • Ego attachment. This project is most useful to others; let them shape it. Accept the PR that adds a new source even if it isn't perfect.

Deployment

Cheapest possible. GitHub Actions runs the scrapers, commits the new data files. GitHub Pages serves Datasette via Docker, or static HTML. Cost: $0 if your data fits in a public repo (<5GB), $5/mo for a tiny VPS if you outgrow.

For larger datasets, use Cloudflare R2 ($0.015/GB/month, no egress fees) for the raw files, with the GitHub repo containing only metadata + Datasette.

What "done" looks like

  • 30 consecutive days of refreshes committed to the repo (visible in git log).
  • The dataset is browseable via Datasette or a static UI.
  • README explains: what's in here, what's NOT, the license, how to cite.
  • Three external citations / mentions (a tweet, a blog post linking to it, a GitHub fork, a journalist quote, any one of these counts).
  • The data has at least 10,000 rows. Smaller datasets don't justify the framing.
  • Blog post documents the dataset's significance, sources, schema decisions, gaps, license.

Hands-on lab

Open the government open-data portal for your country. Browse for an hour. Identify three datasets that look useful but are actually a mess (multiple CSVs with inconsistent column names, broken Excel files, PDFs masquerading as data). Pick the one whose mess would most embarrass a senior official. That's probably your target. Start scraping today.

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Project D, Open Public-Data Aggregator1 / 8

The three filters for picking a good Project D target are: legally open, practically inaccessible, AND:

Score so far: 0 / 0