Project C, Real-Estate Intelligence Dashboard, Final Mastery Project

Aggregate property listings from 10+ public sources, normalise the messy schema, and surface market insights for one city or region.

What you're building

A public dashboard that aggregates property listings (rentals, sales, or both) for one city or region from 10+ sources. Daily refresh, address-resolved, price-per-square-meter normalised, with trend charts and a searchable listings table.

Sources (10+, your pick, examples):
  ┌─ City open-data portal (often free CSV / API)
  ├─ National housing agency feed
  ├─ 3–5 listings sites (regional, not the big international portals)
  ├─ Local newspaper classifieds (often light-protection HTML)
  ├─ Government property-tax assessor records
  └─ A historical-data source from 6 months ago (for trend baseline)

  ↓

  ┌─ Address resolver (Nominatim / open geocoder)
  ├─ Listings DB (Postgres + PostGIS)
  ├─ Price-per-m² normaliser
  └─ Public dashboard (map + filters + trend chart)

This is the lowest-budget capstone. Most real-estate data sources are free or quasi-public; the cost is mostly in your time normalising the chaos.

Required features

10+ data sources. Diversity matters; not just 10 variants of the same listing portal.
Geocoded addresses via Nominatim or similar open geocoder.
Normalised price-per-area (per square meter, or per square foot, pick one for your region).
Trend chart, median price-per-area, weekly, last 6 months.
Map view, heatmap or pin-density, not just a list.
At least one source via Python + Scrapy and at least one via PHP + Symfony.
Browser automation for at least one source.
SERP API for at least one source, search "houses for sale in [city]" on Google Real Estate / Bing, capture the SERP, extract listings.
Public GitHub repo + deployed dashboard + blog post.

Stretch features

Affordability scoring. Combine median price with local salary data (sourced from Project B's job-board crawl, if you also did it, or from open salary surveys).
"What changed this week" widget, new listings, price drops, sold-status changes.
PostGIS spatial queries. Find listings within X km of a coordinate.
Image classifier. Flag listings whose photos look like stock images / fake.

Suggested external sources

Real-estate scraping varies enormously by country. Pick a region you know well, the local nuances (slug-vs-id URLs, regional address quirks, currency / unit norms) matter a lot.

Region	Reasonable starting sources
India	MagicBricks, 99acres, NoBroker (light protection on most listing pages); Municipal corporation property-tax records; State RERA registries
UK	Rightmove, OnTheMarket (check robots.txt carefully); Land Registry's free monthly sales data; council planning portals
US	County assessor public records; Realtor.com listings (light); HUD open data; some state MLS portals offer public excerpts
EU	Notable: each country's "open-data" portal usually has property datasets. Idealista (ES), Immobilienscout24 (DE), check ToS
Other	Almost every country has a national property register with public records. Start there.

Heads-up on the legal/ethical layer (covered in /learn/production-scale/legal-landscape):

Property tax data is overwhelmingly public-record. Safe.
MLS data is often behind paywalls or member-only, don't bypass paywalls.
Listing portals: read robots.txt + ToS carefully. Several have explicit anti-scraping clauses they enforce.
If your project surfaces individuals' addresses (e.g. seller names), GDPR / DPDP / state privacy laws apply. Aggregate to the building or street level.

Schema

CREATE EXTENSION IF NOT EXISTS postgis;

CREATE TABLE listings (
  id BIGSERIAL PRIMARY KEY,
  source TEXT NOT NULL,
  source_id TEXT NOT NULL,
  listing_type TEXT,  -- 'sale', 'rent', 'auction'
  address_raw TEXT,
  address_canonical TEXT,
  coords GEOGRAPHY(Point, 4326),
  city TEXT,
  region TEXT,
  country_code CHAR(2),
  price_amount INTEGER,
  price_currency CHAR(3),
  area_sqm NUMERIC(10, 2),
  bedrooms SMALLINT,
  bathrooms SMALLINT,
  description_text TEXT,
  posted_at DATE,
  captured_at TIMESTAMPTZ DEFAULT now(),
  expired_at TIMESTAMPTZ,
  UNIQUE (source, source_id)
);

CREATE INDEX ON listings USING GIST (coords);
CREATE INDEX ON listings (city, listing_type, posted_at DESC);

Address resolution

The hardest piece of this project. Property records use a hundred address formats; you want one.

Recipe:

Normalise raw address, strip extra whitespace, expand common abbreviations ("St" → "Street", "Apt" → "Apartment"), lowercase.
Geocode the cleaned string via Nominatim's free API (1 req/sec rate limit; cache aggressively).
Reverse-geocode the coords back to a canonical address; store that as address_canonical.
Deduplicate listings within 30m of each other with the same listing_type, likely the same property cross-posted.

Nominatim rate limits are tight. Build a persistent cache (url → coords + canonical_address). Hit it before falling back to the API.

# Minimal Nominatim wrapper with cache
import requests, json, time, hashlib
CACHE = "geocode_cache.jsonl"

def geocode(raw_address):
  key = hashlib.md5(raw_address.lower().encode()).hexdigest()
  # check cache (omitted)
  time.sleep(1)  # be nice
  r = requests.get(
  "https://nominatim.openstreetmap.org/search",
  params={"q": raw_address, "format": "json", "limit": 1},
  headers={"User-Agent": "PriceIntelCapstone/1.0 (contact@example.invalid)"},
  )
  data = r.json()
  if not data: return None
  result = {"lat": float(data[0]["lat"]), "lon": float(data[0]["lon"]),
  "display": data[0]["display_name"]}
  # write to cache (omitted)
  return result

For 10k listings, you'll burn 10k cache entries the first time, then almost none on subsequent days. The cache is the killer feature.

Price-per-area normalisation

def price_per_sqm(price, currency, area_sqm):
  if not area_sqm or area_sqm <= 0:
  return None
  return price / area_sqm  # keep currency separate

# For comparison across currencies, convert at a daily FX snapshot.
# Don't normalise to "USD per sqm", it confuses local users.

Show the natural local unit on the dashboard. Trends are far more meaningful within one currency.

Dashboard ideas

Map view with pin density / heatmap colouring by price/m².
Trend chart, median price/m² weekly for the past 26 weeks.
Filters, bedrooms, area range, price range, listing type.
"Hot postcodes", which postal codes / wards saw the biggest movement this month.
"Compare to last year", same week, last year, same area.

Map libraries: Leaflet (open source, the default), Maplibre GL (vector tiles, prettier), or Mapbox (free tier).

Common pitfalls

Geocoding eats your budget. Cache from day 1. The first night, your scraper may stall on Nominatim's rate limit. Plan for it.
Area unit confusion. Indian listings often quote in sq ft; European in sqm. Decide your canonical unit and convert at the parsing layer, not the dashboard layer.
Bedroom counts include "studio" / "1BR + den" / "2.5BHK". Settle on a numeric scheme: studio = 0.5, ".5BR" = 0.5. Document.
Duplicate listings across portals. Property at 12 Hill Rd shows on 3 portals at slightly different prices. Cluster by geo distance + same listing type.
Sold / withdrawn signal is rare. Most portals just stop displaying. Use the same 30-day-no-seen heuristic as Project B.

Deployment

A VPS with PostGIS is the simplest setup. The map tiles are cacheable. The dashboard rebuilds nightly.

Costs:

VPS: $5–10/mo
Nominatim: free, but consider self-hosting if you go past 100k addresses
Map tiles: free tier on Mapbox / Maplibre for portfolio-sized traffic

What "done" looks like

10+ sources verified working in CI.
60 consecutive days of refreshed data.
Address geocoding hit rate ≥85% across your sample.
The map renders in <2s.
A local who lives in the region looks at it and says something specific ("interesting that prices in [neighbourhood] are flat").
Blog post documents data sources, geocoding strategy, three failures, hosting cost.

Hands-on lab

Pick your city or region today. Open the city's open-data portal in a browser. Find one property dataset (typically tax assessor or sales history). Download a sample. Load it into a Postgres + PostGIS instance running locally via Docker. Visualise with Leaflet. That's day 1. Now go find sources 2–10.

Project C, Real-Estate Intelligence Dashboard

What you’ll learn