Project C, Real-Estate Intelligence Dashboard
Aggregate property listings from 10+ public sources, normalise the messy schema, and surface market insights for one city or region.
What you’ll learn
- Scrape and normalise listings across 10 heterogeneous sources.
- Build a canonical address resolver that handles abbreviations, misspellings, and missing fields.
- Track price-per-square-meter trends over time.
- Ship a public dashboard useful to first-time buyers in your chosen region.
What you're building
A public dashboard that aggregates property listings (rentals, sales, or both) for one city or region from 10+ sources. Daily refresh, address-resolved, price-per-square-meter normalised, with trend charts and a searchable listings table.
Sources (10+, your pick, examples):
┌─ City open-data portal (often free CSV / API)
├─ National housing agency feed
├─ 3–5 listings sites (regional, not the big international portals)
├─ Local newspaper classifieds (often light-protection HTML)
├─ Government property-tax assessor records
└─ A historical-data source from 6 months ago (for trend baseline)
↓
┌─ Address resolver (Nominatim / open geocoder)
├─ Listings DB (Postgres + PostGIS)
├─ Price-per-m² normaliser
└─ Public dashboard (map + filters + trend chart)
This is the lowest-budget capstone. Most real-estate data sources are free or quasi-public; the cost is mostly in your time normalising the chaos.
Required features
- 10+ data sources. Diversity matters; not just 10 variants of the same listing portal.
- Geocoded addresses via Nominatim or similar open geocoder.
- Normalised price-per-area (per square meter, or per square foot, pick one for your region).
- Trend chart, median price-per-area, weekly, last 6 months.
- Map view, heatmap or pin-density, not just a list.
- At least one source via Python + Scrapy and at least one via PHP + Symfony.
- Browser automation for at least one source.
- SERP API for at least one source, search "houses for sale in [city]" on Google Real Estate / Bing, capture the SERP, extract listings.
- Public GitHub repo + deployed dashboard + blog post.
Stretch features
- Affordability scoring. Combine median price with local salary data (sourced from Project B's job-board crawl, if you also did it, or from open salary surveys).
- "What changed this week" widget, new listings, price drops, sold-status changes.
- PostGIS spatial queries. Find listings within X km of a coordinate.
- Image classifier. Flag listings whose photos look like stock images / fake.
Suggested external sources
Real-estate scraping varies enormously by country. Pick a region you know well, the local nuances (slug-vs-id URLs, regional address quirks, currency / unit norms) matter a lot.
| Region | Reasonable starting sources |
|---|---|
| India | MagicBricks, 99acres, NoBroker (light protection on most listing pages); Municipal corporation property-tax records; State RERA registries |
| UK | Rightmove, OnTheMarket (check robots.txt carefully); Land Registry's free monthly sales data; council planning portals |
| US | County assessor public records; Realtor.com listings (light); HUD open data; some state MLS portals offer public excerpts |
| EU | Notable: each country's "open-data" portal usually has property datasets. Idealista (ES), Immobilienscout24 (DE), check ToS |
| Other | Almost every country has a national property register with public records. Start there. |
Heads-up on the legal/ethical layer (covered in /learn/production-scale/legal-landscape):
- Property tax data is overwhelmingly public-record. Safe.
- MLS data is often behind paywalls or member-only, don't bypass paywalls.
- Listing portals: read robots.txt + ToS carefully. Several have explicit anti-scraping clauses they enforce.
- If your project surfaces individuals' addresses (e.g. seller names), GDPR / DPDP / state privacy laws apply. Aggregate to the building or street level.
Schema
CREATE EXTENSION IF NOT EXISTS postgis;
CREATE TABLE listings (
id BIGSERIAL PRIMARY KEY,
source TEXT NOT NULL,
source_id TEXT NOT NULL,
listing_type TEXT, -- 'sale', 'rent', 'auction'
address_raw TEXT,
address_canonical TEXT,
coords GEOGRAPHY(Point, 4326),
city TEXT,
region TEXT,
country_code CHAR(2),
price_amount INTEGER,
price_currency CHAR(3),
area_sqm NUMERIC(10, 2),
bedrooms SMALLINT,
bathrooms SMALLINT,
description_text TEXT,
posted_at DATE,
captured_at TIMESTAMPTZ DEFAULT now(),
expired_at TIMESTAMPTZ,
UNIQUE (source, source_id)
);
CREATE INDEX ON listings USING GIST (coords);
CREATE INDEX ON listings (city, listing_type, posted_at DESC);
Address resolution
The hardest piece of this project. Property records use a hundred address formats; you want one.
Recipe:
- Normalise raw address, strip extra whitespace, expand common abbreviations ("St" → "Street", "Apt" → "Apartment"), lowercase.
- Geocode the cleaned string via Nominatim's free API (1 req/sec rate limit; cache aggressively).
- Reverse-geocode the coords back to a canonical address; store that as
address_canonical. - Deduplicate listings within 30m of each other with the same listing_type, likely the same property cross-posted.
Nominatim rate limits are tight. Build a persistent cache (url → coords + canonical_address). Hit it before falling back to the API.
# Minimal Nominatim wrapper with cache
import requests, json, time, hashlib
CACHE = "geocode_cache.jsonl"
def geocode(raw_address):
key = hashlib.md5(raw_address.lower().encode()).hexdigest()
# check cache (omitted)
time.sleep(1) # be nice
r = requests.get(
"https://nominatim.openstreetmap.org/search",
params={"q": raw_address, "format": "json", "limit": 1},
headers={"User-Agent": "PriceIntelCapstone/1.0 (contact@example.invalid)"},
)
data = r.json()
if not data: return None
result = {"lat": float(data[0]["lat"]), "lon": float(data[0]["lon"]),
"display": data[0]["display_name"]}
# write to cache (omitted)
return result
For 10k listings, you'll burn 10k cache entries the first time, then almost none on subsequent days. The cache is the killer feature.
Price-per-area normalisation
def price_per_sqm(price, currency, area_sqm):
if not area_sqm or area_sqm <= 0:
return None
return price / area_sqm # keep currency separate
# For comparison across currencies, convert at a daily FX snapshot.
# Don't normalise to "USD per sqm", it confuses local users.
Show the natural local unit on the dashboard. Trends are far more meaningful within one currency.
Dashboard ideas
- Map view with pin density / heatmap colouring by price/m².
- Trend chart, median price/m² weekly for the past 26 weeks.
- Filters, bedrooms, area range, price range, listing type.
- "Hot postcodes", which postal codes / wards saw the biggest movement this month.
- "Compare to last year", same week, last year, same area.
Map libraries: Leaflet (open source, the default), Maplibre GL (vector tiles, prettier), or Mapbox (free tier).
Common pitfalls
- Geocoding eats your budget. Cache from day 1. The first night, your scraper may stall on Nominatim's rate limit. Plan for it.
- Area unit confusion. Indian listings often quote in sq ft; European in sqm. Decide your canonical unit and convert at the parsing layer, not the dashboard layer.
- Bedroom counts include "studio" / "1BR + den" / "2.5BHK". Settle on a numeric scheme: studio = 0.5, ".5BR" = 0.5. Document.
- Duplicate listings across portals. Property at 12 Hill Rd shows on 3 portals at slightly different prices. Cluster by geo distance + same listing type.
- Sold / withdrawn signal is rare. Most portals just stop displaying. Use the same 30-day-no-seen heuristic as Project B.
Deployment
A VPS with PostGIS is the simplest setup. The map tiles are cacheable. The dashboard rebuilds nightly.
Costs:
- VPS: $5–10/mo
- Nominatim: free, but consider self-hosting if you go past 100k addresses
- Map tiles: free tier on Mapbox / Maplibre for portfolio-sized traffic
What "done" looks like
- 10+ sources verified working in CI.
- 60 consecutive days of refreshed data.
- Address geocoding hit rate ≥85% across your sample.
- The map renders in <2s.
- A local who lives in the region looks at it and says something specific ("interesting that prices in [neighbourhood] are flat").
- Blog post documents data sources, geocoding strategy, three failures, hosting cost.
Hands-on lab
Pick your city or region today. Open the city's open-data portal in a browser. Find one property dataset (typically tax assessor or sales history). Download a sample. Load it into a Postgres + PostGIS instance running locally via Docker. Visualise with Leaflet. That's day 1. Now go find sources 2–10.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.