Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

6.4expert7 min read

Project C, Real-Estate Intelligence Dashboard

Aggregate property listings from 10+ public sources, normalise the messy schema, and surface market insights for one city or region.

What you’ll learn

  • Scrape and normalise listings across 10 heterogeneous sources.
  • Build a canonical address resolver that handles abbreviations, misspellings, and missing fields.
  • Track price-per-square-meter trends over time.
  • Ship a public dashboard useful to first-time buyers in your chosen region.

What you're building

A public dashboard that aggregates property listings (rentals, sales, or both) for one city or region from 10+ sources. Daily refresh, address-resolved, price-per-square-meter normalised, with trend charts and a searchable listings table.

Sources (10+, your pick, examples):
  ┌─ City open-data portal (often free CSV / API)
  ├─ National housing agency feed
  ├─ 3–5 listings sites (regional, not the big international portals)
  ├─ Local newspaper classifieds (often light-protection HTML)
  ├─ Government property-tax assessor records
  └─ A historical-data source from 6 months ago (for trend baseline)

  ↓

  ┌─ Address resolver (Nominatim / open geocoder)
  ├─ Listings DB (Postgres + PostGIS)
  ├─ Price-per-m² normaliser
  └─ Public dashboard (map + filters + trend chart)

This is the lowest-budget capstone. Most real-estate data sources are free or quasi-public; the cost is mostly in your time normalising the chaos.

Required features

  • 10+ data sources. Diversity matters; not just 10 variants of the same listing portal.
  • Geocoded addresses via Nominatim or similar open geocoder.
  • Normalised price-per-area (per square meter, or per square foot, pick one for your region).
  • Trend chart, median price-per-area, weekly, last 6 months.
  • Map view, heatmap or pin-density, not just a list.
  • At least one source via Python + Scrapy and at least one via PHP + Symfony.
  • Browser automation for at least one source.
  • SERP API for at least one source, search "houses for sale in [city]" on Google Real Estate / Bing, capture the SERP, extract listings.
  • Public GitHub repo + deployed dashboard + blog post.

Stretch features

  • Affordability scoring. Combine median price with local salary data (sourced from Project B's job-board crawl, if you also did it, or from open salary surveys).
  • "What changed this week" widget, new listings, price drops, sold-status changes.
  • PostGIS spatial queries. Find listings within X km of a coordinate.
  • Image classifier. Flag listings whose photos look like stock images / fake.

Suggested external sources

Real-estate scraping varies enormously by country. Pick a region you know well, the local nuances (slug-vs-id URLs, regional address quirks, currency / unit norms) matter a lot.

Region Reasonable starting sources
India MagicBricks, 99acres, NoBroker (light protection on most listing pages); Municipal corporation property-tax records; State RERA registries
UK Rightmove, OnTheMarket (check robots.txt carefully); Land Registry's free monthly sales data; council planning portals
US County assessor public records; Realtor.com listings (light); HUD open data; some state MLS portals offer public excerpts
EU Notable: each country's "open-data" portal usually has property datasets. Idealista (ES), Immobilienscout24 (DE), check ToS
Other Almost every country has a national property register with public records. Start there.

Heads-up on the legal/ethical layer (covered in /learn/production-scale/legal-landscape):

  • Property tax data is overwhelmingly public-record. Safe.
  • MLS data is often behind paywalls or member-only, don't bypass paywalls.
  • Listing portals: read robots.txt + ToS carefully. Several have explicit anti-scraping clauses they enforce.
  • If your project surfaces individuals' addresses (e.g. seller names), GDPR / DPDP / state privacy laws apply. Aggregate to the building or street level.

Schema

CREATE EXTENSION IF NOT EXISTS postgis;

CREATE TABLE listings (
  id BIGSERIAL PRIMARY KEY,
  source TEXT NOT NULL,
  source_id TEXT NOT NULL,
  listing_type TEXT,  -- 'sale', 'rent', 'auction'
  address_raw TEXT,
  address_canonical TEXT,
  coords GEOGRAPHY(Point, 4326),
  city TEXT,
  region TEXT,
  country_code CHAR(2),
  price_amount INTEGER,
  price_currency CHAR(3),
  area_sqm NUMERIC(10, 2),
  bedrooms SMALLINT,
  bathrooms SMALLINT,
  description_text TEXT,
  posted_at DATE,
  captured_at TIMESTAMPTZ DEFAULT now(),
  expired_at TIMESTAMPTZ,
  UNIQUE (source, source_id)
);

CREATE INDEX ON listings USING GIST (coords);
CREATE INDEX ON listings (city, listing_type, posted_at DESC);

Address resolution

The hardest piece of this project. Property records use a hundred address formats; you want one.

Recipe:

  1. Normalise raw address, strip extra whitespace, expand common abbreviations ("St" → "Street", "Apt" → "Apartment"), lowercase.
  2. Geocode the cleaned string via Nominatim's free API (1 req/sec rate limit; cache aggressively).
  3. Reverse-geocode the coords back to a canonical address; store that as address_canonical.
  4. Deduplicate listings within 30m of each other with the same listing_type, likely the same property cross-posted.

Nominatim rate limits are tight. Build a persistent cache (url → coords + canonical_address). Hit it before falling back to the API.

# Minimal Nominatim wrapper with cache
import requests, json, time, hashlib
CACHE = "geocode_cache.jsonl"

def geocode(raw_address):
  key = hashlib.md5(raw_address.lower().encode()).hexdigest()
  # check cache (omitted)
  time.sleep(1)  # be nice
  r = requests.get(
  "https://nominatim.openstreetmap.org/search",
  params={"q": raw_address, "format": "json", "limit": 1},
  headers={"User-Agent": "PriceIntelCapstone/1.0 (contact@example.invalid)"},
  )
  data = r.json()
  if not data: return None
  result = {"lat": float(data[0]["lat"]), "lon": float(data[0]["lon"]),
  "display": data[0]["display_name"]}
  # write to cache (omitted)
  return result

For 10k listings, you'll burn 10k cache entries the first time, then almost none on subsequent days. The cache is the killer feature.

Price-per-area normalisation

def price_per_sqm(price, currency, area_sqm):
  if not area_sqm or area_sqm <= 0:
  return None
  return price / area_sqm  # keep currency separate

# For comparison across currencies, convert at a daily FX snapshot.
# Don't normalise to "USD per sqm", it confuses local users.

Show the natural local unit on the dashboard. Trends are far more meaningful within one currency.

Dashboard ideas

  • Map view with pin density / heatmap colouring by price/m².
  • Trend chart, median price/m² weekly for the past 26 weeks.
  • Filters, bedrooms, area range, price range, listing type.
  • "Hot postcodes", which postal codes / wards saw the biggest movement this month.
  • "Compare to last year", same week, last year, same area.

Map libraries: Leaflet (open source, the default), Maplibre GL (vector tiles, prettier), or Mapbox (free tier).

Common pitfalls

  • Geocoding eats your budget. Cache from day 1. The first night, your scraper may stall on Nominatim's rate limit. Plan for it.
  • Area unit confusion. Indian listings often quote in sq ft; European in sqm. Decide your canonical unit and convert at the parsing layer, not the dashboard layer.
  • Bedroom counts include "studio" / "1BR + den" / "2.5BHK". Settle on a numeric scheme: studio = 0.5, ".5BR" = 0.5. Document.
  • Duplicate listings across portals. Property at 12 Hill Rd shows on 3 portals at slightly different prices. Cluster by geo distance + same listing type.
  • Sold / withdrawn signal is rare. Most portals just stop displaying. Use the same 30-day-no-seen heuristic as Project B.

Deployment

A VPS with PostGIS is the simplest setup. The map tiles are cacheable. The dashboard rebuilds nightly.

Costs:

  • VPS: $5–10/mo
  • Nominatim: free, but consider self-hosting if you go past 100k addresses
  • Map tiles: free tier on Mapbox / Maplibre for portfolio-sized traffic

What "done" looks like

  • 10+ sources verified working in CI.
  • 60 consecutive days of refreshed data.
  • Address geocoding hit rate ≥85% across your sample.
  • The map renders in <2s.
  • A local who lives in the region looks at it and says something specific ("interesting that prices in [neighbourhood] are flat").
  • Blog post documents data sources, geocoding strategy, three failures, hosting cost.

Hands-on lab

Pick your city or region today. Open the city's open-data portal in a browser. Find one property dataset (typically tax assessor or sales history). Download a sample. Load it into a Postgres + PostGIS instance running locally via Docker. Visualise with Leaflet. That's day 1. Now go find sources 2–10.

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Project C, Real-Estate Intelligence Dashboard1 / 8

The lesson identifies this as the hardest piece of Project C:

Score so far: 0 / 0