Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

6.2expert6 min read

Project A, Multi-Source Price Intelligence Platform

Track prices across Catalog108, two external e-commerce sites, and Google Shopping daily. Output: a normalised price-history database plus a public dashboard.

What you’ll learn

  • Scrape and normalise pricing data from heterogeneous sources into one schema.
  • Use a SERP API for the Google Shopping source instead of scraping search.
  • Detect price changes over time and emit alerts.
  • Ship a public dashboard that updates daily without manual intervention.

What you're building

A platform that pulls product prices from four sources daily, normalises them into one schema, stores the history, detects significant changes, and publishes a dashboard.

Sources:
  ┌─ Catalog108 /products (static)
  ├─ Catalog108 /deals/live (polling shim)
  ├─ External site #1 (your pick, e.g. an open product directory)
  ├─ External site #2 (your pick, e.g. another open directory)
  └─ Google Shopping (via SERP API)

  ↓ (daily cron)

  ┌─ Normalisation layer (unified schema)
  ├─ Postgres / SQLite (price history)
  ├─ Change-detection (alert on >5% drop or >10% rise)
  └─ Public dashboard (static HTML rebuilt daily)

The "real" use case: a small reseller deciding when to restock. The pretend use case: a portfolio piece that demonstrates production discipline.

Required features (must-have to pass)

  • All four source types (static HTML, dynamic / live, external site, SERP API).
  • Unified product schema, every source maps to {source, source_id, name, price_cents, currency, captured_at, in_stock, url}.
  • Daily cron (or every 6 hours) running unattended.
  • Price history, not just latest. Don't overwrite; insert new rows.
  • At least one source via Python + Scrapy, at least one source via PHP + Symfony / Guzzle. Per spec, two languages.
  • Proxy rotation for at least one source.
  • Browser automation (Playwright / Panther) for the live-ticker source (Catalog108 /deals/live).
  • Public dashboard (can be a static site built from a template; HTMX, Next.js static export, plain templates, your choice).
  • Public GitHub repo with architecture diagram in the README.
  • Deployed instance running daily.
  • Blog post documenting build, costs, failures.

Stretch features (nice but optional)

  • Email digest of biggest price moves (sent via Resend / Postmark / SendGrid free tier).
  • Telegram bot for instant alerts.
  • CSV export endpoint.
  • A "compare to last month" view.
  • ML-based outlier detection (call out suspicious price drops that are actually listing errors).

Suggested architecture

scraping/
├── scrapy_project/  # Python, handles Catalog108 /products + external #1
│  ├── spiders/
│  │  ├── catalog108_products.py
│  │  └── external_one.py
│  └── pipelines.py  # normalises + writes to Postgres
├── symfony_app/  # PHP, handles external #2 + SERP API
│  ├── src/Command/
│  │  ├── ExternalTwoCommand.php
│  │  └── SerpShoppingCommand.php
│  └── src/Service/PriceNormaliser.php
├── browser_scraper/  # Playwright, Catalog108 /deals/live
│  └── deals_live.py
├── db/
│  ├── migrations/  # schema versioned
│  └── seed.sql
├── dashboard/  # static site built from templates
│  ├── build.py  # reads DB, renders HTML
│  └── templates/
├── alerts/
│  └── change_detector.py
├── docker-compose.yml  # postgres + app
├── .github/workflows/  # CI tests + daily deploy
└── README.md (with mermaid diagram)

You don't have to mirror this. But the split (Python + PHP + browser + dashboard) is what the spec demands.

Suggested external sources

Pick external sources where:

  1. The site is open to scraping (robots.txt allows the catalog path you want).
  2. Prices are in HTML, not canvas-rendered.
  3. The volume is reasonable (≤10k SKUs).
  4. You're not violating a stated ToS.

Some honest candidates:

  • Small independent shops with public catalogs (always check robots.txt).
  • Open product directories (e.g. brand-of-the-month aggregators with permissive policies).
  • Public-API-backed sites where the HTML is just a thin shell over a documented JSON endpoint.

Don't pick Amazon, Walmart, eBay, Etsy as primary sources for the capstone. They're worth scraping in real work, but the anti-bot effort dominates the project. Save them for after the capstone, when you've got the skeleton working.

Suggested SERP API choice

For Google Shopping, pick one of the SERP API providers covered in Sub-Path 3 (/learn/api-scraping/comparing-major-providers). All of them have free tiers sufficient for daily runs across ~20 queries.

Recommended cheap stack:

  • SerpApi free tier (100 searches/month) → covers 3 queries/day for 30 days.
  • ScraperAPI free trial → 1000 credits, more than enough for the capstone window.
  • ZenRows free tier.

Pick whichever your account is approved for fastest. The capstone tests integration competency, not provider selection.

Schema

A minimum-viable price-history schema:

CREATE TABLE products (
  id BIGSERIAL PRIMARY KEY,
  source TEXT NOT NULL,  -- 'catalog108', 'external_one', 'google_shopping'...
  source_id TEXT NOT NULL,  -- the upstream's stable id (SKU, slug, etc.)
  name TEXT NOT NULL,
  canonical_key TEXT,  -- normalised name for cross-source matching
  url TEXT,
  UNIQUE (source, source_id)
);

CREATE TABLE price_observations (
  id BIGSERIAL PRIMARY KEY,
  product_id BIGINT NOT NULL REFERENCES products(id),
  captured_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  price_cents INTEGER NOT NULL,
  currency CHAR(3) NOT NULL,
  in_stock BOOLEAN,
  raw_meta JSONB
);

CREATE INDEX ON price_observations (product_id, captured_at DESC);

canonical_key is the magic field, it's how you join "the same product" across sources when SKUs differ. Build it lossily (lowercase + strip punctuation + first 3 tokens) and accept that ~10% of joins will be wrong; surface that uncertainty in the dashboard.

Change detection

The simplest possible signal:

# Daily comparison: today's price vs latest from yesterday
for product in unique_products:
  yesterday = latest_before(product, today.replace(day=today.day-1))
  today_price = latest(product)
  if not yesterday: continue
  pct = (today_price.price_cents - yesterday.price_cents) / yesterday.price_cents
  if abs(pct) > 0.05:  # 5% move
  emit_alert(product, yesterday, today_price, pct)

That's the entire alert loop. Don't over-engineer.

Common pitfalls

  • Time zone bugs. captured_at should always be UTC. Display in local time on the dashboard only.
  • Currency mixing. Don't average a USD price and a INR price. Either convert (use a daily FX snapshot) or keep them in separate views.
  • Canonical-key collisions. A "Blue Mug" from two sources might be different products. Show the source URLs prominently so users can disambiguate.
  • Anti-bot on external sources. If you're picking sites with light protection, the proxy lesson (/learn/production-scale/rotating-proxy-strategies) is your friend. Don't fight a serious bot wall, switch sources.
  • SERP API rate limits. The free tiers throttle aggressively. Cache aggressively too.

Deployment

Cheapest viable: a $5/mo VPS (Hetzner CX22, DigitalOcean droplet) running:

  • Docker (Postgres + your scraper containers)
  • Cron at 06:00 UTC
  • A static dashboard served via Caddy / nginx

Cheaper still: GitHub Actions cron + Postgres on Neon / Supabase free tier + dashboard on GitHub Pages. The 30-day price-history table fits easily in any free Postgres tier.

What "done" looks like

  • README has a working architecture diagram and "what's where" map.
  • git clone && docker-compose up runs the whole thing locally.
  • The deployed instance has 14 consecutive days of data captured.
  • The dashboard has more than one user (a friend counts).
  • The blog post explains: what you scraped, why those sources, what your unified schema is, three things that broke, the final cost.

Hands-on lab

Catalog108's /deals/live is the simulated WebSocket-ish live ticker. Build a Playwright scraper that polls it every 3 seconds for 60 seconds and captures the price-tick frames into a CSV. That's your first source. Once that works, expand to /products (static) and one external site. Everything else is the same pattern, three more times.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /deals/live

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Project A, Multi-Source Price Intelligence Platform1 / 8

Why does the price-intelligence project specifically require BOTH a static source AND a live/dynamic source on Catalog108?

Score so far: 0 / 0