Project A, Multi-Source Price Intelligence Platform
Track prices across Catalog108, two external e-commerce sites, and Google Shopping daily. Output: a normalised price-history database plus a public dashboard.
What you’ll learn
- Scrape and normalise pricing data from heterogeneous sources into one schema.
- Use a SERP API for the Google Shopping source instead of scraping search.
- Detect price changes over time and emit alerts.
- Ship a public dashboard that updates daily without manual intervention.
What you're building
A platform that pulls product prices from four sources daily, normalises them into one schema, stores the history, detects significant changes, and publishes a dashboard.
Sources:
┌─ Catalog108 /products (static)
├─ Catalog108 /deals/live (polling shim)
├─ External site #1 (your pick, e.g. an open product directory)
├─ External site #2 (your pick, e.g. another open directory)
└─ Google Shopping (via SERP API)
↓ (daily cron)
┌─ Normalisation layer (unified schema)
├─ Postgres / SQLite (price history)
├─ Change-detection (alert on >5% drop or >10% rise)
└─ Public dashboard (static HTML rebuilt daily)
The "real" use case: a small reseller deciding when to restock. The pretend use case: a portfolio piece that demonstrates production discipline.
Required features (must-have to pass)
- All four source types (static HTML, dynamic / live, external site, SERP API).
- Unified product schema, every source maps to
{source, source_id, name, price_cents, currency, captured_at, in_stock, url}. - Daily cron (or every 6 hours) running unattended.
- Price history, not just latest. Don't overwrite; insert new rows.
- At least one source via Python + Scrapy, at least one source via PHP + Symfony / Guzzle. Per spec, two languages.
- Proxy rotation for at least one source.
- Browser automation (Playwright / Panther) for the live-ticker source (Catalog108
/deals/live). - Public dashboard (can be a static site built from a template; HTMX, Next.js static export, plain templates, your choice).
- Public GitHub repo with architecture diagram in the README.
- Deployed instance running daily.
- Blog post documenting build, costs, failures.
Stretch features (nice but optional)
- Email digest of biggest price moves (sent via Resend / Postmark / SendGrid free tier).
- Telegram bot for instant alerts.
- CSV export endpoint.
- A "compare to last month" view.
- ML-based outlier detection (call out suspicious price drops that are actually listing errors).
Suggested architecture
scraping/
├── scrapy_project/ # Python, handles Catalog108 /products + external #1
│ ├── spiders/
│ │ ├── catalog108_products.py
│ │ └── external_one.py
│ └── pipelines.py # normalises + writes to Postgres
├── symfony_app/ # PHP, handles external #2 + SERP API
│ ├── src/Command/
│ │ ├── ExternalTwoCommand.php
│ │ └── SerpShoppingCommand.php
│ └── src/Service/PriceNormaliser.php
├── browser_scraper/ # Playwright, Catalog108 /deals/live
│ └── deals_live.py
├── db/
│ ├── migrations/ # schema versioned
│ └── seed.sql
├── dashboard/ # static site built from templates
│ ├── build.py # reads DB, renders HTML
│ └── templates/
├── alerts/
│ └── change_detector.py
├── docker-compose.yml # postgres + app
├── .github/workflows/ # CI tests + daily deploy
└── README.md (with mermaid diagram)
You don't have to mirror this. But the split (Python + PHP + browser + dashboard) is what the spec demands.
Suggested external sources
Pick external sources where:
- The site is open to scraping (robots.txt allows the catalog path you want).
- Prices are in HTML, not canvas-rendered.
- The volume is reasonable (≤10k SKUs).
- You're not violating a stated ToS.
Some honest candidates:
- Small independent shops with public catalogs (always check robots.txt).
- Open product directories (e.g. brand-of-the-month aggregators with permissive policies).
- Public-API-backed sites where the HTML is just a thin shell over a documented JSON endpoint.
Don't pick Amazon, Walmart, eBay, Etsy as primary sources for the capstone. They're worth scraping in real work, but the anti-bot effort dominates the project. Save them for after the capstone, when you've got the skeleton working.
Suggested SERP API choice
For Google Shopping, pick one of the SERP API providers covered in Sub-Path 3 (/learn/api-scraping/comparing-major-providers). All of them have free tiers sufficient for daily runs across ~20 queries.
Recommended cheap stack:
- SerpApi free tier (100 searches/month) → covers 3 queries/day for 30 days.
- ScraperAPI free trial → 1000 credits, more than enough for the capstone window.
- ZenRows free tier.
Pick whichever your account is approved for fastest. The capstone tests integration competency, not provider selection.
Schema
A minimum-viable price-history schema:
CREATE TABLE products (
id BIGSERIAL PRIMARY KEY,
source TEXT NOT NULL, -- 'catalog108', 'external_one', 'google_shopping'...
source_id TEXT NOT NULL, -- the upstream's stable id (SKU, slug, etc.)
name TEXT NOT NULL,
canonical_key TEXT, -- normalised name for cross-source matching
url TEXT,
UNIQUE (source, source_id)
);
CREATE TABLE price_observations (
id BIGSERIAL PRIMARY KEY,
product_id BIGINT NOT NULL REFERENCES products(id),
captured_at TIMESTAMPTZ NOT NULL DEFAULT now(),
price_cents INTEGER NOT NULL,
currency CHAR(3) NOT NULL,
in_stock BOOLEAN,
raw_meta JSONB
);
CREATE INDEX ON price_observations (product_id, captured_at DESC);
canonical_key is the magic field, it's how you join "the same product" across sources when SKUs differ. Build it lossily (lowercase + strip punctuation + first 3 tokens) and accept that ~10% of joins will be wrong; surface that uncertainty in the dashboard.
Change detection
The simplest possible signal:
# Daily comparison: today's price vs latest from yesterday
for product in unique_products:
yesterday = latest_before(product, today.replace(day=today.day-1))
today_price = latest(product)
if not yesterday: continue
pct = (today_price.price_cents - yesterday.price_cents) / yesterday.price_cents
if abs(pct) > 0.05: # 5% move
emit_alert(product, yesterday, today_price, pct)
That's the entire alert loop. Don't over-engineer.
Common pitfalls
- Time zone bugs.
captured_atshould always be UTC. Display in local time on the dashboard only. - Currency mixing. Don't average a USD price and a INR price. Either convert (use a daily FX snapshot) or keep them in separate views.
- Canonical-key collisions. A "Blue Mug" from two sources might be different products. Show the source URLs prominently so users can disambiguate.
- Anti-bot on external sources. If you're picking sites with light protection, the proxy lesson (
/learn/production-scale/rotating-proxy-strategies) is your friend. Don't fight a serious bot wall, switch sources. - SERP API rate limits. The free tiers throttle aggressively. Cache aggressively too.
Deployment
Cheapest viable: a $5/mo VPS (Hetzner CX22, DigitalOcean droplet) running:
- Docker (Postgres + your scraper containers)
- Cron at 06:00 UTC
- A static dashboard served via Caddy / nginx
Cheaper still: GitHub Actions cron + Postgres on Neon / Supabase free tier + dashboard on GitHub Pages. The 30-day price-history table fits easily in any free Postgres tier.
What "done" looks like
- README has a working architecture diagram and "what's where" map.
git clone && docker-compose upruns the whole thing locally.- The deployed instance has 14 consecutive days of data captured.
- The dashboard has more than one user (a friend counts).
- The blog post explains: what you scraped, why those sources, what your unified schema is, three things that broke, the final cost.
Hands-on lab
Catalog108's /deals/live is the simulated WebSocket-ish live ticker. Build a Playwright scraper that polls it every 3 seconds for 60 seconds and captures the price-tick frames into a CSV. That's your first source. Once that works, expand to /products (static) and one external site. Everything else is the same pattern, three more times.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/deals/liveQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.