Project B, Job Market Analytics Service
Aggregate jobs across Catalog108 /jobs, five external boards, and Google Jobs (via SERP API). Dedupe, normalise, and ship a dashboard that surfaces market trends.
What you’ll learn
- Scrape jobs from six heterogeneous sources daily.
- Dedupe identical roles across boards using fuzzy matching.
- Compute and visualise market signals: median salary, in-demand skills, remote/in-office ratio.
- Ship a public dashboard that's actually useful to job seekers.
What you're building
A daily-refreshed analytics service over the global scraping-engineer job market (or any job market of your choice). It scrapes six sources, dedupes the same role posted to multiple boards, extracts skills + salary signals, and publishes a public dashboard.
Sources (6):
┌─ Catalog108 /jobs (static, Sub-Path 1)
├─ External board #1 (e.g. We Work Remotely / RemoteOK, public RSS available)
├─ External board #2 (e.g. AngelList / Wellfound public listings)
├─ External board #3 (e.g. Hacker News "Who is hiring?" monthly thread)
├─ External board #4 (e.g. a niche board for your domain)
└─ Google Jobs (via SERP API, google_jobs result type)
↓
┌─ Normalised schema (title, company, location, remote, salary_range, skills[], posted_at)
├─ Deduplication (fuzzy match across boards)
├─ Skill extraction (regex / spaCy NER lite)
└─ Public dashboard with weekly trends
This project produces something genuinely useful, a free job-market dashboard with cleaner data than any single board offers, which makes it a strong portfolio piece and a marketing asset for your scraping consultancy.
Required features
- Six daily sources. The diversity matters: at least one must be static HTML, at least one must require browser automation, and at least one must be a SERP API integration.
- Cross-source deduplication. When the same role appears on 3 boards, your dashboard shows it once with three "seen on" badges.
- Skill extraction. Parse free-text descriptions into a structured
skills[]array (Python, Scrapy, AWS, Docker, etc.). Keep a finite vocabulary (~200 skills), don't try to extract every noun. - Salary normalisation. Parse "$100k–$150k", "£60,000", "10–15 LPA" into a unified currency-tagged range.
- Trend chart. At minimum: count of new postings per day for the past 30 days, by skill / region.
- Public GitHub + deployed dashboard + blog post.
Two-language requirement: Scrapy (Python) for ~half the sources, Symfony (PHP) for the other half. Browser automation (Playwright / Panther) for at least one source.
Stretch features
- Email digest. Weekly "new postings matching your filter" email.
- Slack bot. Drop matching jobs into a channel.
- Salary lookup widget. "What's the median salary for a Scrapy engineer in Bangalore?" → answer from your data.
- ML salary prediction. Train a tiny gradient-boosted model on your scraped data; predict salary from title + skills + location.
- API endpoint. Expose your normalised data via a public REST/GraphQL API. Now it's not just a dashboard, it's a service.
Suggested external sources
Pick sources with explicit scraping policies you can honour. Several boards expose public RSS / Atom feeds that are friendlier than the HTML pages:
| Source | Why it's reasonable |
|---|---|
| RemoteOK / We Work Remotely | Have public RSS feeds (no scraping needed for the index, parse the feed) |
| Hacker News "Who is hiring?" monthly threads | All public via the HN API |
| StackOverflow Jobs (now Stack Overflow's "Jobs" archive on the Wayback Machine) | Historical only, but interesting for trend analysis |
| GitHub Jobs / GitLab careers / company career pages of your favourite OSS projects | Light protection, fits the spirit of scraping |
| Government / NGO job boards (USAJobs, EU Open Data) | Open data, often have APIs |
Avoid LinkedIn as a scraping target. The hiQ case (covered in /learn/foundations/legal-ethical-scraping) made some scraping of public LinkedIn legal under CFAA, but LinkedIn's ToS still actively prohibit it and they enforce. Other state-law claims survive. Don't make the capstone about a legal fight.
Schema
CREATE TABLE companies (
id BIGSERIAL PRIMARY KEY,
canonical_name TEXT NOT NULL UNIQUE,
aliases TEXT[],
domain TEXT,
industry TEXT
);
CREATE TABLE jobs (
id BIGSERIAL PRIMARY KEY,
source TEXT NOT NULL,
source_id TEXT NOT NULL,
company_id BIGINT REFERENCES companies(id),
title TEXT NOT NULL,
location TEXT,
remote_kind TEXT, -- 'fully_remote', 'hybrid', 'on_site', 'unknown'
salary_currency CHAR(3),
salary_min_year INTEGER,
salary_max_year INTEGER,
skills TEXT[], -- ['python', 'scrapy', 'aws'...]
description_text TEXT,
posted_at DATE,
captured_at TIMESTAMPTZ DEFAULT now(),
UNIQUE (source, source_id)
);
CREATE TABLE job_clusters (
cluster_id BIGSERIAL PRIMARY KEY,
job_ids BIGINT[] NOT NULL,
canonical_title TEXT,
canonical_company_id BIGINT REFERENCES companies(id)
);
CREATE INDEX ON jobs USING GIN (skills);
CREATE INDEX ON jobs (posted_at);
Deduplication
The classic problem in jobs aggregators. Two jobs are "the same" when:
- Same company (within alias-aware string matching).
- Same role family (titles like "Senior Python Engineer" and "Senior Python Developer" should cluster).
- Posted within ~14 days of each other.
Simple recipe:
- Normalise titles: lowercase, strip seniority words ("senior", "staff", "principal"), remove parenthetical tags.
- Normalise companies: case-fold, strip suffixes ("Inc.", "Ltd.", "LLC"), look up aliases.
- For each new job, query the last 14 days of jobs from the same canonical company with similar canonical title (Levenshtein distance ≤ 3 OR shared 2+ words).
- If a match: append to its cluster. If not: new cluster.
This will give you ~80% precision. The dashboard shows clusters; clicking expands to all postings. Wrong groupings are a footgun in production, but for a capstone, 80% is acceptable if you surface "seen on N boards" prominently so users can verify.
Skill extraction
Keep it simple. A regex against a curated skill vocabulary covers 90% of the value:
SKILLS = {
"python": [r"\bpython\b"],
"scrapy": [r"\bscrapy\b"],
"playwright": [r"\bplaywright\b"],
"aws": [r"\baws\b", r"\bamazon web services\b"],
"docker": [r"\bdocker\b"],
"php": [r"\bphp\b"],
"symfony": [r"\bsymfony\b"],
"postgres": [r"\bpostgres(ql)?\b"],
# ... 100-200 entries
}
def extract_skills(text: str) -> list[str]:
text_lower = text.lower()
return [skill for skill, patterns in SKILLS.items()
if any(re.search(p, text_lower) for p in patterns)]
Don't try NER. The recall benefit isn't worth the complexity for a capstone.
Salary normalisation
# Parse "$100k–$150k", "£60,000", "₹10–15 LPA", "€80,000 - €100,000"
def parse_salary(text: str) -> tuple[str | None, int | None, int | None]:
"""Returns (currency, min_year, max_year). Best-effort; nulls on uncertainty."""
# Detect currency symbol or code
# Detect numbers, multiplier (k, lakh, crore, M)
# Detect range vs single
# Convert to year (some boards quote monthly)
...
Document your assumptions in the README. Wrong salary numbers are worse than missing salary numbers, fail to null rather than guess.
Dashboard ideas
Useful out of the box:
- Job board volume chart, daily new postings, last 90 days, stacked by source.
- Skill demand chart, top 20 skills by posting count, last 30 days.
- Salary distribution, by skill, by location, by remote kind.
- Recent postings table, filterable by skill / location / remote.
- Company spotlight, top 20 companies hiring this month.
Don't over-design. A static site with charts (Chart.js / Apexcharts) and a JSON-backed search is enough. Avoid heavy frontend frameworks; the data is small.
Common pitfalls
- Duplicate explosion. Without dedup, your "top hiring company" chart will be dominated by whichever company cross-posts most aggressively. Dedup before aggregating.
- Salary in different units. A "60k EUR/year" and a "5k EUR/month" describe the same salary. Normalise to year.
- Posting age confusion. A job "posted 30 days ago" on one board might be the same role newly cross-posted today. Cluster on the company-title key, not the post date.
- Location ambiguity. "Remote" + "London" can mean "remote, but registered in London" or "must be physically in London but works from home." Capture both.
- Stale postings. Boards rarely mark a job closed. Treat any job not seen in a daily scrape for 30 days as expired.
Deployment
Same shape as Project A, $5/mo VPS or GitHub Actions + free-tier Postgres + GitHub Pages.
The dashboard is small enough to be a static site rebuilt nightly. The DB is the only stateful piece.
What "done" looks like
- 30 consecutive days of captured data.
- Dedup quality verified by hand on a 50-job sample (≥85% precision on clusters).
- The dashboard answers "how many jobs require Scrapy this month?" in one click.
- A non-technical friend (who isn't job-hunting) can navigate the dashboard and tell you something interesting.
- Blog post explains: sources, dedup strategy, three things that broke, cost.
Hands-on lab
Start with Catalog108 /jobs. Build a Scrapy spider that pulls every listing into your local Postgres. Add company normalisation. Now add one external source. Notice the dedup problem the moment two sources mention "Pinegrove Co., Senior PHP Engineer." Solve it, then add the next four sources.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/jobsQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.