Project B, Job Market Analytics Service, Final Mastery Project

Aggregate jobs across Catalog108 /jobs, five external boards, and Google Jobs (via SERP API). Dedupe, normalise, and ship a dashboard that surfaces market trends.

What you're building

A daily-refreshed analytics service over the global scraping-engineer job market (or any job market of your choice). It scrapes six sources, dedupes the same role posted to multiple boards, extracts skills + salary signals, and publishes a public dashboard.

Sources (6):
  ┌─ Catalog108 /jobs  (static, Sub-Path 1)
  ├─ External board #1 (e.g. We Work Remotely / RemoteOK, public RSS available)
  ├─ External board #2 (e.g. AngelList / Wellfound public listings)
  ├─ External board #3 (e.g. Hacker News "Who is hiring?" monthly thread)
  ├─ External board #4 (e.g. a niche board for your domain)
  └─ Google Jobs (via SERP API, google_jobs result type)

  ↓

  ┌─ Normalised schema (title, company, location, remote, salary_range, skills[], posted_at)
  ├─ Deduplication (fuzzy match across boards)
  ├─ Skill extraction (regex / spaCy NER lite)
  └─ Public dashboard with weekly trends

This project produces something genuinely useful, a free job-market dashboard with cleaner data than any single board offers, which makes it a strong portfolio piece and a marketing asset for your scraping consultancy.

Required features

Six daily sources. The diversity matters: at least one must be static HTML, at least one must require browser automation, and at least one must be a SERP API integration.
Cross-source deduplication. When the same role appears on 3 boards, your dashboard shows it once with three "seen on" badges.
Skill extraction. Parse free-text descriptions into a structured skills[] array (Python, Scrapy, AWS, Docker, etc.). Keep a finite vocabulary (~200 skills), don't try to extract every noun.
Salary normalisation. Parse "$100k–$150k", "£60,000", "10–15 LPA" into a unified currency-tagged range.
Trend chart. At minimum: count of new postings per day for the past 30 days, by skill / region.
Public GitHub + deployed dashboard + blog post.

Two-language requirement: Scrapy (Python) for ~half the sources, Symfony (PHP) for the other half. Browser automation (Playwright / Panther) for at least one source.

Stretch features

Email digest. Weekly "new postings matching your filter" email.
Slack bot. Drop matching jobs into a channel.
Salary lookup widget. "What's the median salary for a Scrapy engineer in Bangalore?" → answer from your data.
ML salary prediction. Train a tiny gradient-boosted model on your scraped data; predict salary from title + skills + location.
API endpoint. Expose your normalised data via a public REST/GraphQL API. Now it's not just a dashboard, it's a service.

Suggested external sources

Pick sources with explicit scraping policies you can honour. Several boards expose public RSS / Atom feeds that are friendlier than the HTML pages:

Source	Why it's reasonable
RemoteOK / We Work Remotely	Have public RSS feeds (no scraping needed for the index, parse the feed)
Hacker News "Who is hiring?" monthly threads	All public via the HN API
StackOverflow Jobs (now Stack Overflow's "Jobs" archive on the Wayback Machine)	Historical only, but interesting for trend analysis
GitHub Jobs / GitLab careers / company career pages of your favourite OSS projects	Light protection, fits the spirit of scraping
Government / NGO job boards (USAJobs, EU Open Data)	Open data, often have APIs

Avoid LinkedIn as a scraping target. The hiQ case (covered in /learn/foundations/legal-ethical-scraping) made some scraping of public LinkedIn legal under CFAA, but LinkedIn's ToS still actively prohibit it and they enforce. Other state-law claims survive. Don't make the capstone about a legal fight.

Schema

CREATE TABLE companies (
  id BIGSERIAL PRIMARY KEY,
  canonical_name TEXT NOT NULL UNIQUE,
  aliases TEXT[],
  domain TEXT,
  industry TEXT
);

CREATE TABLE jobs (
  id BIGSERIAL PRIMARY KEY,
  source TEXT NOT NULL,
  source_id TEXT NOT NULL,
  company_id BIGINT REFERENCES companies(id),
  title TEXT NOT NULL,
  location TEXT,
  remote_kind TEXT,  -- 'fully_remote', 'hybrid', 'on_site', 'unknown'
  salary_currency CHAR(3),
  salary_min_year INTEGER,
  salary_max_year INTEGER,
  skills TEXT[],  -- ['python', 'scrapy', 'aws'...]
  description_text TEXT,
  posted_at DATE,
  captured_at TIMESTAMPTZ DEFAULT now(),
  UNIQUE (source, source_id)
);

CREATE TABLE job_clusters (
  cluster_id BIGSERIAL PRIMARY KEY,
  job_ids BIGINT[] NOT NULL,
  canonical_title TEXT,
  canonical_company_id BIGINT REFERENCES companies(id)
);

CREATE INDEX ON jobs USING GIN (skills);
CREATE INDEX ON jobs (posted_at);

Deduplication

The classic problem in jobs aggregators. Two jobs are "the same" when:

Same company (within alias-aware string matching).
Same role family (titles like "Senior Python Engineer" and "Senior Python Developer" should cluster).
Posted within ~14 days of each other.

Simple recipe:

Normalise titles: lowercase, strip seniority words ("senior", "staff", "principal"), remove parenthetical tags.
Normalise companies: case-fold, strip suffixes ("Inc.", "Ltd.", "LLC"), look up aliases.
For each new job, query the last 14 days of jobs from the same canonical company with similar canonical title (Levenshtein distance ≤ 3 OR shared 2+ words).
If a match: append to its cluster. If not: new cluster.

This will give you ~80% precision. The dashboard shows clusters; clicking expands to all postings. Wrong groupings are a footgun in production, but for a capstone, 80% is acceptable if you surface "seen on N boards" prominently so users can verify.

Skill extraction

Keep it simple. A regex against a curated skill vocabulary covers 90% of the value:

SKILLS = {
  "python": [r"\bpython\b"],
  "scrapy": [r"\bscrapy\b"],
  "playwright": [r"\bplaywright\b"],
  "aws":  [r"\baws\b", r"\bamazon web services\b"],
  "docker":  [r"\bdocker\b"],
  "php":  [r"\bphp\b"],
  "symfony": [r"\bsymfony\b"],
  "postgres": [r"\bpostgres(ql)?\b"],
  # ... 100-200 entries
}

def extract_skills(text: str) -> list[str]:
  text_lower = text.lower()
  return [skill for skill, patterns in SKILLS.items()
  if any(re.search(p, text_lower) for p in patterns)]

Don't try NER. The recall benefit isn't worth the complexity for a capstone.

Salary normalisation

# Parse "$100k–$150k", "£60,000", "₹10–15 LPA", "€80,000 - €100,000"
def parse_salary(text: str) -> tuple[str | None, int | None, int | None]:
  """Returns (currency, min_year, max_year). Best-effort; nulls on uncertainty."""
  # Detect currency symbol or code
  # Detect numbers, multiplier (k, lakh, crore, M)
  # Detect range vs single
  # Convert to year (some boards quote monthly)
  ...

Document your assumptions in the README. Wrong salary numbers are worse than missing salary numbers, fail to null rather than guess.

Dashboard ideas

Useful out of the box:

Job board volume chart, daily new postings, last 90 days, stacked by source.
Skill demand chart, top 20 skills by posting count, last 30 days.
Salary distribution, by skill, by location, by remote kind.
Recent postings table, filterable by skill / location / remote.
Company spotlight, top 20 companies hiring this month.

Don't over-design. A static site with charts (Chart.js / Apexcharts) and a JSON-backed search is enough. Avoid heavy frontend frameworks; the data is small.

Common pitfalls

Duplicate explosion. Without dedup, your "top hiring company" chart will be dominated by whichever company cross-posts most aggressively. Dedup before aggregating.
Salary in different units. A "60k EUR/year" and a "5k EUR/month" describe the same salary. Normalise to year.
Posting age confusion. A job "posted 30 days ago" on one board might be the same role newly cross-posted today. Cluster on the company-title key, not the post date.
Location ambiguity. "Remote" + "London" can mean "remote, but registered in London" or "must be physically in London but works from home." Capture both.
Stale postings. Boards rarely mark a job closed. Treat any job not seen in a daily scrape for 30 days as expired.

Deployment

Same shape as Project A, $5/mo VPS or GitHub Actions + free-tier Postgres + GitHub Pages.

The dashboard is small enough to be a static site rebuilt nightly. The DB is the only stateful piece.

What "done" looks like

30 consecutive days of captured data.
Dedup quality verified by hand on a 50-job sample (≥85% precision on clusters).
The dashboard answers "how many jobs require Scrapy this month?" in one click.
A non-technical friend (who isn't job-hunting) can navigate the dashboard and tell you something interesting.
Blog post explains: sources, dedup strategy, three things that broke, cost.

Hands-on lab

Start with Catalog108 /jobs. Build a Scrapy spider that pulls every listing into your local Postgres. Add company normalisation. Now add one external source. Notice the dedup problem the moment two sources mention "Pinegrove Co., Senior PHP Engineer." Solve it, then add the next four sources.

Project B, Job Market Analytics Service

What you’ll learn

What you're building

Required features

Stretch features

Suggested external sources

Schema

Deduplication

Skill extraction

Salary normalisation

Dashboard ideas

Common pitfalls

Deployment

What "done" looks like

Hands-on lab

Hands-on lab

Quiz, check your understanding

Project B requires six daily sources. Among the recommended choices, which is explicitly discouraged?