Git and GitHub for Scraper Projects, Foundations

The minimum Git you need to keep scraper projects sane, collaborate, and ship to production. Plus the patterns specific to scraper repos.

Git is non-negotiable for any project you'll touch more than once. Scrapers add a few specific concerns: credentials, large output files, and (if you publish) sensitive targets. Twenty minutes of discipline saves real pain later.

The 10 commands that cover 95% of use

# One-time setup
git config --global user.name  "Your Name"
git config --global user.email "you@example.com"
git config --global init.defaultBranch main

# Start a repo
git init  # in an existing directory
git clone <url>  # clone someone else's

# Daily flow
git status  # what's changed?
git add file1.py file2.py  # stage specific files
git add .  # stage everything (careful)
git commit -m "Add product page scraper"
git log --oneline  # recent commits

# Branching
git checkout -b feature/login-flow  # new branch
git switch main  # back to main
git merge feature/login-flow  # merge a branch in

# Remote (GitHub)
git remote add origin git@github.com:you/repo.git
git push -u origin main
git pull

If you can do those, you can use Git. Everything else is variations.

Make commits small and meaningful

Bad:

git commit -m "stuff"
git commit -m "wip"
git commit -m "fix"

Good:

git commit -m "Add User-Agent rotation to HttpClient"
git commit -m "Fix CSRF token re-fetch on multi-step login"
git commit -m "Add reviews pagination to ProductDetailScraper"

Each commit should be:

Self-contained, could be reverted without breaking anything else.
Single-topic, one logical change. Easier to review, easier to bisect when bugs appear.
Imperative present-tense subject, "Add X," "Fix Y," "Refactor Z." Reads like an instruction Git itself would follow.

When in doubt, smaller. A scraper repo with 50 small commits is easier to navigate than 5 mega-commits.

.gitignore, non-negotiable

Every scraper project needs at least this:

# Environments
.venv/
venv/
__pycache__/
*.pyc

# PHP
vendor/
composer.lock  # OPTIONAL, debate; usually KEEP committed

# Editor
.idea/
.vscode/
.DS_Store

# Secrets
.env
.env.local
config/config.php  # if it contains real credentials

# Scraper output
output/
data/
*.csv
*.jsonl
*.sqlite
logs/

Two notes:

composer.lock: usually committed (so deploys are reproducible). The line above is a placeholder for the rare case where you've decided otherwise.
Scraped data files: gitignored. Repos shouldn't contain hundreds of MB of CSVs that change daily.

If you accidentally commit a file you shouldn't have, don't just delete and re-commit, the data lives in history. Use git filter-repo or, simpler, treat the secret as compromised and rotate it. This is why credentials should never be in code in the first place.

Secrets, never in git

Three layers of protection:

Environment variables. Read credentials from os.environ (Python) or getenv() (PHP). Never hard-code.
.env files gitignored. Use python-dotenv or vlucas/phpdotenv to load. Commit .env.example with dummy values to document the schema:

# .env.example (commit this)
HOSTINGER_DB_HOST=
HOSTINGER_DB_USER=
HOSTINGER_DB_PASS=

# .env (gitignored, real values)
HOSTINGER_DB_HOST=localhost
HOSTINGER_DB_USER=catalog108
HOSTINGER_DB_PASS=actual-real-password

Pre-commit hooks. Tools like gitleaks or trufflehog scan diffs for things that look like API keys before they get committed.

Branching strategy

Two reasonable choices for solo / small-team scraper projects:

"Main + feature branches"

main  ← always deployable
├── feature/x ← work in progress
└── fix/y  ← work in progress

Work on a branch, push, open a PR, merge to main when ready. Simple, widely understood.

"Just commit to main"

For solo projects with low blast radius, committing directly to main is fine. Just keep commits small.

Pick one and stick with it. Inconsistent branching habits create more confusion than either pure strategy.

GitHub: the social layer

GitHub turns Git into collaboration. The four things you'll do most:

1. Create the repo

# After git init + first commit
gh repo create my-scraper --public --source=. --remote=origin --push
# or via the GitHub web UI: New repo → copy the URL → git remote add origin <url>

2. Pull Requests

A PR is "here are some commits, please review and merge." Even on solo projects, opening PRs (and then merging your own) gives you:

A diff to self-review
A discussion thread tied to the change
A CI run that confirms tests pass before merge
Clean history (squash merge bundles N commits into one)

3. Issues

Use them. "TODO" comments in code rot; GitHub issues persist, are searchable, and force you to write the problem down clearly.

4. Actions (CI)

For scraping projects, the typical CI is:

# .github/workflows/test.yml
name: test
on: [push, pull_request]
jobs:
  test:
  runs-on: ubuntu-latest
  steps:
  - uses: actions/checkout@v4
  - uses: actions/setup-python@v5
  with: { python-version: "3.12" }
  - run: pip install -r requirements.txt
  - run: pytest

GitHub Actions can also run scrapers on a schedule:

on:
  schedule:
  - cron: "0 0 * * *"  # daily at 00:00 UTC

For light scrapers, this is the cheapest way to run a daily crawl, no server, no cost (within free tier).

Scraper-specific Git gotchas

1. Don't commit the target site's full HTML

Tempting for debugging: "I'll commit the saved HTML so I can iterate on parsers." Quickly bloats the repo. Use a samples/ folder, gitignored, with a samples/.gitkeep file so the structure exists.

2. Don't commit scraper output as "data"

output/products.jsonl gets larger every day. Don't track it. Push outputs to S3, a database, or a separate "data" repo that lives by itself.

3. Don't make scraping target obvious in your repo name

If you've built competitor-tracker for tracking a specific competitor, naming it scrape-acme-corp on a public GitHub puts a target on your back. Use a generic name for public repos when sensitivity matters.

4. .gitkeep / placeholder files

Git doesn't track empty directories. To keep output/ in the repo structure, add output/.gitkeep (a zero-byte placeholder).

SSH vs HTTPS for GitHub

# HTTPS, works everywhere, requires personal access token for write
git remote add origin https://github.com/you/repo.git

# SSH, recommended; one-time setup of an SSH key
ssh-keygen -t ed25519 -C "you@example.com"
# add the public key (~/.ssh/id_ed25519.pub) to GitHub Settings → SSH keys
git remote add origin git@github.com:you/repo.git

SSH means no password / token prompts ever. Set it up once.

Hands-on lab

In an empty directory:

git init
Create README.md with a one-line description, .gitignore with the basics above.
Commit. Push to a fresh GitHub repo (private if you're nervous about content).
Create a branch feature/setup, add a requirements.txt, commit, push the branch, open a PR.
Self-merge the PR. Pull main back to your local.

You've executed the entire Git workflow you'll repeat for every scraping project from here on.

Git and GitHub for Scraper Projects

What you’ll learn