Git and GitHub for Scraper Projects
The minimum Git you need to keep scraper projects sane, collaborate, and ship to production. Plus the patterns specific to scraper repos.
What you’ll learn
- Track changes with the commit-branch-merge workflow.
- Use a `.gitignore` to keep secrets, venvs, vendor/, and scraped data out of git.
- Push to GitHub, open a PR, and work with branches cleanly.
- Avoid the scraper-specific landmines: committed credentials, giant scraped datasets, accidentally exposing scraping targets.
Git is non-negotiable for any project you'll touch more than once. Scrapers add a few specific concerns: credentials, large output files, and (if you publish) sensitive targets. Twenty minutes of discipline saves real pain later.
The 10 commands that cover 95% of use
# One-time setup
git config --global user.name "Your Name"
git config --global user.email "you@example.com"
git config --global init.defaultBranch main
# Start a repo
git init # in an existing directory
git clone <url> # clone someone else's
# Daily flow
git status # what's changed?
git add file1.py file2.py # stage specific files
git add . # stage everything (careful)
git commit -m "Add product page scraper"
git log --oneline # recent commits
# Branching
git checkout -b feature/login-flow # new branch
git switch main # back to main
git merge feature/login-flow # merge a branch in
# Remote (GitHub)
git remote add origin git@github.com:you/repo.git
git push -u origin main
git pull
If you can do those, you can use Git. Everything else is variations.
Make commits small and meaningful
Bad:
git commit -m "stuff"
git commit -m "wip"
git commit -m "fix"
Good:
git commit -m "Add User-Agent rotation to HttpClient"
git commit -m "Fix CSRF token re-fetch on multi-step login"
git commit -m "Add reviews pagination to ProductDetailScraper"
Each commit should be:
- Self-contained, could be reverted without breaking anything else.
- Single-topic, one logical change. Easier to review, easier to bisect when bugs appear.
- Imperative present-tense subject, "Add X," "Fix Y," "Refactor Z." Reads like an instruction Git itself would follow.
When in doubt, smaller. A scraper repo with 50 small commits is easier to navigate than 5 mega-commits.
.gitignore, non-negotiable
Every scraper project needs at least this:
# Environments
.venv/
venv/
__pycache__/
*.pyc
# PHP
vendor/
composer.lock # OPTIONAL, debate; usually KEEP committed
# Editor
.idea/
.vscode/
.DS_Store
# Secrets
.env
.env.local
config/config.php # if it contains real credentials
# Scraper output
output/
data/
*.csv
*.jsonl
*.sqlite
logs/
Two notes:
- composer.lock: usually committed (so deploys are reproducible). The line above is a placeholder for the rare case where you've decided otherwise.
- Scraped data files: gitignored. Repos shouldn't contain hundreds of MB of CSVs that change daily.
If you accidentally commit a file you shouldn't have, don't just delete and re-commit, the data lives in history. Use git filter-repo or, simpler, treat the secret as compromised and rotate it. This is why credentials should never be in code in the first place.
Secrets, never in git
Three layers of protection:
-
Environment variables. Read credentials from
os.environ(Python) orgetenv()(PHP). Never hard-code. -
.envfiles gitignored. Usepython-dotenvorvlucas/phpdotenvto load. Commit.env.examplewith dummy values to document the schema:
# .env.example (commit this)
HOSTINGER_DB_HOST=
HOSTINGER_DB_USER=
HOSTINGER_DB_PASS=
# .env (gitignored, real values)
HOSTINGER_DB_HOST=localhost
HOSTINGER_DB_USER=catalog108
HOSTINGER_DB_PASS=actual-real-password
- Pre-commit hooks. Tools like
gitleaksortrufflehogscan diffs for things that look like API keys before they get committed.
Branching strategy
Two reasonable choices for solo / small-team scraper projects:
"Main + feature branches"
main ← always deployable
├── feature/x ← work in progress
└── fix/y ← work in progress
Work on a branch, push, open a PR, merge to main when ready. Simple, widely understood.
"Just commit to main"
For solo projects with low blast radius, committing directly to main is fine. Just keep commits small.
Pick one and stick with it. Inconsistent branching habits create more confusion than either pure strategy.
GitHub: the social layer
GitHub turns Git into collaboration. The four things you'll do most:
1. Create the repo
# After git init + first commit
gh repo create my-scraper --public --source=. --remote=origin --push
# or via the GitHub web UI: New repo → copy the URL → git remote add origin <url>
2. Pull Requests
A PR is "here are some commits, please review and merge." Even on solo projects, opening PRs (and then merging your own) gives you:
- A diff to self-review
- A discussion thread tied to the change
- A CI run that confirms tests pass before merge
- Clean history (squash merge bundles N commits into one)
3. Issues
Use them. "TODO" comments in code rot; GitHub issues persist, are searchable, and force you to write the problem down clearly.
4. Actions (CI)
For scraping projects, the typical CI is:
# .github/workflows/test.yml
name: test
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.12" }
- run: pip install -r requirements.txt
- run: pytest
GitHub Actions can also run scrapers on a schedule:
on:
schedule:
- cron: "0 0 * * *" # daily at 00:00 UTC
For light scrapers, this is the cheapest way to run a daily crawl, no server, no cost (within free tier).
Scraper-specific Git gotchas
1. Don't commit the target site's full HTML
Tempting for debugging: "I'll commit the saved HTML so I can iterate on parsers." Quickly bloats the repo. Use a samples/ folder, gitignored, with a samples/.gitkeep file so the structure exists.
2. Don't commit scraper output as "data"
output/products.jsonl gets larger every day. Don't track it. Push outputs to S3, a database, or a separate "data" repo that lives by itself.
3. Don't make scraping target obvious in your repo name
If you've built competitor-tracker for tracking a specific competitor, naming it scrape-acme-corp on a public GitHub puts a target on your back. Use a generic name for public repos when sensitivity matters.
4. .gitkeep / placeholder files
Git doesn't track empty directories. To keep output/ in the repo structure, add output/.gitkeep (a zero-byte placeholder).
SSH vs HTTPS for GitHub
# HTTPS, works everywhere, requires personal access token for write
git remote add origin https://github.com/you/repo.git
# SSH, recommended; one-time setup of an SSH key
ssh-keygen -t ed25519 -C "you@example.com"
# add the public key (~/.ssh/id_ed25519.pub) to GitHub Settings → SSH keys
git remote add origin git@github.com:you/repo.git
SSH means no password / token prompts ever. Set it up once.
Hands-on lab
In an empty directory:
git init- Create
README.mdwith a one-line description,.gitignorewith the basics above. - Commit. Push to a fresh GitHub repo (private if you're nervous about content).
- Create a branch
feature/setup, add arequirements.txt, commit, push the branch, open a PR. - Self-merge the PR. Pull main back to your local.
You've executed the entire Git workflow you'll repeat for every scraping project from here on.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.