Building Your Own Lightweight Proxy Pool
When buying isn't appropriate or you need fine-grained control, here's the architecture for a self-managed proxy pool with rotation, health, and failover.
What you’ll learn
- Architect a proxy pool service with health checks, rotation, failover.
- Decide when self-managed beats commercial.
- Avoid the common pitfalls of DIY proxy infrastructure.
Most projects should buy from a provider. But sometimes you need control, internal-only routing, custom regions, or sources commercial providers don't offer. This lesson is the architecture for a DIY pool.
When DIY makes sense
| Reason | Notes |
|---|---|
| You have datacenter IPs the provider doesn't | E.g. you own a small VPS fleet across regions |
| Compliance or security requires self-hosted | Sensitive scraping; audit logs of egress |
| You're aggregating multiple commercial providers | Build "meta-pool" across vendors |
| Very high volume where per-GB pricing breaks down | Self-host can amortize |
| You're building a proxy product | This is your business |
For most others: buy from a provider. Engineering time is more expensive than proxies.
Architecture
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Scrapers │ ───► │ Pool API │ ───► │ Proxy IPs │
│ (workers) │ HTTP │ (gateway) │ │ (squid, HA) │
└──────────────┘ └──────────────┘ └──────────────┘
│
▼
┌──────────────┐
│ Redis store │
│ (state) │
└──────────────┘
Three components:
- Proxy IPs. Actual machines running Squid, HAProxy, or a simple SOCKS5 server.
- Pool API / gateway. A small service scrapers talk to. Decides which proxy to use per request.
- State store. Redis (or Postgres) tracking proxy health and assignment.
The proxy nodes
Each node runs Squid (the most battle-tested) listening on its own IP:
# /etc/squid/squid.conf, minimal anonymous proxy
http_port 0.0.0.0:3128
http_access allow all # restrict in production!
forwarded_for delete
via off
request_header_access X-Forwarded-For deny all
forwarded_for delete and via off strip identifying headers, the target sees only the proxy IP, not your origin.
For auth, add basic auth via auth_param:
auth_param basic program /usr/lib/squid/basic_ncsa_auth /etc/squid/htpasswd
auth_param basic realm Proxy
acl users proxy_auth REQUIRED
http_access allow users
The pool API
A small Python service that scrapers query for "give me a proxy":
from flask import Flask, jsonify
import random, redis
app = Flask(__name__)
r = redis.Redis()
@app.route("/proxy")
def get_proxy():
healthy = [p.decode() for p in r.smembers("proxies:healthy")]
if not healthy:
return jsonify({"error": "no proxies"}), 503
return jsonify({"proxy": random.choice(healthy)})
@app.route("/report/<status>/<path:proxy>", methods=["POST"])
def report(status, proxy):
if status == "fail":
r.smove("proxies:healthy", "proxies:dead", proxy)
r.expire(f"proxies:dead:{proxy}", 300) # auto-resurrect
return ""
Workers ask /proxy, get one, use it, report success/failure. The Redis sets manage state.
For higher throughput, build the gateway as a transparent TCP/HTTP proxy itself, scrapers point at it, and it forwards to a chosen backend. Tools like gimme-aws-creds or mitmproxy script this; or a custom Go/Rust binary.
Health checks
A separate process probes each proxy every minute:
import httpx, redis, time
r = redis.Redis()
def check(proxy):
try:
client = httpx.Client(proxies=proxy, timeout=5)
resp = client.get("https://api.ipify.org?format=json")
if resp.status_code == 200:
actual_ip = resp.json()["ip"]
# confirm it matches expected
r.sadd("proxies:healthy", proxy)
r.srem("proxies:dead", proxy)
return True
except Exception:
pass
r.smove("proxies:healthy", "proxies:dead", proxy)
return False
while True:
for p in get_all_proxies():
check(p)
time.sleep(60)
Persistent failure = quarantine. Periodic recovery via background re-checks.
Rotation strategies
The simplest gateway picks randomly. More sophisticated:
- Round-robin. Even distribution.
- Least recently used. Spread load.
- Latency-weighted. Prefer fast proxies (covered in §4.30).
- Per-session sticky. Hash session ID → proxy.
For sticky:
def get_for_session(session_id):
proxy = r.get(f"session:{session_id}:proxy")
if proxy:
return proxy.decode()
p = random.choice(healthy)
r.setex(f"session:{session_id}:proxy", 1800, p) # 30-min stickiness
return p
Geographic awareness
Tag each proxy with region in Redis:
r.sadd("proxies:region:us", "http://1.2.3.4:3128")
r.sadd("proxies:region:de", "http://5.6.7.8:3128")
Workers request /proxy?region=de. The API picks from the right set.
Free proxy aggregation, careful
You can scrape free proxy lists and validate them:
sources = [
"https://free-proxy-list.net/",
"https://www.sslproxies.org/",
]
def scrape_lists():
candidates = []
for u in sources:
r = httpx.get(u)
for row in BeautifulSoup(r.text, "lxml").select("table tbody tr"):
ip = row.select_one("td:nth-child(1)").text
port = row.select_one("td:nth-child(2)").text
candidates.append(f"http://{ip}:{port}")
return candidates
Reality:
- 95% of "free" proxies are dead within minutes.
- The 5% alive are often compromised devices.
- Validation rate is awful.
- Legal/ethical concerns.
This approach is mostly for learning, not production. Don't run scraping that needs reliability on a pool of free proxies.
When self-hosted breaks down
Self-hosted pools fail when:
- You need millions of IPs. Renting that many servers is more expensive than buying residential.
- You need residential or mobile. You can't self-host these.
- Maintenance burden. Proxies fail, get IP-blocked, need updates. Running 50 Squid boxes is real ops work.
- Geographic diversity costs. Renting boxes in 30 countries adds up; providers do this at scale.
For most projects, "self-host the easy datacenter, buy residential/mobile" is the practical hybrid.
Composing with commercial providers
A common pattern: the pool API abstracts "give me a proxy," then internally picks among:
- Your self-hosted datacenter pool (cheapest, easy targets).
- A commercial residential gateway (mid-tier).
- A commercial mobile gateway (hardest).
Pick by target's known difficulty (a label per target). Cheap proxies first, escalate on failure.
Hands-on lab
If you have a spare VPS or two:
- Install Squid on each. Configure basic auth.
- Write a tiny Flask gateway that gives out a random proxy.
- Run a scraper through it. Validate failover by killing a Squid and watching the gateway adjust.
You'll be surprised at how much functionality you get in ~200 lines of code. You'll also be surprised at how quickly the operational burden grows past 5 proxies. That's the build-vs-buy line.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.