Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

4.31intermediate5 min read

Building Your Own Lightweight Proxy Pool

When buying isn't appropriate or you need fine-grained control, here's the architecture for a self-managed proxy pool with rotation, health, and failover.

What you’ll learn

  • Architect a proxy pool service with health checks, rotation, failover.
  • Decide when self-managed beats commercial.
  • Avoid the common pitfalls of DIY proxy infrastructure.

Most projects should buy from a provider. But sometimes you need control, internal-only routing, custom regions, or sources commercial providers don't offer. This lesson is the architecture for a DIY pool.

When DIY makes sense

Reason Notes
You have datacenter IPs the provider doesn't E.g. you own a small VPS fleet across regions
Compliance or security requires self-hosted Sensitive scraping; audit logs of egress
You're aggregating multiple commercial providers Build "meta-pool" across vendors
Very high volume where per-GB pricing breaks down Self-host can amortize
You're building a proxy product This is your business

For most others: buy from a provider. Engineering time is more expensive than proxies.

Architecture

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│  Scrapers  │ ───►  │  Pool API  │ ───►  │  Proxy IPs  │
│ (workers)  │  HTTP │ (gateway)  │  │ (squid, HA)  │
└──────────────┘  └──────────────┘  └──────────────┘
  │
  ▼
  ┌──────────────┐
  │  Redis store │
  │  (state)  │
  └──────────────┘

Three components:

  1. Proxy IPs. Actual machines running Squid, HAProxy, or a simple SOCKS5 server.
  2. Pool API / gateway. A small service scrapers talk to. Decides which proxy to use per request.
  3. State store. Redis (or Postgres) tracking proxy health and assignment.

The proxy nodes

Each node runs Squid (the most battle-tested) listening on its own IP:

# /etc/squid/squid.conf, minimal anonymous proxy
http_port 0.0.0.0:3128
http_access allow all  # restrict in production!
forwarded_for delete
via off
request_header_access X-Forwarded-For deny all

forwarded_for delete and via off strip identifying headers, the target sees only the proxy IP, not your origin.

For auth, add basic auth via auth_param:

auth_param basic program /usr/lib/squid/basic_ncsa_auth /etc/squid/htpasswd
auth_param basic realm Proxy
acl users proxy_auth REQUIRED
http_access allow users

The pool API

A small Python service that scrapers query for "give me a proxy":

from flask import Flask, jsonify
import random, redis

app = Flask(__name__)
r = redis.Redis()

@app.route("/proxy")
def get_proxy():
  healthy = [p.decode() for p in r.smembers("proxies:healthy")]
  if not healthy:
  return jsonify({"error": "no proxies"}), 503
  return jsonify({"proxy": random.choice(healthy)})

@app.route("/report/<status>/<path:proxy>", methods=["POST"])
def report(status, proxy):
  if status == "fail":
  r.smove("proxies:healthy", "proxies:dead", proxy)
  r.expire(f"proxies:dead:{proxy}", 300)  # auto-resurrect
  return ""

Workers ask /proxy, get one, use it, report success/failure. The Redis sets manage state.

For higher throughput, build the gateway as a transparent TCP/HTTP proxy itself, scrapers point at it, and it forwards to a chosen backend. Tools like gimme-aws-creds or mitmproxy script this; or a custom Go/Rust binary.

Health checks

A separate process probes each proxy every minute:

import httpx, redis, time

r = redis.Redis()

def check(proxy):
  try:
  client = httpx.Client(proxies=proxy, timeout=5)
  resp = client.get("https://api.ipify.org?format=json")
  if resp.status_code == 200:
  actual_ip = resp.json()["ip"]
  # confirm it matches expected
  r.sadd("proxies:healthy", proxy)
  r.srem("proxies:dead", proxy)
  return True
  except Exception:
  pass
  r.smove("proxies:healthy", "proxies:dead", proxy)
  return False

while True:
  for p in get_all_proxies():
  check(p)
  time.sleep(60)

Persistent failure = quarantine. Periodic recovery via background re-checks.

Rotation strategies

The simplest gateway picks randomly. More sophisticated:

  • Round-robin. Even distribution.
  • Least recently used. Spread load.
  • Latency-weighted. Prefer fast proxies (covered in §4.30).
  • Per-session sticky. Hash session ID → proxy.

For sticky:

def get_for_session(session_id):
  proxy = r.get(f"session:{session_id}:proxy")
  if proxy:
  return proxy.decode()
  p = random.choice(healthy)
  r.setex(f"session:{session_id}:proxy", 1800, p)  # 30-min stickiness
  return p

Geographic awareness

Tag each proxy with region in Redis:

r.sadd("proxies:region:us", "http://1.2.3.4:3128")
r.sadd("proxies:region:de", "http://5.6.7.8:3128")

Workers request /proxy?region=de. The API picks from the right set.

Free proxy aggregation, careful

You can scrape free proxy lists and validate them:

sources = [
  "https://free-proxy-list.net/",
  "https://www.sslproxies.org/",
]

def scrape_lists():
  candidates = []
  for u in sources:
  r = httpx.get(u)
  for row in BeautifulSoup(r.text, "lxml").select("table tbody tr"):
  ip = row.select_one("td:nth-child(1)").text
  port = row.select_one("td:nth-child(2)").text
  candidates.append(f"http://{ip}:{port}")
  return candidates

Reality:

  • 95% of "free" proxies are dead within minutes.
  • The 5% alive are often compromised devices.
  • Validation rate is awful.
  • Legal/ethical concerns.

This approach is mostly for learning, not production. Don't run scraping that needs reliability on a pool of free proxies.

When self-hosted breaks down

Self-hosted pools fail when:

  • You need millions of IPs. Renting that many servers is more expensive than buying residential.
  • You need residential or mobile. You can't self-host these.
  • Maintenance burden. Proxies fail, get IP-blocked, need updates. Running 50 Squid boxes is real ops work.
  • Geographic diversity costs. Renting boxes in 30 countries adds up; providers do this at scale.

For most projects, "self-host the easy datacenter, buy residential/mobile" is the practical hybrid.

Composing with commercial providers

A common pattern: the pool API abstracts "give me a proxy," then internally picks among:

  1. Your self-hosted datacenter pool (cheapest, easy targets).
  2. A commercial residential gateway (mid-tier).
  3. A commercial mobile gateway (hardest).

Pick by target's known difficulty (a label per target). Cheap proxies first, escalate on failure.

Hands-on lab

If you have a spare VPS or two:

  1. Install Squid on each. Configure basic auth.
  2. Write a tiny Flask gateway that gives out a random proxy.
  3. Run a scraper through it. Validate failover by killing a Squid and watching the gateway adjust.

You'll be surprised at how much functionality you get in ~200 lines of code. You'll also be surprised at how quickly the operational burden grows past 5 proxies. That's the build-vs-buy line.

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Building Your Own Lightweight Proxy Pool1 / 8

When is self-hosted proxy infrastructure a better choice than buying from a provider?

Score so far: 0 / 0