Proxy Health Checks, Failover, and Cost Optimization
Production proxy management: detecting dead proxies, failing over, and cutting waste. The operational side of proxy infrastructure.
What you’ll learn
- Implement proxy health checks and failure tracking.
- Build a fail-fast failover system.
- Identify the biggest sources of proxy bill waste.
A proxy pool full of dead IPs costs money and adds latency. Production scrapers need automated health checks, fast failover, and cost-watching. This lesson is operational, patterns that show up in every successful scraping team.
What "unhealthy" means
A proxy can fail in many flavors:
| Symptom | Likely cause |
|---|---|
| Connection timeout | Proxy down or overloaded |
| 407 Proxy Authentication Required | Auth header dropped or revoked |
| 502 / 504 from proxy | Upstream gateway issue |
| 200 but wrong content (login page) | IP banned by target |
| 200 but wrong country | Geo-targeting misconfigured |
| High latency (>5s) | Proxy congested |
| TLS handshake failure | Proxy MITM gone wrong |
Each warrants different action: retry with a new IP, swap proxies, refresh credentials, alert.
Health-check loop
A background task periodically probes each proxy:
import httpx, asyncio, time
class ProxyHealth:
def __init__(self):
self.stats = {} # proxy_url -> {success, fail, last_check, latency}
async def check(self, proxy_url):
start = time.time()
try:
async with httpx.AsyncClient(proxies=proxy_url, timeout=10) as c:
r = await c.get("https://httpbin.org/ip")
latency = time.time() - start
if r.status_code == 200:
self._record(proxy_url, success=True, latency=latency)
else:
self._record(proxy_url, success=False, latency=latency)
except Exception:
self._record(proxy_url, success=False, latency=time.time() - start)
def _record(self, proxy, success, latency):
s = self.stats.setdefault(proxy, {"success": 0, "fail": 0, "latency": 0})
s["last_check"] = time.time()
s["latency"] = latency
if success:
s["success"] += 1
else:
s["fail"] += 1
def is_healthy(self, proxy):
s = self.stats.get(proxy)
if not s: return True
total = s["success"] + s["fail"]
return s["success"] / total > 0.8 if total > 5 else True
Run check() every few minutes for each proxy. Filter the pool by is_healthy() when picking.
For rotating-gateway providers (one URL, many IPs underneath), health checks are different, you can't easily probe a specific IP. Instead, track aggregate behavior: success rate over the last N requests, average latency, error rates.
Failover patterns
When a request fails through proxy A, try proxy B before giving up.
class ProxyPool:
def __init__(self, proxies):
self.live = set(proxies)
self.dead = {} # proxy -> retry_after_timestamp
def get(self):
# Restore proxies whose dead-time has expired
now = time.time()
for p, t in list(self.dead.items()):
if now > t:
self.live.add(p)
del self.dead[p]
if not self.live:
raise RuntimeError("no live proxies")
return random.choice(list(self.live))
def mark_dead(self, proxy, retry_in=300):
self.live.discard(proxy)
self.dead[proxy] = time.time() + retry_in
async def fetch_with_failover(pool, url, max_tries=3):
last_exc = None
for _ in range(max_tries):
proxy = pool.get()
try:
async with httpx.AsyncClient(proxies=proxy, timeout=10) as c:
r = await c.get(url)
if r.status_code in (407, 502, 503):
pool.mark_dead(proxy, retry_in=60)
continue
return r
except (httpx.TimeoutException, httpx.ConnectError) as e:
pool.mark_dead(proxy, retry_in=300)
last_exc = e
raise last_exc or RuntimeError("all retries failed")
Three retries, each with a fresh proxy. Dead proxies are quarantined for a window (proportional to failure type) then re-tested.
Detecting "200 but banned"
The hardest failure: HTTP 200 but the content is a login wall or CAPTCHA. Standard retry logic doesn't catch this.
def looks_banned(html: str) -> bool:
indicators = [
"Verify you are human",
"Sorry, we just need to make sure",
"captcha-mock",
'"error":"banned"',
]
return any(s in html for s in indicators)
r = await fetch(url, proxy)
if looks_banned(r.text):
pool.mark_dead(proxy, retry_in=3600) # this IP is burned for an hour
return await fetch_with_failover(pool, url)
Maintain a per-target list of ban indicators. They change occasionally, re-validate when ban rates spike.
Latency-aware selection
Not all proxies are equally fast. Weight selection by latency:
import random
def weighted_pick(self):
if not self.live: return None
weights = [1 / (self.stats[p]["latency"] or 0.1) for p in self.live]
return random.choices(list(self.live), weights=weights, k=1)[0]
Faster proxies get picked more often. Combine with health filtering for a self-balancing pool.
Cost optimization
Top sources of proxy bill waste, in order:
1. Image and CSS downloads
A scraper that triggers a full browser-style load on every page burns bandwidth on assets you don't need. Two fixes:
# httpx, strip non-HTML responses
r = await client.head(url)
if "html" not in r.headers.get("content-type", ""):
return None
r = await client.get(url)
# Scrapy, don't follow images
class HtmlOnlyMiddleware:
def process_response(self, request, response, spider):
if response.headers.get("content-type", b"").startswith(b"image/"):
return Response(response.url, body=b"", status=204)
return response
For browser-mode scrapes, configure Playwright/Panther to block images:
context = await browser.new_context(
java_script_enabled=True,
bypass_csp=True,
)
await context.route("**/*.{png,jpg,jpeg,gif,svg,webp}", lambda r: r.abort())
Can cut bandwidth by 70%+ on media-heavy sites.
2. Failed retries
A 30% block rate means 30% bandwidth wasted on responses you can't use. Better fingerprinting (covered in §4.32–§4.34) cuts blocks. So does respecting Retry-After and not hammering.
3. Wrong proxy tier
Datacenter for hard sites: high block rate, bandwidth wasted on failed retries, paradoxically more expensive than residential after accounting for retries.
Residential for easy sites: 5-15x premium over datacenter for no benefit.
Match the tier to the actual target difficulty. Many shops over-pay for residential where datacenter would do.
4. Forgotten/idle workers
Scheduled scrapes that no longer have a downstream consumer keep burning proxy budget. Audit periodically: every active scraper, what's it for, who reads the output, is it still needed?
5. Overly broad geographic targeting
City-level is 2-5x country-level. Country-level is small premium over "any." Only pay for the precision your use case needs.
Per-request cost tracking
In production, log the cost per request:
async def fetch(url):
r = await client.get(url...)
bytes_used = len(r.content) + estimate_request_size()
metrics.add_proxy_bytes(bytes_used, tier="residential")
return r
Roll up at end of day:
2026-05-12 catalog108-scrape: 12,432 reqs, 412 MB residential, $3.30 estimated
Catches budget runaway before it hits the monthly invoice.
Hands-on lab
In your scraper:
- Implement a simple ProxyPool with mark_dead/failover.
- Add a ban detector for one specific target ("Verify you are human").
- Run 1000 requests and log: success rate, average latency, bytes per request.
- Compute estimated monthly cost at current volume.
The exercise turns proxy management from "it works" into "I know what it costs." That awareness saves projects.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.