Integrating CAPTCHA Solving in Python and PHP Scrapers
Wire a CAPTCHA solver into a real scraper. The patterns that handle detection, solving, retry, and token injection in Scrapy and Symfony.
What you’ll learn
- Detect CAPTCHA in HTML or page state.
- Submit, poll, and inject solves in Scrapy and Symfony pipelines.
- Handle solver failures gracefully.
The integration layer between scraper and solver is where reality bites. Detection logic, solver dispatch, token injection, retry. This lesson is the end-to-end pattern in Python (Scrapy) and PHP (Symfony).
Detection
First step: know you've hit a CAPTCHA. The signal varies:
def is_captcha(response):
text = response.text
if "g-recaptcha" in text: return "recaptcha"
if "h-captcha" in text: return "hcaptcha"
if "data-sitekey" in text and "turnstile" in text: return "turnstile"
if "/cdn-cgi/challenge-platform/" in text: return "cloudflare"
if "captcha-delivery.com" in text: return "datadome"
return None
For Playwright, detect via DOM:
async def page_has_captcha(page):
if await page.locator("iframe[src*='recaptcha']").count() > 0:
return "recaptcha"
if await page.locator(".h-captcha").count() > 0:
return "hcaptcha"
...
Detection lives on the response or page. Mistakes here mean either over-solving (cost) or under-solving (silent failures).
Python: Scrapy middleware
A middleware that detects, solves, retries:
from scrapy.exceptions import IgnoreRequest
from twocaptcha import TwoCaptcha
class CaptchaMiddleware:
def __init__(self, api_key):
self.solver = TwoCaptcha(api_key)
@classmethod
def from_crawler(cls, crawler):
api_key = crawler.settings.get("TWOCAPTCHA_KEY")
return cls(api_key)
def process_response(self, request, response, spider):
if "data-sitekey" not in response.text:
return response
# Detect
match = re.search(r'data-sitekey="([^"]+)"', response.text)
if not match: return response
site_key = match.group(1)
# Solve (blocking, for production, async-dispatch instead)
try:
result = self.solver.recaptcha(sitekey=site_key, url=response.url)
token = result["code"]
except Exception as e:
spider.logger.warning(f"captcha solve failed: {e}")
raise IgnoreRequest()
# Build retry with token
new = request.replace(
body=json.dumps({
**request.meta.get("form_data", {}),
"g-recaptcha-response": token,
}),
dont_filter=True,
method="POST",
)
return new
Key points:
- Detect captcha in response.
- Block while solving (acceptable for low-volume flows; for high-throughput, queue solves).
- Build a new request with the token included.
dont_filter=Truebypasses the dupefilter.
For Playwright (via scrapy-playwright):
async def parse_with_captcha(self, response):
page = response.meta["playwright_page"]
if await page.locator("iframe[src*='recaptcha']").count() > 0:
site_key = await page.evaluate("() => document.querySelector('.g-recaptcha').dataset.sitekey")
token = self.solver.recaptcha(sitekey=site_key, url=page.url)["code"]
await page.evaluate(f'document.getElementById("g-recaptcha-response").value="{token}"')
await page.evaluate(f'window["___grecaptcha_cfg"].clients[0].O.O.callback("{token}")')
yield {"data": await page.content()}
Inject the token, fire the callback, the page advances.
PHP: Symfony pattern
class CaptchaAwareScraper
{
public function __construct(
private HttpClientInterface $http,
private CaptchaSolver $solver,
) {}
public function fetch(string $url): ?string
{
$response = $this->http->request('GET', $url);
$html = $response->getContent(false);
if (!preg_match('/data-sitekey="([^"]+)"/', $html, $m)) {
return $html;
}
$siteKey = $m[1];
$token = $this->solver->solveRecaptchaV2($siteKey, $url);
// Submit the form with the token
$form = $this->extractFormData($html);
$form['g-recaptcha-response'] = $token;
$response2 = $this->http->request('POST', $url, [
'body' => $form,
'headers' => $response->getHeaders()['set-cookie'] ?? [],
]);
return $response2->getContent();
}
}
For Panther-based PHP scrapes, the integration uses Selenium-style JS execution:
$driver = $client->getWebDriver();
$driver->executeScript("document.getElementById('g-recaptcha-response').value = arguments[0]", [$token]);
$driver->executeScript("___grecaptcha_cfg.clients[0].O.O.callback(arguments[0])", [$token]);
Async solving with worker queues
For high-throughput scrapes, blocking on solve is a worker-stealing problem. Push captcha events to a separate queue:
# Scrapy pipeline / Messenger handler
def process_item_with_captcha(item):
if item.get("needs_captcha"):
captcha_queue.put({
"url": item["url"],
"site_key": item["site_key"],
"callback": handle_solved,
})
return # original request paused
# Separate worker
async def captcha_worker():
while True:
task = await captcha_queue.get()
token = await solver.solve_async(task)
await task["callback"](task, token)
The scraper main loop doesn't block. Solvers process in parallel. Solved items re-enter the pipeline.
Token injection patterns
How tokens are submitted varies:
reCAPTCHA v2
- Token goes in
g-recaptcha-responseform field or POST body.
reCAPTCHA v3
- Token in
g-recaptcha-responseform field, often alongside anactionparameter.
hCaptcha
- Token in
h-captcha-responseform field.
Cloudflare Turnstile
- Token in
cf-turnstile-responseform field.
Always inspect the original form to see which field is expected; some sites use custom names.
Retry logic
Solves can fail. Retry with backoff:
def solve_with_retry(solver, site_key, url, max_tries=3):
for attempt in range(max_tries):
try:
return solver.recaptcha(sitekey=site_key, url=url)["code"]
except Exception as e:
if "ERROR_CAPTCHA_UNSOLVABLE" in str(e):
# solver gave up; try a different solver?
raise
if attempt < max_tries - 1:
time.sleep(2 ** attempt)
raise RuntimeError("all solve attempts failed")
Distinguish "transient error, retry" from "unsolvable, escalate." Most solvers refund failed-unsolvable solves; check provider terms.
Error reporting
Always log:
- Detection: was a captcha present?
- Submission: did the solve request succeed?
- Solving time: how long did the solver take?
- Final outcome: did the token get accepted by the target?
metrics.increment("captcha.detected")
metrics.increment("captcha.solved")
metrics.timing("captcha.solve_duration_ms", duration)
metrics.increment("captcha.token_rejected_by_target")
Token-rejected-by-target is critical. A solver may return a "valid" token that the target's strict scoring still rejects (v3, Enterprise). Track this rate; high rejection means the target is using stronger checking and your strategy needs to escalate.
Hands-on lab
Against /challenges/antibot/captcha-mock on Catalog108 (a mock captcha endpoint):
- Hit the endpoint; detect the mock captcha.
- Simulate a solve (the lab endpoint accepts a known token).
- Submit the token and verify the response indicates success.
- Add retry logic for failed solves.
The exercise covers the full loop, detection, solve, injection, retry, success. After this, integrating a real solver against a real target is mostly swapping the API call.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/challenges/antibot/captcha-mockQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.