Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

4.40intermediate4 min read

Integrating CAPTCHA Solving in Python and PHP Scrapers

Wire a CAPTCHA solver into a real scraper. The patterns that handle detection, solving, retry, and token injection in Scrapy and Symfony.

What you’ll learn

  • Detect CAPTCHA in HTML or page state.
  • Submit, poll, and inject solves in Scrapy and Symfony pipelines.
  • Handle solver failures gracefully.

The integration layer between scraper and solver is where reality bites. Detection logic, solver dispatch, token injection, retry. This lesson is the end-to-end pattern in Python (Scrapy) and PHP (Symfony).

Detection

First step: know you've hit a CAPTCHA. The signal varies:

def is_captcha(response):
  text = response.text
  if "g-recaptcha" in text: return "recaptcha"
  if "h-captcha" in text: return "hcaptcha"
  if "data-sitekey" in text and "turnstile" in text: return "turnstile"
  if "/cdn-cgi/challenge-platform/" in text: return "cloudflare"
  if "captcha-delivery.com" in text: return "datadome"
  return None

For Playwright, detect via DOM:

async def page_has_captcha(page):
  if await page.locator("iframe[src*='recaptcha']").count() > 0:
  return "recaptcha"
  if await page.locator(".h-captcha").count() > 0:
  return "hcaptcha"
  ...

Detection lives on the response or page. Mistakes here mean either over-solving (cost) or under-solving (silent failures).

Python: Scrapy middleware

A middleware that detects, solves, retries:

from scrapy.exceptions import IgnoreRequest
from twocaptcha import TwoCaptcha

class CaptchaMiddleware:
  def __init__(self, api_key):
  self.solver = TwoCaptcha(api_key)

  @classmethod
  def from_crawler(cls, crawler):
  api_key = crawler.settings.get("TWOCAPTCHA_KEY")
  return cls(api_key)

  def process_response(self, request, response, spider):
  if "data-sitekey" not in response.text:
  return response

  # Detect
  match = re.search(r'data-sitekey="([^"]+)"', response.text)
  if not match: return response
  site_key = match.group(1)

  # Solve (blocking, for production, async-dispatch instead)
  try:
  result = self.solver.recaptcha(sitekey=site_key, url=response.url)
  token = result["code"]
  except Exception as e:
  spider.logger.warning(f"captcha solve failed: {e}")
  raise IgnoreRequest()

  # Build retry with token
  new = request.replace(
  body=json.dumps({
  **request.meta.get("form_data", {}),
  "g-recaptcha-response": token,
  }),
  dont_filter=True,
  method="POST",
  )
  return new

Key points:

  • Detect captcha in response.
  • Block while solving (acceptable for low-volume flows; for high-throughput, queue solves).
  • Build a new request with the token included.
  • dont_filter=True bypasses the dupefilter.

For Playwright (via scrapy-playwright):

async def parse_with_captcha(self, response):
  page = response.meta["playwright_page"]
  if await page.locator("iframe[src*='recaptcha']").count() > 0:
  site_key = await page.evaluate("() => document.querySelector('.g-recaptcha').dataset.sitekey")
  token = self.solver.recaptcha(sitekey=site_key, url=page.url)["code"]
  await page.evaluate(f'document.getElementById("g-recaptcha-response").value="{token}"')
  await page.evaluate(f'window["___grecaptcha_cfg"].clients[0].O.O.callback("{token}")')
  yield {"data": await page.content()}

Inject the token, fire the callback, the page advances.

PHP: Symfony pattern

class CaptchaAwareScraper
{
  public function __construct(
  private HttpClientInterface $http,
  private CaptchaSolver $solver,
  ) {}

  public function fetch(string $url): ?string
  {
  $response = $this->http->request('GET', $url);
  $html = $response->getContent(false);

  if (!preg_match('/data-sitekey="([^"]+)"/', $html, $m)) {
  return $html;
  }
  $siteKey = $m[1];

  $token = $this->solver->solveRecaptchaV2($siteKey, $url);

  // Submit the form with the token
  $form = $this->extractFormData($html);
  $form['g-recaptcha-response'] = $token;

  $response2 = $this->http->request('POST', $url, [
  'body' => $form,
  'headers' => $response->getHeaders()['set-cookie'] ?? [],
  ]);
  return $response2->getContent();
  }
}

For Panther-based PHP scrapes, the integration uses Selenium-style JS execution:

$driver = $client->getWebDriver();
$driver->executeScript("document.getElementById('g-recaptcha-response').value = arguments[0]", [$token]);
$driver->executeScript("___grecaptcha_cfg.clients[0].O.O.callback(arguments[0])", [$token]);

Async solving with worker queues

For high-throughput scrapes, blocking on solve is a worker-stealing problem. Push captcha events to a separate queue:

# Scrapy pipeline / Messenger handler
def process_item_with_captcha(item):
  if item.get("needs_captcha"):
  captcha_queue.put({
  "url": item["url"],
  "site_key": item["site_key"],
  "callback": handle_solved,
  })
  return  # original request paused

# Separate worker
async def captcha_worker():
  while True:
  task = await captcha_queue.get()
  token = await solver.solve_async(task)
  await task["callback"](task, token)

The scraper main loop doesn't block. Solvers process in parallel. Solved items re-enter the pipeline.

Token injection patterns

How tokens are submitted varies:

reCAPTCHA v2

  • Token goes in g-recaptcha-response form field or POST body.

reCAPTCHA v3

  • Token in g-recaptcha-response form field, often alongside an action parameter.

hCaptcha

  • Token in h-captcha-response form field.

Cloudflare Turnstile

  • Token in cf-turnstile-response form field.

Always inspect the original form to see which field is expected; some sites use custom names.

Retry logic

Solves can fail. Retry with backoff:

def solve_with_retry(solver, site_key, url, max_tries=3):
  for attempt in range(max_tries):
  try:
  return solver.recaptcha(sitekey=site_key, url=url)["code"]
  except Exception as e:
  if "ERROR_CAPTCHA_UNSOLVABLE" in str(e):
  # solver gave up; try a different solver?
  raise
  if attempt < max_tries - 1:
  time.sleep(2 ** attempt)
  raise RuntimeError("all solve attempts failed")

Distinguish "transient error, retry" from "unsolvable, escalate." Most solvers refund failed-unsolvable solves; check provider terms.

Error reporting

Always log:

  • Detection: was a captcha present?
  • Submission: did the solve request succeed?
  • Solving time: how long did the solver take?
  • Final outcome: did the token get accepted by the target?
metrics.increment("captcha.detected")
metrics.increment("captcha.solved")
metrics.timing("captcha.solve_duration_ms", duration)
metrics.increment("captcha.token_rejected_by_target")

Token-rejected-by-target is critical. A solver may return a "valid" token that the target's strict scoring still rejects (v3, Enterprise). Track this rate; high rejection means the target is using stronger checking and your strategy needs to escalate.

Hands-on lab

Against /challenges/antibot/captcha-mock on Catalog108 (a mock captcha endpoint):

  1. Hit the endpoint; detect the mock captcha.
  2. Simulate a solve (the lab endpoint accepts a known token).
  3. Submit the token and verify the response indicates success.
  4. Add retry logic for failed solves.

The exercise covers the full loop, detection, solve, injection, retry, success. After this, integrating a real solver against a real target is mostly swapping the API call.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /challenges/antibot/captcha-mock

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Integrating CAPTCHA Solving in Python and PHP Scrapers1 / 8

Once a CAPTCHA solver returns a token, where does it go for reCAPTCHA v2?

Score so far: 0 / 0