Guzzle: The Industry-Standard PHP HTTP Client, Static Scraping

Guzzle wraps cURL with a clean, modern API. PSR-7 messages, sessions, async, middleware. The default choice for any serious PHP scraper.

Guzzle is to PHP what requests is to Python, the de facto HTTP library. It's built on top of cURL (or PHP streams when cURL isn't available), exposes PSR-7 message objects, supports async via Promises, and has a middleware system that makes adding retries, logging, and auth trivial. Every modern PHP scraper, Symfony or otherwise, ends up using Guzzle either directly or indirectly.

Install

composer require guzzlehttp/guzzle

The Client object

A Client is the equivalent of requests.Session, it holds defaults (base URI, headers, cookies, timeouts) and reuses connections:

<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;

$client = new Client([
  'base_uri'  => 'https://practice.scrapingcentral.com',
  'timeout'  => 10,
  'connect_timeout' => 5,
  'headers'  => [
  'User-Agent'  => 'Mozilla/5.0 (compatible; my-scraper)',
  'Accept-Language' => 'en-US,en;q=0.9',
  ],
  'cookies'  => true,  // enable cookie jar
  'http_errors'  => true,  // throw on 4xx/5xx (default)
]);

With base_uri set, every request URL is resolved relative to it: $client->get('/products') hits https://practice.scrapingcentral.com/products.

GET requests

$response = $client->get('/products', [
  'query' => ['page' => 2, 'category' => 'kitchen'],
]);

echo $response->getStatusCode();  // 200
echo $response->getHeaderLine('Content-Type'); // text/html; charset=UTF-8
$body = (string) $response->getBody();  // cast PSR-7 stream to string

The query option is the equivalent of Python's params=, Guzzle URL-encodes for you.

$response is a PSR-7 ResponseInterface. The body is a stream (Psr\Http\Message\StreamInterface), cast it to string with (string) for the full content, or read it incrementally for big payloads.

POST requests, three body formats

Form-encoded:

$response = $client->post('/account/login', [
  'form_params' => [
  'username' => 'student@practice.scrapingcentral.com',
  'password' => 'practice123',
  ],
]);

Multipart (for file uploads):

$response = $client->post('/upload', [
  'multipart' => [
  ['name' => 'file', 'contents' => fopen('/path/to/file.png', 'r')],
  ['name' => 'caption', 'contents' => 'My photo'],
  ],
]);

JSON:

$response = $client->post('/api/products', [
  'json' => ['name' => 'New product', 'price' => 9.99],
]);

Like Python requests, the json option both serializes the body AND sets Content-Type: application/json. Don't pass body => json_encode(...) and forget the header, common bug.

Cookies and sessions

The 'cookies' => true option on the Client enables an in-memory cookie jar that persists across requests:

$client = new Client([
  'base_uri' => 'https://practice.scrapingcentral.com',
  'cookies'  => true,
]);

$client->get('/');  // server sets a session cookie
$client->post('/account/login', [  // login sends that cookie back
  'form_params' => ['username' => '...', 'password' => '...'],
]);
$response = $client->get('/dashboard');  // authenticated request

For more control, pass a CookieJar instance instead of true:

use GuzzleHttp\Cookie\CookieJar;

$jar = new CookieJar();
$client = new Client(['cookies' => $jar]);

// Inspect, share, or persist:
print_r($jar->toArray());

FileCookieJar and SessionCookieJar are also available if you need on-disk or PHP-session-backed persistence.

Error handling

By default, 4xx and 5xx responses throw an exception:

use GuzzleHttp\Exception\ClientException;  // 4xx
use GuzzleHttp\Exception\ServerException;  // 5xx
use GuzzleHttp\Exception\ConnectException;  // network failure
use GuzzleHttp\Exception\RequestException;  // parent of all above

try {
  $response = $client->get('/account/missing');
} catch (ClientException $e) {
  echo "404 or 4xx: " . $e->getResponse()->getStatusCode();
} catch (ServerException $e) {
  echo "5xx: " . $e->getResponse()->getStatusCode();
} catch (ConnectException $e) {
  echo "Network failure: " . $e->getMessage();
}

To opt out of exceptions and inspect manually (helpful for scrapers that don't want exceptions for normal 404s):

$response = $client->get('/some/url', ['http_errors' => false]);
if ($response->getStatusCode() === 200) {
  // process body
}

Async and concurrency

Where Guzzle really shines: send many requests in parallel without threads.

use GuzzleHttp\Promise\Utils;

$promises = [
  'p1' => $client->getAsync('/products?page=1'),
  'p2' => $client->getAsync('/products?page=2'),
  'p3' => $client->getAsync('/products?page=3'),
];

$responses = Utils::unwrap($promises);
foreach ($responses as $key => $response) {
  echo "$key: " . $response->getStatusCode() . "\n";
}

Three requests fire in parallel; unwrap waits for all to complete. For bulk scraping, this is dramatically faster than serial loops, often 5-10x.

For controlled concurrency (e.g., max 5 in flight at any time), use GuzzleHttp\Pool:

use GuzzleHttp\Pool;
use GuzzleHttp\Psr7\Request;

$requests = function () {
  for ($page = 1; $page <= 100; $page++) {
  yield new Request('GET', "/products?page=$page");
  }
};

$pool = new Pool($client, $requests(), [
  'concurrency' => 5,
  'fulfilled' => function ($response, $index) {
  echo "page $index: " . $response->getStatusCode() . "\n";
  },
  'rejected' => function ($reason, $index) {
  echo "page $index FAILED: " . $reason . "\n";
  },
]);
$pool->promise()->wait();

100 pages, 5 in flight at any moment, with per-result and per-failure callbacks. This is production scraper territory.

Middleware

Guzzle uses a middleware stack you can extend, retries, logging, caching, custom auth, all written as middleware that wraps the request/response cycle. We'll cover retries and middleware in the Production sub-path; for now, know that the extension point exists.

Proxies, TLS, auth

$client = new Client([
  'proxy'  => 'http://user:pass@proxy.example.com:8000',
  'verify'  => true,
  'auth'  => ['student', 'practice123'],  // HTTP Basic
  'headers' => ['Authorization' => 'Bearer ' . $token],
]);

All the same concepts as Python requests, same names, similar semantics. If you know requests, you know Guzzle's defaults.

Hands-on lab

Use Guzzle to fetch /products?page=2. Confirm getStatusCode() returns 200, print the first 500 bytes of (string) $response->getBody(), and verify getHeaderLine('Content-Type') says HTML. Then call getAsync for pages 1-5 and use Utils::unwrap to fetch them in parallel. Note the total elapsed time vs. a serial loop.

Guzzle: The Industry-Standard PHP HTTP Client

What you’ll learn

Install

The Client object

GET requests

POST requests, three body formats

Cookies and sessions

Error handling

Async and concurrency

Middleware

Proxies, TLS, auth

Hands-on lab

Hands-on lab

Quiz, check your understanding

What is a Guzzle `Client` analogous to in Python `requests`?