PHP Crash-Course for Scrapers
The PHP you'll actually use to write scrapers. Arrays, strings, file I/O, error handling, JSON, and the modern PHP 8 features that make it pleasant.
What you’ll learn
- Manipulate strings and arrays in modern PHP 8.
- Use associative arrays as the PHP equivalent of Python dicts.
- Read and write files, JSON, and CSV without surprises.
- Handle errors with try/catch and named exception types, not the @ silencer.
PHP 8.x is genuinely good. If your last PHP was 5.x or 7.x, much has changed: strict types, named arguments, enums, readonly properties, match expressions, the null-safe operator. This crash course assumes none of that history, it's just the PHP you'll write in scrapers, in 2026 idiom.
Strings
$url = "https://practice.scrapingcentral.com/products?page=2";
str_starts_with($url, "https://"); // true (PHP 8.0+)
str_ends_with($url, ".pdf"); // false
str_contains($url, "products"); // true
strtolower($url); // lowercase
trim($url); // strip whitespace
explode("?", $url); // ["https://practice.scrapingcentral.com/products", "page=2"]
str_replace("page=2", "page=3", $url); // rewrite
sprintf("page=%d&limit=%d", 2, 20); // formatted string, "page=2&limit=20"
PHP 8 added the three str_*_with functions; before that you used strpos(...) === 0. Use the new ones, they're clearer.
String interpolation
$page = 2;
$url = "https://practice.scrapingcentral.com/products?page=$page";
// More explicit with curly braces (preferred when expressions get complex):
$url = "https://practice.scrapingcentral.com/products?page={$page}";
// printf-style for formatting:
$formatted = sprintf('$%.2f', 14.99); // '$14.99'
Double-quoted strings interpolate; single-quoted don't. For literal-text strings (no variables), single quotes are marginally faster and signal intent.
Arrays
PHP arrays are unusual: they're both list and dict in one structure. An array with integer keys behaves like a list; with string keys, like a dict.
List-style
$prices = [14.99, 24.95, 9.50, 49.00];
$prices[0]; // 14.99
$prices[count($prices) - 1]; // 49.00, no negative indexing
end($prices); // 49.00, alternative
$prices[] = 7.00; // append
sort($prices); // mutates in place
asort($prices); // sort, preserve keys
array_sum($prices) / count($prices); // average
Dict-style (associative)
$product = [
"id" => 42,
"title" => "Yellow ceramic mug",
"price" => 14.99,
"tags" => ["kitchen", "ceramic"],
];
$product["title"]; // direct access, Warning if missing
$product["title"] ?? "n/a"; // null coalesce, "n/a" if missing
isset($product["title"]); // true
array_key_exists("title", $product); // true (and: works for null values too)
$product["stock"] = 15; // assign
array_keys($product); // ["id", "title", "price", "tags", "stock"]
The ?? null-coalescing operator is the PHP equivalent of Python's dict.get(k, default). It returns the right side when the left side is null or undefined.
Iterating
foreach ($product as $key => $value) {
echo "$key = $value\n";
}
// Just values
foreach ($product as $value) {
var_dump($value);
}
Array transformations
$prices = [14.99, 24.95, 9.50, 49.00];
array_map(fn($p) => $p * 0.9, $prices); // 10% off all
array_filter($prices, fn($p) => $p < 20); // [14.99, 9.50]
array_reduce($prices, fn($c, $p) => $c + $p, 0); // sum
PHP 7.4+ has short arrow functions (fn($x) => ...), single-expression closures, with automatic capture from the enclosing scope. Use them; they're terser than function ($x) use ($outer) { return ...; }.
Functions
function fetchPage(string $url, int $timeout = 10): string {
$ch = curl_init($url);
curl_setopt_array($ch, [
CURLOPT_RETURNTRANSFER => true,
CURLOPT_TIMEOUT => $timeout,
CURLOPT_FOLLOWLOCATION => true,
]);
$body = curl_exec($ch);
if ($body === false) {
throw new RuntimeException(curl_error($ch));
}
curl_close($ch);
return $body;
}
// Named arguments (PHP 8.0+)
$body = fetchPage(url: "https://example.com", timeout: 30);
Type declarations (string $url, : string) are optional in PHP but recommended. They turn typos into errors at call time instead of silent bugs.
Modern PHP 8 features you'll use
// match expression
$status = match (true) {
$code >= 500 => 'server-error',
$code >= 400 => 'client-error',
$code >= 300 => 'redirect',
$code >= 200 => 'success',
default => 'unknown',
};
// Null-safe operator
$avg = $response?->reviews?->average ?? 0;
// Constructor property promotion (PHP 8.0)
class HttpClient {
public function __construct(
private readonly int $timeout = 10,
private readonly string $userAgent = 'my-scraper/1.0',
) {}
}
// Enums (PHP 8.1)
enum HttpStatus: int {
case OK = 200;
case NotFound = 404;
case ServerError = 500;
}
match replaces unwieldy if/elseif chains. The null-safe ?-> is for traversing nullable objects. Constructor promotion turns 15-line classes into 5-line ones.
File I/O
// Write
file_put_contents("output.txt", "Hello\n");
file_put_contents("output.txt", "World\n", FILE_APPEND);
// Read all at once (small files)
$contents = file_get_contents("input.txt");
// Read line by line (big files)
$f = fopen("input.txt", "r");
while (($line = fgets($f)) !== false) {
$line = trim($line);
process($line);
}
fclose($f);
For text files, file_put_contents and file_get_contents are the one-liners you'll use 80% of the time.
JSON
$data = json_decode($responseBody, associative: true); // PHP 8.0 named arg
// or: $data = json_decode($responseBody, true);
// JSON encode, pretty-printed, preserves UTF-8
$json = json_encode($products, JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE);
file_put_contents("products.json", $json);
Two flags worth knowing on json_encode:
JSON_UNESCAPED_UNICODE, don't escape non-ASCII to\uXXXXJSON_UNESCAPED_SLASHES, don't escape/(cleaner URLs in output)JSON_PRETTY_PRINT, indented output
CSV
$f = fopen("products.csv", "w");
fputcsv($f, ["id", "title", "price"]); // header
foreach ($products as $p) {
fputcsv($f, [$p["id"], $p["title"], $p["price"]]);
}
fclose($f);
// Read
$f = fopen("products.csv", "r");
$headers = fgetcsv($f);
while (($row = fgetcsv($f)) !== false) {
$product = array_combine($headers, $row); // turn list into dict
// ...
}
fclose($f);
fputcsv and fgetcsv handle quoting correctly, never construct CSV with string concatenation.
Error handling
PHP has two error mechanisms, old warnings/notices (loose) and exceptions (clean). Modern code uses exceptions:
try {
$body = fetchPage($url);
$data = json_decode($body, true, flags: JSON_THROW_ON_ERROR);
} catch (RuntimeException $e) {
// network / curl failure
error_log("Fetch failed for $url: " . $e->getMessage());
} catch (JsonException $e) {
// malformed JSON
error_log("Bad JSON from $url");
}
Two practices to adopt:
- Pass
JSON_THROW_ON_ERRORsojson_decoderaises on bad JSON instead of returning null and leaving you to figure out why. - Never use the
@silencer. Code like@file_get_contents(...)swallows the error entirely, making bugs invisible.
Useful built-in functions
// URL handling
parse_url("https://example.com/products?page=2");
http_build_query(["page" => 2, "limit" => 20]); // "page=2&limit=20"
urlencode("hello world"); // "hello+world"
// Time
time(); // unix timestamp
date("Y-m-d", time()); // "2026-05-12"
strtotime("2026-04-12"); // parse to timestamp
$dt = new DateTimeImmutable("now"); // modern OO date
// Regex
preg_match('/\d{4}-\d{2}-\d{2}/', $text, $m);
preg_match_all('/\$(\d+\.\d{2})/', $text, $matches);
preg_replace('/\s+/', ' ', $text);
// Sort
sort($items); // values, reindex
asort($items); // values, preserve keys
ksort($items); // by keys
usort($items, fn($a, $b) => $a["price"] <=> $b["price"]); // custom comparator
The <=> "spaceship" operator (PHP 7.0+) returns -1/0/+1, ideal for custom sort comparators.
Hands-on lab
Write a 20-line PHP script that:
- Fetches
https://practice.scrapingcentral.com/via Guzzle. - Decodes the response body, counts how many
<a>substrings appear usingsubstr_count. - Writes the count to a JSON file using
json_encodewithJSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE. - Wraps the network call in
try/catch (\GuzzleHttp\Exception\RequestException $e).
If you can do that in 20 lines, you're fluent enough to follow the rest of the curriculum's PHP track.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.