Data Cleaning with PHP, Filters, Validators, Pipelines
PHP scrapers also produce dirty data. Use filter_var, validators, and a small pipeline pattern to coerce, validate, and reshape, the PHP counterpart to pandas.
What you’ll learn
- Normalize strings (trim, case, whitespace, non-breaking spaces).
- Coerce types with `filter_var`, sprintf, DateTime.
- Validate fields against business rules.
- Compose cleaning into a reusable pipeline pattern.
PHP doesn't have pandas. It has something different: native filter_var, mature validation libraries (Symfony Validator, Respect/Validation), and the standard-library tools to build clean, composable pipelines. This lesson is the equivalent of pandas cleaning, the PHP way.
Load scraped data
<?php
// From JSON
$rows = json_decode(file_get_contents('products.json'), true);
// From JSONL
$rows = [];
foreach (file('products.jsonl', FILE_IGNORE_NEW_LINES) as $line) {
$rows[] = json_decode($line, true);
}
// From CSV
$rows = [];
$fp = fopen('products.csv', 'r');
$headers = fgetcsv($fp);
while (($cells = fgetcsv($fp)) !== false) {
$rows[] = array_combine($headers, $cells);
}
fclose($fp);
// From SQLite
$db = new PDO('sqlite:scrape.db');
$rows = $db->query('SELECT * FROM products')->fetchAll(PDO::FETCH_ASSOC);
Each pattern gives you an array of associative arrays, the universal "rows" shape in PHP.
Normalize strings
function clean_string(?string $s): ?string {
if ($s === null) return null;
$s = trim($s);
$s = str_replace("\xC2\xA0", ' ', $s); // non-breaking space (UTF-8 bytes)
$s = preg_replace('/\s+/u', ' ', $s); // collapse whitespace runs
return $s === '' ? null : $s;
}
foreach ($rows as &$row) {
foreach ($row as $key => $val) {
if (is_string($val)) {
$row[$key] = clean_string($val);
}
}
}
unset($row);
Note \xC2\xA0 for non-breaking space, Python's \xa0 is the Unicode codepoint; in UTF-8 it's the byte sequence C2 A0. Match the bytes, not the abstract codepoint.
Coerce types
function parse_price(?string $s): ?float {
if ($s === null) return null;
$s = preg_replace('/[^\d.,-]/', '', $s); // keep digits, comma, dot, minus
$s = str_replace(',', '', $s); // assume thousands separator
return is_numeric($s) ? (float) $s : null;
}
function parse_int_safe(?string $s): ?int {
return filter_var($s, FILTER_VALIDATE_INT) !== false ? (int) $s : null;
}
function parse_date(?string $s, string $format = 'Y-m-d'): ?DateTimeImmutable {
if (!$s) return null;
$dt = DateTimeImmutable::createFromFormat($format, $s);
return $dt ?: null;
}
filter_var with FILTER_VALIDATE_INT / FILTER_VALIDATE_FLOAT / FILTER_VALIDATE_EMAIL / FILTER_VALIDATE_URL covers the common cases out of the box. Use it before reaching for regex.
Validate
Decide if a row is good enough to keep:
function is_valid_product(array $row): bool {
return !empty($row['source_url'])
&& !empty($row['name'])
&& filter_var($row['source_url'], FILTER_VALIDATE_URL)
&& (is_float($row['price']) && $row['price'] > 0);
}
$clean = array_filter($rows, 'is_valid_product');
array_filter removes rows where the predicate returns false. The result is what you save downstream.
For richer validation rules, use Symfony Validator or Respect/Validation:
composer require respect/validation
use Respect\Validation\Validator as v;
$rule = v::key('source_url', v::url())
->key('name', v::stringType()->notEmpty())
->key('price', v::floatVal()->positive());
foreach ($rows as $row) {
try {
$rule->assert($row);
} catch (Exception $e) {
// Log or collect; skip invalid row
continue;
}
}
Deduplicate
function dedupe_by_key(array $rows, string $key): array {
$seen = [];
$out = [];
foreach ($rows as $row) {
$k = $row[$key] ?? null;
if ($k !== null && !isset($seen[$k])) {
$seen[$k] = true;
$out[] = $row;
}
}
return $out;
}
$rows = dedupe_by_key($rows, 'source_url');
To keep the most recent: sort by scraped_at first, then dedupe:
usort($rows, fn($a, $b) => strcmp($a['scraped_at'], $b['scraped_at']));
// later rows overwrite earlier ones if we walk backward
$rows = array_reverse(dedupe_by_key(array_reverse($rows), 'source_url'));
Derive columns
foreach ($rows as &$row) {
$row['price_with_tax'] = round($row['price'] * 1.10, 2);
$row['domain'] = parse_url($row['source_url'], PHP_URL_HOST);
$row['scraped_date'] = substr($row['scraped_at'], 0, 10); // YYYY-MM-DD
$row['is_expensive'] = $row['price'] > 50;
}
unset($row);
A pipeline pattern
Compose cleaning steps as small functions and pipe rows through them:
class Pipeline {
private array $steps = [];
public function pipe(callable $fn): self {
$this->steps[] = $fn;
return $this;
}
public function run(array $rows): array {
foreach ($this->steps as $fn) {
$rows = $fn($rows);
}
return $rows;
}
}
// Usage
$pipeline = (new Pipeline())
->pipe(fn($rows) => array_map(fn($r) => array_map('clean_string', $r), $rows))
->pipe(fn($rows) => array_map(function ($r) {
$r['price'] = parse_price($r['price']);
$r['scraped_at'] = parse_date($r['scraped_at'], 'Y-m-d H:i:s')?->format('c');
return $r;
}, $rows))
->pipe(fn($rows) => array_filter($rows, 'is_valid_product'))
->pipe(fn($rows) => dedupe_by_key($rows, 'source_url'));
$clean = $pipeline->run($rows);
Each step is one transform. Inserting or removing steps doesn't touch the others.
Aggregate
$by_category = [];
foreach ($rows as $row) {
$cat = $row['category'];
$by_category[$cat] ??= ['count' => 0, 'total_price' => 0];
$by_category[$cat]['count']++;
$by_category[$cat]['total_price'] += $row['price'];
}
foreach ($by_category as $cat => &$agg) {
$agg['avg_price'] = $agg['total_price'] / $agg['count'];
}
unset($agg);
print_r($by_category);
Verbose compared to pandas's df.groupby('category').agg(...), but the logic is the same. For larger aggregations, you can offload to SQL via PDO and SQLite (Lesson 1.30):
$result = $db->query('
SELECT category, COUNT(*) as count, AVG(price) as avg_price
FROM products GROUP BY category
')->fetchAll(PDO::FETCH_ASSOC);
This is often clearer and faster than PHP-level loops for non-trivial aggregations.
Export
// JSON
file_put_contents(
'clean.json',
json_encode($rows, JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE)
);
// JSONL
$fp = fopen('clean.jsonl', 'w');
foreach ($rows as $row) {
fwrite($fp, json_encode($row, JSON_UNESCAPED_UNICODE) . "\n");
}
fclose($fp);
// CSV
$fp = fopen('clean.csv', 'w');
fputcsv($fp, array_keys($rows[0]));
foreach ($rows as $row) {
fputcsv($fp, $row);
}
fclose($fp);
// To SQLite (with UPSERT for incremental writes)
$db = new PDO('sqlite:clean.db');
$db->exec('CREATE TABLE IF NOT EXISTS products (source_url TEXT PRIMARY KEY, name TEXT, price REAL)');
$stmt = $db->prepare('INSERT OR REPLACE INTO products (source_url, name, price) VALUES (?, ?, ?)');
$db->beginTransaction();
foreach ($rows as $row) {
$stmt->execute([$row['source_url'], $row['name'], $row['price']]);
}
$db->commit();
When to use a library instead
For more complex cleaning, fuzzy matching, address normalization, NER, ML-assisted disambiguation, PHP's options are thinner than Python's. If your cleaning grows past basic type/whitespace/dedup, exporting to Python for the heavy work and re-importing the cleaned data is a normal architecture. PHP scrapers feeding a Python cleaning step (or vice versa) is common in real teams.
Hands-on lab
Pull the JSON from /api/products/1/reviews. Clean it with the pipeline pattern: trim strings, coerce ratings to int and dates to DateTime, dedupe by review ID, drop reviews with empty bodies. Export both as JSONL and into a SQLite table with UPSERT. Then run a SQL query for "average rating per author" using either PHP loops or SQL.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/api/products/1/reviewsQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.