Data Cleaning with PHP, Filters, Validators, Pipelines, Static Scraping

PHP scrapers also produce dirty data. Use filter_var, validators, and a small pipeline pattern to coerce, validate, and reshape, the PHP counterpart to pandas.

PHP doesn't have pandas. It has something different: native filter_var, mature validation libraries (Symfony Validator, Respect/Validation), and the standard-library tools to build clean, composable pipelines. This lesson is the equivalent of pandas cleaning, the PHP way.

Load scraped data

<?php
// From JSON
$rows = json_decode(file_get_contents('products.json'), true);

// From JSONL
$rows = [];
foreach (file('products.jsonl', FILE_IGNORE_NEW_LINES) as $line) {
  $rows[] = json_decode($line, true);
}

// From CSV
$rows = [];
$fp = fopen('products.csv', 'r');
$headers = fgetcsv($fp);
while (($cells = fgetcsv($fp)) !== false) {
  $rows[] = array_combine($headers, $cells);
}
fclose($fp);

// From SQLite
$db = new PDO('sqlite:scrape.db');
$rows = $db->query('SELECT * FROM products')->fetchAll(PDO::FETCH_ASSOC);

Each pattern gives you an array of associative arrays, the universal "rows" shape in PHP.

Normalize strings

function clean_string(?string $s): ?string {
  if ($s === null) return null;
  $s = trim($s);
  $s = str_replace("\xC2\xA0", ' ', $s);  // non-breaking space (UTF-8 bytes)
  $s = preg_replace('/\s+/u', ' ', $s);  // collapse whitespace runs
  return $s === '' ? null : $s;
}

foreach ($rows as &$row) {
  foreach ($row as $key => $val) {
  if (is_string($val)) {
  $row[$key] = clean_string($val);
  }
  }
}
unset($row);

Note \xC2\xA0 for non-breaking space, Python's \xa0 is the Unicode codepoint; in UTF-8 it's the byte sequence C2 A0. Match the bytes, not the abstract codepoint.

Coerce types

function parse_price(?string $s): ?float {
  if ($s === null) return null;
  $s = preg_replace('/[^\d.,-]/', '', $s);  // keep digits, comma, dot, minus
  $s = str_replace(',', '', $s);  // assume thousands separator
  return is_numeric($s) ? (float) $s : null;
}

function parse_int_safe(?string $s): ?int {
  return filter_var($s, FILTER_VALIDATE_INT) !== false ? (int) $s : null;
}

function parse_date(?string $s, string $format = 'Y-m-d'): ?DateTimeImmutable {
  if (!$s) return null;
  $dt = DateTimeImmutable::createFromFormat($format, $s);
  return $dt ?: null;
}

filter_var with FILTER_VALIDATE_INT / FILTER_VALIDATE_FLOAT / FILTER_VALIDATE_EMAIL / FILTER_VALIDATE_URL covers the common cases out of the box. Use it before reaching for regex.

Validate

Decide if a row is good enough to keep:

function is_valid_product(array $row): bool {
  return !empty($row['source_url'])
  && !empty($row['name'])
  && filter_var($row['source_url'], FILTER_VALIDATE_URL)
  && (is_float($row['price']) && $row['price'] > 0);
}

$clean = array_filter($rows, 'is_valid_product');

array_filter removes rows where the predicate returns false. The result is what you save downstream.

For richer validation rules, use Symfony Validator or Respect/Validation:

composer require respect/validation

use Respect\Validation\Validator as v;

$rule = v::key('source_url', v::url())
  ->key('name', v::stringType()->notEmpty())
  ->key('price', v::floatVal()->positive());

foreach ($rows as $row) {
  try {
  $rule->assert($row);
  } catch (Exception $e) {
  // Log or collect; skip invalid row
  continue;
  }
}

Deduplicate

function dedupe_by_key(array $rows, string $key): array {
  $seen = [];
  $out = [];
  foreach ($rows as $row) {
  $k = $row[$key] ?? null;
  if ($k !== null && !isset($seen[$k])) {
  $seen[$k] = true;
  $out[] = $row;
  }
  }
  return $out;
}

$rows = dedupe_by_key($rows, 'source_url');

To keep the most recent: sort by scraped_at first, then dedupe:

usort($rows, fn($a, $b) => strcmp($a['scraped_at'], $b['scraped_at']));
// later rows overwrite earlier ones if we walk backward
$rows = array_reverse(dedupe_by_key(array_reverse($rows), 'source_url'));

Derive columns

foreach ($rows as &$row) {
  $row['price_with_tax'] = round($row['price'] * 1.10, 2);
  $row['domain'] = parse_url($row['source_url'], PHP_URL_HOST);
  $row['scraped_date'] = substr($row['scraped_at'], 0, 10);  // YYYY-MM-DD
  $row['is_expensive'] = $row['price'] > 50;
}
unset($row);

A pipeline pattern

Compose cleaning steps as small functions and pipe rows through them:

class Pipeline {
  private array $steps = [];

  public function pipe(callable $fn): self {
  $this->steps[] = $fn;
  return $this;
  }

  public function run(array $rows): array {
  foreach ($this->steps as $fn) {
  $rows = $fn($rows);
  }
  return $rows;
  }
}

// Usage
$pipeline = (new Pipeline())
  ->pipe(fn($rows) => array_map(fn($r) => array_map('clean_string', $r), $rows))
  ->pipe(fn($rows) => array_map(function ($r) {
  $r['price'] = parse_price($r['price']);
  $r['scraped_at'] = parse_date($r['scraped_at'], 'Y-m-d H:i:s')?->format('c');
  return $r;
  }, $rows))
  ->pipe(fn($rows) => array_filter($rows, 'is_valid_product'))
  ->pipe(fn($rows) => dedupe_by_key($rows, 'source_url'));

$clean = $pipeline->run($rows);

Each step is one transform. Inserting or removing steps doesn't touch the others.

Aggregate

$by_category = [];
foreach ($rows as $row) {
  $cat = $row['category'];
  $by_category[$cat] ??= ['count' => 0, 'total_price' => 0];
  $by_category[$cat]['count']++;
  $by_category[$cat]['total_price'] += $row['price'];
}

foreach ($by_category as $cat => &$agg) {
  $agg['avg_price'] = $agg['total_price'] / $agg['count'];
}
unset($agg);

print_r($by_category);

Verbose compared to pandas's df.groupby('category').agg(...), but the logic is the same. For larger aggregations, you can offload to SQL via PDO and SQLite (Lesson 1.30):

$result = $db->query('
  SELECT category, COUNT(*) as count, AVG(price) as avg_price
  FROM products GROUP BY category
')->fetchAll(PDO::FETCH_ASSOC);

This is often clearer and faster than PHP-level loops for non-trivial aggregations.

Export

// JSON
file_put_contents(
  'clean.json',
  json_encode($rows, JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE)
);

// JSONL
$fp = fopen('clean.jsonl', 'w');
foreach ($rows as $row) {
  fwrite($fp, json_encode($row, JSON_UNESCAPED_UNICODE) . "\n");
}
fclose($fp);

// CSV
$fp = fopen('clean.csv', 'w');
fputcsv($fp, array_keys($rows[0]));
foreach ($rows as $row) {
  fputcsv($fp, $row);
}
fclose($fp);

// To SQLite (with UPSERT for incremental writes)
$db = new PDO('sqlite:clean.db');
$db->exec('CREATE TABLE IF NOT EXISTS products (source_url TEXT PRIMARY KEY, name TEXT, price REAL)');
$stmt = $db->prepare('INSERT OR REPLACE INTO products (source_url, name, price) VALUES (?, ?, ?)');
$db->beginTransaction();
foreach ($rows as $row) {
  $stmt->execute([$row['source_url'], $row['name'], $row['price']]);
}
$db->commit();

When to use a library instead

For more complex cleaning, fuzzy matching, address normalization, NER, ML-assisted disambiguation, PHP's options are thinner than Python's. If your cleaning grows past basic type/whitespace/dedup, exporting to Python for the heavy work and re-importing the cleaned data is a normal architecture. PHP scrapers feeding a Python cleaning step (or vice versa) is common in real teams.

Hands-on lab

Pull the JSON from /api/products/1/reviews. Clean it with the pipeline pattern: trim strings, coerce ratings to int and dates to DateTime, dedupe by review ID, drop reviews with empty bodies. Export both as JSONL and into a SQLite table with UPSERT. Then run a SQL query for "average rating per author" using either PHP loops or SQL.

Data Cleaning with PHP, Filters, Validators, Pipelines

What you’ll learn

Load scraped data

Normalize strings

Coerce types

Validate

Deduplicate

Derive columns

A pipeline pattern

Aggregate

Export

When to use a library instead

Hands-on lab

Hands-on lab

Quiz, check your understanding

Which PHP function validates and converts a string to an integer safely (returns the int or false)?