Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

4.52intermediate4 min read

S3 + Parquet for Cold Storage

Object storage with columnar Parquet files is the cheapest durable home for scraped data you don't query daily. The patterns that make it efficient.

What you’ll learn

  • Explain when Parquet on S3 beats a database for cold data.
  • Partition Parquet datasets so queries skip files.
  • Write Parquet from Python and PHP scraping pipelines.

The cheapest place to keep scraped data is object storage, S3, R2, Backblaze B2, MinIO, at roughly $0.005–$0.023 per GB per month depending on tier. Compared to a hot database at $0.10+ per GB, that's a 10–20x cost difference for data you rarely query.

Parquet is the file format that makes this fast: columnar, compressed, with statistics that let query engines skip whole files.

When cold storage is right

Data Where
Last 90 days of price observations Postgres or ClickHouse (hot)
Years of price history you query monthly S3 + Parquet (cold)
Raw HTML snapshots S3 (compressed gzip)
Daily aggregated summaries Hot DB
Per-row event log forever S3 + Parquet

Tier your data. Cold tiers are for "I might need this" not "I query it hourly."

Why Parquet, not JSON/CSV

Format Compression Selective read Schema Query engines
JSON lines Poor (3x with gzip) Read whole file None All
CSV OK Read whole file Implicit All
Parquet Excellent (5–10x) Read just the columns you need Embedded DuckDB, ClickHouse, Trino, Spark, Athena

A 1GB of JSON-lines scraped events typically becomes ~150MB of Parquet. Querying one column reads ~30MB instead of 1GB.

Partitioning by date

Write each day's data into its own folder:

s3://my-bucket/price-events/source=catalog108/year=2026/month=05/day=12/part-0.parquet
s3://my-bucket/price-events/source=catalog108/year=2026/month=05/day=12/part-1.parquet
s3://my-bucket/price-events/source=catalog108/year=2026/month=05/day=13/part-0.parquet

The key=value/ convention is "Hive partitioning." Query engines parse this from the path and prune files. A query WHERE year=2026 AND month=05 AND day=12 reads only that folder.

Partition by how you query: usually source, year, month. Don't over-partition, millions of tiny files is worse than fewer large ones. Aim for 100MB–1GB per file.

Writing Parquet from Python

import pyarrow as pa
import pyarrow.parquet as pq
from datetime import date

events = [
  {"source": "catalog108", "source_id": "P-1042",
  "observed_at": "2026-05-12T10:00:00", "price_cents": 4999},
  # ...
]

table = pa.Table.from_pylist(events)
pq.write_to_dataset(
  table,
  root_path="s3://my-bucket/price-events/",
  partition_cols=["source"],
  filesystem=pa.fs.S3FileSystem(region="us-east-1"),
  compression="zstd",  # better than snappy, smaller files
)

For batch ETL, accumulate a day's events and flush once:

# At end of day
df = pa.Table.from_pandas(pandas_df)
pq.write_table(df, f"s3://my-bucket/price-events/source=catalog108/year=2026/month=05/day=12/part-0.parquet",
  compression="zstd")

Writing Parquet from PHP

PHP's Parquet support is less mature. The pragmatic options:

  • Convert via DuckDB CLI, write JSONL or CSV, then run DuckDB to convert:
    duckdb -c "COPY (SELECT * FROM read_json_auto('events.jsonl')) TO 'events.parquet' (FORMAT PARQUET, COMPRESSION ZSTD)"
    
  • Use the flow-php/etl library, Flow PHP has a Parquet writer:
    use Flow\ETL\Adapter\Parquet\ParquetExtractor;
    data_frame()
    ->from(from_array($events))
    ->saveMode(SaveMode::Overwrite)
    ->load(to_parquet('events.parquet'))
    ->run();
    
  • Two-stage pipeline, PHP scrapers write JSONL to S3; a nightly Python or DuckDB job converts to Parquet.

Many production setups use option three: PHP scrapers stay simple, Parquet conversion is a separate job.

Querying Parquet without a database

DuckDB queries S3 Parquet directly, no warehouse needed:

SELECT source, AVG(price_cents)/100 AS avg_usd
FROM 's3://my-bucket/price-events/source=*/year=2026/month=05/**/*.parquet'
GROUP BY source;

For dashboards, point Grafana or Metabase at DuckDB and it queries Parquet as if it were a table. For very large data, ClickHouse can also read Parquet from S3 directly via s3() table function.

Compaction

Many small files hurt query performance. Set up a daily compaction job:

# Read all day-partition parts, rewrite as one
table = pq.read_table("s3://my-bucket/price-events/source=catalog108/year=2026/month=05/day=12/")
pq.write_table(table, "s3://my-bucket/.../day=12/part-compact.parquet", compression="zstd")
# Then delete the small parts

For larger pipelines, tools like Apache Iceberg / Delta Lake handle this and metadata more rigorously, but plain Parquet with daily compaction works fine up to TBs.

Lifecycle policies

S3 lets you transition objects across tiers automatically. A typical policy:

  • Days 0–30: S3 Standard ($0.023/GB)
  • Days 31–90: S3 Standard-IA ($0.0125/GB)
  • Days 91+: S3 Glacier Instant Retrieval ($0.004/GB)
  • Days 365+: S3 Glacier Deep Archive ($0.00099/GB)

Set it once on the bucket; AWS migrates objects automatically. Retrieval cost matters, Glacier tiers charge per GB pulled back.

What to try

Run MinIO locally (docker run -p 9000:9000 minio/minio server /data), write 100k fake price events as Parquet partitioned by source/year/month, then query with DuckDB. Compare file size vs the same data as JSONL. Compression should give roughly 5x reduction.

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

S3 + Parquet for Cold Storage1 / 8

Why is Parquet preferred over JSON-lines for cold storage of scraped data?

Score so far: 0 / 0