S3 + Parquet for Cold Storage, Production, Scale & Career

Object storage with columnar Parquet files is the cheapest durable home for scraped data you don't query daily. The patterns that make it efficient.

The cheapest place to keep scraped data is object storage, S3, R2, Backblaze B2, MinIO, at roughly $0.005–$0.023 per GB per month depending on tier. Compared to a hot database at $0.10+ per GB, that's a 10–20x cost difference for data you rarely query.

Parquet is the file format that makes this fast: columnar, compressed, with statistics that let query engines skip whole files.

When cold storage is right

Data	Where
Last 90 days of price observations	Postgres or ClickHouse (hot)
Years of price history you query monthly	S3 + Parquet (cold)
Raw HTML snapshots	S3 (compressed gzip)
Daily aggregated summaries	Hot DB
Per-row event log forever	S3 + Parquet

Tier your data. Cold tiers are for "I might need this" not "I query it hourly."

Why Parquet, not JSON/CSV

Format	Compression	Selective read	Schema	Query engines
JSON lines	Poor (3x with gzip)	Read whole file	None	All
CSV	OK	Read whole file	Implicit	All
Parquet	Excellent (5–10x)	Read just the columns you need	Embedded	DuckDB, ClickHouse, Trino, Spark, Athena

A 1GB of JSON-lines scraped events typically becomes ~150MB of Parquet. Querying one column reads ~30MB instead of 1GB.

Partitioning by date

Write each day's data into its own folder:

s3://my-bucket/price-events/source=catalog108/year=2026/month=05/day=12/part-0.parquet
s3://my-bucket/price-events/source=catalog108/year=2026/month=05/day=12/part-1.parquet
s3://my-bucket/price-events/source=catalog108/year=2026/month=05/day=13/part-0.parquet

The key=value/ convention is "Hive partitioning." Query engines parse this from the path and prune files. A query WHERE year=2026 AND month=05 AND day=12 reads only that folder.

Partition by how you query: usually source, year, month. Don't over-partition, millions of tiny files is worse than fewer large ones. Aim for 100MB–1GB per file.

Writing Parquet from Python

import pyarrow as pa
import pyarrow.parquet as pq
from datetime import date

events = [
  {"source": "catalog108", "source_id": "P-1042",
  "observed_at": "2026-05-12T10:00:00", "price_cents": 4999},
  # ...
]

table = pa.Table.from_pylist(events)
pq.write_to_dataset(
  table,
  root_path="s3://my-bucket/price-events/",
  partition_cols=["source"],
  filesystem=pa.fs.S3FileSystem(region="us-east-1"),
  compression="zstd",  # better than snappy, smaller files
)

For batch ETL, accumulate a day's events and flush once:

# At end of day
df = pa.Table.from_pandas(pandas_df)
pq.write_table(df, f"s3://my-bucket/price-events/source=catalog108/year=2026/month=05/day=12/part-0.parquet",
  compression="zstd")

Writing Parquet from PHP

PHP's Parquet support is less mature. The pragmatic options:

Convert via DuckDB CLI, write JSONL or CSV, then run DuckDB to convert:

duckdb -c "COPY (SELECT * FROM read_json_auto('events.jsonl')) TO 'events.parquet' (FORMAT PARQUET, COMPRESSION ZSTD)"

Use the flow-php/etl library, Flow PHP has a Parquet writer:

use Flow\ETL\Adapter\Parquet\ParquetExtractor;
data_frame()
->from(from_array($events))
->saveMode(SaveMode::Overwrite)
->load(to_parquet('events.parquet'))
->run();

Two-stage pipeline, PHP scrapers write JSONL to S3; a nightly Python or DuckDB job converts to Parquet.

Many production setups use option three: PHP scrapers stay simple, Parquet conversion is a separate job.

Querying Parquet without a database

DuckDB queries S3 Parquet directly, no warehouse needed:

SELECT source, AVG(price_cents)/100 AS avg_usd
FROM 's3://my-bucket/price-events/source=*/year=2026/month=05/**/*.parquet'
GROUP BY source;

For dashboards, point Grafana or Metabase at DuckDB and it queries Parquet as if it were a table. For very large data, ClickHouse can also read Parquet from S3 directly via s3() table function.

Compaction

Many small files hurt query performance. Set up a daily compaction job:

# Read all day-partition parts, rewrite as one
table = pq.read_table("s3://my-bucket/price-events/source=catalog108/year=2026/month=05/day=12/")
pq.write_table(table, "s3://my-bucket/.../day=12/part-compact.parquet", compression="zstd")
# Then delete the small parts

For larger pipelines, tools like Apache Iceberg / Delta Lake handle this and metadata more rigorously, but plain Parquet with daily compaction works fine up to TBs.

Lifecycle policies

S3 lets you transition objects across tiers automatically. A typical policy:

Days 0–30: S3 Standard ($0.023/GB)
Days 31–90: S3 Standard-IA ($0.0125/GB)
Days 91+: S3 Glacier Instant Retrieval ($0.004/GB)
Days 365+: S3 Glacier Deep Archive ($0.00099/GB)

Set it once on the bucket; AWS migrates objects automatically. Retrieval cost matters, Glacier tiers charge per GB pulled back.

What to try

Run MinIO locally (docker run -p 9000:9000 minio/minio server /data), write 100k fake price events as Parquet partitioned by source/year/month, then query with DuckDB. Compare file size vs the same data as JSONL. Compression should give roughly 5x reduction.

S3 + Parquet for Cold Storage

What you’ll learn