S3 + Parquet for Cold Storage
Object storage with columnar Parquet files is the cheapest durable home for scraped data you don't query daily. The patterns that make it efficient.
What you’ll learn
- Explain when Parquet on S3 beats a database for cold data.
- Partition Parquet datasets so queries skip files.
- Write Parquet from Python and PHP scraping pipelines.
The cheapest place to keep scraped data is object storage, S3, R2, Backblaze B2, MinIO, at roughly $0.005–$0.023 per GB per month depending on tier. Compared to a hot database at $0.10+ per GB, that's a 10–20x cost difference for data you rarely query.
Parquet is the file format that makes this fast: columnar, compressed, with statistics that let query engines skip whole files.
When cold storage is right
| Data | Where |
|---|---|
| Last 90 days of price observations | Postgres or ClickHouse (hot) |
| Years of price history you query monthly | S3 + Parquet (cold) |
| Raw HTML snapshots | S3 (compressed gzip) |
| Daily aggregated summaries | Hot DB |
| Per-row event log forever | S3 + Parquet |
Tier your data. Cold tiers are for "I might need this" not "I query it hourly."
Why Parquet, not JSON/CSV
| Format | Compression | Selective read | Schema | Query engines |
|---|---|---|---|---|
| JSON lines | Poor (3x with gzip) | Read whole file | None | All |
| CSV | OK | Read whole file | Implicit | All |
| Parquet | Excellent (5–10x) | Read just the columns you need | Embedded | DuckDB, ClickHouse, Trino, Spark, Athena |
A 1GB of JSON-lines scraped events typically becomes ~150MB of Parquet. Querying one column reads ~30MB instead of 1GB.
Partitioning by date
Write each day's data into its own folder:
s3://my-bucket/price-events/source=catalog108/year=2026/month=05/day=12/part-0.parquet
s3://my-bucket/price-events/source=catalog108/year=2026/month=05/day=12/part-1.parquet
s3://my-bucket/price-events/source=catalog108/year=2026/month=05/day=13/part-0.parquet
The key=value/ convention is "Hive partitioning." Query engines parse this from the path and prune files. A query WHERE year=2026 AND month=05 AND day=12 reads only that folder.
Partition by how you query: usually source, year, month. Don't over-partition, millions of tiny files is worse than fewer large ones. Aim for 100MB–1GB per file.
Writing Parquet from Python
import pyarrow as pa
import pyarrow.parquet as pq
from datetime import date
events = [
{"source": "catalog108", "source_id": "P-1042",
"observed_at": "2026-05-12T10:00:00", "price_cents": 4999},
# ...
]
table = pa.Table.from_pylist(events)
pq.write_to_dataset(
table,
root_path="s3://my-bucket/price-events/",
partition_cols=["source"],
filesystem=pa.fs.S3FileSystem(region="us-east-1"),
compression="zstd", # better than snappy, smaller files
)
For batch ETL, accumulate a day's events and flush once:
# At end of day
df = pa.Table.from_pandas(pandas_df)
pq.write_table(df, f"s3://my-bucket/price-events/source=catalog108/year=2026/month=05/day=12/part-0.parquet",
compression="zstd")
Writing Parquet from PHP
PHP's Parquet support is less mature. The pragmatic options:
- Convert via DuckDB CLI, write JSONL or CSV, then run DuckDB to convert:
duckdb -c "COPY (SELECT * FROM read_json_auto('events.jsonl')) TO 'events.parquet' (FORMAT PARQUET, COMPRESSION ZSTD)" - Use the
flow-php/etllibrary, Flow PHP has a Parquet writer:use Flow\ETL\Adapter\Parquet\ParquetExtractor; data_frame() ->from(from_array($events)) ->saveMode(SaveMode::Overwrite) ->load(to_parquet('events.parquet')) ->run(); - Two-stage pipeline, PHP scrapers write JSONL to S3; a nightly Python or DuckDB job converts to Parquet.
Many production setups use option three: PHP scrapers stay simple, Parquet conversion is a separate job.
Querying Parquet without a database
DuckDB queries S3 Parquet directly, no warehouse needed:
SELECT source, AVG(price_cents)/100 AS avg_usd
FROM 's3://my-bucket/price-events/source=*/year=2026/month=05/**/*.parquet'
GROUP BY source;
For dashboards, point Grafana or Metabase at DuckDB and it queries Parquet as if it were a table. For very large data, ClickHouse can also read Parquet from S3 directly via s3() table function.
Compaction
Many small files hurt query performance. Set up a daily compaction job:
# Read all day-partition parts, rewrite as one
table = pq.read_table("s3://my-bucket/price-events/source=catalog108/year=2026/month=05/day=12/")
pq.write_table(table, "s3://my-bucket/.../day=12/part-compact.parquet", compression="zstd")
# Then delete the small parts
For larger pipelines, tools like Apache Iceberg / Delta Lake handle this and metadata more rigorously, but plain Parquet with daily compaction works fine up to TBs.
Lifecycle policies
S3 lets you transition objects across tiers automatically. A typical policy:
- Days 0–30: S3 Standard ($0.023/GB)
- Days 31–90: S3 Standard-IA ($0.0125/GB)
- Days 91+: S3 Glacier Instant Retrieval ($0.004/GB)
- Days 365+: S3 Glacier Deep Archive ($0.00099/GB)
Set it once on the bucket; AWS migrates objects automatically. Retrieval cost matters, Glacier tiers charge per GB pulled back.
What to try
Run MinIO locally (docker run -p 9000:9000 minio/minio server /data), write 100k fake price events as Parquet partitioned by source/year/month, then query with DuckDB. Compare file size vs the same data as JSONL. Compression should give roughly 5x reduction.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.