File Downloads: Images, PDFs, ZIPs, Static Scraping

Beyond HTML: how to download binary files efficiently, stream big files without exhausting memory, and verify the file you got is the file you wanted.

Most scraping deals with HTML and JSON. Sometimes you also need the binary stuff, product images, PDF datasheets, ZIP archives. The basics are simple; the wrinkles around streaming, encoding, and validation are where production scrapers earn their stripes.

The simple case: small files

import requests

r = requests.get("https://practice.scrapingcentral.com/challenges/static/files/images/sample.png")
r.raise_for_status()

with open("sample.png", "wb") as f:
  f.write(r.content)

r.content is the raw bytes. Open the destination file in "wb" (write-binary) mode and write them. That's it for files up to a few MB.

Streaming: large files

For anything big, a 50MB PDF, a 1GB archive, buffering the whole response in memory is wasteful and may OOM your scraper. Stream it instead:

r = requests.get(url, stream=True)
r.raise_for_status()

with open("big.zip", "wb") as f:
  for chunk in r.iter_content(chunk_size=8192):
  f.write(chunk)

stream=True tells requests not to read the body upfront. iter_content(chunk_size=...) yields chunks as they arrive. Memory usage stays at ~8KB regardless of file size.

Use stream=True for any download where you don't know the file size in advance. The cost is minimal, for small files you don't notice, for big files you've saved yourself a memory bomb.

A progress indicator

import requests
from pathlib import Path

def download(url, dest):
  r = requests.get(url, stream=True, timeout=30)
  r.raise_for_status()
  total = int(r.headers.get("Content-Length", 0))
  downloaded = 0
  with open(dest, "wb") as f:
  for chunk in r.iter_content(8192):
  f.write(chunk)
  downloaded += len(chunk)
  if total:
  pct = 100 * downloaded / total
  print(f"\r{pct:.1f}%", end="", flush=True)
  print()

The Content-Length response header tells you the total. Not every server sends it, some don't (chunked encoding), some lie. Handle both cases.

For real progress bars, the tqdm library is the canonical Python choice:

from tqdm import tqdm
with open(dest, "wb") as f, tqdm(total=total, unit="B", unit_scale=True) as bar:
  for chunk in r.iter_content(8192):
  f.write(chunk)
  bar.update(len(chunk))

Verifying what you got

Three checks worth running:

1. Status and content type

if r.status_code != 200:
  raise RuntimeError(f"Got {r.status_code}")
ct = r.headers.get("Content-Type", "")
if not ct.startswith("image/"):
  raise RuntimeError(f"Expected image, got {ct}")

Servers sometimes return an HTML error page with status 200 (yes, really). Checking Content-Type catches this. Also catches "I expected a PDF but got an HTML 'access denied' page."

2. File size

expected = int(r.headers.get("Content-Length", 0))
actual = Path(dest).stat().st_size
if expected and actual != expected:
  raise RuntimeError(f"Truncated: got {actual}, expected {expected}")

Network interruptions during streaming can produce partial files. A size check before you treat the download as done is cheap insurance.

3. Magic bytes (file type by content)

The first few bytes of binary files identify them definitively:

File type	Magic bytes (hex)	ASCII
PNG	`89 50 4E 47 0D 0A 1A 0A`	`\x89PNG\r\n\x1a\n`
JPEG	`FF D8 FF`
PDF	`25 50 44 46`	`%PDF`
ZIP	`50 4B 03 04`	`PK\x03\x04`
GZIP	`1F 8B`

def is_pdf(path):
  with open(path, "rb") as f:
  return f.read(4) == b"%PDF"

When a server says it's sending a PDF but actually sends an HTML "this PDF is gone" page, magic bytes catch the lie.

PHP equivalents

$ch = curl_init('https://practice.scrapingcentral.com/challenges/static/files/images/sample.png');
$fp = fopen('sample.png', 'wb');
curl_setopt($ch, CURLOPT_FILE, $fp);  // write directly to file, no buffering
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_exec($ch);
curl_close($ch);
fclose($fp);

CURLOPT_FILE is the cURL equivalent of stream-and-write, much more memory-efficient than capturing the body to a string.

With Guzzle:

use GuzzleHttp\Client;

$client = new Client();
$client->get('https://practice.scrapingcentral.com/challenges/static/files/images/sample.png', [
  'sink' => '/path/to/sample.png',
]);

The sink option streams the response body directly to the given path or stream. No memory bloat regardless of file size.

Concurrency for many downloads

Downloading 1000 product images one at a time is slow. Parallelize:

Python (httpx + asyncio):

import asyncio, httpx

async def fetch(client, url, dest):
  async with client.stream("GET", url) as r:
  r.raise_for_status()
  with open(dest, "wb") as f:
  async for chunk in r.aiter_bytes(8192):
  f.write(chunk)

async def main(jobs):
  async with httpx.AsyncClient(timeout=30) as client:
  sem = asyncio.Semaphore(10)
  async def bounded(url, dest):
  async with sem:
  await fetch(client, url, dest)
  await asyncio.gather(*[bounded(u, d) for u, d in jobs])

asyncio.run(main([(url1, path1), (url2, path2)...]))

Semaphore caps concurrency, 10 concurrent downloads is polite for most sites; pushing higher risks IP-level rate limits.

PHP (Guzzle Pool):

use GuzzleHttp\Client;
use GuzzleHttp\Pool;
use GuzzleHttp\Psr7\Request;

$client = new Client();
$pool = new Pool($client, (function () use ($urls) {
  foreach ($urls as $i => $url) {
  yield new Request('GET', $url);
  }
})(), [
  'concurrency' => 10,
  'fulfilled' => function ($response, $index) use ($urls) {
  $dest = '/path/' . basename($urls[$index]);
  file_put_contents($dest, $response->getBody());
  },
]);
$pool->promise()->wait();

Polite rate limiting

Mass downloading triggers anti-abuse faster than mass HTML scraping, bandwidth is expensive. Two things help:

Throttle. Cap concurrent connections (Semaphore / Pool concurrency).
Honour Retry-After. If a server 429s with Retry-After: 60, wait 60 seconds. Don't argue.

Naming and deduplication

Decide upfront how you'll name files:

URL-based: urlparse(url).path.split("/")[-1], fast, but collides if multiple URLs map to the same name.
Hash-based: hashlib.sha256(content).hexdigest(), unique, but loses readable names.
ID-based: extract from your own data (f"{product_id}.png"), readable AND unique.

The third option is usually the right one for scrapers that maintain a record of source rows.

Hands-on lab

Download every image from /challenges/static/files/images. Use streaming. Verify each is a valid PNG/JPEG with magic-byte checks. Print total bytes downloaded. Then try /challenges/static/files/pdfs and /challenges/static/files/large (large covers files big enough that you actually need streaming).

File Downloads: Images, PDFs, ZIPs

What you’ll learn

The simple case: small files

Streaming: large files

A progress indicator

Verifying what you got

1. Status and content type

2. File size

3. Magic bytes (file type by content)

PHP equivalents

Concurrency for many downloads

Polite rate limiting

Naming and deduplication

Hands-on lab

Hands-on lab

Quiz, check your understanding

Why pass `stream=True` to `requests.get` for large file downloads?