File Downloads: Images, PDFs, ZIPs
Beyond HTML: how to download binary files efficiently, stream big files without exhausting memory, and verify the file you got is the file you wanted.
What you’ll learn
- Save binary responses to disk in Python and PHP.
- Stream large files instead of buffering them in memory.
- Verify file integrity with size or hash checks.
- Use content-type and magic bytes to validate downloads.
Most scraping deals with HTML and JSON. Sometimes you also need the binary stuff, product images, PDF datasheets, ZIP archives. The basics are simple; the wrinkles around streaming, encoding, and validation are where production scrapers earn their stripes.
The simple case: small files
import requests
r = requests.get("https://practice.scrapingcentral.com/challenges/static/files/images/sample.png")
r.raise_for_status()
with open("sample.png", "wb") as f:
f.write(r.content)
r.content is the raw bytes. Open the destination file in "wb" (write-binary) mode and write them. That's it for files up to a few MB.
Streaming: large files
For anything big, a 50MB PDF, a 1GB archive, buffering the whole response in memory is wasteful and may OOM your scraper. Stream it instead:
r = requests.get(url, stream=True)
r.raise_for_status()
with open("big.zip", "wb") as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)
stream=True tells requests not to read the body upfront. iter_content(chunk_size=...) yields chunks as they arrive. Memory usage stays at ~8KB regardless of file size.
Use stream=True for any download where you don't know the file size in advance. The cost is minimal, for small files you don't notice, for big files you've saved yourself a memory bomb.
A progress indicator
import requests
from pathlib import Path
def download(url, dest):
r = requests.get(url, stream=True, timeout=30)
r.raise_for_status()
total = int(r.headers.get("Content-Length", 0))
downloaded = 0
with open(dest, "wb") as f:
for chunk in r.iter_content(8192):
f.write(chunk)
downloaded += len(chunk)
if total:
pct = 100 * downloaded / total
print(f"\r{pct:.1f}%", end="", flush=True)
print()
The Content-Length response header tells you the total. Not every server sends it, some don't (chunked encoding), some lie. Handle both cases.
For real progress bars, the tqdm library is the canonical Python choice:
from tqdm import tqdm
with open(dest, "wb") as f, tqdm(total=total, unit="B", unit_scale=True) as bar:
for chunk in r.iter_content(8192):
f.write(chunk)
bar.update(len(chunk))
Verifying what you got
Three checks worth running:
1. Status and content type
if r.status_code != 200:
raise RuntimeError(f"Got {r.status_code}")
ct = r.headers.get("Content-Type", "")
if not ct.startswith("image/"):
raise RuntimeError(f"Expected image, got {ct}")
Servers sometimes return an HTML error page with status 200 (yes, really). Checking Content-Type catches this. Also catches "I expected a PDF but got an HTML 'access denied' page."
2. File size
expected = int(r.headers.get("Content-Length", 0))
actual = Path(dest).stat().st_size
if expected and actual != expected:
raise RuntimeError(f"Truncated: got {actual}, expected {expected}")
Network interruptions during streaming can produce partial files. A size check before you treat the download as done is cheap insurance.
3. Magic bytes (file type by content)
The first few bytes of binary files identify them definitively:
| File type | Magic bytes (hex) | ASCII |
|---|---|---|
| PNG | 89 50 4E 47 0D 0A 1A 0A |
\x89PNG\r\n\x1a\n |
| JPEG | FF D8 FF |
|
25 50 44 46 |
%PDF |
|
| ZIP | 50 4B 03 04 |
PK\x03\x04 |
| GZIP | 1F 8B |
def is_pdf(path):
with open(path, "rb") as f:
return f.read(4) == b"%PDF"
When a server says it's sending a PDF but actually sends an HTML "this PDF is gone" page, magic bytes catch the lie.
PHP equivalents
$ch = curl_init('https://practice.scrapingcentral.com/challenges/static/files/images/sample.png');
$fp = fopen('sample.png', 'wb');
curl_setopt($ch, CURLOPT_FILE, $fp); // write directly to file, no buffering
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_exec($ch);
curl_close($ch);
fclose($fp);
CURLOPT_FILE is the cURL equivalent of stream-and-write, much more memory-efficient than capturing the body to a string.
With Guzzle:
use GuzzleHttp\Client;
$client = new Client();
$client->get('https://practice.scrapingcentral.com/challenges/static/files/images/sample.png', [
'sink' => '/path/to/sample.png',
]);
The sink option streams the response body directly to the given path or stream. No memory bloat regardless of file size.
Concurrency for many downloads
Downloading 1000 product images one at a time is slow. Parallelize:
Python (httpx + asyncio):
import asyncio, httpx
async def fetch(client, url, dest):
async with client.stream("GET", url) as r:
r.raise_for_status()
with open(dest, "wb") as f:
async for chunk in r.aiter_bytes(8192):
f.write(chunk)
async def main(jobs):
async with httpx.AsyncClient(timeout=30) as client:
sem = asyncio.Semaphore(10)
async def bounded(url, dest):
async with sem:
await fetch(client, url, dest)
await asyncio.gather(*[bounded(u, d) for u, d in jobs])
asyncio.run(main([(url1, path1), (url2, path2)...]))
Semaphore caps concurrency, 10 concurrent downloads is polite for most sites; pushing higher risks IP-level rate limits.
PHP (Guzzle Pool):
use GuzzleHttp\Client;
use GuzzleHttp\Pool;
use GuzzleHttp\Psr7\Request;
$client = new Client();
$pool = new Pool($client, (function () use ($urls) {
foreach ($urls as $i => $url) {
yield new Request('GET', $url);
}
})(), [
'concurrency' => 10,
'fulfilled' => function ($response, $index) use ($urls) {
$dest = '/path/' . basename($urls[$index]);
file_put_contents($dest, $response->getBody());
},
]);
$pool->promise()->wait();
Polite rate limiting
Mass downloading triggers anti-abuse faster than mass HTML scraping, bandwidth is expensive. Two things help:
- Throttle. Cap concurrent connections (Semaphore / Pool concurrency).
- Honour
Retry-After. If a server 429s withRetry-After: 60, wait 60 seconds. Don't argue.
Naming and deduplication
Decide upfront how you'll name files:
- URL-based:
urlparse(url).path.split("/")[-1], fast, but collides if multiple URLs map to the same name. - Hash-based:
hashlib.sha256(content).hexdigest(), unique, but loses readable names. - ID-based: extract from your own data (
f"{product_id}.png"), readable AND unique.
The third option is usually the right one for scrapers that maintain a record of source rows.
Hands-on lab
Download every image from /challenges/static/files/images. Use streaming. Verify each is a valid PNG/JPEG with magic-byte checks. Print total bytes downloaded. Then try /challenges/static/files/pdfs and /challenges/static/files/large (large covers files big enough that you actually need streaming).
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/challenges/static/files/imagesQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.