Web Scraping with Rust - Fast and Efficient

Learn web scraping with Rust using reqwest and scraper crates. Covers async scraping, HTML parsing, and building high-performance scrapers.

Rust offers the best performance for CPU-intensive scraping tasks. With zero-cost abstractions and no garbage collector, Rust scrapers process data faster and use less memory than any alternative.

Setup

Add dependencies to Cargo.toml:

[dependencies]
reqwest = { version = "0.12", features = ["json", "cookies"] }
scraper = "0.21"
tokio = { version = "1", features = ["full"] }
serde = { version = "1", features = ["derive"] }
serde_json = "1"

Basic Scraper

use reqwest;
use scraper::{Html, Selector};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let response = reqwest::get("https://example.com")
        .await?
        .text()
        .await?;

    let document = Html::parse_document(&response);
    let title_selector = Selector::parse("h1").unwrap();

    for element in document.select(&title_selector) {
        println!("Title: {}", element.text().collect::<String>());
    }

    Ok(())
}

Scraping with Structured Output

use serde::Serialize;
use scraper::{Html, Selector};

#[derive(Serialize, Debug)]
struct Product {
    name: String,
    price: String,
    url: String,
}

async fn scrape_products(html: &str) -> Vec<Product> {
    let document = Html::parse_document(html);
    let card_selector = Selector::parse(".product-card").unwrap();
    let name_selector = Selector::parse(".product-name").unwrap();
    let price_selector = Selector::parse(".product-price").unwrap();
    let link_selector = Selector::parse("a").unwrap();

    let mut products = Vec::new();

    for card in document.select(&card_selector) {
        let name = card.select(&name_selector)
            .next()
            .map(|e| e.text().collect::<String>())
            .unwrap_or_default();

        let price = card.select(&price_selector)
            .next()
            .map(|e| e.text().collect::<String>())
            .unwrap_or_default();

        let url = card.select(&link_selector)
            .next()
            .and_then(|e| e.value().attr("href"))
            .unwrap_or_default()
            .to_string();

        products.push(Product { name, price, url });
    }

    products
}

Concurrent Scraping

use futures::stream::{self, StreamExt};
use reqwest::Client;

async fn scrape_urls(urls: Vec<String>, concurrency: usize) -> Vec<String> {
    let client = Client::new();

    let results: Vec<String> = stream::iter(urls)
        .map(|url| {
            let client = client.clone();
            async move {
                match client.get(&url).send().await {
                    Ok(resp) => resp.text().await.unwrap_or_default(),
                    Err(e) => {
                        eprintln!("Error fetching {}: {}", url, e);
                        String::new()
                    }
                }
            }
        })
        .buffer_unordered(concurrency)
        .collect()
        .await;

    results
}

#[tokio::main]
async fn main() {
    let urls: Vec<String> = (1..=100)
        .map(|i| format!("https://example.com/page/{}", i))
        .collect();

    let pages = scrape_urls(urls, 10).await;
    println!("Scraped {} pages", pages.len());
}

Using with ScraperAPI

use reqwest::Client;

async fn scrape_with_api(url: &str) -> Result<String, reqwest::Error> {
    let client = Client::new();
    let api_key = "YOUR_SCRAPERAPI_KEY";

    let response = client.get("http://api.scraperapi.com")
        .query(&[
            ("api_key", api_key),
            ("url", url)
        ])
        .send()
        .await?
        .text()
        .await?;

    Ok(response)
}

Performance Comparison

Language	10K pages parse time	Memory usage
Rust	~2 seconds	~50 MB
Go	~5 seconds	~100 MB
Python	~30 seconds	~300 MB
Node.js	~15 seconds	~200 MB

HTML parsing benchmarks on a standard VPS. Network time excluded.

When to Choose Rust for Scraping

You are processing millions of pages and need maximum throughput
Memory is constrained (edge computing, small VPS)
You are building a long-running scraping service
The scraping logic is stable and will not change frequently

For rapid prototyping or sites requiring browser rendering, Python with ScraperAPI is more practical. Use Rust when raw performance justifies the development time.