Web Scraping with Go (Colly Framework)

Learn web scraping with Go using the Colly framework. Covers installation, selectors, concurrency, and building production-ready scrapers.

Go is excellent for web scraping when you need speed and low resource usage. Colly is the most popular Go scraping framework, offering a clean API with built-in concurrency, rate limiting, and caching.

Installation

go mod init scraper
go get github.com/gocolly/colly/v2

Basic Scraper

package main

import (
    "fmt"
    "github.com/gocolly/colly/v2"
)

func main() {
    c := colly.NewCollector(
        colly.AllowedDomains("example.com"),
    )

    c.OnHTML("h1", func(e *colly.HTMLElement) {
        fmt.Println("Title:", e.Text)
    })

    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        fmt.Println("Link:", link)
    })

    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Visiting:", r.URL.String())
    })

    c.Visit("https://example.com")
}

Scraping Product Data

package main

import (
    "encoding/json"
    "fmt"
    "github.com/gocolly/colly/v2"
    "os"
)

type Product struct {
    Name  string `json:"name"`
    Price string `json:"price"`
    URL   string `json:"url"`
}

func main() {
    var products []Product

    c := colly.NewCollector()

    c.OnHTML(".product-card", func(e *colly.HTMLElement) {
        product := Product{
            Name:  e.ChildText(".product-name"),
            Price: e.ChildText(".product-price"),
            URL:   e.ChildAttr("a", "href"),
        }
        products = append(products, product)
    })

    // Handle pagination
    c.OnHTML("a.next-page", func(e *colly.HTMLElement) {
        nextURL := e.Attr("href")
        e.Request.Visit(nextURL)
    })

    c.Visit("https://example.com/products")

    // Save results
    data, _ := json.MarshalIndent(products, "", "  ")
    os.WriteFile("products.json", data, 0644)
    fmt.Printf("Scraped %d products\n", len(products))
}

Concurrent Scraping with Rate Limiting

c := colly.NewCollector(
    colly.Async(true),
)

// Limit concurrency and add delays
c.Limit(&colly.LimitRule{
    DomainGlob:  "*",
    Parallelism: 5,
    Delay:       1 * time.Second,
    RandomDelay: 500 * time.Millisecond,
})

// Queue all URLs
for _, url := range urls {
    c.Visit(url)
}

c.Wait() // Wait for all async requests to complete

Using with ScraperAPI

Route requests through ScraperAPI for anti-bot bypass.

c := colly.NewCollector()

// Use ScraperAPI as a proxy
c.SetProxy("http://scraperapi:YOUR_SCRAPERAPI_KEY@proxy-server.scraperapi.com:8001")

// Or modify the request URL
c.OnRequest(func(r *colly.Request) {
    originalURL := r.URL.String()
    apiURL := fmt.Sprintf(
        "http://api.scraperapi.com?api_key=%s&url=%s",
        "YOUR_SCRAPERAPI_KEY",
        url.QueryEscape(originalURL),
    )
    r.URL, _ = url.Parse(apiURL)
})

Error Handling and Retries

c.OnError(func(r *colly.Response, err error) {
    fmt.Printf("Error on %s: %v (status: %d)\n",
        r.Request.URL, err, r.StatusCode)

    // Retry on failure
    if r.StatusCode == 429 || r.StatusCode >= 500 {
        time.Sleep(5 * time.Second)
        r.Request.Retry()
    }
})

Why Choose Go for Scraping

Performance, Go processes responses 5-10x faster than Python for HTML parsing
Concurrency, Goroutines handle thousands of concurrent requests efficiently
Low memory, Go uses significantly less memory than Python or Node.js
Single binary, Deploy your scraper as a single compiled binary with no dependencies

Go with Colly is the best choice when you need high-throughput scraping with minimal resource usage. For projects where development speed matters more than runtime performance, Python remains easier.