Tutorial
Web Scraping with Go (Colly Framework)
Learn web scraping with Go using the Colly framework. Covers installation, selectors, concurrency, and building production-ready scrapers.
Go is excellent for web scraping when you need speed and low resource usage. Colly is the most popular Go scraping framework, offering a clean API with built-in concurrency, rate limiting, and caching.
Installation
go mod init scraper
go get github.com/gocolly/colly/v2
Basic Scraper
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector(
colly.AllowedDomains("example.com"),
)
c.OnHTML("h1", func(e *colly.HTMLElement) {
fmt.Println("Title:", e.Text)
})
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
fmt.Println("Link:", link)
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting:", r.URL.String())
})
c.Visit("https://example.com")
}
Scraping Product Data
package main
import (
"encoding/json"
"fmt"
"github.com/gocolly/colly/v2"
"os"
)
type Product struct {
Name string `json:"name"`
Price string `json:"price"`
URL string `json:"url"`
}
func main() {
var products []Product
c := colly.NewCollector()
c.OnHTML(".product-card", func(e *colly.HTMLElement) {
product := Product{
Name: e.ChildText(".product-name"),
Price: e.ChildText(".product-price"),
URL: e.ChildAttr("a", "href"),
}
products = append(products, product)
})
// Handle pagination
c.OnHTML("a.next-page", func(e *colly.HTMLElement) {
nextURL := e.Attr("href")
e.Request.Visit(nextURL)
})
c.Visit("https://example.com/products")
// Save results
data, _ := json.MarshalIndent(products, "", " ")
os.WriteFile("products.json", data, 0644)
fmt.Printf("Scraped %d products\n", len(products))
}
Concurrent Scraping with Rate Limiting
c := colly.NewCollector(
colly.Async(true),
)
// Limit concurrency and add delays
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 5,
Delay: 1 * time.Second,
RandomDelay: 500 * time.Millisecond,
})
// Queue all URLs
for _, url := range urls {
c.Visit(url)
}
c.Wait() // Wait for all async requests to complete
Using with ScraperAPI
Route requests through ScraperAPI for anti-bot bypass.
c := colly.NewCollector()
// Use ScraperAPI as a proxy
c.SetProxy("http://scraperapi:YOUR_SCRAPERAPI_KEY@proxy-server.scraperapi.com:8001")
// Or modify the request URL
c.OnRequest(func(r *colly.Request) {
originalURL := r.URL.String()
apiURL := fmt.Sprintf(
"http://api.scraperapi.com?api_key=%s&url=%s",
"YOUR_SCRAPERAPI_KEY",
url.QueryEscape(originalURL),
)
r.URL, _ = url.Parse(apiURL)
})
Error Handling and Retries
c.OnError(func(r *colly.Response, err error) {
fmt.Printf("Error on %s: %v (status: %d)\n",
r.Request.URL, err, r.StatusCode)
// Retry on failure
if r.StatusCode == 429 || r.StatusCode >= 500 {
time.Sleep(5 * time.Second)
r.Request.Retry()
}
})
Why Choose Go for Scraping
- Performance, Go processes responses 5-10x faster than Python for HTML parsing
- Concurrency, Goroutines handle thousands of concurrent requests efficiently
- Low memory, Go uses significantly less memory than Python or Node.js
- Single binary, Deploy your scraper as a single compiled binary with no dependencies
Go with Colly is the best choice when you need high-throughput scraping with minimal resource usage. For projects where development speed matters more than runtime performance, Python remains easier.