Goroutines & Channels: Concurrent Scraping the Go Way, Go for Scrapers

The reason Go is good for high-throughput crawling. Goroutines, channels, a worker-pool pattern, and a tiny concurrent crawler you can run.

This is the lesson Go was made for. Once you have goroutines and channels in your head, the "Go is good for crawlers" claim stops being marketing and starts being obvious.

What's a goroutine

A goroutine is a function that runs concurrently with the rest of your program. They're cheap (a few KB of stack each, grows on demand), so a single process can have hundreds of thousands of goroutines without breaking a sweat. The Go runtime schedules them onto OS threads for you.

package main

import (
    "fmt"
    "time"
)

func say(s string) {
    for i := 0; i < 3; i++ {
        fmt.Println(s, i)
        time.Sleep(100 * time.Millisecond)
    }
}

func main() {
    go say("world")    // runs concurrently
    say("hello")       // runs in main goroutine
}

go funcCall() launches the call in a new goroutine and returns immediately. The output interleaves "hello" and "world" because both goroutines run concurrently.

Compare to Python:

import threading, time
def say(s):
    for i in range(3):
        print(s, i)
        time.sleep(0.1)

threading.Thread(target=say, args=("world",)).start()
say("hello")

Same shape, but a Python thread is ~MB-class, a goroutine is ~KB. The economic difference is what makes 100k concurrent fetches in one Go process feasible.

Waiting for goroutines to finish

The example above relies on the main goroutine being slow enough to let the other one finish. In real code you need explicit waiting. The standard tool is sync.WaitGroup:

package main

import (
    "fmt"
    "sync"
)

func main() {
    var wg sync.WaitGroup

    for _, url := range []string{"a", "b", "c"} {
        wg.Add(1)
        go func(u string) {
            defer wg.Done()
            fmt.Println("fetching", u)
        }(url)
    }

    wg.Wait()
    fmt.Println("all done")
}

Three rules:

wg.Add(1) before launching the goroutine, not inside it (race condition).
defer wg.Done() as the first line inside the goroutine so it runs even if the goroutine panics.
Pass the loop variable as an argument (go func(u string){}(url)), not capture it by closure. This is the most famous Go gotcha; older Go versions reuse the variable across iterations.

wg.Wait() blocks until the counter reaches zero. After it returns, all goroutines have finished.

Channels, sending data between goroutines

A channel is a typed pipe. One goroutine writes to it; another reads.

results := make(chan string)         // unbuffered channel of strings

go func() {
    results <- "first"                // send
    results <- "second"
    close(results)
}()

for r := range results {              // receive until closed
    fmt.Println(r)
}

Channel rules to internalise:

Send blocks until another goroutine receives (for unbuffered channels).
Receive blocks until another goroutine sends.
Closing a channel signals "no more sends." Receivers using range stop automatically.
Send on a closed channel panics. Receive on a closed channel returns the zero value.
Only the sender should close the channel. Closing on the receive side is a bug.

The blocking is the feature, not the bug. It gives you natural backpressure: a fast producer can't outrun a slow consumer.

Buffered vs unbuffered channels

ch := make(chan int)        // unbuffered: send blocks until receive
ch := make(chan int, 100)   // buffered: send blocks only when buffer full

Buffered channels are useful when you know a rough capacity (e.g. the number of URLs in a frontier) and don't want a fast producer blocked on every send. They're not "better"; they decouple producer and consumer slightly, at the cost of memory.

For scraping pipelines, a buffered channel with capacity equal to the number of workers is a common shape.

A worker pool: the scraping idiom

This is the pattern you'll write most often. Spawn N workers that all read from a chan string of URLs and push results to a chan Result.

package main

import (
    "fmt"
    "io"
    "net/http"
    "sync"
    "time"
)

type Result struct {
    URL      string
    Status   int
    BodySize int
}

func worker(id int, urls <-chan string, results chan<- Result, wg *sync.WaitGroup) {
    defer wg.Done()
    client := &http.Client{Timeout: 10 * time.Second}

    for url := range urls {
        resp, err := client.Get(url)
        if err != nil {
            results <- Result{URL: url, Status: -1}
            continue
        }
        body, _ := io.ReadAll(resp.Body)
        resp.Body.Close()
        results <- Result{URL: url, Status: resp.StatusCode, BodySize: len(body)}
    }
    fmt.Printf("worker %d done\n", id)
}

func main() {
    urls := []string{
        "https://practice.scrapingcentral.com/",
        "https://example.com/",
        "https://example.org/",
        "https://example.net/",
    }

    jobs := make(chan string, len(urls))
    results := make(chan Result, len(urls))

    var wg sync.WaitGroup
    for i := 1; i <= 3; i++ {        // 3 workers
        wg.Add(1)
        go worker(i, jobs, results, &wg)
    }

    for _, u := range urls {
        jobs <- u
    }
    close(jobs)                       // signals workers: no more URLs

    go func() {                       // close results once all workers done
        wg.Wait()
        close(results)
    }()

    for r := range results {
        fmt.Printf("%-40s status=%d size=%d\n", r.URL, r.Status, r.BodySize)
    }
}

Read this twice. It contains every concurrent-Go idiom you'll see in real scraper code:

Typed channel directions in function signatures: <-chan string (read-only), chan<- Result (send-only). The compiler enforces it; the documentation is in the signature.
close(jobs) when the producer is done, so workers' for range loops exit.
go func(){ wg.Wait(); close(results) }() so the main loop's for range results exits cleanly.
&wg passed by pointer so all workers see the same WaitGroup.

That last point is one of the most common bugs: pass WaitGroups (and mutexes, and large structs) by pointer.

`select`: waiting on multiple channels

for {
    select {
    case url := <-urls:
        fetch(url)
    case <-quit:
        return
    case <-time.After(5 * time.Second):
        fmt.Println("idle timeout")
        return
    }
}

select blocks until one of its cases can proceed, then runs that case. Useful when a goroutine needs to handle work, a stop signal, and a timeout simultaneously. This is the pattern for graceful shutdown in real scraper code.

`context.Context`: the cancellation idiom

Every serious Go HTTP call should accept a context.Context. It carries deadlines and cancellation signals.

import "context"

ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()

req, _ := http.NewRequestWithContext(ctx, "GET", url, nil)
resp, err := http.DefaultClient.Do(req)

If the 10 seconds elapse before the response, Do returns an error and the connection is killed. This is how a Go scraper cancels in-flight requests on shutdown. Use WithCancel for manual cancellation, WithTimeout/WithDeadline for time-based.

Race conditions, the one bug you must avoid

Two goroutines writing to the same map (or slice) without synchronisation will eventually crash with a fatal "concurrent map writes" error, or silently corrupt data.

Two ways to fix:

Don't share. Each goroutine has its own map; merge at the end through a channel.
Use a mutex. sync.Mutex or sync.RWMutex around the shared state.

Option 1 is the Go idiom. "Share memory by communicating, not communicate by sharing memory." When in doubt, push the data through a channel; don't reach into a shared struct.

Run with go run -race ./... to enable the race detector. It will catch most concurrent-write bugs in development.

When goroutines are too many

A goroutine per URL works for a few thousand URLs. For 100M, you do need a worker pool to cap concurrency. Otherwise:

File descriptors run out (OS limit, usually 1024 to 10k).
The remote server bans you for hammering.
Memory used by in-flight responses grows unbounded.

The worker-pool pattern above is the answer. Pick a worker count (50, 100, 500) tuned to the target site's tolerance and your machine's resources. The frontier channel acts as backpressure.

Where to practice

Take the worker-pool snippet above. Add a chan Result consumer that writes results to a CSV. Reuse csv.NewWriter from encoding/csv.
Modify it to fetch 10,000 URLs (use a test target or practice.scrapingcentral.com). Tune the worker count and watch throughput.
Read Go by Example: Worker Pools. It's the canonical pattern, well-explained.
Read Effective Go: Concurrency. It's a 15-minute read and the most efficient way to internalise the idioms.

Next: GO4 covers what a real HTTP request looks like in Go, with net/http.

Goroutines & Channels: Concurrent Scraping the Go Way

What you’ll learn

What's a goroutine

Waiting for goroutines to finish

Channels, sending data between goroutines

Buffered vs unbuffered channels

A worker pool: the scraping idiom

`select`: waiting on multiple channels

`context.Context`: the cancellation idiom

Race conditions, the one bug you must avoid

When goroutines are too many

Where to practice

Quiz, check your understanding

What makes goroutines cheaper than OS threads for high-throughput crawling?

Goroutines & Channels: Concurrent Scraping the Go Way

What you’ll learn

What's a goroutine

Waiting for goroutines to finish

Channels, sending data between goroutines

Buffered vs unbuffered channels

A worker pool: the scraping idiom

select: waiting on multiple channels

context.Context: the cancellation idiom

Race conditions, the one bug you must avoid

When goroutines are too many

Where to practice

Quiz, check your understanding

What makes goroutines cheaper than OS threads for high-throughput crawling?

`select`: waiting on multiple channels

`context.Context`: the cancellation idiom