Goroutines & Channels: Concurrent Scraping the Go Way
The reason Go is good for high-throughput crawling. Goroutines, channels, a worker-pool pattern, and a tiny concurrent crawler you can run.
What you’ll learn
- Spawn a goroutine and wait for it cleanly with `sync.WaitGroup`.
- Send and receive data across goroutines with channels.
- Build a worker-pool: N goroutines reading from one channel of URLs.
- Pick between unbuffered, buffered, and closed channels for backpressure.
This is the lesson Go was made for. Once you have goroutines and channels in your head, the "Go is good for crawlers" claim stops being marketing and starts being obvious.
What's a goroutine
A goroutine is a function that runs concurrently with the rest of your program. They're cheap (a few KB of stack each, grows on demand), so a single process can have hundreds of thousands of goroutines without breaking a sweat. The Go runtime schedules them onto OS threads for you.
package main
import (
"fmt"
"time"
)
func say(s string) {
for i := 0; i < 3; i++ {
fmt.Println(s, i)
time.Sleep(100 * time.Millisecond)
}
}
func main() {
go say("world") // runs concurrently
say("hello") // runs in main goroutine
}
go funcCall() launches the call in a new goroutine and returns immediately. The output interleaves "hello" and "world" because both goroutines run concurrently.
Compare to Python:
import threading, time
def say(s):
for i in range(3):
print(s, i)
time.sleep(0.1)
threading.Thread(target=say, args=("world",)).start()
say("hello")
Same shape, but a Python thread is ~MB-class, a goroutine is ~KB. The economic difference is what makes 100k concurrent fetches in one Go process feasible.
Waiting for goroutines to finish
The example above relies on the main goroutine being slow enough to let the other one finish. In real code you need explicit waiting. The standard tool is sync.WaitGroup:
package main
import (
"fmt"
"sync"
)
func main() {
var wg sync.WaitGroup
for _, url := range []string{"a", "b", "c"} {
wg.Add(1)
go func(u string) {
defer wg.Done()
fmt.Println("fetching", u)
}(url)
}
wg.Wait()
fmt.Println("all done")
}
Three rules:
wg.Add(1)before launching the goroutine, not inside it (race condition).defer wg.Done()as the first line inside the goroutine so it runs even if the goroutine panics.- Pass the loop variable as an argument (
go func(u string){}(url)), not capture it by closure. This is the most famous Go gotcha; older Go versions reuse the variable across iterations.
wg.Wait() blocks until the counter reaches zero. After it returns, all goroutines have finished.
Channels, sending data between goroutines
A channel is a typed pipe. One goroutine writes to it; another reads.
results := make(chan string) // unbuffered channel of strings
go func() {
results <- "first" // send
results <- "second"
close(results)
}()
for r := range results { // receive until closed
fmt.Println(r)
}
Channel rules to internalise:
- Send blocks until another goroutine receives (for unbuffered channels).
- Receive blocks until another goroutine sends.
- Closing a channel signals "no more sends." Receivers using
rangestop automatically. - Send on a closed channel panics. Receive on a closed channel returns the zero value.
- Only the sender should close the channel. Closing on the receive side is a bug.
The blocking is the feature, not the bug. It gives you natural backpressure: a fast producer can't outrun a slow consumer.
Buffered vs unbuffered channels
ch := make(chan int) // unbuffered: send blocks until receive
ch := make(chan int, 100) // buffered: send blocks only when buffer full
Buffered channels are useful when you know a rough capacity (e.g. the number of URLs in a frontier) and don't want a fast producer blocked on every send. They're not "better"; they decouple producer and consumer slightly, at the cost of memory.
For scraping pipelines, a buffered channel with capacity equal to the number of workers is a common shape.
A worker pool: the scraping idiom
This is the pattern you'll write most often. Spawn N workers that all read from a chan string of URLs and push results to a chan Result.
package main
import (
"fmt"
"io"
"net/http"
"sync"
"time"
)
type Result struct {
URL string
Status int
BodySize int
}
func worker(id int, urls <-chan string, results chan<- Result, wg *sync.WaitGroup) {
defer wg.Done()
client := &http.Client{Timeout: 10 * time.Second}
for url := range urls {
resp, err := client.Get(url)
if err != nil {
results <- Result{URL: url, Status: -1}
continue
}
body, _ := io.ReadAll(resp.Body)
resp.Body.Close()
results <- Result{URL: url, Status: resp.StatusCode, BodySize: len(body)}
}
fmt.Printf("worker %d done\n", id)
}
func main() {
urls := []string{
"https://practice.scrapingcentral.com/",
"https://example.com/",
"https://example.org/",
"https://example.net/",
}
jobs := make(chan string, len(urls))
results := make(chan Result, len(urls))
var wg sync.WaitGroup
for i := 1; i <= 3; i++ { // 3 workers
wg.Add(1)
go worker(i, jobs, results, &wg)
}
for _, u := range urls {
jobs <- u
}
close(jobs) // signals workers: no more URLs
go func() { // close results once all workers done
wg.Wait()
close(results)
}()
for r := range results {
fmt.Printf("%-40s status=%d size=%d\n", r.URL, r.Status, r.BodySize)
}
}
Read this twice. It contains every concurrent-Go idiom you'll see in real scraper code:
- Typed channel directions in function signatures:
<-chan string(read-only),chan<- Result(send-only). The compiler enforces it; the documentation is in the signature. close(jobs)when the producer is done, so workers'for rangeloops exit.go func(){ wg.Wait(); close(results) }()so the main loop'sfor range resultsexits cleanly.&wgpassed by pointer so all workers see the same WaitGroup.
That last point is one of the most common bugs: pass WaitGroups (and mutexes, and large structs) by pointer.
select: waiting on multiple channels
for {
select {
case url := <-urls:
fetch(url)
case <-quit:
return
case <-time.After(5 * time.Second):
fmt.Println("idle timeout")
return
}
}
select blocks until one of its cases can proceed, then runs that case. Useful when a goroutine needs to handle work, a stop signal, and a timeout simultaneously. This is the pattern for graceful shutdown in real scraper code.
context.Context: the cancellation idiom
Every serious Go HTTP call should accept a context.Context. It carries deadlines and cancellation signals.
import "context"
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
req, _ := http.NewRequestWithContext(ctx, "GET", url, nil)
resp, err := http.DefaultClient.Do(req)
If the 10 seconds elapse before the response, Do returns an error and the connection is killed. This is how a Go scraper cancels in-flight requests on shutdown. Use WithCancel for manual cancellation, WithTimeout/WithDeadline for time-based.
Race conditions, the one bug you must avoid
Two goroutines writing to the same map (or slice) without synchronisation will eventually crash with a fatal "concurrent map writes" error, or silently corrupt data.
Two ways to fix:
- Don't share. Each goroutine has its own map; merge at the end through a channel.
- Use a mutex.
sync.Mutexorsync.RWMutexaround the shared state.
Option 1 is the Go idiom. "Share memory by communicating, not communicate by sharing memory." When in doubt, push the data through a channel; don't reach into a shared struct.
Run with go run -race ./... to enable the race detector. It will catch most concurrent-write bugs in development.
When goroutines are too many
A goroutine per URL works for a few thousand URLs. For 100M, you do need a worker pool to cap concurrency. Otherwise:
- File descriptors run out (OS limit, usually 1024 to 10k).
- The remote server bans you for hammering.
- Memory used by in-flight responses grows unbounded.
The worker-pool pattern above is the answer. Pick a worker count (50, 100, 500) tuned to the target site's tolerance and your machine's resources. The frontier channel acts as backpressure.
Where to practice
- Take the worker-pool snippet above. Add a
chan Resultconsumer that writes results to a CSV. Reusecsv.NewWriterfromencoding/csv. - Modify it to fetch 10,000 URLs (use a test target or
practice.scrapingcentral.com). Tune the worker count and watch throughput. - Read Go by Example: Worker Pools. It's the canonical pattern, well-explained.
- Read Effective Go: Concurrency. It's a 15-minute read and the most efficient way to internalise the idioms.
Next: GO4 covers what a real HTTP request looks like in Go, with net/http.
Quiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.