Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

3.43advanced5 min read

WebSocket Scraping for Real-Time Data

When the data updates faster than HTTP polling can keep up. WebSockets are bidirectional, persistent, and surprisingly easy to scrape.

What you’ll learn

  • Recognise WebSocket usage in DevTools.
  • Connect to a WebSocket from Python and Node.
  • Subscribe to channels and consume streamed messages.
  • Use Catalog108's polling-shim lab to practice the patterns.

When a site updates faster than HTTP polling can keep up, live trading prices, sports scores, chat messages, collaborative document edits, the data probably flows over a WebSocket. Persistent, bidirectional, low-overhead. And often unprotected from a scraping standpoint once you know how.

A note on Catalog108: real WebSocket servers require persistent connections that shared hosting environments often don't support. Catalog108 simulates the data shape of a WebSocket via polling shims (/api/ws/live-prices, /api/ws/echo) so you can practice the message-handling logic locally. The patterns transfer directly to real WebSocket targets.

Recognising WebSocket usage

In DevTools → Network → Filter → WS:

  • A request with status 101 Switching Protocols.
  • URL starts with wss://... (TLS) or ws://... (plain).
  • Clicking it shows a stream of "Messages" with text or binary content.

Real WebSockets, the connection

In Python with websockets:

import asyncio, json, websockets

async def main():
  async with websockets.connect("wss://example.com/ws/prices") as ws:
  # Some servers require a subscribe message
  await ws.send(json.dumps({"action": "subscribe", "channel": "prices"}))
  while True:
  raw = await ws.recv()
  msg = json.loads(raw)
  print(msg)

asyncio.run(main())

In Node with ws:

import WebSocket from 'ws';

const ws = new WebSocket('wss://example.com/ws/prices');
ws.on('open', () => {
  ws.send(JSON.stringify({ action: 'subscribe', channel: 'prices' }));
});
ws.on('message', (raw) => {
  console.log(JSON.parse(raw.toString()));
});

Catalog108's polling shim

The Catalog108 lab /api/ws/live-prices simulates a price-tick stream by responding to GET polls every 3 seconds with new tick data:

import requests, time

while True:
  r = requests.get("https://practice.scrapingcentral.com/api/ws/live-prices")
  for tick in r.json()["ticks"]:
  print(tick["symbol"], tick["price"])
  time.sleep(3)

This lets you practice the consume-and-process loop without infrastructure complexity. For a real WebSocket target, swap the polling for a websockets.connect() and the rest of your code stays the same.

Subscribe-publish patterns

Most WebSocket APIs use subscribe-publish:

async def main():
  async with websockets.connect("wss://example.com/ws") as ws:
  # Subscribe to multiple channels
  await ws.send(json.dumps({
  "type": "subscribe",
  "channels": ["prices.AAPL", "prices.GOOG", "prices.MSFT"],
  }))
  async for raw in ws:
  msg = json.loads(raw)
  channel = msg.get("channel")
  if channel and channel.startswith("prices."):
  handle_tick(msg)

Channels vs streams vs topics, same idea, different naming conventions.

Heartbeats and reconnection

Real WebSockets idle-disconnect after some timeout. Two mitigations:

  • Heartbeat. Client sends {"type": "ping"} every 30s; server replies pong. Keeps the connection alive.
  • Reconnect with backoff. On disconnect, wait + reconnect; resubscribe after.
import asyncio, json, websockets

async def run():
  backoff = 1
  while True:
  try:
  async with websockets.connect(URL, ping_interval=20, ping_timeout=10) as ws:
  await ws.send(json.dumps({"type": "subscribe", "channels": ["prices"]}))
  backoff = 1  # reset on successful connect
  async for raw in ws:
  process(json.loads(raw))
  except (websockets.ConnectionClosed, OSError) as e:
  print(f"Disconnected: {e}; reconnecting in {backoff}s")
  await asyncio.sleep(backoff)
  backoff = min(60, backoff * 2)

asyncio.run(run())

websockets library handles ping/pong automatically with the ping_interval parameter.

Auth on WebSockets

Three common patterns:

  1. Token in URL, wss://example.com/ws?token=abc123. Easy.
  2. Subprotocol auth, Authorization header sent during the upgrade. websockets supports custom headers:
async with websockets.connect(URL, additional_headers={"Authorization": f"Bearer {token}"}) as ws:
  1. First-message auth, after connecting, the first WS message is {"action": "auth", "token": "..."}. Server validates before subscribing.

Capture a real connection in DevTools to see which pattern the target uses.

Message types

WebSocket messages are either text or binary. JSON over text is most common, but you'll see:

  • JSON text, most APIs.
  • Plain text events, 42[...] (Socket.IO encoding).
  • MessagePack binary, compact binary encoding of JSON-like structures.
  • Protobuf binary, Google's Protocol Buffers. Schema-dependent decoding.

For unfamiliar binary, inspect a sample in DevTools' message view first, guess the encoding before decoding.

Catalog108 echo lab

/api/ws/echo simulates a WebSocket echo server via POST:

import requests

r = requests.post("https://practice.scrapingcentral.com/api/ws/echo",
  json={"message": "hello"})
print(r.json())  # → {"echo": "hello", "timestamp": "..."}

Practice the send-and-receive logic against the shim; transfer to a real WS echo (e.g. wss://echo.websocket.org) when you want to test against real WebSockets.

A real example, Coinbase Pro

For practice on a real, public WebSocket:

import asyncio, json, websockets

async def coinbase():
  async with websockets.connect("wss://ws-feed.exchange.coinbase.com") as ws:
  await ws.send(json.dumps({
  "type": "subscribe",
  "product_ids": ["BTC-USD"],
  "channels": ["ticker"],
  }))
  async for raw in ws:
  msg = json.loads(raw)
  if msg.get("type") == "ticker":
  print(msg["product_id"], msg["price"])

asyncio.run(coinbase())

Real, public, low-friction. Connect, subscribe, consume.

When to use a WebSocket

Use a WebSocket when:

  • The data updates faster than ~1 Hz.
  • You want push notifications, not polling.
  • Bandwidth/latency matters (mobile, IoT).
  • The protocol is already WS (you're capturing browser traffic).

Stick with HTTP polling when:

  • Updates are slow (daily, hourly).
  • You only need occasional snapshots.
  • WebSockets aren't natively supported by your infrastructure.

Storage and processing

Streaming data needs streaming storage:

  • Append-only logs, Kafka, Redpanda, Pulsar.
  • Time-series databases, InfluxDB, TimescaleDB, ClickHouse.
  • Simple file logs, JSON-lines for prototyping.

A scraper that consumes a WebSocket and crashes loses everything since the last save. Build durable persistence into the consume loop.

Hands-on lab

On Catalog108: poll /api/ws/live-prices for 5 minutes. Capture every tick. Notice the price field updates over time. Then graduate to a real WebSocket, wss://ws-feed.exchange.coinbase.com is public and well-behaved. Subscribe to BTC-USD ticker; log to a file. You've now scraped both shimmed and real WebSocket APIs.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /challenges/api/websocket/live-prices

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

WebSocket Scraping for Real-Time Data1 / 8

Which HTTP status code marks a WebSocket upgrade?

Score so far: 0 / 0