WebSocket Scraping for Real-Time Data
When the data updates faster than HTTP polling can keep up. WebSockets are bidirectional, persistent, and surprisingly easy to scrape.
What you’ll learn
- Recognise WebSocket usage in DevTools.
- Connect to a WebSocket from Python and Node.
- Subscribe to channels and consume streamed messages.
- Use Catalog108's polling-shim lab to practice the patterns.
When a site updates faster than HTTP polling can keep up, live trading prices, sports scores, chat messages, collaborative document edits, the data probably flows over a WebSocket. Persistent, bidirectional, low-overhead. And often unprotected from a scraping standpoint once you know how.
A note on Catalog108: real WebSocket servers require persistent connections that shared hosting environments often don't support. Catalog108 simulates the data shape of a WebSocket via polling shims (/api/ws/live-prices, /api/ws/echo) so you can practice the message-handling logic locally. The patterns transfer directly to real WebSocket targets.
Recognising WebSocket usage
In DevTools → Network → Filter → WS:
- A request with status
101 Switching Protocols. - URL starts with
wss://...(TLS) orws://...(plain). - Clicking it shows a stream of "Messages" with text or binary content.
Real WebSockets, the connection
In Python with websockets:
import asyncio, json, websockets
async def main():
async with websockets.connect("wss://example.com/ws/prices") as ws:
# Some servers require a subscribe message
await ws.send(json.dumps({"action": "subscribe", "channel": "prices"}))
while True:
raw = await ws.recv()
msg = json.loads(raw)
print(msg)
asyncio.run(main())
In Node with ws:
import WebSocket from 'ws';
const ws = new WebSocket('wss://example.com/ws/prices');
ws.on('open', () => {
ws.send(JSON.stringify({ action: 'subscribe', channel: 'prices' }));
});
ws.on('message', (raw) => {
console.log(JSON.parse(raw.toString()));
});
Catalog108's polling shim
The Catalog108 lab /api/ws/live-prices simulates a price-tick stream by responding to GET polls every 3 seconds with new tick data:
import requests, time
while True:
r = requests.get("https://practice.scrapingcentral.com/api/ws/live-prices")
for tick in r.json()["ticks"]:
print(tick["symbol"], tick["price"])
time.sleep(3)
This lets you practice the consume-and-process loop without infrastructure complexity. For a real WebSocket target, swap the polling for a websockets.connect() and the rest of your code stays the same.
Subscribe-publish patterns
Most WebSocket APIs use subscribe-publish:
async def main():
async with websockets.connect("wss://example.com/ws") as ws:
# Subscribe to multiple channels
await ws.send(json.dumps({
"type": "subscribe",
"channels": ["prices.AAPL", "prices.GOOG", "prices.MSFT"],
}))
async for raw in ws:
msg = json.loads(raw)
channel = msg.get("channel")
if channel and channel.startswith("prices."):
handle_tick(msg)
Channels vs streams vs topics, same idea, different naming conventions.
Heartbeats and reconnection
Real WebSockets idle-disconnect after some timeout. Two mitigations:
- Heartbeat. Client sends
{"type": "ping"}every 30s; server repliespong. Keeps the connection alive. - Reconnect with backoff. On disconnect, wait + reconnect; resubscribe after.
import asyncio, json, websockets
async def run():
backoff = 1
while True:
try:
async with websockets.connect(URL, ping_interval=20, ping_timeout=10) as ws:
await ws.send(json.dumps({"type": "subscribe", "channels": ["prices"]}))
backoff = 1 # reset on successful connect
async for raw in ws:
process(json.loads(raw))
except (websockets.ConnectionClosed, OSError) as e:
print(f"Disconnected: {e}; reconnecting in {backoff}s")
await asyncio.sleep(backoff)
backoff = min(60, backoff * 2)
asyncio.run(run())
websockets library handles ping/pong automatically with the ping_interval parameter.
Auth on WebSockets
Three common patterns:
- Token in URL,
wss://example.com/ws?token=abc123. Easy. - Subprotocol auth,
Authorizationheader sent during the upgrade.websocketssupports custom headers:
async with websockets.connect(URL, additional_headers={"Authorization": f"Bearer {token}"}) as ws:
- First-message auth, after connecting, the first WS message is
{"action": "auth", "token": "..."}. Server validates before subscribing.
Capture a real connection in DevTools to see which pattern the target uses.
Message types
WebSocket messages are either text or binary. JSON over text is most common, but you'll see:
- JSON text, most APIs.
- Plain text events,
42[...](Socket.IO encoding). - MessagePack binary, compact binary encoding of JSON-like structures.
- Protobuf binary, Google's Protocol Buffers. Schema-dependent decoding.
For unfamiliar binary, inspect a sample in DevTools' message view first, guess the encoding before decoding.
Catalog108 echo lab
/api/ws/echo simulates a WebSocket echo server via POST:
import requests
r = requests.post("https://practice.scrapingcentral.com/api/ws/echo",
json={"message": "hello"})
print(r.json()) # → {"echo": "hello", "timestamp": "..."}
Practice the send-and-receive logic against the shim; transfer to a real WS echo (e.g. wss://echo.websocket.org) when you want to test against real WebSockets.
A real example, Coinbase Pro
For practice on a real, public WebSocket:
import asyncio, json, websockets
async def coinbase():
async with websockets.connect("wss://ws-feed.exchange.coinbase.com") as ws:
await ws.send(json.dumps({
"type": "subscribe",
"product_ids": ["BTC-USD"],
"channels": ["ticker"],
}))
async for raw in ws:
msg = json.loads(raw)
if msg.get("type") == "ticker":
print(msg["product_id"], msg["price"])
asyncio.run(coinbase())
Real, public, low-friction. Connect, subscribe, consume.
When to use a WebSocket
Use a WebSocket when:
- The data updates faster than ~1 Hz.
- You want push notifications, not polling.
- Bandwidth/latency matters (mobile, IoT).
- The protocol is already WS (you're capturing browser traffic).
Stick with HTTP polling when:
- Updates are slow (daily, hourly).
- You only need occasional snapshots.
- WebSockets aren't natively supported by your infrastructure.
Storage and processing
Streaming data needs streaming storage:
- Append-only logs, Kafka, Redpanda, Pulsar.
- Time-series databases, InfluxDB, TimescaleDB, ClickHouse.
- Simple file logs, JSON-lines for prototyping.
A scraper that consumes a WebSocket and crashes loses everything since the last save. Build durable persistence into the consume loop.
Hands-on lab
On Catalog108: poll /api/ws/live-prices for 5 minutes. Capture every tick. Notice the price field updates over time. Then graduate to a real WebSocket, wss://ws-feed.exchange.coinbase.com is public and well-behaved. Subscribe to BTC-USD ticker; log to a file. You've now scraped both shimmed and real WebSocket APIs.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/challenges/api/websocket/live-pricesQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.