Running Scrapy Spiders on Scrapy Cloud (Zyte) - Deployment

Deploy and manage Scrapy spiders on Zyte's Scrapy Cloud platform for effortless scheduling, monitoring, and scaling.

Scrapy Cloud (now part of Zyte) is a managed platform purpose-built for deploying Scrapy spiders. It handles scheduling, monitoring, data storage, and scaling without any infrastructure setup.

Getting Started

# Install the Zyte CLI
pip install shub

# Login to your Zyte account
shub login
# Enter your API key from https://app.zyte.com/account/apikey

Create a Scrapy Project

scrapy startproject myproject
cd myproject

# myproject/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com"]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

Deploy to Scrapy Cloud

# Initialize the project for deployment
shub deploy

# This creates a scrapinghub.yml if it does not exist
# Enter your project ID when prompted (found in the Zyte dashboard)

Your scrapinghub.yml should look like:

# scrapinghub.yml
projects:
  default: 123456  # Your project ID

requirements:
  file: requirements.txt

Run Spiders from the CLI

# Run a spider
shub schedule quotes

# Check the status of running jobs
shub items quotes  # View scraped items

# View logs
shub log 123456/1/1  # project/spider/job

Schedule Periodic Runs

In the Zyte dashboard, navigate to your project and set up periodic jobs:

Cron-based scheduling, run at specific times
Periodic scheduling, run every N minutes/hours
Priority levels, control which spiders run first

You can also schedule via the API:

import requests

API_KEY = "YOUR_ZYTE_API_KEY"
PROJECT_ID = "123456"

response = requests.post(
    f"https://app.zyte.com/api/schedule.json",
    auth=(API_KEY, ""),
    data={
        "project": PROJECT_ID,
        "spider": "quotes",
        "add_tag": ["production"],
    },
)
print(response.json())

Accessing Scraped Data

import requests

API_KEY = "YOUR_ZYTE_API_KEY"

# List jobs
jobs = requests.get(
    "https://app.zyte.com/api/jobs/list.json",
    auth=(API_KEY, ""),
    params={"project": "123456", "state": "finished"},
).json()

# Download items from a job
job_id = jobs["jobs"][0]["key"]
items = requests.get(
    f"https://storage.scrapinghub.com/items/{job_id}",
    auth=(API_KEY, ""),
    params={"format": "json"},
).json()

for item in items[:5]:
    print(item)

Scrapy Cloud vs Self-Hosted

Feature	Scrapy Cloud	Self-Hosted (VPS)
Setup time	Minutes	Hours
Scheduling	Built-in	Cron setup needed
Monitoring	Dashboard included	DIY
Scaling	Automatic	Manual
Cost	Free tier + paid plans	$5+/month
Customization	Limited	Full control

Tips

Use scrapinghub.yml settings to configure concurrency and autothrottle
Store secrets as project-level environment variables in the Zyte dashboard
Use Collections or Datasets in Zyte for structured data storage
Enable AutoThrottle in your Scrapy settings for polite scraping