Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Running Scrapy Spiders on Scrapy Cloud (Zyte)

Deploy and manage Scrapy spiders on Zyte's Scrapy Cloud platform for effortless scheduling, monitoring, and scaling.

Deployment · #5intermediate3 min read
Share:WhatsAppLinkedIn

Scrapy Cloud (now part of Zyte) is a managed platform purpose-built for deploying Scrapy spiders. It handles scheduling, monitoring, data storage, and scaling without any infrastructure setup.

Getting Started

# Install the Zyte CLI
pip install shub

# Login to your Zyte account
shub login
# Enter your API key from https://app.zyte.com/account/apikey

Create a Scrapy Project

scrapy startproject myproject
cd myproject
# myproject/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com"]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

Deploy to Scrapy Cloud

# Initialize the project for deployment
shub deploy

# This creates a scrapinghub.yml if it does not exist
# Enter your project ID when prompted (found in the Zyte dashboard)

Your scrapinghub.yml should look like:

# scrapinghub.yml
projects:
  default: 123456  # Your project ID

requirements:
  file: requirements.txt

Run Spiders from the CLI

# Run a spider
shub schedule quotes

# Check the status of running jobs
shub items quotes  # View scraped items

# View logs
shub log 123456/1/1  # project/spider/job

Schedule Periodic Runs

In the Zyte dashboard, navigate to your project and set up periodic jobs:

  • Cron-based scheduling, run at specific times
  • Periodic scheduling, run every N minutes/hours
  • Priority levels, control which spiders run first

You can also schedule via the API:

import requests

API_KEY = "YOUR_ZYTE_API_KEY"
PROJECT_ID = "123456"

response = requests.post(
    f"https://app.zyte.com/api/schedule.json",
    auth=(API_KEY, ""),
    data={
        "project": PROJECT_ID,
        "spider": "quotes",
        "add_tag": ["production"],
    },
)
print(response.json())

Accessing Scraped Data

import requests

API_KEY = "YOUR_ZYTE_API_KEY"

# List jobs
jobs = requests.get(
    "https://app.zyte.com/api/jobs/list.json",
    auth=(API_KEY, ""),
    params={"project": "123456", "state": "finished"},
).json()

# Download items from a job
job_id = jobs["jobs"][0]["key"]
items = requests.get(
    f"https://storage.scrapinghub.com/items/{job_id}",
    auth=(API_KEY, ""),
    params={"format": "json"},
).json()

for item in items[:5]:
    print(item)

Scrapy Cloud vs Self-Hosted

Feature Scrapy Cloud Self-Hosted (VPS)
Setup time Minutes Hours
Scheduling Built-in Cron setup needed
Monitoring Dashboard included DIY
Scaling Automatic Manual
Cost Free tier + paid plans $5+/month
Customization Limited Full control

Tips

  • Use scrapinghub.yml settings to configure concurrency and autothrottle
  • Store secrets as project-level environment variables in the Zyte dashboard
  • Use Collections or Datasets in Zyte for structured data storage
  • Enable AutoThrottle in your Scrapy settings for polite scraping