Running Scrapy Spiders on Scrapy Cloud (Zyte)
Deploy and manage Scrapy spiders on Zyte's Scrapy Cloud platform for effortless scheduling, monitoring, and scaling.
Deployment · #5intermediate3 min read
Scrapy Cloud (now part of Zyte) is a managed platform purpose-built for deploying Scrapy spiders. It handles scheduling, monitoring, data storage, and scaling without any infrastructure setup.
Getting Started
# Install the Zyte CLI
pip install shub
# Login to your Zyte account
shub login
# Enter your API key from https://app.zyte.com/account/apikey
Create a Scrapy Project
scrapy startproject myproject
cd myproject
# myproject/spiders/quotes_spider.py
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com"]
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
"tags": quote.css("div.tags a.tag::text").getall(),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
Deploy to Scrapy Cloud
# Initialize the project for deployment
shub deploy
# This creates a scrapinghub.yml if it does not exist
# Enter your project ID when prompted (found in the Zyte dashboard)
Your scrapinghub.yml should look like:
# scrapinghub.yml
projects:
default: 123456 # Your project ID
requirements:
file: requirements.txt
Run Spiders from the CLI
# Run a spider
shub schedule quotes
# Check the status of running jobs
shub items quotes # View scraped items
# View logs
shub log 123456/1/1 # project/spider/job
Schedule Periodic Runs
In the Zyte dashboard, navigate to your project and set up periodic jobs:
- Cron-based scheduling, run at specific times
- Periodic scheduling, run every N minutes/hours
- Priority levels, control which spiders run first
You can also schedule via the API:
import requests
API_KEY = "YOUR_ZYTE_API_KEY"
PROJECT_ID = "123456"
response = requests.post(
f"https://app.zyte.com/api/schedule.json",
auth=(API_KEY, ""),
data={
"project": PROJECT_ID,
"spider": "quotes",
"add_tag": ["production"],
},
)
print(response.json())
Accessing Scraped Data
import requests
API_KEY = "YOUR_ZYTE_API_KEY"
# List jobs
jobs = requests.get(
"https://app.zyte.com/api/jobs/list.json",
auth=(API_KEY, ""),
params={"project": "123456", "state": "finished"},
).json()
# Download items from a job
job_id = jobs["jobs"][0]["key"]
items = requests.get(
f"https://storage.scrapinghub.com/items/{job_id}",
auth=(API_KEY, ""),
params={"format": "json"},
).json()
for item in items[:5]:
print(item)
Scrapy Cloud vs Self-Hosted
| Feature | Scrapy Cloud | Self-Hosted (VPS) |
|---|---|---|
| Setup time | Minutes | Hours |
| Scheduling | Built-in | Cron setup needed |
| Monitoring | Dashboard included | DIY |
| Scaling | Automatic | Manual |
| Cost | Free tier + paid plans | $5+/month |
| Customization | Limited | Full control |
Tips
- Use
scrapinghub.ymlsettings to configure concurrency and autothrottle - Store secrets as project-level environment variables in the Zyte dashboard
- Use Collections or Datasets in Zyte for structured data storage
- Enable AutoThrottle in your Scrapy settings for polite scraping