Scraping with Scrapy Framework - Getting Started
Get started with Scrapy, the most powerful Python web scraping framework. Install Scrapy, create a project, and run your first spider.
Scrapy is a full-featured web scraping framework for Python. It handles requests, parsing, data storage, and error handling out of the box, making it the go-to choice for large-scale scraping projects.
Installation
pip install scrapy
Creating a Scrapy Project
scrapy startproject quotescraper
cd quotescraper
This generates the following structure:
quotescraper/
scrapy.cfg
quotescraper/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
Your First Spider
Create quotescraper/spiders/quotes_spider.py:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
"tags": quote.css("div.tags a.tag::text").getall(),
}
# Follow pagination
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
Running Your Spider
# Output to terminal
scrapy crawl quotes
# Save to JSON file
scrapy crawl quotes -o quotes.json
# Save to CSV
scrapy crawl quotes -o quotes.csv
Key Advantages of Scrapy
| Feature | Benefit |
|---|---|
| Built-in concurrency | Scrapes multiple pages simultaneously |
| Automatic retries | Handles failed requests without extra code |
| Export feeds | JSON, CSV, XML output built in |
| Middleware system | Customize request/response processing |
| Respect for robots.txt | Enabled by default |
Scrapy vs Requests + BeautifulSoup
Use Requests + BeautifulSoup for quick, simple scraping tasks. Use Scrapy when you need to scrape many pages, handle complex crawling logic, or build production-grade scrapers.
Configuring Scrapy Settings
Edit quotescraper/settings.py to adjust behavior:
# Be polite - add a delay between requests
DOWNLOAD_DELAY = 1
# Limit concurrent requests
CONCURRENT_REQUESTS = 8
# Identify your scraper
USER_AGENT = "ScrapingCentral Tutorial Bot (+https://scrapingcentral.com)"
Next Steps
- Learn about Scrapy Items for structured data
- Explore Scrapy middleware and pipelines for data processing
- Integrate proxy services like ScraperAPI for large-scale Scrapy projects