Scraping with Scrapy Framework - Getting Started - Python Scraping

Get started with Scrapy, the most powerful Python web scraping framework. Install Scrapy, create a project, and run your first spider.

Scrapy is a full-featured web scraping framework for Python. It handles requests, parsing, data storage, and error handling out of the box, making it the go-to choice for large-scale scraping projects.

Installation

pip install scrapy

Creating a Scrapy Project

scrapy startproject quotescraper
cd quotescraper

This generates the following structure:

quotescraper/
    scrapy.cfg
    quotescraper/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/
            __init__.py

Your First Spider

Create quotescraper/spiders/quotes_spider.py:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/"]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

        # Follow pagination
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

Running Your Spider

# Output to terminal
scrapy crawl quotes

# Save to JSON file
scrapy crawl quotes -o quotes.json

# Save to CSV
scrapy crawl quotes -o quotes.csv

Key Advantages of Scrapy

Feature	Benefit
Built-in concurrency	Scrapes multiple pages simultaneously
Automatic retries	Handles failed requests without extra code
Export feeds	JSON, CSV, XML output built in
Middleware system	Customize request/response processing
Respect for robots.txt	Enabled by default

Scrapy vs Requests + BeautifulSoup

Use Requests + BeautifulSoup for quick, simple scraping tasks. Use Scrapy when you need to scrape many pages, handle complex crawling logic, or build production-grade scrapers.

Configuring Scrapy Settings

Edit quotescraper/settings.py to adjust behavior:

# Be polite - add a delay between requests
DOWNLOAD_DELAY = 1

# Limit concurrent requests
CONCURRENT_REQUESTS = 8

# Identify your scraper
USER_AGENT = "ScrapingCentral Tutorial Bot (+https://scrapingcentral.com)"

Next Steps

Learn about Scrapy Items for structured data
Explore Scrapy middleware and pipelines for data processing
Integrate proxy services like ScraperAPI for large-scale Scrapy projects