Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Running Scrapers on Apify Platform

Deploy and run web scrapers on the Apify platform with built-in proxy management, scheduling, storage, and monitoring.

Deployment · #12beginner3 min read
Share:WhatsAppLinkedIn

Apify is a cloud platform built specifically for web scraping. It provides managed infrastructure, built-in proxy rotation, data storage, scheduling, and a marketplace of pre-built scrapers called Actors.

Why Apify?

  • Zero infrastructure management
  • Built-in proxy pool (datacenter and residential)
  • Automatic scheduling and monitoring
  • Dataset storage with API access
  • Pre-built scrapers for popular websites
  • Free tier with $5/month of compute

Getting Started

# Install the Apify CLI
npm install -g apify-cli

# Login
apify login

# Create a new Actor from a Python template
apify create my-scraper --template python-beautifulsoup
cd my-scraper

Building an Apify Actor

# src/main.py
from apify import Actor
import requests
from bs4 import BeautifulSoup

async def main():
    async with Actor:
        # Get input from the Apify platform
        actor_input = await Actor.get_input() or {}
        urls = actor_input.get("urls", ["https://news.ycombinator.com"])

        # Open a dataset to store results
        dataset = await Actor.open_dataset()

        for url in urls:
            Actor.log.info(f"Scraping {url}")

            response = requests.get(url, timeout=30, headers={
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/124.0.0.0"
            })
            soup = BeautifulSoup(response.text, "html.parser")

            items = []
            for link in soup.select(".titleline > a"):
                item = {
                    "title": link.text,
                    "url": link.get("href", ""),
                    "source": url,
                }
                items.append(item)

            # Push results to the dataset
            await dataset.push_data(items)
            Actor.log.info(f"Scraped {len(items)} items from {url}")

        Actor.log.info("Scraping complete!")

Actor Input Schema

Define what inputs your Actor accepts:

{
    "title": "My Scraper Input",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "urls": {
            "title": "URLs to scrape",
            "type": "array",
            "description": "List of URLs to scrape",
            "editor": "stringList",
            "default": ["https://news.ycombinator.com"]
        },
        "maxPages": {
            "title": "Max pages",
            "type": "integer",
            "description": "Maximum number of pages to scrape",
            "default": 10
        }
    },
    "required": ["urls"]
}

Deploy and Run

# Test locally
apify run

# Deploy to the Apify platform
apify push

# Run from the CLI
apify call my-scraper -i '{"urls": ["https://example.com"]}'

Using Apify's Proxy

Apify provides built-in proxy management:

from apify import Actor
import requests

async def main():
    async with Actor:
        # Get proxy configuration
        proxy_config = await Actor.create_proxy_configuration(
            groups=["RESIDENTIAL"],
            country_code="US",
        )

        proxy_url = await proxy_config.new_url()

        response = requests.get(
            "https://example.com",
            proxies={"http": proxy_url, "https": proxy_url},
            timeout=30,
        )

Scheduling Runs

Set up automatic runs in the Apify console or via API:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

# Create a scheduled run
schedule = client.schedules().create(
    name="daily-scrape",
    cron_expression="0 8 * * *",
    actions=[{
        "type": "RUN_ACTOR",
        "actorId": "your-actor-id",
        "runInput": {"urls": ["https://example.com"]},
    }],
)

Accessing Results

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

# Get the last run's dataset
run = client.actor("your-actor-id").last_run()
dataset = client.dataset(run.get("defaultDatasetId"))

# Download all items
for item in dataset.iterate_items():
    print(item)

Apify vs Self-Hosted

Feature Apify Self-Hosted
Setup time 5 minutes Hours
Proxies Built-in BYO
Storage Included BYO
Monitoring Dashboard DIY
Cost (light usage) Free tier $5+/month
Customization Actor framework Full control

Apify is an excellent choice when you want to get scraping quickly without managing infrastructure.