Running Scrapers on Apify Platform
Deploy and run web scrapers on the Apify platform with built-in proxy management, scheduling, storage, and monitoring.
Deployment · #12beginner3 min read
Apify is a cloud platform built specifically for web scraping. It provides managed infrastructure, built-in proxy rotation, data storage, scheduling, and a marketplace of pre-built scrapers called Actors.
Why Apify?
- Zero infrastructure management
- Built-in proxy pool (datacenter and residential)
- Automatic scheduling and monitoring
- Dataset storage with API access
- Pre-built scrapers for popular websites
- Free tier with $5/month of compute
Getting Started
# Install the Apify CLI
npm install -g apify-cli
# Login
apify login
# Create a new Actor from a Python template
apify create my-scraper --template python-beautifulsoup
cd my-scraper
Building an Apify Actor
# src/main.py
from apify import Actor
import requests
from bs4 import BeautifulSoup
async def main():
async with Actor:
# Get input from the Apify platform
actor_input = await Actor.get_input() or {}
urls = actor_input.get("urls", ["https://news.ycombinator.com"])
# Open a dataset to store results
dataset = await Actor.open_dataset()
for url in urls:
Actor.log.info(f"Scraping {url}")
response = requests.get(url, timeout=30, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/124.0.0.0"
})
soup = BeautifulSoup(response.text, "html.parser")
items = []
for link in soup.select(".titleline > a"):
item = {
"title": link.text,
"url": link.get("href", ""),
"source": url,
}
items.append(item)
# Push results to the dataset
await dataset.push_data(items)
Actor.log.info(f"Scraped {len(items)} items from {url}")
Actor.log.info("Scraping complete!")
Actor Input Schema
Define what inputs your Actor accepts:
{
"title": "My Scraper Input",
"type": "object",
"schemaVersion": 1,
"properties": {
"urls": {
"title": "URLs to scrape",
"type": "array",
"description": "List of URLs to scrape",
"editor": "stringList",
"default": ["https://news.ycombinator.com"]
},
"maxPages": {
"title": "Max pages",
"type": "integer",
"description": "Maximum number of pages to scrape",
"default": 10
}
},
"required": ["urls"]
}
Deploy and Run
# Test locally
apify run
# Deploy to the Apify platform
apify push
# Run from the CLI
apify call my-scraper -i '{"urls": ["https://example.com"]}'
Using Apify's Proxy
Apify provides built-in proxy management:
from apify import Actor
import requests
async def main():
async with Actor:
# Get proxy configuration
proxy_config = await Actor.create_proxy_configuration(
groups=["RESIDENTIAL"],
country_code="US",
)
proxy_url = await proxy_config.new_url()
response = requests.get(
"https://example.com",
proxies={"http": proxy_url, "https": proxy_url},
timeout=30,
)
Scheduling Runs
Set up automatic runs in the Apify console or via API:
from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
# Create a scheduled run
schedule = client.schedules().create(
name="daily-scrape",
cron_expression="0 8 * * *",
actions=[{
"type": "RUN_ACTOR",
"actorId": "your-actor-id",
"runInput": {"urls": ["https://example.com"]},
}],
)
Accessing Results
from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
# Get the last run's dataset
run = client.actor("your-actor-id").last_run()
dataset = client.dataset(run.get("defaultDatasetId"))
# Download all items
for item in dataset.iterate_items():
print(item)
Apify vs Self-Hosted
| Feature | Apify | Self-Hosted |
|---|---|---|
| Setup time | 5 minutes | Hours |
| Proxies | Built-in | BYO |
| Storage | Included | BYO |
| Monitoring | Dashboard | DIY |
| Cost (light usage) | Free tier | $5+/month |
| Customization | Actor framework | Full control |
Apify is an excellent choice when you want to get scraping quickly without managing infrastructure.