Data Extraction
Techniques for extracting structured data from HTML, JSON, and APIs
#1
Getting Started with Web Scraping in Python
Learn the basics of web scraping with Python using the Requests library and BeautifulSoup. Your first scraper in 10 minutes.
#2
CSS Selectors for Web Scraping
Master CSS selectors to extract exactly the data you need. Classes, IDs, attributes, and advanced selector patterns.
#3
Handling Pagination in Web Scraping
Learn how to scrape paginated websites by following next-page links, handling page numbers, and collecting data across multiple pages.
#4
Scraping with Scrapy Framework - Getting Started
Get started with Scrapy, the most powerful Python web scraping framework. Install Scrapy, create a project, and run your first spider.
#5
Scrapy Spiders and Items
Define structured data with Scrapy Items and build advanced spiders with CrawlSpider, SitemapSpider, and custom parsing logic.
#6
Scrapy Middleware and Pipelines
Customize Scrapy's request/response flow with middleware and process scraped data using item pipelines for validation, cleaning, and storage.
#7
Async Scraping with HTTPX and asyncio
Speed up your scrapers with async Python. Use HTTPX and asyncio to make concurrent HTTP requests and scrape pages in parallel.
#8
Scraping with aiohttp
Use aiohttp for high-performance async web scraping in Python. Learn session management, connection pooling, and concurrent page fetching.
#9
Storing Scraped Data in CSV and JSON
Save your scraped data to CSV and JSON files using Python's built-in modules. Learn best practices for data export, encoding, and file organization.
#10
Storing Scraped Data in Databases (SQLite, PostgreSQL)
Store scraped data in SQLite and PostgreSQL databases. Learn schema design, upserts, and best practices for persistent scraping data storage.
#11
Error Handling and Retries in Scrapers
Build robust scrapers with proper error handling, automatic retries, exponential backoff, and graceful failure recovery.
#12
Scraping Behind Login/Authentication
Scrape websites that require login. Handle form-based authentication, session tokens, and authenticated API requests with Python.
#13
Handling Cookies and Sessions
Master cookie management and persistent sessions in Python web scraping. Handle session cookies, cookie jars, and cross-request state.
#14
Scraping Dynamic Content Without a Browser
Extract data from JavaScript-heavy websites without using a browser. Discover hidden APIs, intercept XHR requests, and parse JSON responses.
#15
Using ScraperAPI with Python
Integrate ScraperAPI into your Python scrapers for automatic proxy rotation, CAPTCHA solving, and JavaScript rendering. Complete guide with code examples.
#16
Using ScrapingAnt with Python
Integrate ScrapingAnt into your Python scrapers for headless browser rendering, proxy rotation, and anti-bot bypass. Complete tutorial with examples.
#17
Web Scraping with lxml and XPath
Use lxml and XPath expressions for fast, powerful HTML parsing. Learn XPath syntax, axes, and functions for precise data extraction.
#18
Extracting Data from HTML Tables
Scrape HTML tables from websites using BeautifulSoup and pandas. Handle complex tables with rowspan, colspan, and nested elements.
#19
Scraping Images and Files
Download images, PDFs, and other files while web scraping. Learn URL resolution, streaming downloads, and file organization best practices.
#20
Building a Price Monitoring Scraper
Build a complete price monitoring scraper that tracks product prices over time, detects price drops, and sends alerts. A real-world scraping project.
#21
Scraping Multiple Pages Concurrently
Speed up scraping with concurrent requests using threading, multiprocessing, and asyncio. Learn to balance speed with politeness.
#22
Scraping with Python and Regex
Use Python regular expressions to extract emails, phone numbers, prices, URLs, and other patterns from scraped web pages.
#23
Handling Different Encodings (UTF-8, ISO-8859)
Handle character encoding issues in web scraping. Detect, convert, and fix UTF-8, ISO-8859, and other encodings to avoid garbled text.
#24
Scraping XML and RSS Feeds
Parse XML documents and RSS/Atom feeds with Python. Extract structured data from feeds using feedparser, lxml, and the xml.etree module.
#25
Building a News Aggregator Scraper
Build a complete news aggregator that collects articles from multiple sources using RSS feeds and web scraping. Deduplicate, categorize, and store results.
#26
Scraping with Zyte API
Use Zyte API (formerly Scrapy Cloud) for intelligent web scraping with automatic extraction, browser rendering, and anti-bot bypass.
#27
Web Scraping Best Practices and Patterns
Master web scraping best practices: respectful scraping, anti-detection, data quality, error recovery, project architecture, and legal considerations.
#1
Introduction to Playwright for Web Scraping
Learn to scrape JavaScript-heavy websites using Playwright. Handles SPAs, lazy loading, and dynamic content.
#2
Selenium WebDriver Basics for Web Scraping
Learn the fundamentals of Selenium WebDriver for web scraping. Set up Chrome WebDriver, navigate pages, and extract data from dynamic websites.
#3
Playwright Advanced: Handling Popups and Dialogs
Master handling JavaScript alerts, confirm dialogs, popups, and new browser windows in Playwright for reliable web scraping.
#4
Playwright Waiting Strategies and Selectors
Learn Playwright's waiting strategies and powerful selector engine to build reliable scrapers that handle dynamic content loading.
#5
Selenium: Handling JavaScript-Rendered Pages
Learn how to scrape JavaScript-rendered pages with Selenium. Handle dynamic content, AJAX calls, and single-page applications.
#6
Taking Screenshots and PDFs with Playwright
Learn to capture full-page screenshots, element screenshots, and generate PDFs from web pages using Playwright.
#7
Scraping Infinite Scroll Pages
Learn techniques to scrape infinite scroll pages using Playwright and Selenium. Handle lazy-loaded content and extract all data from endlessly scrolling websites.
#8
Handling Dropdowns, Forms, and Clicks
Learn how to interact with web forms, dropdowns, checkboxes, and buttons using Playwright and Selenium for effective web scraping.
#10
Using Playwright with Proxies
Learn to configure Playwright with HTTP, SOCKS5, and rotating proxies for anonymous web scraping and IP rotation.
#11
Using Selenium with Proxies
Configure Selenium WebDriver with HTTP, SOCKS, and authenticated proxies for anonymous and scalable web scraping.
#12
Puppeteer Basics for Web Scraping
Get started with Puppeteer for web scraping in Node.js. Learn to launch headless Chrome, navigate pages, and extract data from dynamic websites.
#14
Intercepting Network Requests with Playwright
Learn to intercept, modify, and block network requests in Playwright for faster scraping and direct API data extraction.
#15
Scraping SPAs: React, Vue, and Angular Sites
Learn strategies for scraping single-page applications built with React, Vue, and Angular using browser automation tools.
#18
Scraping with Playwright in Python
A comprehensive guide to web scraping with Playwright in Python, covering sync and async APIs, data extraction patterns, and exporting results.
#1
Introduction to API Scraping
Learn what API scraping is, why it's more reliable than HTML scraping, and how to get started extracting data from web APIs.
#4
Scraping Paginated APIs
Learn how to handle offset-based, page-based, and cursor-based pagination when scraping APIs with Python.
#5
Working with GraphQL APIs
Learn how to discover and scrape GraphQL APIs, craft queries, handle variables, and paginate through GraphQL endpoints.
#9
Scraping JSON APIs and Processing Responses
Learn how to scrape JSON APIs, navigate nested response structures, and extract exactly the data you need using Python.
#13
Scraping Social Media APIs
Learn techniques for extracting data from social media platforms using their official APIs and alternative approaches.
#1
HTML Parsing with BeautifulSoup - Complete Guide
Master HTML parsing with BeautifulSoup4 in Python. Learn to navigate the DOM, find elements, extract text, and handle attributes.
#3
XPath Expressions for Web Scraping
Master XPath expressions for precise element selection in web scraping. Learn axes, predicates, functions, and advanced patterns.
#4
Parsing JSON Responses in Python
Learn to parse, navigate, and extract data from JSON API responses in Python using the json module, jmespath, and pandas.
#5
Using Regex for Data Extraction
Learn to use Python regular expressions to extract emails, URLs, prices, dates, and other patterns from scraped text.
#7
Extracting Structured Data from Unstructured HTML
Techniques for pulling structured records from messy, inconsistent HTML pages. Handle missing elements, variable layouts, and embedded metadata.
#8
Parsing HTML Tables into DataFrames
Extract HTML tables from web pages and convert them into pandas DataFrames for analysis. Handle merged cells, multi-row headers, and nested tables.
#10
Extracting Emails and Phone Numbers from Web Pages
Extract email addresses and phone numbers from scraped web pages using regex patterns, BeautifulSoup, and validation techniques.
#11
Parsing Dates and Prices from Scraped Text
Extract and normalize dates, prices, and currencies from messy scraped text using Python's dateutil, regex, and locale-aware parsing.
#12
Using jq and JSONPath for JSON Parsing
Master jq for command-line JSON processing and JSONPath for querying JSON in Python. Filter, transform, and extract data from API responses.
#1
Proxy Basics for Web Scraping
Understand proxy types, when to use them, and how to integrate proxies into your Python scrapers.
#12
Handling Honeypot Traps
Learn how to identify and avoid honeypot traps that websites use to detect and block web scrapers.