Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

1.1beginner4 min read

Your First Scraper: requests + BeautifulSoup

Build a working scraper in fifteen lines of Python. Fetch a page, parse it, pull out structured data, the canonical static-scraping pipeline.

What you’ll learn

  • Install and import `requests` and `beautifulsoup4`.
  • Make a GET request and check the status code before parsing.
  • Parse HTML into a navigable tree with BeautifulSoup.
  • Extract text and attributes from selected elements.
  • Recognise the four-step shape every static scraper repeats: fetch, parse, select, extract.

Almost every static scraper, regardless of how big it gets, is the same four steps: fetch a URL, parse the response, select the elements you want, extract their text or attributes. This lesson walks you through all four in one file.

Install the two libraries

pip install requests beautifulsoup4 lxml

requests does the HTTP. beautifulsoup4 builds a DOM tree from the bytes that come back. lxml is the underlying parser BeautifulSoup will use, faster and more forgiving than the stdlib html.parser, so install it once and forget about it.

The four-step pipeline

import requests
from bs4 import BeautifulSoup

# 1. Fetch
url = "https://practice.scrapingcentral.com/"
response = requests.get(url, timeout=10)
response.raise_for_status()

# 2. Parse
soup = BeautifulSoup(response.text, "lxml")

# 3. Select
title = soup.select_one("h1")
links = soup.select("a[href]")

# 4. Extract
print(title.get_text(strip=True))
for a in links[:5]:
  print(a.get_text(strip=True), "→", a["href"])

Run it. You should see the homepage <h1> text and the first five links. That is a complete, working scraper. Everything else in this sub-path is a variation on these four lines.

Step 1: fetch

requests.get(url) does a GET request and returns a Response object. Three things you check on every response:

Check Code Why
Status code response.status_code 200 means success. Anything else means stop.
Final URL response.url If the server redirected, this is where you ended up.
Content type response.headers["Content-Type"] Make sure you got HTML, not JSON or a PDF.

raise_for_status() is the one-line version of "blow up loudly if it's not a 2xx." Use it. Silent failures are the worst kind of scraper bug.

The timeout=10 argument is non-negotiable. Without it, a hung server can freeze your entire scraper indefinitely. We'll come back to timeouts and retries in Lesson 1.6.

Step 2: parse

soup = BeautifulSoup(response.text, "lxml")

This single line transforms a string of HTML into a tree of nested objects. Now you can navigate that tree by tag name, CSS selector, attribute, or position. BeautifulSoup is forgiving, give it broken HTML and it does its best.

A common beginner mistake is passing response.content (bytes) versus response.text (str). For most pages either works, but .text lets requests handle encoding for you; .content requires you to know the encoding yourself. Use .text unless you have a reason not to (Lesson 1.16 covers when you do).

Step 3: select

Two methods you'll use 95% of the time:

  • soup.select(selector), returns a list of matches for a CSS selector.
  • soup.select_one(selector), returns the first match (or None).

You can also use BeautifulSoup's native API:

  • soup.find("h1"), first <h1>.
  • soup.find_all("a"), every <a>.
  • soup.find("div", class_="product-card"), first matching div.

CSS selectors are more concise for anything non-trivial; the find API is more pythonic for simple cases. Most working code uses both. Lesson 1.13 dives into the full BeautifulSoup API.

Step 4: extract

Once you have an element, you usually want one of three things:

el.get_text(strip=True)  # the visible text
el["href"]  # an attribute
el.attrs  # all attributes as a dict

get_text(strip=True) removes surrounding whitespace and newlines, almost always what you want. Without strip=True you get raw whitespace from the HTML source, which looks ugly when you print it.

The result is a Python object

Once extracted, your data is just strings, dicts, and lists. Save it to a file, push it to a database, print it, transform it, that's normal Python. The "scraping" is over the moment you've extracted.

products = []
for card in soup.select("article.product-card"):
  products.append({
  "name": card.select_one("h2").get_text(strip=True),
  "price": card.select_one(".price").get_text(strip=True),
  "url": card.select_one("a")["href"],
  })

A list of dicts. That's the shape 90% of scraping projects produce.

What this lesson didn't cover (yet)

  • What if the page needs cookies or a login? Lesson 1.4.
  • What if there are multiple pages? Lesson 1.23.
  • What if JavaScript builds the data after load? That's an entire sub-path, Dynamic Web.
  • What if the page is huge or you need to be polite about hammering it? Lesson 1.28.

For now you have a working scraper. The next lessons fill in the details.

Hands-on lab

Open https://practice.scrapingcentral.com/ in your browser. Then run the code block above (you'll need to adjust the selectors to match the homepage, that's part of the exercise). Print the page title, the first three navigation links, and any visible call-to-action button text. Confirm your output before moving on.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

Your First Scraper: requests + BeautifulSoup1 / 8

Which four steps does every static scraper repeat, in order?

Score so far: 0 / 0