Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

F5beginner4 min read

HTML Structure and the DOM

What HTML actually is, what a parser turns it into, and the tree-shaped mental model your scraper needs to navigate.

What you’ll learn

  • Distinguish HTML source text from the DOM tree it parses into.
  • Name the three node types you'll work with: elements, text, attributes.
  • Understand why 'view source' and 'inspect element' can show different markup.
  • Read a real-world HTML snippet and identify which parts are useful data vs. layout chrome.

You don't scrape HTML. You scrape the DOM that a parser builds from HTML. The distinction sounds pedantic for two minutes and then explains a hundred bugs.

HTML is text. DOM is a tree.

HTML is the source:

<article class="product" data-id="42">
  <h2>Yellow ceramic mug</h2>
  <p class="price">$14.99</p>
  <ul class="tags">
  <li>kitchen</li>
  <li>ceramic</li>
  </ul>
</article>

The DOM is what a parser builds:

article.product[data-id=42]
├── h2
│  └── "Yellow ceramic mug"  (text node)
├── p.price
│  └── "$14.99"  (text node)
└── ul.tags
  ├── li
  │  └── "kitchen"
  └── li
  └── "ceramic"

Every scraping library, BeautifulSoup, lxml, DomCrawler, Cheerio, gives you a programmatic interface to that tree. Your job is to navigate it.

Three node types you'll actually use

Node Example What it holds
Element <p class="price"> A tag with attributes; can have children
Text "$14.99" A string between/inside tags
Attribute class="price", data-id="42" Key/value pairs on an element

Comments and processing instructions exist too. Ignore them.

The hierarchy: parent, child, descendant, sibling

Familiar terms, used constantly:

  • Parent, the element that immediately contains another (e.g. ul.tags is the parent of each li).
  • Child, the inverse. article has three children: h2, p.price, ul.tags.
  • Descendant, any node nested inside, at any depth. Each li is a descendant of article even though it's two levels deep.
  • Sibling, same parent. h2 and p.price are siblings.

These map directly to CSS combinators (> child, descendant by space, + next sibling, ~ general sibling) and XPath axes (covered in the next two lessons).

Tag soup vs. clean HTML

The HTML spec demands clean nesting. Reality doesn't. Real-world pages are full of unclosed tags, badly nested elements, and weirdness:

<table>
  <tr>
  <td>cell 1
  <td>cell 2  <!-- missing </td> -->
  <tr>  <!-- missing </tr> -->
  <td>cell 3</td>
  </tr>
</table>

A strict XML parser refuses this. An HTML parser (BeautifulSoup with html.parser or lxml, PHP's DOMDocument with the HTML5 input mode) silently fixes it: closes missing tags at the next reasonable spot, re-nests as needed, and produces a consistent tree. This is the single most important reason to use a real HTML parser and not regex on the source text.

Why 'view source' and 'inspect element' can disagree

Two different views of the page:

View What you see
View Source (Ctrl+U) The raw HTML the server sent, what a curl request would receive
Inspect Element (DevTools) The DOM as it currently exists, after JavaScript has modified it

If a page is server-rendered, both views match. If JavaScript adds, removes, or modifies elements after load, only Inspect Element shows the latest state. This single distinction is the entire static-vs-dynamic decision the curriculum's Sub-Path 2 spends six lessons on.

A quick scraper-side proxy for the same diagnostic:

curl -s https://practice.scrapingcentral.com/ | head -50

If your data is in that output, you can use a static scraper. If it isn't, JavaScript is putting it there, and you'll need a browser-driving tool.

Attributes you'll grep constantly

Attributes carry most of the scraping signal:

  • id="...", unique identifier (in theory). Use it.
  • class="...", space-separated CSS classes. Used for grouping similar items.
  • href="...", link target on <a> elements.
  • src="...", resource target on <img>, <script>, <iframe>.
  • data-*="...", custom attributes (data-product-id, data-page). Often the cleanest hook for scrapers.
  • aria-label="...", accessibility label, often holds the same text as visible content.

data-* attributes are the unsung hero of scraping: they exist because developers added them for their own JavaScript, but they're stable, semantic, and rarely change across redesigns.

The semantic vs. presentational split

Modern HTML pushes you toward semantic tags (<article>, <section>, <nav>, <aside>, <header>, <footer>). Older / sloppier markup uses generic <div> and <span> everywhere with class names doing the semantic work. Scrape both, but prefer semantic tags as anchors when they exist, because they're less likely to be renamed in a redesign.

Hands-on lab

Open practice.scrapingcentral.com/products in DevTools. Find one product card. Identify:

  1. The outer element (probably an <article> or <div> with a class like product-card).
  2. Where the product name lives, which tag, what class.
  3. Where the price is, same questions.
  4. Whether the product ID is in data-*, the href of a child link, or both.

Then look at the same page via curl -s | head -200. Confirm the same data exists in the raw HTML (it should, /products is server-rendered). This habit, comparing rendered DOM to raw HTML, is something you'll do on every new scraping target.

Hands-on lab

Practice this lesson on Catalog108, our first-party scraping sandbox.

Open lab target → /products

Quiz, check your understanding

Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.

HTML Structure and the DOM1 / 8

What is the relationship between HTML source text and the DOM?

Score so far: 0 / 0