HTML Structure and the DOM, Foundations

What HTML actually is, what a parser turns it into, and the tree-shaped mental model your scraper needs to navigate.

You don't scrape HTML. You scrape the DOM that a parser builds from HTML. The distinction sounds pedantic for two minutes and then explains a hundred bugs.

HTML is text. DOM is a tree.

HTML is the source:

<article class="product" data-id="42">
  <h2>Yellow ceramic mug</h2>
  <p class="price">$14.99</p>
  <ul class="tags">
  <li>kitchen</li>
  <li>ceramic</li>
  </ul>
</article>

The DOM is what a parser builds:

article.product[data-id=42]
├── h2
│  └── "Yellow ceramic mug"  (text node)
├── p.price
│  └── "$14.99"  (text node)
└── ul.tags
  ├── li
  │  └── "kitchen"
  └── li
  └── "ceramic"

Every scraping library, BeautifulSoup, lxml, DomCrawler, Cheerio, gives you a programmatic interface to that tree. Your job is to navigate it.

Three node types you'll actually use

Node	Example	What it holds
Element	`<p class="price">`	A tag with attributes; can have children
Text	`"$14.99"`	A string between/inside tags
Attribute	`class="price"`, `data-id="42"`	Key/value pairs on an element

Comments and processing instructions exist too. Ignore them.

The hierarchy: parent, child, descendant, sibling

Familiar terms, used constantly:

Parent, the element that immediately contains another (e.g. ul.tags is the parent of each li).
Child, the inverse. article has three children: h2, p.price, ul.tags.
Descendant, any node nested inside, at any depth. Each li is a descendant of article even though it's two levels deep.
Sibling, same parent. h2 and p.price are siblings.

These map directly to CSS combinators (> child, descendant by space, + next sibling, ~ general sibling) and XPath axes (covered in the next two lessons).

Tag soup vs. clean HTML

The HTML spec demands clean nesting. Reality doesn't. Real-world pages are full of unclosed tags, badly nested elements, and weirdness:

<table>
  <tr>
  <td>cell 1
  <td>cell 2  <!-- missing </td> -->
  <tr>  <!-- missing </tr> -->
  <td>cell 3</td>
  </tr>
</table>

A strict XML parser refuses this. An HTML parser (BeautifulSoup with html.parser or lxml, PHP's DOMDocument with the HTML5 input mode) silently fixes it: closes missing tags at the next reasonable spot, re-nests as needed, and produces a consistent tree. This is the single most important reason to use a real HTML parser and not regex on the source text.

Why 'view source' and 'inspect element' can disagree

Two different views of the page:

View	What you see
View Source (`Ctrl+U`)	The raw HTML the server sent, what a `curl` request would receive
Inspect Element (DevTools)	The DOM as it currently exists, after JavaScript has modified it

If a page is server-rendered, both views match. If JavaScript adds, removes, or modifies elements after load, only Inspect Element shows the latest state. This single distinction is the entire static-vs-dynamic decision the curriculum's Sub-Path 2 spends six lessons on.

A quick scraper-side proxy for the same diagnostic:

curl -s https://practice.scrapingcentral.com/ | head -50

If your data is in that output, you can use a static scraper. If it isn't, JavaScript is putting it there, and you'll need a browser-driving tool.

Attributes you'll grep constantly

Attributes carry most of the scraping signal:

id="...", unique identifier (in theory). Use it.
class="...", space-separated CSS classes. Used for grouping similar items.
href="...", link target on <a> elements.
src="...", resource target on <img>, <script>, <iframe>.
data-*="...", custom attributes (data-product-id, data-page). Often the cleanest hook for scrapers.
aria-label="...", accessibility label, often holds the same text as visible content.

data-* attributes are the unsung hero of scraping: they exist because developers added them for their own JavaScript, but they're stable, semantic, and rarely change across redesigns.

The semantic vs. presentational split

Modern HTML pushes you toward semantic tags (<article>, <section>, <nav>, <aside>, <header>, <footer>). Older / sloppier markup uses generic <div> and <span> everywhere with class names doing the semantic work. Scrape both, but prefer semantic tags as anchors when they exist, because they're less likely to be renamed in a redesign.

Hands-on lab

Open practice.scrapingcentral.com/products in DevTools. Find one product card. Identify:

The outer element (probably an <article> or <div> with a class like product-card).
Where the product name lives, which tag, what class.
Where the price is, same questions.
Whether the product ID is in data-*, the href of a child link, or both.

Then look at the same page via curl -s | head -200. Confirm the same data exists in the raw HTML (it should, /products is server-rendered). This habit, comparing rendered DOM to raw HTML, is something you'll do on every new scraping target.

HTML Structure and the DOM

What you’ll learn

HTML is text. DOM is a tree.

Three node types you'll actually use

The hierarchy: parent, child, descendant, sibling

Tag soup vs. clean HTML

Why 'view source' and 'inspect element' can disagree

Attributes you'll grep constantly

The semantic vs. presentational split

Hands-on lab

Hands-on lab

Quiz, check your understanding

What is the relationship between HTML source text and the DOM?