HTML Structure and the DOM
What HTML actually is, what a parser turns it into, and the tree-shaped mental model your scraper needs to navigate.
What you’ll learn
- Distinguish HTML source text from the DOM tree it parses into.
- Name the three node types you'll work with: elements, text, attributes.
- Understand why 'view source' and 'inspect element' can show different markup.
- Read a real-world HTML snippet and identify which parts are useful data vs. layout chrome.
You don't scrape HTML. You scrape the DOM that a parser builds from HTML. The distinction sounds pedantic for two minutes and then explains a hundred bugs.
HTML is text. DOM is a tree.
HTML is the source:
<article class="product" data-id="42">
<h2>Yellow ceramic mug</h2>
<p class="price">$14.99</p>
<ul class="tags">
<li>kitchen</li>
<li>ceramic</li>
</ul>
</article>
The DOM is what a parser builds:
article.product[data-id=42]
├── h2
│ └── "Yellow ceramic mug" (text node)
├── p.price
│ └── "$14.99" (text node)
└── ul.tags
├── li
│ └── "kitchen"
└── li
└── "ceramic"
Every scraping library, BeautifulSoup, lxml, DomCrawler, Cheerio, gives you a programmatic interface to that tree. Your job is to navigate it.
Three node types you'll actually use
| Node | Example | What it holds |
|---|---|---|
| Element | <p class="price"> |
A tag with attributes; can have children |
| Text | "$14.99" |
A string between/inside tags |
| Attribute | class="price", data-id="42" |
Key/value pairs on an element |
Comments and processing instructions exist too. Ignore them.
The hierarchy: parent, child, descendant, sibling
Familiar terms, used constantly:
- Parent, the element that immediately contains another (e.g.
ul.tagsis the parent of eachli). - Child, the inverse.
articlehas three children:h2,p.price,ul.tags. - Descendant, any node nested inside, at any depth. Each
liis a descendant ofarticleeven though it's two levels deep. - Sibling, same parent.
h2andp.priceare siblings.
These map directly to CSS combinators (> child, descendant by space, + next sibling, ~ general sibling) and XPath axes (covered in the next two lessons).
Tag soup vs. clean HTML
The HTML spec demands clean nesting. Reality doesn't. Real-world pages are full of unclosed tags, badly nested elements, and weirdness:
<table>
<tr>
<td>cell 1
<td>cell 2 <!-- missing </td> -->
<tr> <!-- missing </tr> -->
<td>cell 3</td>
</tr>
</table>
A strict XML parser refuses this. An HTML parser (BeautifulSoup with html.parser or lxml, PHP's DOMDocument with the HTML5 input mode) silently fixes it: closes missing tags at the next reasonable spot, re-nests as needed, and produces a consistent tree. This is the single most important reason to use a real HTML parser and not regex on the source text.
Why 'view source' and 'inspect element' can disagree
Two different views of the page:
| View | What you see |
|---|---|
View Source (Ctrl+U) |
The raw HTML the server sent, what a curl request would receive |
| Inspect Element (DevTools) | The DOM as it currently exists, after JavaScript has modified it |
If a page is server-rendered, both views match. If JavaScript adds, removes, or modifies elements after load, only Inspect Element shows the latest state. This single distinction is the entire static-vs-dynamic decision the curriculum's Sub-Path 2 spends six lessons on.
A quick scraper-side proxy for the same diagnostic:
curl -s https://practice.scrapingcentral.com/ | head -50
If your data is in that output, you can use a static scraper. If it isn't, JavaScript is putting it there, and you'll need a browser-driving tool.
Attributes you'll grep constantly
Attributes carry most of the scraping signal:
id="...", unique identifier (in theory). Use it.class="...", space-separated CSS classes. Used for grouping similar items.href="...", link target on<a>elements.src="...", resource target on<img>,<script>,<iframe>.data-*="...", custom attributes (data-product-id,data-page). Often the cleanest hook for scrapers.aria-label="...", accessibility label, often holds the same text as visible content.
data-* attributes are the unsung hero of scraping: they exist because developers added them for their own JavaScript, but they're stable, semantic, and rarely change across redesigns.
The semantic vs. presentational split
Modern HTML pushes you toward semantic tags (<article>, <section>, <nav>, <aside>, <header>, <footer>). Older / sloppier markup uses generic <div> and <span> everywhere with class names doing the semantic work. Scrape both, but prefer semantic tags as anchors when they exist, because they're less likely to be renamed in a redesign.
Hands-on lab
Open practice.scrapingcentral.com/products in DevTools. Find one product card. Identify:
- The outer element (probably an
<article>or<div>with a class likeproduct-card). - Where the product name lives, which tag, what class.
- Where the price is, same questions.
- Whether the product ID is in
data-*, thehrefof a child link, or both.
Then look at the same page via curl -s | head -200. Confirm the same data exists in the raw HTML (it should, /products is server-rendered). This habit, comparing rendered DOM to raw HTML, is something you'll do on every new scraping target.
Hands-on lab
Practice this lesson on Catalog108, our first-party scraping sandbox.
Open lab target →/productsQuiz, check your understanding
Pass mark is 70%. Pick the best answer; you’ll see the explanation right after.