Scraping Central is reader-supported. When you buy through links on our site, we may earn an affiliate commission.

Data Parsing

HTML parsing, JSON processing, regex patterns, and data cleaning techniques

15 articles

#1

HTML Parsing with BeautifulSoup - Complete Guide

Master HTML parsing with BeautifulSoup4 in Python. Learn to navigate the DOM, find elements, extract text, and handle attributes.

beginner
beautifulsouphtml-parsingdata-extraction

#2

CSS Selectors vs XPath - When to Use Which

Compare CSS selectors and XPath for web scraping. Learn the syntax, strengths, and best use cases for each approach.

beginner
css-selectorsxpathhtml-parsingbeautifulsoup

#3

XPath Expressions for Web Scraping

Master XPath expressions for precise element selection in web scraping. Learn axes, predicates, functions, and advanced patterns.

intermediate
xpathhtml-parsinglxmldata-extraction

#4

Parsing JSON Responses in Python

Learn to parse, navigate, and extract data from JSON API responses in Python using the json module, jmespath, and pandas.

beginner
jsondata-extractionpythonapis

#5

Using Regex for Data Extraction

Learn to use Python regular expressions to extract emails, URLs, prices, dates, and other patterns from scraped text.

intermediate
regexdata-extractiontext-parsing

#6

Data Cleaning After Scraping (pandas)

Clean, transform, and prepare scraped data for analysis using pandas. Handle missing values, duplicates, type conversions, and text normalization.

intermediate
pandasdata-cleaningdata-processing

#7

Extracting Structured Data from Unstructured HTML

Techniques for pulling structured records from messy, inconsistent HTML pages. Handle missing elements, variable layouts, and embedded metadata.

intermediate
html-parsingdata-extractionbeautifulsoupschema

#8

Parsing HTML Tables into DataFrames

Extract HTML tables from web pages and convert them into pandas DataFrames for analysis. Handle merged cells, multi-row headers, and nested tables.

beginner
html-parsingpandastablesdata-extraction

#9

Handling Malformed HTML

Learn techniques for parsing broken, incomplete, and malformed HTML that you commonly encounter when web scraping.

intermediate
html-parsingbeautifulsouplxmlerror-handling

#10

Extracting Emails and Phone Numbers from Web Pages

Extract email addresses and phone numbers from scraped web pages using regex patterns, BeautifulSoup, and validation techniques.

beginner
regexdata-extractioncontact-infotext-parsing

#11

Parsing Dates and Prices from Scraped Text

Extract and normalize dates, prices, and currencies from messy scraped text using Python's dateutil, regex, and locale-aware parsing.

intermediate
text-parsingdatespricesdata-extraction

#12

Using jq and JSONPath for JSON Parsing

Master jq for command-line JSON processing and JSONPath for querying JSON in Python. Filter, transform, and extract data from API responses.

intermediate
jsonjqjsonpathdata-extractioncli

#13

Deduplication of Scraped Data

Remove duplicate records from scraped datasets using exact matching, fuzzy matching, and content hashing techniques in Python.

intermediate
deduplicationdata-cleaningpandasdata-quality

#14

Normalizing and Validating Scraped Data

Ensure scraped data quality through normalization and validation using Pydantic models, custom validators, and pandas techniques.

intermediate
data-validationdata-normalizationpydanticdata-quality

#15

Converting Scraped Data to Different Formats (CSV, JSON, Excel, SQL)

Export scraped data to CSV, JSON, Excel, SQLite, and other formats using Python. Learn best practices for each format and when to use them.

beginner
data-exportcsvjsonexcelsqlpandas