Data Parsing

HTML parsing, JSON processing, regex patterns, and data cleaning techniques

15 articles

HTML Parsing with BeautifulSoup - Complete Guide

Master HTML parsing with BeautifulSoup4 in Python. Learn to navigate the DOM, find elements, extract text, and handle attributes.

beginner

beautifulsouphtml-parsingdata-extraction

CSS Selectors vs XPath - When to Use Which

Compare CSS selectors and XPath for web scraping. Learn the syntax, strengths, and best use cases for each approach.

beginner

css-selectorsxpathhtml-parsingbeautifulsoup

XPath Expressions for Web Scraping

Master XPath expressions for precise element selection in web scraping. Learn axes, predicates, functions, and advanced patterns.

intermediate

xpathhtml-parsinglxmldata-extraction

Parsing JSON Responses in Python

Learn to parse, navigate, and extract data from JSON API responses in Python using the json module, jmespath, and pandas.

beginner

jsondata-extractionpythonapis

Using Regex for Data Extraction

Learn to use Python regular expressions to extract emails, URLs, prices, dates, and other patterns from scraped text.

intermediate

regexdata-extractiontext-parsing

Data Cleaning After Scraping (pandas)

Clean, transform, and prepare scraped data for analysis using pandas. Handle missing values, duplicates, type conversions, and text normalization.

intermediate

pandasdata-cleaningdata-processing

Extracting Structured Data from Unstructured HTML

Techniques for pulling structured records from messy, inconsistent HTML pages. Handle missing elements, variable layouts, and embedded metadata.

intermediate

html-parsingdata-extractionbeautifulsoupschema

Parsing HTML Tables into DataFrames

Extract HTML tables from web pages and convert them into pandas DataFrames for analysis. Handle merged cells, multi-row headers, and nested tables.

beginner

html-parsingpandastablesdata-extraction

Handling Malformed HTML

Learn techniques for parsing broken, incomplete, and malformed HTML that you commonly encounter when web scraping.

intermediate

html-parsingbeautifulsouplxmlerror-handling

#10

Extracting Emails and Phone Numbers from Web Pages

Extract email addresses and phone numbers from scraped web pages using regex patterns, BeautifulSoup, and validation techniques.

beginner

regexdata-extractioncontact-infotext-parsing

#11

Parsing Dates and Prices from Scraped Text

Extract and normalize dates, prices, and currencies from messy scraped text using Python's dateutil, regex, and locale-aware parsing.

intermediate

text-parsingdatespricesdata-extraction

#12

Using jq and JSONPath for JSON Parsing

Master jq for command-line JSON processing and JSONPath for querying JSON in Python. Filter, transform, and extract data from API responses.

intermediate

jsonjqjsonpathdata-extractioncli

#13

Deduplication of Scraped Data

Remove duplicate records from scraped datasets using exact matching, fuzzy matching, and content hashing techniques in Python.

intermediate

deduplicationdata-cleaningpandasdata-quality

#14

Normalizing and Validating Scraped Data

Ensure scraped data quality through normalization and validation using Pydantic models, custom validators, and pandas techniques.

intermediate

data-validationdata-normalizationpydanticdata-quality

#15

Converting Scraped Data to Different Formats (CSV, JSON, Excel, SQL)

Export scraped data to CSV, JSON, Excel, SQLite, and other formats using Python. Learn best practices for each format and when to use them.

beginner

data-exportcsvjsonexcelsqlpandas