Working with exported data is often a case study in unexpected HTML contamination. After one too many encounters with
<p>
tags lurking in seemingly innocent CSVs, I built this utility as a self-defense mechanism. It's a straightforward
Python tool that strips HTML from CSV files while religiously preserving your data structure.
If you've ever exported data from virtually any system, you've likely encountered what I call "HTML leakage" - those moments when:
- Your CRM exports customer notes complete with formatting tags
- Your CMS gives you content tables with embedded HTML
- Your analytics platform decides
<strong>
tags belong in numerical data - Your survey platform translates rich text responses into markup soup
The result? Data that's perfectly readable to a browser but completely frustrating for analysis, transformation, or migration.
This tool approaches the problem with two guiding principles:
- Respect the structure - Never alter column counts, row arrangements, or field relationships
- Remove the markup - Use robust HTML parsing to extract just the meaningful content
In practice, this means a CSV with 17 columns and 5,000 rows before sanitization will have exactly 17 columns and 5,000 rows after, but without the HTML cruft.
While regex is tempting for quick HTML removal, it's notoriously unreliable with complex markup. This tool:
- Uses BeautifulSoup as the primary HTML parser (when available)
- Falls back to regex patterns if BeautifulSoup isn't installed
- Carefully maintains CSV structure throughout processing
- Handles CSV dialects, encoding, and escaping properly
Having built too many brittle regex-based parsers in the past, I've learned that HTML requires a proper parser. The BeautifulSoup approach handles nested tags, malformed HTML, and complex attribute structures far more reliably.
# Clone it
git clone https://github.com/poacosta/csv-html-sanitizer.git
cd csv-html-sanitizer
# Optional but recommended
pip install beautifulsoup4
# Run it
python csv_sanitizer.py your_html_riddled_file.csv
For most cases, just point it at your CSV:
python csv_sanitizer.py messy_export.csv
This creates sanitized_messy_export.csv
with all HTML removed.
Based on my encounters with different types of HTML contamination, I've added three sanitization modes:
# Remove only structural elements (p, div, strong, etc.)
python csv_sanitizer.py input.csv --mode structural
# Just decode HTML entities without removing tags
python csv_sanitizer.py input.csv --mode basic
When you know exactly which tags are causing trouble:
python csv_sanitizer.py input.csv --tags p,div,span,strong
Because UTF-8 is more of an aspiration than a reality in many systems:
python csv_sanitizer.py input.csv --encoding latin-1
The sanitizer employs a two-stage approach to HTML handling:
- First pass: Entity decoding (
&
→&
, etc.) - Second pass: HTML tag removal via BeautifulSoup or regex
CSV processing uses Python's built-in csv module with careful handling of:
- Dialect detection (delimiter, quote character)
- Proper escaping of special characters
- Structure preservation with explicit field mapping
The tool handles edge cases like:
- Inconsistent quoting in source files
- Missing escape characters in dialects
- HTML fragments vs. complete documents
Having battle-tested this on exports from various systems, I've found a few limitations worth noting:
- While it preserves CSV structure perfectly, whitespace formatting from HTML is normalized
- Very large files (100MB+) will work but consume proportional memory
- Some extremely malformed HTML might lose content in rare edge cases
Core (no external dependencies):
- Python 3.6+
- Standard library modules only
Enhanced functionality:
- BeautifulSoup4 (recommended but optional)
MIT License - Take it, improve it, share what you learn.