Converts HTML to plain text. Preserves structure, handles tables, and can track document elements.
pip install git+https://github.com/accessibleapps/html_to_text.git
Or with uv:
uv pip install git+https://github.com/accessibleapps/html_to_text.git
from html_to_text import html_to_text
html = """
<h1>Title</h1>
<p>First paragraph with <strong>bold</strong> text.</p>
<p>Second paragraph.</p>
"""
text = html_to_text(html)
print(text)
Output:
Title
First paragraph with bold text.
Second paragraph.
- Structure preservation: Block elements (
<p>
,<div>
,<h1-h6>
) get appropriate spacing - Table handling: Tables are parsed and tracked if callbacks are provided
- Link extraction: Internal and external links can be captured
- Pre-formatted text:
<pre>
and<code>
blocks preserve whitespace - Heading hierarchy: Tracks document structure through heading levels
- Element tracking: Optional callbacks for building document indexes
Parameters:
item
(str|lxml.etree.Element): HTML string or parsed lxml elementnode_parsed_callback
(callable, optional): Function called when elements are parsed. Receives(parent, tag_type, content, **kwargs)
startpos
(int, optional): Starting position offset for trackingfile
(str, optional): File path for resolving relative links
Returns: Plain text string
Track document structure by providing a callback function:
def track_elements(parent, tag_type, content, **kwargs):
"""
Called for each tracked element.
Args:
parent: Parent element ID (for hierarchical structures)
tag_type: 'heading', 'link', 'table', 'tr', 'td', 'th', 'id', 'page'
content: Element content (text for links, None for structural elements)
**kwargs: Element-specific data (start, end, level, href, etc.)
Returns:
Dictionary with 'id' key (used as parent for child elements)
"""
print(f"{tag_type}: {content}")
return {'id': f"{tag_type}_{kwargs.get('start', 0)}"}
html = """
<h1>Main Title</h1>
<h2>Subsection</h2>
<p>Text with <a href="/page">link</a>.</p>
<table>
<tr><td>Cell</td></tr>
</table>
"""
text = html_to_text(html, node_parsed_callback=track_elements)
Callback parameters by element type:
Type | parent | content | kwargs |
---|---|---|---|
heading |
Parent heading ID | None | start , end , tag , level |
link |
None | Link text | start , end , href |
table |
Parent table ID | None | start , attrs |
tr , td , th |
Parent table element ID | None | start , attrs |
id |
None | Element ID | start |
page |
None | Page identifier | start , pagenum |
<script>
<style>
<title>
- Elements with
class="pagenum"
Headings (<h1>
- <h6>
): Surrounded by double newlines. Hierarchy tracked in callbacks.
Blocks (<p>
, <div>
, <blockquote>
, <center>
): Double newlines before and after.
Line breaks (<br>
): Converted to single newline.
Horizontal rule (<hr>
): Becomes 80-character dash line.
Pre-formatted (<pre>
, <code>
): Whitespace preserved exactly.
Links: Text extracted; URLs captured via callbacks if provided.
Tables: Structure tracked via callbacks; text extracted in reading order.
- Python ≥ 3.8
- lxml
from html_to_text import html_to_text
doc_structure = []
def index_callback(parent, tag_type, content, **kwargs):
entry = {
'id': len(doc_structure),
'type': tag_type,
'parent': parent,
'start': kwargs.get('start'),
'end': kwargs.get('end')
}
if tag_type == 'heading':
entry['level'] = kwargs['level']
elif tag_type == 'link':
entry['text'] = content
entry['href'] = kwargs['href']
doc_structure.append(entry)
return entry
html = open('document.html').read()
text = html_to_text(html, node_parsed_callback=index_callback, file='document.html')
# Now doc_structure contains full document index with positions
for item in doc_structure:
if item['type'] == 'heading':
indent = ' ' * int(item['level'])
heading_text = text[item['start']:item['end']]
print(f"{indent}{heading_text}")
See LICENSE file.