html_to_text

Converts HTML to plain text. Preserves structure, handles tables, and can track document elements.

Installation

pip install git+https://github.com/accessibleapps/html_to_text.git

Or with uv:

uv pip install git+https://github.com/accessibleapps/html_to_text.git

Basic Usage

from html_to_text import html_to_text

html = """
<h1>Title</h1>
<p>First paragraph with <strong>bold</strong> text.</p>
<p>Second paragraph.</p>
"""

text = html_to_text(html)
print(text)

Output:

Title

First paragraph with bold text.

Second paragraph.

Features

Structure preservation: Block elements (<p>, <div>, <h1-h6>) get appropriate spacing
Table handling: Tables are parsed and tracked if callbacks are provided
Link extraction: Internal and external links can be captured
Pre-formatted text: <pre> and <code> blocks preserve whitespace
Heading hierarchy: Tracks document structure through heading levels
Element tracking: Optional callbacks for building document indexes

API

`html_to_text(item, node_parsed_callback=None, startpos=0, file="")`

Parameters:

item (str|lxml.etree.Element): HTML string or parsed lxml element
node_parsed_callback (callable, optional): Function called when elements are parsed. Receives (parent, tag_type, content, **kwargs)
startpos (int, optional): Starting position offset for tracking
file (str, optional): File path for resolving relative links

Returns: Plain text string

Advanced: Callbacks

Track document structure by providing a callback function:

def track_elements(parent, tag_type, content, **kwargs):
    """
    Called for each tracked element.

    Args:
        parent: Parent element ID (for hierarchical structures)
        tag_type: 'heading', 'link', 'table', 'tr', 'td', 'th', 'id', 'page'
        content: Element content (text for links, None for structural elements)
        **kwargs: Element-specific data (start, end, level, href, etc.)

    Returns:
        Dictionary with 'id' key (used as parent for child elements)
    """
    print(f"{tag_type}: {content}")
    return {'id': f"{tag_type}_{kwargs.get('start', 0)}"}

html = """
<h1>Main Title</h1>
<h2>Subsection</h2>
<p>Text with <a href="/page">link</a>.</p>
<table>
    <tr><td>Cell</td></tr>
</table>
"""

text = html_to_text(html, node_parsed_callback=track_elements)

Callback parameters by element type:

Type	parent	content	kwargs
`heading`	Parent heading ID	None	`start`, `end`, `tag`, `level`
`link`	None	Link text	`start`, `end`, `href`
`table`	Parent table ID	None	`start`, `attrs`
`tr`, `td`, `th`	Parent table element ID	None	`start`, `attrs`
`id`	None	Element ID	`start`
`page`	None	Page identifier	`start`, `pagenum`

Ignored Elements

<script>
<style>
<title>
Elements with class="pagenum"

Special Handling

Headings (<h1> - <h6>): Surrounded by double newlines. Hierarchy tracked in callbacks.

Blocks (<p>, <div>, <blockquote>, <center>): Double newlines before and after.

Line breaks (<br>): Converted to single newline.

Horizontal rule (<hr>): Becomes 80-character dash line.

Pre-formatted (<pre>, <code>): Whitespace preserved exactly.

Links: Text extracted; URLs captured via callbacks if provided.

Tables: Structure tracked via callbacks; text extracted in reading order.

Requirements

Python ≥ 3.8
lxml

Example: Building a Document Index

from html_to_text import html_to_text

doc_structure = []

def index_callback(parent, tag_type, content, **kwargs):
    entry = {
        'id': len(doc_structure),
        'type': tag_type,
        'parent': parent,
        'start': kwargs.get('start'),
        'end': kwargs.get('end')
    }

    if tag_type == 'heading':
        entry['level'] = kwargs['level']
    elif tag_type == 'link':
        entry['text'] = content
        entry['href'] = kwargs['href']

    doc_structure.append(entry)
    return entry

html = open('document.html').read()
text = html_to_text(html, node_parsed_callback=index_callback, file='document.html')

# Now doc_structure contains full document index with positions
for item in doc_structure:
    if item['type'] == 'heading':
        indent = '  ' * int(item['level'])
        heading_text = text[item['start']:item['end']]
        print(f"{indent}{heading_text}")

License

See LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github/workflows		.github/workflows
tests		tests
.gitignore		.gitignore
README.md		README.md
html_to_text.py		html_to_text.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

html_to_text

Installation

Basic Usage

Features

API

`html_to_text(item, node_parsed_callback=None, startpos=0, file="")`

Advanced: Callbacks

Ignored Elements

Special Handling

Requirements

Example: Building a Document Index

License

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

accessibleapps/html_to_text

Folders and files

Latest commit

History

Repository files navigation

html_to_text

Installation

Basic Usage

Features

API

html_to_text(item, node_parsed_callback=None, startpos=0, file="")

Advanced: Callbacks

Ignored Elements

Special Handling

Requirements

Example: Building a Document Index

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

`html_to_text(item, node_parsed_callback=None, startpos=0, file="")`

Packages