-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Overview
Add memory-efficient inspection tools for HTML documents to help agents extract specific data without parsing entire pages into context.
Motivation
HTML documents can be large (especially modern web pages with embedded scripts/styles). Agents often need to extract specific elements, count elements, or inspect structure without loading entire DOM into context.
Proposed Functions
High Priority - Selective Extraction
get_html_text_at_selector- Extract text from specific element(s) by CSS selectorget_html_element_at_selector- Extract element HTML by CSS selectorextract_html_attributes- Get all attributes from elements matching selectorextract_html_links- List all links (href) without full parse
Medium Priority - Inspection
count_html_elements- Count elements by tag name or selectorget_html_structure- Get DOM tree overview (tag hierarchy) without contentget_html_metadata- Extract meta tags, title, description onlysearch_html_text- Find elements containing text pattern
Medium Priority - Data Extraction
extract_html_tables_simple- Extract tables as structured data (complement to existing extract_table)extract_html_lists- Extract ul/ol lists as arraysextract_html_forms- Extract form structure (fields, actions)preview_html_elements- Get first N elements matching selector
Lower Priority - Analysis
get_html_element_stats- Statistics for element types (count, attributes, depth)validate_html_structure_simple- Quick validation without full parseget_html_selector_path- Get CSS selector path for element
Design Principles
- Google ADK compliant (JSON-serializable types, no defaults)
- @strands_tool decorator
- CSS selector support for element selection
- Memory-efficient (selective parsing where possible)
- Consistent with JSON/XML token-saving patterns
- Return structured data (strings, lists, dicts)
Related
- Extends existing html/parsing.py functions
- Related to issue HTML: Processing Enhancements #42 (HTML Processing Enhancements)
- Similar to XML: get_xml_element_at_path, count_xml_elements
- Complements existing extract_text, extract_table, extract_images
Module
html/parsing.py
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request