Skip to content

HTML: Token-Saving Inspection Tools (Medium Priority) #66

@jwesleye

Description

@jwesleye

Overview

Add memory-efficient inspection tools for HTML documents to help agents extract specific data without parsing entire pages into context.

Motivation

HTML documents can be large (especially modern web pages with embedded scripts/styles). Agents often need to extract specific elements, count elements, or inspect structure without loading entire DOM into context.

Proposed Functions

High Priority - Selective Extraction

  • get_html_text_at_selector - Extract text from specific element(s) by CSS selector
  • get_html_element_at_selector - Extract element HTML by CSS selector
  • extract_html_attributes - Get all attributes from elements matching selector
  • extract_html_links - List all links (href) without full parse

Medium Priority - Inspection

  • count_html_elements - Count elements by tag name or selector
  • get_html_structure - Get DOM tree overview (tag hierarchy) without content
  • get_html_metadata - Extract meta tags, title, description only
  • search_html_text - Find elements containing text pattern

Medium Priority - Data Extraction

  • extract_html_tables_simple - Extract tables as structured data (complement to existing extract_table)
  • extract_html_lists - Extract ul/ol lists as arrays
  • extract_html_forms - Extract form structure (fields, actions)
  • preview_html_elements - Get first N elements matching selector

Lower Priority - Analysis

  • get_html_element_stats - Statistics for element types (count, attributes, depth)
  • validate_html_structure_simple - Quick validation without full parse
  • get_html_selector_path - Get CSS selector path for element

Design Principles

  • Google ADK compliant (JSON-serializable types, no defaults)
  • @strands_tool decorator
  • CSS selector support for element selection
  • Memory-efficient (selective parsing where possible)
  • Consistent with JSON/XML token-saving patterns
  • Return structured data (strings, lists, dicts)

Related

  • Extends existing html/parsing.py functions
  • Related to issue HTML: Processing Enhancements #42 (HTML Processing Enhancements)
  • Similar to XML: get_xml_element_at_path, count_xml_elements
  • Complements existing extract_text, extract_table, extract_images

Module

html/parsing.py

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions