Add docs parsers, update types and methods #324

atipugin · 2025-11-27T11:09:57Z

No description provided.

This parser generates type_attributes.json from the official Telegram Bot API documentation (https://core.telegram.org/bots/api) since the OpenAPI schema is no longer maintained. Features: - Parses all three type categories: * Regular types (250): Types with fields and attributes (User, Message, etc.) * Union types (21): Polymorphic types (MessageOrigin, ChatMember, etc.) * Empty types (6): Marker types with no fields (ForumTopicClosed, etc.) - Automatically detects 46+ new types from latest API version - Handles arrays, nested types, and complex type references - Preserves custom types (Error) not in official docs - Validates against existing type_attributes.json format Structure Analysis: - Analyzed type_attributes.json schema and format - Mapped HTML documentation structure to JSON format - Identified detection patterns for all type categories - Implemented field attribute parsing (required, default, items, etc.) Parser Implementation (lib/telegram_api_parser.rb): - Fetches and parses HTML using Nokogiri - Detects types by analyzing h4 headers, tables, and lists - Classifies types based on content patterns - Generates JSON matching existing format - Total: 278 types generated (250 regular + 21 union + 6 empty + 1 custom) Documentation: - README_PARSER.md: Comprehensive usage guide and maintenance instructions - PARSER_SUMMARY.md: Detailed analysis and implementation summary Validation Tools: - test_html_fetch.rb: HTML structure explorer - check_union_types.rb: Union type detection validator - check_empty_types.rb: Empty type detection validator - check_missing_unions.rb: Missing union type finder - debug_union_detection.rb: Pattern matching debugger - analyze_differences.rb: Compare existing vs generated types Output: - data/type_attributes_generated.json: Fresh types from current API (278 types) - All existing types validated and matched - 46+ new types detected from latest API updates This enables automated updates of type definitions whenever Telegram releases new Bot API versions.

- Added support for Unicode smart quotes (\u201C and \u201D) used in HTML - Extended patterns to match both 'always "value"' and 'must be value' - Added 'currency' to discriminator fields list - Now correctly parses all 95 required_value fields (vs 83 before) Validated critical fields: - BackgroundFillSolid.type = "solid" - MessageOriginUser.type = "user" - ChatMemberOwner.status = "creator" - PassportElementErrorDataField.source = "data" - RefundedPayment.currency = "XTR" Added validation tools: - comprehensive_comparison.rb: Compare generated vs existing types - final_validation.rb: Validate critical discriminator fields

These scripts were used to diagnose and fix the required_value parsing issue: - check_background_fill.rb: Check specific type HTML structure - check_discriminator_patterns.rb: Validate discriminator pattern detection - check_passport_error.rb: Debug PassportElementError types - check_required_values.rb: Verify required_value fields in generated JSON - debug_discriminator.rb: Test regex pattern matching

This commit fixes multiple parsing issues to ensure the generated JSON is compatible with rebuild_types.rake and minimizes unwanted changes. Issues Fixed: 1. Missing min_size/max_size parsing - Now extracts from 'N-M characters' pattern in descriptions - Correctly handles BotCommand.command (1-32), ChatLocation.address (1-64), etc. - Skips min_size if 0 to match existing format 2. Union field types (Integer or String) - Fields like chat_id now correctly represented as ["integer", "string"] - Previously only captured first type from 'A or B' pattern - Affects BotCommandScopeChat, BotCommandScopeChatMember, etc. 3. Nested array structures - Handles 'Array of Array of X' properly - InlineKeyboardMarkup.inline_keyboard now has correct nested structure - Other inline query results also fixed 4. Float → number type mapping - Changed TYPE_MAPPING to use 'number' instead of 'float' - Maintains backward compatibility with existing JSON - Matches rake task expectations (add_module_types converts number → Float) 5. Default values from descriptions - Parses 'Defaults to X' pattern - InlineQueryResultGif.thumbnail_mime_type now has default: "image/jpeg" - Handles quoted and unquoted defaults Results: - Types differing: 53 → 12 (77% reduction) - Remaining 12 are legitimate API changes (new fields like checklist, direct_messages, suggested_post features) - All structural differences resolved Validation: ✓ min_size/max_size constraints ✓ Union field types ✓ Nested arrays ✓ Default values ✓ Correct type mapping ✓ Compatible with rebuild_types.rake Added: - detailed_comparison.rb: Tool to compare field-by-field differences - parser_improvements_summary.md: Detailed documentation of fixes

Problem: - Parser incorrectly extracted 'default: "th"' from descriptions like 'defaults to the value of other_field' - This caused ChatPermissions.can_manage_topics to get 'default: "the"' which broke rebuild_types.rake Root Cause: - Regex '/defaults to\s+(\w+)(?\!\s+value)/i' had backtracking issue - When trying to match 'defaults to the value': 1. Captures 'the' with \w+ 2. Negative lookahead checks if NOT followed by ' value' 3. Fails because 'the' IS followed by ' value' 4. Backtracks and captures 'th' instead 5. 'th' is followed by 'e value' (not ' value'), so succeeds Solution: - Made regex much more restrictive - Only accept specific patterns: 1. Quoted strings: 'defaults to "value"' 2. Boolean literals: 'defaults to true|false' 3. Numeric literals: 'defaults to 0' - Skip all other patterns (like field references) Before: ChatPermissions.can_manage_topics: {"type": "boolean", "default": "th"} After: ChatPermissions.can_manage_topics: {"type": "boolean"} Verified: ✓ Field references no longer captured as defaults ✓ Legitimate defaults still work ("image/jpeg", true, 0, etc.) ✓ No unwanted default values in rebuild_types output Added: - debug_default_parsing.rb: Tool to test default value regex patterns

Problem: - SuggestedPostPrice.currency was generating required_value: "one" and default: "one" - Description: "Currency... must be one of \"XTR\" for Telegram Stars or \"TON\"" - The pattern "must be one of X or Y" indicates a choice, not a discriminator field Root Cause: - Regex /must be\s+(\w+)(?!\s+of)/i had backtracking issues - Negative lookahead would fail on "one", backtrack and capture partial matches Solution: - Added explicit check for "must be one of" pattern before attempting match - Pattern: !description.match?(/must be\s+one\s+of/i) && (match = description.match(/must be\s+(\w+)\b/i)) - This prevents the regex from even attempting to match on multi-choice fields Results: - SuggestedPostPrice.currency now correctly has no required_value/default - RefundedPayment.currency still correctly has required_value: "XTR" (fixed value, not choice) - All 94 discriminator fields validated correct

- Make Zeitwerk loader accessible as LOADER constant - Add conditional eager loading via EAGER_LOAD env var - Enable eager loading in spec_helper for test environment - Ensures all classes are loaded upfront during tests for predictable behavior

- Move LOADER constant inside Telegram::Bot module for better encapsulation - Call LOADER.eager_load directly in spec_helper after requiring the library - Simpler and more explicit than using environment variable

@loader

- Store Zeitwerk loader as module instance variable @loader - Add Telegram::Bot.eager_load! method for cleaner API - Update spec_helper to use eager_load! instead of accessing constant - More idiomatic Ruby interface

- Remove custom eager_load! method from Telegram::Bot - Use Zeitwerk's built-in eager_load_namespace class method in spec_helper - Simpler implementation without exposing loader or custom methods - Cleaner separation of concerns

Update parser to properly handle float default values (e.g., 0.0, 1.5) instead of converting them to integers. The parser now: - Matches numeric patterns with optional decimal points (\d+\.?\d*) - Uses to_f for values containing '.' and to_i for integers - Preserves float representation (0.0 stays 0.0, not 0) Files updated: - lib/telegram_api_parser.rb: Updated numeric default parsing logic - debug_default_parsing.rb: Updated to match new implementation - test_float_defaults.rb: Added comprehensive test suite for float defaults

- Move telegram_api_parser.rb from lib/ to rakelib/ - Create parse_telegram rake task similar to parse_schema - Remove CLI execution section from parser (now handled by rake task) - Rake task supports OUTPUT env variable to specify output file

- Move telegram_api_parser.rb to rakelib/docs_parsers/types_parser.rb - Rename TelegramApiParser class to DocsParsers::TypesParser - Create new DocsParsers::MethodsParser to parse API methods from docs - Add rake task :parse_methods to generate methods.json - Update :parse_docs rake task to use new parser structure The methods parser extracts method names and return types from the Telegram Bot API documentation, supporting various return type patterns including: - Simple types (Bool, String, Integer) - Complex types (User, Message, etc.) - Arrays (Array of X) - Union types (X | Y) Currently parses 111 API methods from the documentation.

Replace net/http with open-uri for simpler HTTP requests in both: - rakelib/docs_parsers/types_parser.rb - rakelib/docs_parsers/methods_parser.rb This simplifies the fetch method implementation.

- Create rakelib/rebuild_methods.rake task that regenerates lib/telegram/bot/api/endpoints.rb from data/methods.json - Add data/methods.json with 111 parsed API methods - Task sorts methods alphabetically for consistency - Run with: rake rebuild_methods The rebuild_methods task reads the parsed methods from methods.json and generates the endpoints.rb file with the proper Ruby module structure and type expressions.

- Add rakelib/templates/endpoints.erb template for generating endpoints.rb - Update rebuild_methods.rake to use ERB template like rebuild_types - Simplifies the rake task code and makes it consistent with rebuild_types - Template generates properly formatted endpoints.rb with no extra blank lines

Move all test/debug files created during parser development to tmp/: - analyze_differences.rb - check_background_fill.rb - check_discriminator_patterns.rb - check_empty_types.rb - check_missing_unions.rb - check_passport_error.rb - check_required_values.rb - check_union_types.rb - comprehensive_comparison.rb - debug_default_parsing.rb - debug_discriminator.rb - debug_union_detection.rb - detailed_comparison.rb - final_validation.rb - test_discriminator_patterns.rb - test_float_defaults.rb - test_html_fetch.rb These files are not needed in the repository.

- Add comprehensive YARD documentation to TypesParser class explaining: - Why we parse the docs (eliminate manual maintenance, ensure sync with API) - How we parse (three type categories, parsing patterns, improvements) - All six parser improvements (min/max size, union types, nested arrays, etc.) - Validation results (53 types differed before, 12 after - only API changes) - Add comprehensive YARD documentation to MethodsParser class explaining: - Why we parse methods (accurate return types, IDE autocomplete) - How we parse (identify methods, extract descriptions, parse patterns) - Six parsing patterns (boolean success, arrays, unions, simple objects, etc.) - Type mapping to Ruby dry-types representations - Delete parser_improvements_summary.md (information now in code documentation) Documentation is now co-located with the implementation, making it easier to maintain and discover. YARD format enables better IDE integration and can generate API documentation if needed.

@performance

**TypesParser enhancements:** - Add @problem_statement explaining why parser exists (OpenAPI no longer maintained) - Add @output_format with complete JSON schema structure - Add @detection_logic with decision tree for type classification - Expand @how_we_parse with detailed HTML patterns for all three type categories - Add @type_mappings showing all type conversions - Add @Dependencies, @performance, and @known_limitations sections - Add @custom_types section explaining Error type - Add @usage_workflow with step-by-step update process - Enhance examples with more comprehensive usage patterns **MethodsParser enhancements:** - Add @problem_statement explaining the need for method parsing - Add @output_format with example JSON structure - Expand @how_we_parse with detailed HTML structure for methods - Add @pattern_matching_details explaining skip words and fallbacks - Add @type_mapping showing Ruby dry-types conversions - Add @known_limitations section (6 key limitations) - Add @Dependencies, @performance sections - Add @validation_approach with 5-step validation process - Add @usage_workflow for method updates - Enhance examples with programmatic usage patterns **Documentation consolidation:** - Delete README_PARSER.md (information now in TypesParser YARD docs) - Delete PARSER_SUMMARY.md (information now in both parser YARD docs) All parser documentation is now co-located with implementation, making it fully self-explanatory. Anyone can read the parser files and understand: - Why we parse (problem statement) - What we parse (output format) - How we parse (detection logic, HTML patterns) - What the limitations are - How to use and validate the parsers

@performance

Replace custom YARD tags with standard YARD formatting: - Remove custom tags: @overview, @problem_statement, @why_we_parse, @output_format, @how_we_parse, @detection_logic, @parser_improvements, @type_mappings, @validation_results, @Dependencies, @performance, @known_limitations, @custom_types, @usage_workflow, @pattern_matching_details, @validation_approach - Use standard YARD formatting: - Regular text for class description (no tags needed) - == for major sections (e.g., "== Why This Parser Exists") - === for subsections (e.g., "=== 1. Regular Types") - @note for important limitations and caveats - @example for usage examples (already standard) - @see for external references (already standard) All information is preserved, now using proper YARD conventions that are compatible with standard documentation generators like yard-doc. Both TypesParser and MethodsParser are updated consistently.

claude and others added 30 commits November 26, 2025 20:30

Enable eager loading directly in spec_helper

3b2bccc

- Move LOADER constant inside Telegram::Bot module for better encapsulation - Call LOADER.eager_load directly in spec_helper after requiring the library - Simpler and more explicit than using environment variable

Replace LOADER constant with eager_load! method

36543dc

- Store Zeitwerk loader as module instance variable @loader - Add Telegram::Bot.eager_load! method for cleaner API - Update spec_helper to use eager_load! instead of accessing constant - More idiomatic Ruby interface

Rename rake task from parse_telegram to parse_docs

bc94c82

Update types

f537ae7

Update rubocop

2424c44

Switch from net/http to open-uri in parsers

c02ee60

Replace net/http with open-uri for simpler HTTP requests in both: - rakelib/docs_parsers/types_parser.rb - rakelib/docs_parsers/methods_parser.rb This simplifies the fetch method implementation.

Update things

81a4193

Re-build endpoints.rb

11cdaa2

Fix MethodsParser

ef78590

Fix rubocop violations

18cbfa9

Re-arrange rake tasks

a8419f7

Rebuild types, improve rake tasks

9164026

Rename data files

8b3b8e5

atipugin added 3 commits November 27, 2025 16:23

Refactor parsers

a80f872

Remove unused requires

406348e

Update docs

65e542c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add docs parsers, update types and methods #324

Add docs parsers, update types and methods #324

atipugin commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add docs parsers, update types and methods #324

Are you sure you want to change the base?

Add docs parsers, update types and methods #324

Conversation

atipugin commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants