Skip to content

Analyze and improve formula parsing robustness #96

@Jython1415

Description

@Jython1415

Formula Parsing Strategy Analysis and Recommendations

Summary

The formula expansion system uses a hybrid approach combining pyparsing for structured AST parsing with regex as a fallback for argument extraction. The recent enhancement (commit 7b7e1cb) added sophisticated regex-based argument extraction using balanced parenthesis matching to handle cases where pyparsing fails.

Current Architecture

1. pyparsing (Primary Parser)

Location: FormulaParser.__init__() (lines 58-93)

What it handles:

  • Identifiers (function names, variables)
  • String literals (single and double quotes with escaping)
  • Numbers (via pyparsing_common.number())
  • Function calls with arguments
  • Zero-argument function calls (via Optional)

What it does NOT handle:

  • LET syntax and variable bindings
  • LAMBDA function definitions
  • Spreadsheet range operators (A1:B10, :)
  • Array literals ({1,2,3})
  • Binary operators (+, -, *, /, &, =, etc.)
  • Comparison operators (<, >, <=, >=, <>)
  • Unary operators (-, NOT)

Parsing strategy:

result = self.grammar.parse_string(normalized, parse_all=False)
  • Uses parse_all=False - only parses what matches the grammar
  • Silently skips unparsable content
  • No error reporting for partial failures

2. Regex Fallback (Recently Enhanced)

When activated: When pyparsing AST walk finds no function calls (line 155)

Components:

  1. Function detection (line 159):

    pattern = r"\b" + re.escape(func_name) + r"\s*\("
  2. Argument extraction (new in 7b7e1cb):

    • _extract_args_from_formula(): Balanced parenthesis matching
    • _split_arguments(): Comma splitting with nesting awareness
    • Handles nested parentheses, braces, and quoted strings

What it handles:

  • Finds function calls even in complex LET/LAMBDA structures
  • Extracts arguments as raw strings
  • Respects string boundaries (single/double quotes)
  • Respects nesting (parentheses and braces)

What it does NOT handle well:

  • Escaped quotes in edge cases (checks formula[i-1] != '\\' but doesn't handle \\\" )
  • Triple-quoted strings (not valid in Sheets but could confuse parser)
  • Comments within function calls (should be stripped first, which it is)
  • Character-level syntax errors (returns empty list on failure)

When Do We Use Each Strategy?

pyparsing is used:

  1. Always attempted first for all formulas
  2. Succeeds on: Simple function calls, properly nested expressions
  3. Fails on: LET statements, complex multi-line formulas, operators outside function calls

regex fallback is used:

  1. When pyparsing finds no calls in the AST
  2. Typical cases:
    • Function calls nested within LET bindings
    • Multi-line formulas with complex structure
    • Formulas with operators pyparsing doesn't understand

Current flow:

Formula → pyparsing parse → AST walk → Found calls?
                                       ↓ No
                                    regex fallback → Extract calls

Is Our Use of Regex Appropriate?

✅ Appropriate aspects:

  1. Fallback strategy is sound: Try structured parsing first, fall back to pattern matching
  2. Balanced parenthesis matching is correct approach: Can't use simple regex for nested structures
  3. Handles the actual problem: LET-nested calls that pyparsing can't parse
  4. Practical solution: Works for all 33 current formulas

⚠️ Concerns:

  1. Inconsistent output types:

    • pyparsing returns: ParseResults objects with rich structure
    • regex returns: plain strings
    • Later code must handle both (see expand_argument, substitute_arguments)
  2. Silent failures:

    • Regex returns empty list if parentheses don't match
    • No way to distinguish "no calls found" from "parsing failed"
  3. Maintenance burden:

    • Two parallel code paths for same functionality
    • Testing requires covering both paths
    • Bugs could exist in either path
  4. Semantic understanding:

    • Regex has no understanding of formula semantics
    • Could match function names inside strings (mitigated by string tracking)
    • Could fail on valid syntax it doesn't expect

Anticipated Failure Cases

1. Escaped Characters

=MYFUNCTION("Say \"Hello\"")

Current code checks formula[i-1] != '\\' but doesn't handle:

  • Double backslashes: "\\\" (backslash followed by escaped quote)
  • Backslash at position 0

2. Edge Case: Empty Function Calls with Whitespace

=DENSIFY(  )  // Zero args with spaces

Should work but not explicitly tested.

3. Complex Nested Structures

=LET(
  helper, LAMBDA(x, OTHERFUNC(x, {1,2,3})),
  result, BYROW(data, helper),
  FINALFUNC(result, "mode")
)

Multiple levels of LET + LAMBDA + function calls. Current regex should handle but not tested.

4. String Concatenation with Function Calls

="Prefix: " & MYFUNCTION(x) & " Suffix"

pyparsing doesn't handle & operator. Regex should find MYFUNCTION but not tested.

5. Error Functions

=IFERROR(MYFUNCTION(x), ERROR("Custom message"))

Multiple function calls, one is ERROR. Should work but edge case.

6. Array Formulas with Braces

=MYFUNCTION({1,2,3;4,5,6}, "mode")

Arrays with row separator ;. Current regex tracks braces but not tested with semicolons.

7. Range Operators

=MYFUNCTION(A1:B10, C:C)

Colon operator not in pyparsing grammar. Should work via regex but not tested.

8. Multiple Calls Same Function

=MYFUNCTION(x) + MYFUNCTION(y)

Regex uses finditer so should find both, but not tested.

Can We Define Test Cases?

Yes! Recommended test cases:

Category 1: Basic Parsing

def test_simple_function_call():
    """FUNC(arg1, arg2)"""

def test_zero_arg_function():
    """FUNC()"""
    
def test_nested_function_calls():
    """OUTER(INNER(x))"""

Category 2: LET and LAMBDA

def test_let_nested_call():
    """LET(x, FUNC(y), x + 1)"""
    
def test_lambda_with_call():
    """LAMBDA(x, FUNC(x, "mode"))"""
    
def test_complex_let_lambda():
    """LET(helper, LAMBDA(x, FUNC1(x)), FUNC2(helper))"""

Category 3: String Handling

def test_string_with_quotes():
    """FUNC("value with 'single' quotes")"""
    
def test_escaped_quotes():
    """FUNC("value with \\"escaped\\" quotes")"""
    
def test_mixed_quote_types():
    """FUNC('single', "double")"""

Category 4: Arrays and Ranges

def test_array_literal():
    """FUNC({1,2,3}, "mode")"""
    
def test_2d_array():
    """FUNC({1,2;3,4})"""
    
def test_range_reference():
    """FUNC(A1:B10, C:C)"""

Category 5: Operators

def test_arithmetic_operators():
    """FUNC(x + y, z * 2)"""
    
def test_string_concatenation():
    """FUNC("prefix" & value & "suffix")"""
    
def test_comparison_operators():
    """FUNC(x > 0, y <= 10)"""

Category 6: Edge Cases

def test_multiple_same_function():
    """FUNC(x) + FUNC(y)"""
    
def test_whitespace_variations():
    """FUNC( arg1 , arg2 )"""
    
def test_multiline_formula():
    # Complex multi-line LET statement
    
def test_deeply_nested():
    """FUNC1(FUNC2(FUNC3(FUNC4(x))))"""

Category 7: Real-World Patterns

def test_densify_expansion():
    """Test actual DENSIFY(range, 'mode') expansion"""
    
def test_unpivot_with_blanktoempty():
    """Test UNPIVOT formula that calls BLANKTOEMPTY"""
    
def test_vstackblank_expansion():
    """Test VSTACKFILL(a, b, BLANK()) expansion"""

What Would It Take to Make It More Robust?

Option 1: Improve pyparsing Grammar (Recommended)

Effort: Medium | Risk: Low | Benefit: High

Approach:

  1. Add support for LET syntax:

    let_binding = identifier + Literal(",").suppress() + expression
    let_expr = (Literal("LET") + lparen + 
                DelimitedList(Group(let_binding)) + rparen)
  2. Add support for LAMBDA:

    lambda_expr = (Literal("LAMBDA") + lparen + 
                   DelimitedList(identifier) + Literal(",").suppress() +
                   expression + rparen)
  3. Add operator support:

    operator = MatchFirst([Literal(op) for op in ['+', '-', '*', '/', '&', ':', ...]])
    binary_expr = expression + operator + expression
  4. Add range references:

    cell_ref = Word(alphas, max=3) + Word(nums)
    range_ref = cell_ref + Literal(":") + cell_ref

Pros:

  • Maintains structured parsing
  • Better error reporting
  • Can validate syntax
  • Single code path reduces complexity

Cons:

  • Requires significant pyparsing expertise
  • Google Sheets formula syntax is complex
  • May never handle 100% of cases

Option 2: Pure Regex Approach

Effort: Low | Risk: Medium | Benefit: Medium

Approach:

  1. Remove pyparsing dependency
  2. Use only regex with balanced matching
  3. Simplify code to single parsing path

Pros:

  • Simpler codebase
  • Consistent output format
  • Easier to understand and maintain
  • Already works for all formulas

Cons:

  • No structural understanding
  • Harder to validate formulas
  • Error messages less helpful
  • Regex harder to extend

Option 3: Use Specialized Parser Library

Effort: High | Risk: High | Benefit: High

Approach:

  1. Research existing Google Sheets formula parsers
  2. Options might include:
    • Lark parser (more powerful than pyparsing)
    • ANTLR (industrial-strength parser generator)
    • Tree-sitter (incremental parsing)
    • Custom recursive descent parser

Pros:

  • Potentially complete solution
  • Better error messages
  • Could enable formula validation
  • Could enable formula analysis/transformation

Cons:

  • Significant development time
  • May not exist for Google Sheets
  • Adds external dependency
  • Overkill for current needs

Option 4: Hybrid Refinement (Minimum Viable Improvement)

Effort: Low | Risk: Low | Benefit: Medium

Approach:

  1. Keep current hybrid approach

  2. Improve error handling:

    try:
        args = self._extract_args_from_formula(formula, match.end() - 1)
        if args is None:
            raise ParseException(f"Failed to extract args for {func_name}")
    except Exception as e:
        log_warning(f"Regex fallback failed: {e}")
        continue
  3. Add validation:

    # Verify extracted args make sense
    if any(arg.count('(') != arg.count(')') for arg in args):
        raise ParseException("Unbalanced parentheses in arguments")
  4. Expand test coverage (see test cases above)

  5. Add debug mode:

    if DEBUG:
        print(f"Using {'pyparsing' if calls else 'regex'} for {name}")

Pros:

  • Low risk, immediate benefit
  • Builds on working solution
  • Better error messages
  • Easier debugging

Cons:

  • Doesn't address fundamental limitations
  • Still two code paths to maintain

Recommendations

Immediate (Do Now):

  1. Add comprehensive test suite covering all identified failure cases
  2. Add error handling to regex fallback with clear error messages
  3. Add debug logging to track which parser is used for each formula
  4. Document limitations in code comments

Short Term (Next Sprint):

  1. Improve pyparsing grammar to handle LET and LAMBDA
  2. Add validation for extracted arguments
  3. Create parser benchmarks to track performance

Long Term (Future):

  1. Evaluate specialized parsers (Lark, ANTLR) if complexity grows
  2. Consider adding formula validation beyond just expansion
  3. Build formula analyzer to detect potential issues

Conclusion

The current hybrid approach is pragmatic and effective for the current use case. The regex fallback is appropriately used as a safety net when pyparsing fails. However, the system would benefit from:

  1. Better test coverage (highest priority)
  2. Improved error handling (high priority)
  3. Enhanced pyparsing grammar (medium priority)
  4. Better documentation of when each parser is used (medium priority)

The regex approach is a good short-term solution but improving pyparsing would provide a better long-term foundation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions