Analyze and improve formula parsing robustness

# Formula Parsing Strategy Analysis and Recommendations

## Summary

The formula expansion system uses a hybrid approach combining **pyparsing** for structured AST parsing with **regex** as a fallback for argument extraction. The recent enhancement (commit 7b7e1cb) added sophisticated regex-based argument extraction using balanced parenthesis matching to handle cases where pyparsing fails.

## Current Architecture

### 1. pyparsing (Primary Parser)

**Location**: `FormulaParser.__init__()` (lines 58-93)

**What it handles**:
- Identifiers (function names, variables)
- String literals (single and double quotes with escaping)
- Numbers (via pyparsing_common.number())
- Function calls with arguments
- Zero-argument function calls (via Optional)

**What it does NOT handle**:
- LET syntax and variable bindings
- LAMBDA function definitions
- Spreadsheet range operators (A1:B10, :)
- Array literals ({1,2,3})
- Binary operators (+, -, *, /, &, =, etc.)
- Comparison operators (<, >, <=, >=, <>)
- Unary operators (-, NOT)

**Parsing strategy**:
```python
result = self.grammar.parse_string(normalized, parse_all=False)
```
- Uses `parse_all=False` - only parses what matches the grammar
- Silently skips unparsable content
- No error reporting for partial failures

### 2. Regex Fallback (Recently Enhanced)

**When activated**: When pyparsing AST walk finds no function calls (line 155)

**Components**:

1. **Function detection** (line 159):
   ```python
   pattern = r"\b" + re.escape(func_name) + r"\s*\("
   ```

2. **Argument extraction** (new in 7b7e1cb):
   - `_extract_args_from_formula()`: Balanced parenthesis matching
   - `_split_arguments()`: Comma splitting with nesting awareness
   - Handles nested parentheses, braces, and quoted strings

**What it handles**:
- Finds function calls even in complex LET/LAMBDA structures
- Extracts arguments as raw strings
- Respects string boundaries (single/double quotes)
- Respects nesting (parentheses and braces)

**What it does NOT handle well**:
- Escaped quotes in edge cases (checks `formula[i-1] != '\\'` but doesn't handle `\\\"` )
- Triple-quoted strings (not valid in Sheets but could confuse parser)
- Comments within function calls (should be stripped first, which it is)
- Character-level syntax errors (returns empty list on failure)

## When Do We Use Each Strategy?

### pyparsing is used:
1. **Always attempted first** for all formulas
2. **Succeeds on**: Simple function calls, properly nested expressions
3. **Fails on**: LET statements, complex multi-line formulas, operators outside function calls

### regex fallback is used:
1. **When pyparsing finds no calls** in the AST
2. **Typical cases**:
   - Function calls nested within LET bindings
   - Multi-line formulas with complex structure
   - Formulas with operators pyparsing doesn't understand

### Current flow:
```
Formula → pyparsing parse → AST walk → Found calls?
                                       ↓ No
                                    regex fallback → Extract calls
```

## Is Our Use of Regex Appropriate?

### ✅ Appropriate aspects:

1. **Fallback strategy is sound**: Try structured parsing first, fall back to pattern matching
2. **Balanced parenthesis matching is correct approach**: Can't use simple regex for nested structures
3. **Handles the actual problem**: LET-nested calls that pyparsing can't parse
4. **Practical solution**: Works for all 33 current formulas

### ⚠️ Concerns:

1. **Inconsistent output types**:
   - pyparsing returns: ParseResults objects with rich structure
   - regex returns: plain strings
   - Later code must handle both (see `expand_argument`, `substitute_arguments`)

2. **Silent failures**:
   - Regex returns empty list if parentheses don't match
   - No way to distinguish "no calls found" from "parsing failed"

3. **Maintenance burden**:
   - Two parallel code paths for same functionality
   - Testing requires covering both paths
   - Bugs could exist in either path

4. **Semantic understanding**:
   - Regex has no understanding of formula semantics
   - Could match function names inside strings (mitigated by string tracking)
   - Could fail on valid syntax it doesn't expect

## Anticipated Failure Cases

### 1. Escaped Characters
```excel
=MYFUNCTION("Say \"Hello\"")
```
Current code checks `formula[i-1] != '\\'` but doesn't handle:
- Double backslashes: `"\\\"` (backslash followed by escaped quote)
- Backslash at position 0

### 2. Edge Case: Empty Function Calls with Whitespace
```excel
=DENSIFY(  )  // Zero args with spaces
```
Should work but not explicitly tested.

### 3. Complex Nested Structures
```excel
=LET(
  helper, LAMBDA(x, OTHERFUNC(x, {1,2,3})),
  result, BYROW(data, helper),
  FINALFUNC(result, "mode")
)
```
Multiple levels of LET + LAMBDA + function calls. Current regex should handle but not tested.

### 4. String Concatenation with Function Calls
```excel
="Prefix: " & MYFUNCTION(x) & " Suffix"
```
pyparsing doesn't handle `&` operator. Regex should find MYFUNCTION but not tested.

### 5. Error Functions
```excel
=IFERROR(MYFUNCTION(x), ERROR("Custom message"))
```
Multiple function calls, one is ERROR. Should work but edge case.

### 6. Array Formulas with Braces
```excel
=MYFUNCTION({1,2,3;4,5,6}, "mode")
```
Arrays with row separator `;`. Current regex tracks braces but not tested with semicolons.

### 7. Range Operators
```excel
=MYFUNCTION(A1:B10, C:C)
```
Colon operator not in pyparsing grammar. Should work via regex but not tested.

### 8. Multiple Calls Same Function
```excel
=MYFUNCTION(x) + MYFUNCTION(y)
```
Regex uses `finditer` so should find both, but not tested.

## Can We Define Test Cases?

### Yes! Recommended test cases:

#### **Category 1: Basic Parsing**
```python
def test_simple_function_call():
    """FUNC(arg1, arg2)"""

def test_zero_arg_function():
    """FUNC()"""
    
def test_nested_function_calls():
    """OUTER(INNER(x))"""
```

#### **Category 2: LET and LAMBDA**
```python
def test_let_nested_call():
    """LET(x, FUNC(y), x + 1)"""
    
def test_lambda_with_call():
    """LAMBDA(x, FUNC(x, "mode"))"""
    
def test_complex_let_lambda():
    """LET(helper, LAMBDA(x, FUNC1(x)), FUNC2(helper))"""
```

#### **Category 3: String Handling**
```python
def test_string_with_quotes():
    """FUNC("value with 'single' quotes")"""
    
def test_escaped_quotes():
    """FUNC("value with \\"escaped\\" quotes")"""
    
def test_mixed_quote_types():
    """FUNC('single', "double")"""
```

#### **Category 4: Arrays and Ranges**
```python
def test_array_literal():
    """FUNC({1,2,3}, "mode")"""
    
def test_2d_array():
    """FUNC({1,2;3,4})"""
    
def test_range_reference():
    """FUNC(A1:B10, C:C)"""
```

#### **Category 5: Operators**
```python
def test_arithmetic_operators():
    """FUNC(x + y, z * 2)"""
    
def test_string_concatenation():
    """FUNC("prefix" & value & "suffix")"""
    
def test_comparison_operators():
    """FUNC(x > 0, y <= 10)"""
```

#### **Category 6: Edge Cases**
```python
def test_multiple_same_function():
    """FUNC(x) + FUNC(y)"""
    
def test_whitespace_variations():
    """FUNC( arg1 , arg2 )"""
    
def test_multiline_formula():
    # Complex multi-line LET statement
    
def test_deeply_nested():
    """FUNC1(FUNC2(FUNC3(FUNC4(x))))"""
```

#### **Category 7: Real-World Patterns**
```python
def test_densify_expansion():
    """Test actual DENSIFY(range, 'mode') expansion"""
    
def test_unpivot_with_blanktoempty():
    """Test UNPIVOT formula that calls BLANKTOEMPTY"""
    
def test_vstackblank_expansion():
    """Test VSTACKFILL(a, b, BLANK()) expansion"""
```

## What Would It Take to Make It More Robust?

### Option 1: Improve pyparsing Grammar (Recommended)

**Effort**: Medium | **Risk**: Low | **Benefit**: High

**Approach**:
1. Add support for LET syntax:
   ```python
   let_binding = identifier + Literal(",").suppress() + expression
   let_expr = (Literal("LET") + lparen + 
               DelimitedList(Group(let_binding)) + rparen)
   ```

2. Add support for LAMBDA:
   ```python
   lambda_expr = (Literal("LAMBDA") + lparen + 
                  DelimitedList(identifier) + Literal(",").suppress() +
                  expression + rparen)
   ```

3. Add operator support:
   ```python
   operator = MatchFirst([Literal(op) for op in ['+', '-', '*', '/', '&', ':', ...]])
   binary_expr = expression + operator + expression
   ```

4. Add range references:
   ```python
   cell_ref = Word(alphas, max=3) + Word(nums)
   range_ref = cell_ref + Literal(":") + cell_ref
   ```

**Pros**:
- Maintains structured parsing
- Better error reporting
- Can validate syntax
- Single code path reduces complexity

**Cons**:
- Requires significant pyparsing expertise
- Google Sheets formula syntax is complex
- May never handle 100% of cases

### Option 2: Pure Regex Approach

**Effort**: Low | **Risk**: Medium | **Benefit**: Medium

**Approach**:
1. Remove pyparsing dependency
2. Use only regex with balanced matching
3. Simplify code to single parsing path

**Pros**:
- Simpler codebase
- Consistent output format
- Easier to understand and maintain
- Already works for all formulas

**Cons**:
- No structural understanding
- Harder to validate formulas
- Error messages less helpful
- Regex harder to extend

### Option 3: Use Specialized Parser Library

**Effort**: High | **Risk**: High | **Benefit**: High

**Approach**:
1. Research existing Google Sheets formula parsers
2. Options might include:
   - Lark parser (more powerful than pyparsing)
   - ANTLR (industrial-strength parser generator)
   - Tree-sitter (incremental parsing)
   - Custom recursive descent parser

**Pros**:
- Potentially complete solution
- Better error messages
- Could enable formula validation
- Could enable formula analysis/transformation

**Cons**:
- Significant development time
- May not exist for Google Sheets
- Adds external dependency
- Overkill for current needs

### Option 4: Hybrid Refinement (Minimum Viable Improvement)

**Effort**: Low | **Risk**: Low | **Benefit**: Medium

**Approach**:
1. Keep current hybrid approach
2. Improve error handling:
   ```python
   try:
       args = self._extract_args_from_formula(formula, match.end() - 1)
       if args is None:
           raise ParseException(f"Failed to extract args for {func_name}")
   except Exception as e:
       log_warning(f"Regex fallback failed: {e}")
       continue
   ```

3. Add validation:
   ```python
   # Verify extracted args make sense
   if any(arg.count('(') != arg.count(')') for arg in args):
       raise ParseException("Unbalanced parentheses in arguments")
   ```

4. Expand test coverage (see test cases above)

5. Add debug mode:
   ```python
   if DEBUG:
       print(f"Using {'pyparsing' if calls else 'regex'} for {name}")
   ```

**Pros**:
- Low risk, immediate benefit
- Builds on working solution
- Better error messages
- Easier debugging

**Cons**:
- Doesn't address fundamental limitations
- Still two code paths to maintain

## Recommendations

### Immediate (Do Now):
1. ✅ **Add comprehensive test suite** covering all identified failure cases
2. ✅ **Add error handling** to regex fallback with clear error messages
3. ✅ **Add debug logging** to track which parser is used for each formula
4. ✅ **Document limitations** in code comments

### Short Term (Next Sprint):
1. **Improve pyparsing grammar** to handle LET and LAMBDA
2. **Add validation** for extracted arguments
3. **Create parser benchmarks** to track performance

### Long Term (Future):
1. **Evaluate specialized parsers** (Lark, ANTLR) if complexity grows
2. **Consider adding formula validation** beyond just expansion
3. **Build formula analyzer** to detect potential issues

## Conclusion

The current hybrid approach is **pragmatic and effective** for the current use case. The regex fallback is **appropriately used** as a safety net when pyparsing fails. However, the system would benefit from:

1. **Better test coverage** (highest priority)
2. **Improved error handling** (high priority)
3. **Enhanced pyparsing grammar** (medium priority)
4. **Better documentation** of when each parser is used (medium priority)

The regex approach is a good **short-term solution** but improving pyparsing would provide a better **long-term foundation**.

Analyze and improve formula parsing robustness #96

Description

Formula Parsing Strategy Analysis and Recommendations

Summary

Current Architecture

1. pyparsing (Primary Parser)

2. Regex Fallback (Recently Enhanced)

When Do We Use Each Strategy?

pyparsing is used:

regex fallback is used:

Current flow:

Is Our Use of Regex Appropriate?

✅ Appropriate aspects:

⚠️ Concerns:

Anticipated Failure Cases

1. Escaped Characters

2. Edge Case: Empty Function Calls with Whitespace

3. Complex Nested Structures

4. String Concatenation with Function Calls

5. Error Functions

6. Array Formulas with Braces

7. Range Operators

8. Multiple Calls Same Function

Can We Define Test Cases?

Yes! Recommended test cases:

Category 1: Basic Parsing

Category 2: LET and LAMBDA

Category 3: String Handling

Category 4: Arrays and Ranges

Category 5: Operators

Category 6: Edge Cases

Category 7: Real-World Patterns

What Would It Take to Make It More Robust?

Option 1: Improve pyparsing Grammar (Recommended)

Option 2: Pure Regex Approach

Option 3: Use Specialized Parser Library

Option 4: Hybrid Refinement (Minimum Viable Improvement)

Recommendations

Immediate (Do Now):

Short Term (Next Sprint):

Long Term (Future):

Conclusion

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions