-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Formula Parsing Strategy Analysis and Recommendations
Summary
The formula expansion system uses a hybrid approach combining pyparsing for structured AST parsing with regex as a fallback for argument extraction. The recent enhancement (commit 7b7e1cb) added sophisticated regex-based argument extraction using balanced parenthesis matching to handle cases where pyparsing fails.
Current Architecture
1. pyparsing (Primary Parser)
Location: FormulaParser.__init__() (lines 58-93)
What it handles:
- Identifiers (function names, variables)
- String literals (single and double quotes with escaping)
- Numbers (via pyparsing_common.number())
- Function calls with arguments
- Zero-argument function calls (via Optional)
What it does NOT handle:
- LET syntax and variable bindings
- LAMBDA function definitions
- Spreadsheet range operators (A1:B10, :)
- Array literals ({1,2,3})
- Binary operators (+, -, *, /, &, =, etc.)
- Comparison operators (<, >, <=, >=, <>)
- Unary operators (-, NOT)
Parsing strategy:
result = self.grammar.parse_string(normalized, parse_all=False)- Uses
parse_all=False- only parses what matches the grammar - Silently skips unparsable content
- No error reporting for partial failures
2. Regex Fallback (Recently Enhanced)
When activated: When pyparsing AST walk finds no function calls (line 155)
Components:
-
Function detection (line 159):
pattern = r"\b" + re.escape(func_name) + r"\s*\("
-
Argument extraction (new in 7b7e1cb):
_extract_args_from_formula(): Balanced parenthesis matching_split_arguments(): Comma splitting with nesting awareness- Handles nested parentheses, braces, and quoted strings
What it handles:
- Finds function calls even in complex LET/LAMBDA structures
- Extracts arguments as raw strings
- Respects string boundaries (single/double quotes)
- Respects nesting (parentheses and braces)
What it does NOT handle well:
- Escaped quotes in edge cases (checks
formula[i-1] != '\\'but doesn't handle\\\") - Triple-quoted strings (not valid in Sheets but could confuse parser)
- Comments within function calls (should be stripped first, which it is)
- Character-level syntax errors (returns empty list on failure)
When Do We Use Each Strategy?
pyparsing is used:
- Always attempted first for all formulas
- Succeeds on: Simple function calls, properly nested expressions
- Fails on: LET statements, complex multi-line formulas, operators outside function calls
regex fallback is used:
- When pyparsing finds no calls in the AST
- Typical cases:
- Function calls nested within LET bindings
- Multi-line formulas with complex structure
- Formulas with operators pyparsing doesn't understand
Current flow:
Formula → pyparsing parse → AST walk → Found calls?
↓ No
regex fallback → Extract calls
Is Our Use of Regex Appropriate?
✅ Appropriate aspects:
- Fallback strategy is sound: Try structured parsing first, fall back to pattern matching
- Balanced parenthesis matching is correct approach: Can't use simple regex for nested structures
- Handles the actual problem: LET-nested calls that pyparsing can't parse
- Practical solution: Works for all 33 current formulas
⚠️ Concerns:
-
Inconsistent output types:
- pyparsing returns: ParseResults objects with rich structure
- regex returns: plain strings
- Later code must handle both (see
expand_argument,substitute_arguments)
-
Silent failures:
- Regex returns empty list if parentheses don't match
- No way to distinguish "no calls found" from "parsing failed"
-
Maintenance burden:
- Two parallel code paths for same functionality
- Testing requires covering both paths
- Bugs could exist in either path
-
Semantic understanding:
- Regex has no understanding of formula semantics
- Could match function names inside strings (mitigated by string tracking)
- Could fail on valid syntax it doesn't expect
Anticipated Failure Cases
1. Escaped Characters
=MYFUNCTION("Say \"Hello\"")
Current code checks formula[i-1] != '\\' but doesn't handle:
- Double backslashes:
"\\\"(backslash followed by escaped quote) - Backslash at position 0
2. Edge Case: Empty Function Calls with Whitespace
=DENSIFY( ) // Zero args with spaces
Should work but not explicitly tested.
3. Complex Nested Structures
=LET(
helper, LAMBDA(x, OTHERFUNC(x, {1,2,3})),
result, BYROW(data, helper),
FINALFUNC(result, "mode")
)
Multiple levels of LET + LAMBDA + function calls. Current regex should handle but not tested.
4. String Concatenation with Function Calls
="Prefix: " & MYFUNCTION(x) & " Suffix"
pyparsing doesn't handle & operator. Regex should find MYFUNCTION but not tested.
5. Error Functions
=IFERROR(MYFUNCTION(x), ERROR("Custom message"))
Multiple function calls, one is ERROR. Should work but edge case.
6. Array Formulas with Braces
=MYFUNCTION({1,2,3;4,5,6}, "mode")
Arrays with row separator ;. Current regex tracks braces but not tested with semicolons.
7. Range Operators
=MYFUNCTION(A1:B10, C:C)
Colon operator not in pyparsing grammar. Should work via regex but not tested.
8. Multiple Calls Same Function
=MYFUNCTION(x) + MYFUNCTION(y)
Regex uses finditer so should find both, but not tested.
Can We Define Test Cases?
Yes! Recommended test cases:
Category 1: Basic Parsing
def test_simple_function_call():
"""FUNC(arg1, arg2)"""
def test_zero_arg_function():
"""FUNC()"""
def test_nested_function_calls():
"""OUTER(INNER(x))"""Category 2: LET and LAMBDA
def test_let_nested_call():
"""LET(x, FUNC(y), x + 1)"""
def test_lambda_with_call():
"""LAMBDA(x, FUNC(x, "mode"))"""
def test_complex_let_lambda():
"""LET(helper, LAMBDA(x, FUNC1(x)), FUNC2(helper))"""Category 3: String Handling
def test_string_with_quotes():
"""FUNC("value with 'single' quotes")"""
def test_escaped_quotes():
"""FUNC("value with \\"escaped\\" quotes")"""
def test_mixed_quote_types():
"""FUNC('single', "double")"""Category 4: Arrays and Ranges
def test_array_literal():
"""FUNC({1,2,3}, "mode")"""
def test_2d_array():
"""FUNC({1,2;3,4})"""
def test_range_reference():
"""FUNC(A1:B10, C:C)"""Category 5: Operators
def test_arithmetic_operators():
"""FUNC(x + y, z * 2)"""
def test_string_concatenation():
"""FUNC("prefix" & value & "suffix")"""
def test_comparison_operators():
"""FUNC(x > 0, y <= 10)"""Category 6: Edge Cases
def test_multiple_same_function():
"""FUNC(x) + FUNC(y)"""
def test_whitespace_variations():
"""FUNC( arg1 , arg2 )"""
def test_multiline_formula():
# Complex multi-line LET statement
def test_deeply_nested():
"""FUNC1(FUNC2(FUNC3(FUNC4(x))))"""Category 7: Real-World Patterns
def test_densify_expansion():
"""Test actual DENSIFY(range, 'mode') expansion"""
def test_unpivot_with_blanktoempty():
"""Test UNPIVOT formula that calls BLANKTOEMPTY"""
def test_vstackblank_expansion():
"""Test VSTACKFILL(a, b, BLANK()) expansion"""What Would It Take to Make It More Robust?
Option 1: Improve pyparsing Grammar (Recommended)
Effort: Medium | Risk: Low | Benefit: High
Approach:
-
Add support for LET syntax:
let_binding = identifier + Literal(",").suppress() + expression let_expr = (Literal("LET") + lparen + DelimitedList(Group(let_binding)) + rparen)
-
Add support for LAMBDA:
lambda_expr = (Literal("LAMBDA") + lparen + DelimitedList(identifier) + Literal(",").suppress() + expression + rparen)
-
Add operator support:
operator = MatchFirst([Literal(op) for op in ['+', '-', '*', '/', '&', ':', ...]]) binary_expr = expression + operator + expression
-
Add range references:
cell_ref = Word(alphas, max=3) + Word(nums) range_ref = cell_ref + Literal(":") + cell_ref
Pros:
- Maintains structured parsing
- Better error reporting
- Can validate syntax
- Single code path reduces complexity
Cons:
- Requires significant pyparsing expertise
- Google Sheets formula syntax is complex
- May never handle 100% of cases
Option 2: Pure Regex Approach
Effort: Low | Risk: Medium | Benefit: Medium
Approach:
- Remove pyparsing dependency
- Use only regex with balanced matching
- Simplify code to single parsing path
Pros:
- Simpler codebase
- Consistent output format
- Easier to understand and maintain
- Already works for all formulas
Cons:
- No structural understanding
- Harder to validate formulas
- Error messages less helpful
- Regex harder to extend
Option 3: Use Specialized Parser Library
Effort: High | Risk: High | Benefit: High
Approach:
- Research existing Google Sheets formula parsers
- Options might include:
- Lark parser (more powerful than pyparsing)
- ANTLR (industrial-strength parser generator)
- Tree-sitter (incremental parsing)
- Custom recursive descent parser
Pros:
- Potentially complete solution
- Better error messages
- Could enable formula validation
- Could enable formula analysis/transformation
Cons:
- Significant development time
- May not exist for Google Sheets
- Adds external dependency
- Overkill for current needs
Option 4: Hybrid Refinement (Minimum Viable Improvement)
Effort: Low | Risk: Low | Benefit: Medium
Approach:
-
Keep current hybrid approach
-
Improve error handling:
try: args = self._extract_args_from_formula(formula, match.end() - 1) if args is None: raise ParseException(f"Failed to extract args for {func_name}") except Exception as e: log_warning(f"Regex fallback failed: {e}") continue
-
Add validation:
# Verify extracted args make sense if any(arg.count('(') != arg.count(')') for arg in args): raise ParseException("Unbalanced parentheses in arguments")
-
Expand test coverage (see test cases above)
-
Add debug mode:
if DEBUG: print(f"Using {'pyparsing' if calls else 'regex'} for {name}")
Pros:
- Low risk, immediate benefit
- Builds on working solution
- Better error messages
- Easier debugging
Cons:
- Doesn't address fundamental limitations
- Still two code paths to maintain
Recommendations
Immediate (Do Now):
- ✅ Add comprehensive test suite covering all identified failure cases
- ✅ Add error handling to regex fallback with clear error messages
- ✅ Add debug logging to track which parser is used for each formula
- ✅ Document limitations in code comments
Short Term (Next Sprint):
- Improve pyparsing grammar to handle LET and LAMBDA
- Add validation for extracted arguments
- Create parser benchmarks to track performance
Long Term (Future):
- Evaluate specialized parsers (Lark, ANTLR) if complexity grows
- Consider adding formula validation beyond just expansion
- Build formula analyzer to detect potential issues
Conclusion
The current hybrid approach is pragmatic and effective for the current use case. The regex fallback is appropriately used as a safety net when pyparsing fails. However, the system would benefit from:
- Better test coverage (highest priority)
- Improved error handling (high priority)
- Enhanced pyparsing grammar (medium priority)
- Better documentation of when each parser is used (medium priority)
The regex approach is a good short-term solution but improving pyparsing would provide a better long-term foundation.