Skip to content

Comments

feat(calc): formula evaluation engine with recursive expression parser#1

Merged
wolfiesch merged 2 commits intomainfrom
feat/calc-engine
Feb 20, 2026
Merged

feat(calc): formula evaluation engine with recursive expression parser#1
wolfiesch merged 2 commits intomainfrom
feat/calc-engine

Conversation

@wolfiesch
Copy link
Collaborator

Summary

  • New wolfxl.calc subpackage for evaluating Excel formulas in-memory, designed for LRBench-Agent perturbation scoring
  • Recursive descent expression parser replacing fragile regex dispatch - handles operator precedence, balanced parentheses, and arbitrarily nested formulas like =ROUND(SUM(A1:A5)*IF(B1>0,1.1,1.0),2)
  • 39-function whitelist with 15 builtin Python implementations (SUM, IF, ROUND, ABS, IFERROR, MIN/MAX, AVERAGE, AND/OR/NOT, COUNT/COUNTA)
  • DependencyGraph with Kahn's topological sort, cycle detection, and BFS affected-cells propagation
  • Workbook convenience methods (calculate(), recalculate()) with evaluator caching for efficient repeated perturbation rounds
  • Optional formulas library integration via wolfxl[calc] extra (EUPL-licensed, doesn't affect MIT license)

Architecture

wolfxl.calc (new - pure Python, no Rust changes)
├── _protocol.py      CalcEngine Protocol + CellDelta, RecalcResult
├── _parser.py        Regex ref extraction + range expansion
├── _graph.py         DependencyGraph (Kahn's topological sort)
├── _functions.py     39-function whitelist + 15 builtin implementations
├── _evaluator.py     WorkbookEvaluator (recursive descent parser)
└── __init__.py       Re-exports

Key design decisions

  • Recursive descent over regex: The expression parser uses paren-depth-aware operator splitting with three precedence passes (comparison < additive < multiplicative). Right-to-left scanning gives correct left-to-right associativity.
  • Balanced paren matching: _match_function_call() only matches when the close-paren is the last character, so SUM(A1:A5)*2 is correctly parsed as a binary expression, not a function call.
  • Evaluator caching: Workbook.calculate() caches the evaluator on the instance; recalculate() reuses it, avoiding O(scan+calculate) per perturbation round.
  • Optional dependency: formulas library is an optional extra. All 15 builtin functions work without it.

Test plan

  • 142 new tests across 5 test files (functions, parser, graph, evaluator, integration)
  • All 219 tests pass (142 new + 77 existing)
  • Complex nested formulas: =ROUND(SUM*IF,2), =SUM()+SUM(), =(A+B)*C, =IF(SUM>N,...)*2
  • Operator precedence: =A1+A2*A3 correctly evaluates as A1+(A2*A3)
  • Perturbation discrimination: formulas propagate (ratio=1.0), hardcoded values don't (ratio=0.0)
  • 100-round bit-exact determinism
  • 500-formula workbook evaluates in <200ms
  • Evaluator caching: recalculate() reuses cached evaluator after calculate()
  • Golden .xlsx fixtures: sum_chain, cross_sheet, hardcoded, mixed
  • Zero regressions in existing 77 wolfxl tests

🤖 Generated with Claude Code

…arser

Build wolfxl.calc subpackage for evaluating Excel formulas in-memory:

- Protocol layer: CalcEngine Protocol, CellDelta/RecalcResult dataclasses
- Parser: regex-based reference extraction, range expansion, optional
  formulas-lib compilation
- Graph: DependencyGraph with Kahn's topological sort, cycle detection,
  affected-cells BFS
- Functions: 39-function whitelist with 15 builtin implementations
  (SUM, IF, ROUND, ABS, IFERROR, MIN/MAX, AVERAGE, etc.)
- Evaluator: recursive descent expression parser handling operator
  precedence, balanced parentheses, and arbitrarily nested formulas
  like =ROUND(SUM(A1:A5)*IF(B1>0,1.1,1.0),2)
- Workbook convenience methods: calculate() and recalculate() with
  evaluator caching for efficient repeated perturbation rounds
- Pyright configuration for maturin python-source layout

142 new tests across 5 test files + 4 golden .xlsx fixtures.
All 219 tests pass (142 new + 77 existing).

Designed for LRBench-Agent perturbation scoring: perturb input cells,
check if downstream formula cells propagate changes. Formulas score ~1.0,
hardcoded values score 0.0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings February 20, 2026 01:06
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 89a2ccc1de

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 182 to 186
try:
lf = float(left) if not isinstance(left, (int, float)) else left
rf = float(right) if not isinstance(right, (int, float)) else right
except (ValueError, TypeError):
return False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle text operands in comparison evaluation

The comparison helper currently coerces both sides to float and returns False on conversion errors, so any text comparison like A1="OK" or "x"="x" evaluates as false. This breaks common formulas such as IF(A1="OK",...) by always taking the false branch when string values are involved, producing incorrect workbook results even though string literals are otherwise parsed.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 668fd6e. _compare() now handles text operands with case-insensitive string comparison (matching Excel behavior). Added 4 tests in TestTextComparison.

Comment on lines 466 to 467
elif ch == ',' and depth == 0:
args.append(self._resolve_arg(current.strip(), sheet))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Respect quoted commas when splitting function args

Function arguments are split on every top-level comma without tracking whether the parser is inside a quoted string, so formulas like =IF(TRUE,"a,b","c") are tokenized into extra arguments. In this path IF fails arity validation and _eval_function returns None, causing valid formulas with comma-containing text literals to evaluate incorrectly.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 668fd6e. _parse_function_args() now tracks in_string state and handles Excel escaped quotes. Commas inside string literals no longer split arguments. Added 2 tests in TestQuotedCommasInArgs.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new wolfxl.calc subpackage to evaluate Excel formulas in-memory (parser + dependency graph + evaluator) and integrates it with Workbook.calculate() / Workbook.recalculate() to support efficient perturbation-based rescoring workflows.

Changes:

  • Introduces wolfxl.calc with reference extraction, dependency graph (topo sort + affected-cells), function registry, and a recursive expression evaluator.
  • Adds Workbook.calculate() / Workbook.recalculate() convenience methods with evaluator caching.
  • Adds extensive unit + integration tests and commits golden .xlsx fixtures; adds wolfxl[calc] optional dependency.

Reviewed changes

Copilot reviewed 13 out of 17 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/test_calc_parser.py Tests for reference/range/function extraction and range expansion.
tests/test_calc_integration.py End-to-end workbook tests + caching + determinism/perf assertions.
tests/test_calc_graph.py Tests for dependency graph ordering, cycles, affected-cells, max depth.
tests/test_calc_functions.py Tests for function whitelist/registry and builtin implementations.
tests/test_calc_evaluator.py Tests for evaluator behavior, precedence, nesting, edge cases.
tests/fixtures/calc/sum_chain.xlsx Golden workbook fixture used by integration tests.
tests/fixtures/calc/mixed.xlsx Golden workbook fixture used by integration tests.
tests/fixtures/calc/hardcoded.xlsx Golden workbook fixture used by integration tests.
tests/fixtures/calc/cross_sheet.xlsx Golden workbook fixture used by integration tests.
python/wolfxl/calc/_protocol.py Protocol + dataclasses for perturbation deltas/results.
python/wolfxl/calc/_parser.py Regex-based ref extraction + range expansion + optional formulas probing.
python/wolfxl/calc/_graph.py Dependency graph + topo sort + affected-cells + max-depth.
python/wolfxl/calc/_functions.py Function whitelist and builtin implementations + registry.
python/wolfxl/calc/_evaluator.py Recursive expression parser/evaluator + recalc delta computation.
python/wolfxl/calc/init.py Public re-exports for wolfxl.calc.
python/wolfxl/_workbook.py Adds calculate() / recalculate() methods with caching.
pyproject.toml Adds wolfxl[calc] extra and Pyright configuration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 391 to 393
return float(expr) if '.' in expr else int(expr)
except ValueError:
pass
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numeric literal parsing only handles plain ints and decimals via int(expr) / float(expr) gated by '.' in expr. This fails for valid Excel numeric formats like scientific notation (1E3, -2.5e-4) and can cause such tokens to be treated as cell refs. Consider using a regex (or float() first) to support E/e notation and then decide int-vs-float if needed.

Suggested change
return float(expr) if '.' in expr else int(expr)
except ValueError:
pass
# Use float() first to support integers, decimals, and scientific notation (e.g. 1E3, -2.5e-4).
num = float(expr)
except ValueError:
pass
else:
# Preserve existing behavior: return int for plain integer literals, float otherwise.
if re.fullmatch(r'[+-]?\d+', expr):
return int(expr)
return num

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 668fd6e. Numeric literal parsing now uses float() first to support scientific notation (1E3, -2.5e-4), then checks for plain int. Also added scientific notation guard in _find_top_level_split() so e.g. 2.5e-1 isn't split as subtraction. Added 2 tests.

Comment on lines +31 to +37
# Strings in formulas (to skip refs inside string literals)
_STRING_RE = re.compile(r'"[^"]*"')


def _strip_strings(formula: str) -> str:
"""Remove string literals so refs inside quotes aren't matched."""
return _STRING_RE.sub("", formula)
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_STRING_RE = re.compile(r'"[^"]*"') doesn't handle Excel-escaped quotes inside string literals (Excel uses doubled quotes ""), so reference/function extraction can become incorrect for formulas containing embedded quotes. Consider a pattern/state machine that understands "" escapes (and potentially unterminated strings) to make _strip_strings() safe.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. The current regex handles the common case. Excel doubled-quote escaping ("") is an edge case we haven't hit in practice. Filed as a known limitation - will address if real workbooks surface this.

Comment on lines 286 to 298
class TestPerformance:
def test_500_formula_cells_under_200ms(self) -> None:
"""calculate() on a 500-formula workbook must complete in <200ms."""
wb = _build_income_statement(num_rows=250) # 250*2 + 3 = 503 formulas
ev = WorkbookEvaluator()
ev.load(wb)

start = time.perf_counter()
ev.calculate()
elapsed = time.perf_counter() - start

assert elapsed < 0.200, f"calculate() took {elapsed:.3f}s (>200ms)"

Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test enforces a hard wall-clock threshold (<200ms) which is prone to flakiness across CI runners, CPU contention, and debug builds. Consider converting this to a benchmark-style test (optional/marked), loosening the threshold substantially, or asserting relative performance characteristics without absolute timing.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 668fd6e. Threshold loosened from 200ms to 2s and test marked with @pytest.mark.slow. Local runs typically complete in <100ms but CI runners vary widely.

Comment on lines +113 to +136
# Fixture generation (saved to disk once)
# ---------------------------------------------------------------------------


@pytest.fixture(scope="session", autouse=True)
def golden_fixtures() -> None:
"""Generate golden .xlsx fixtures for other tests."""
os.makedirs(FIXTURE_DIR, exist_ok=True)

builders = {
"sum_chain.xlsx": _build_sum_chain,
"cross_sheet.xlsx": _build_cross_sheet,
"hardcoded.xlsx": _build_hardcoded,
"mixed.xlsx": _build_mixed,
}

for name, builder in builders.items():
path = os.path.join(FIXTURE_DIR, name)
if not os.path.exists(path):
wb = builder()
wb.save(path)


# ---------------------------------------------------------------------------
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The autouse session fixture writes golden .xlsx files into tests/fixtures/calc when missing. Tests ideally shouldn't mutate the source tree; this can also fail in read-only checkouts. Since the fixtures are committed in this PR, consider removing runtime generation (or generate into a temp directory instead).

Suggested change
# Fixture generation (saved to disk once)
# ---------------------------------------------------------------------------
@pytest.fixture(scope="session", autouse=True)
def golden_fixtures() -> None:
"""Generate golden .xlsx fixtures for other tests."""
os.makedirs(FIXTURE_DIR, exist_ok=True)
builders = {
"sum_chain.xlsx": _build_sum_chain,
"cross_sheet.xlsx": _build_cross_sheet,
"hardcoded.xlsx": _build_hardcoded,
"mixed.xlsx": _build_mixed,
}
for name, builder in builders.items():
path = os.path.join(FIXTURE_DIR, name)
if not os.path.exists(path):
wb = builder()
wb.save(path)
# ---------------------------------------------------------------------------

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fixtures are committed in the PR, and the autouse fixture only generates them if missing. This is a convenience for local dev - in CI the committed fixtures are used. Keeping as-is since it's harmless (no-op in read-only checkouts where fixtures already exist in the commit).

Comment on lines +134 to +140
from wolfxl.calc._evaluator import WorkbookEvaluator

ev = WorkbookEvaluator()
ev.load(self)
result = ev.calculate()
self._evaluator = ev # cache for recalculate()
return result
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workbook.calculate() assigns self._evaluator, but _evaluator is never declared/initialized on the class. With [tool.mypy] strict = true, this typically triggers an "attribute-defined" error, and it also means reader/patcher constructors won't have the attribute. Initialize self._evaluator (e.g., to None) in __init__ and also set it on instances created via _from_reader/_from_patcher.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 668fd6e. _evaluator is now initialized to None in init, _from_reader, and _from_patcher. Strict mypy should be happy.

Comment on lines 142 to 147
def recalculate(
self,
perturbations: dict[str, float | int],
tolerance: float = 1e-10,
) -> Any:
"""Perturb input cells and recompute affected formulas.
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type of Workbook.recalculate() is annotated as Any, but the docstring says it returns RecalcResult, and callers/tests treat it as such. Please tighten the annotation to RecalcResult (using TYPE_CHECKING import to avoid hard runtime dependency if needed) so type checkers and IDEs get the correct contract.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 668fd6e. Return type tightened to RecalcResult using TYPE_CHECKING import to avoid hard runtime dependency on wolfxl.calc.

Comment on lines 48 to 75
formula_cells = set(self.formulas.keys())
if not formula_cells:
return []

# Compute in-degrees within formula cells only
in_degree: dict[str, int] = {}
for cell in formula_cells:
deps = self.dependencies.get(cell, set())
# Only count deps that are themselves formula cells
in_degree[cell] = len(deps & formula_cells)

# Start with formula cells that have no formula-cell dependencies
queue: deque[str] = deque()
for cell in formula_cells:
if in_degree[cell] == 0:
queue.append(cell)

order: list[str] = []
while queue:
cell = queue.popleft()
order.append(cell)
# Reduce in-degree for dependent formula cells
for dep in self.dependents.get(cell, set()):
if dep in formula_cells:
in_degree[dep] -= 1
if in_degree[dep] == 0:
queue.append(dep)

Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

topological_order() iterates over set(self.formulas.keys()) and set/dict/set iteration order can vary across processes (hash randomization), making evaluation order and any derived outputs (e.g., delta ordering) non-deterministic. If determinism is a goal, consider ordering formula_cells deterministically (e.g., sorted(...)) and iterating dependents in a stable order as well.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 668fd6e. topological_order() now uses sorted() on initial queue and when adding newly-unblocked dependents. Evaluation order is deterministic regardless of Python hash randomization.

Comment on lines 458 to 471

for ch in args_str:
if ch == '(':
depth += 1
current += ch
elif ch == ')':
depth -= 1
current += ch
elif ch == ',' and depth == 0:
args.append(self._resolve_arg(current.strip(), sheet))
current = ""
else:
current += ch

Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_parse_function_args() splits on commas by tracking only parenthesis depth, but it doesn't track whether it's inside a string literal. This will incorrectly split arguments for formulas like CONCATENATE("a,b", "c") or IF(A1="x,y",...). Track in_string (and Excel escaped quotes "") so commas inside quoted strings don't split arguments.

Suggested change
for ch in args_str:
if ch == '(':
depth += 1
current += ch
elif ch == ')':
depth -= 1
current += ch
elif ch == ',' and depth == 0:
args.append(self._resolve_arg(current.strip(), sheet))
current = ""
else:
current += ch
in_string = False
i = 0
length = len(args_str)
while i < length:
ch = args_str[i]
if ch == '"':
if in_string:
# Inside a string: handle Excel escaped quote ("")
if i + 1 < length and args_str[i + 1] == '"':
# Escaped quote, stays inside string
current += '"'
i += 2
continue
else:
# End of string
in_string = False
current += ch
i += 1
continue
else:
# Start of string
in_string = True
current += ch
i += 1
continue
if not in_string:
if ch == '(':
depth += 1
current += ch
elif ch == ')':
depth -= 1
current += ch
elif ch == ',' and depth == 0:
args.append(self._resolve_arg(current.strip(), sheet))
current = ""
else:
current += ch
else:
# Inside a string, all characters (including commas/parentheses) are literal
current += ch
i += 1

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 668fd6e (same as the Codex comment). _parse_function_args() now tracks in_string state with Excel escaped quote handling.

CI fixes:
- Add `from __future__ import annotations` to __init__.py (Python 3.9 compat)
- Use `--no-index` in CI pip install to prevent PyPI fallback
- Fix import sorting in all 5 test files (ruff I001)
- Move ruff select to [tool.ruff.lint] section (deprecation warning)
- Register pytest `slow` mark in pyproject.toml

PR review fixes (Codex + Copilot):
- Handle text operands in comparison evaluation (case-insensitive)
- Respect quoted commas when splitting function args
- Support scientific notation in numeric literals (e.g. 2.5e-1)
- Skip +/- inside scientific notation during operator scanning
- Initialize _evaluator attribute on Workbook class + all constructors
- Tighten recalculate() return type from Any to RecalcResult
- Deterministic topological ordering via sorted() on formula cells
- Loosen perf test threshold (200ms -> 2s) to prevent CI flakiness

8 new tests: string comparison (4), quoted commas (2), scientific notation (2).
227 total tests pass (150 calc + 77 existing).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@wolfiesch wolfiesch merged commit d0f4806 into main Feb 20, 2026
7 checks passed
@wolfiesch wolfiesch deleted the feat/calc-engine branch February 20, 2026 03:16
@wolfiesch wolfiesch mentioned this pull request Feb 20, 2026
3 tasks
wolfiesch added a commit that referenced this pull request Feb 20, 2026
- POWER: return #NUM! for negative base with fractional exponent
  instead of leaking complex numbers (Copilot #2, Codex #11)
- LEFT: return #VALUE! for negative num_chars (Copilot #1)
- MID: return #VALUE! for start < 1 or num_chars < 0 (Copilot #8)
- range_shape: use min/max instead of abs() for consistency with
  expand_range (Copilot #3)
- Add @_requires_formulas skip marker to formulas-dependent test
  classes so CI passes without wolfxl[calc] extra (Codex #10)
- Add 4 edge case tests: POWER(-1,0.5), LEFT(-1), MID(0,..), MID(.,-1)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant