Add comparative benchmark harness and stable-recalc perf groundwork by PSU3D0 · Pull Request #28 · PSU3D0/formualizer

PSU3D0 · 2026-03-06T19:43:07Z

Summary

This PR lands the comparative benchmark harness, expands the benchmark suite substantially, adds CI/nightly benchmark execution plans, and includes two narrowly-scoped stable-topology performance improvements for Formualizer.

At a high level it adds:

benchmark corpus/scenario contracts and reporting governance
native/comparator harness adapters and raw-result/report generation
a materially broader benchmark suite (incremental, lookup, aggregate, structural, and real-world anchor scenarios)
default batch plans for CI and nightly benchmark execution
parity/fairness fixes for generated XLSX corpus used in cross-engine comparison
stable-topology recalc plan reuse support in the benchmark runner
an internal static schedule cache for stable recalcs in the engine

What landed

Benchmark harness + contracts

benchmarks/scenarios.yaml
benchmarks/function_matrix.yaml
benchmarks/reporting.md
benchmarks/README.md
benchmarks/harness/...
crates/formualizer-bench-core
crates/formualizer-testkit

Highlights:

scenario metadata for profile/family/tier/comparison-profile/runtime-mode/regression-gate
support-policy / claim-class / caveat-label matrix
raw JSON result schema and markdown reporting
adapters for:
- formualizer_rust_native
- ironcalc_rust_native
- hyperformula_node
- scaffolded future adapters for wasm/python variants

New benchmark scenarios

Added scenarios across the benchmark meta tranche:

incremental / locality
- inc_sparse_dirty_region_1m
- inc_cross_sheet_mesh_3x25k
lookup / joins
- lookup_index_match_dense_50k
- lookup_cross_sheet_dim_fact
aggregates / analytics
- agg_countifs_multi_criteria_100k
- agg_mixed_rollup_grid_2k_reports
structural / edit churn
- struct_row_insert_middle_50k_refs
- struct_sheet_rename_rebind
real-world anchors
- real_finance_model_v1
- real_ops_model_v1

Plus the earlier core scenarios now governed under the same suite metadata:

headline_100k_single_edit
chain_100k
fanout_100k
cross_sheet_mesh
sparse_whole_column_refs
sumifs_fact_table_100k
structural_sheet_recovery

CI / nightly execution plans

Added YAML-defined execution plans in:

benchmarks/harness/plans.yaml

Plans:

ci_formualizer_gate
- fast formualizer-native-only gate
- covers core_smoke plus structural_sheet_recovery
nightly_native_compares
- scheduled native-best compare plan for:
  - core comparative scenarios
  - native-strength scenarios
  - nightly-scale watchlist scenarios

The harness runner now supports:

list-plans
validate-plans
run-plan --plan <name>

Fairness / parity fixes

The generated comparison corpus was hardened so comparator engines can ingest the same workbook fairly:

style normalization in generated XLSX
worksheet formula normalization to strip leading = inside OOXML <f> nodes

This addressed a concrete comparator fairness issue with IronCalc workbook ingest.

Stable-topology perf groundwork

1. Benchmark-runner recalc plan reuse mode

perf(bench): add recalc plan reuse mode

Adds a controlled native_best_cached_plan / --reuse-recalc-plan mode to the Formualizer native benchmark runner.

Properties:

correctness-safe
deterministic
invalidates on topology-changing benchmark ops
falls back when dynamic refs are present or when no reusable plan exists

2. Engine static schedule cache for stable recalcs

perf(engine): cache static schedules for stable recalcs

Adds a conservative internal schedule cache for stable-topology, non-dynamic, non-range-dependency recalcs.

Properties:

deterministic
invalidated on topology-changing edits
covered by engine tests
intentionally narrow in scope (no adaptive runtime policy)

Performance notes

We did one extra A/B measurement round on the same machine comparing:

pre-perf benchmark branch point: 0ea742a
current head: 320d484

Measured as 3-run medians for formualizer_rust_native native_best:

Scenario	Pre-perf incremental	Current incremental	Delta
`headline_100k_single_edit`	`19.467 us`	`24.887 us`	`+27.8%`
`chain_100k`	`105,092.923 us`	`70,290.640 us`	`-33.1%`
`fanout_100k`	`63,018.455 us`	`45,666.231 us`	`-27.5%`
`sumifs_fact_table_100k`	`19,060.541 us`	`21,267.939 us`	`+11.6%`
`lookup_cross_sheet_dim_fact`	`16,538.927 us`	`14,976.657 us`	`-9.5%`

Interpretation:

the stable-topology work clearly improves chain/fanout and also helps one lookup-heavy comparative scenario
the regressions observed are modest in absolute terms (headline is a few microseconds; sumifs is a couple milliseconds incremental)
the current implementation intentionally avoids adaptive/non-deterministic runtime heuristics

Validation performed

Representative validation run on this branch included:

cargo fmt --all
cargo clippy -p formualizer-eval --lib --tests -- -D warnings
cargo clippy -p formualizer-bench-core --all-targets --features xlsx,formualizer_runner,ironcalc_runner -- -D warnings
cargo clippy -p formualizer-testkit --all-targets -- -D warnings
cargo test -p formualizer-eval recalc_plan -- --nocapture
cargo test -p formualizer-eval schedule_cache -- --nocapture
cargo test -p formualizer-bench-core --features formualizer_runner --bin run-formualizer-native -- --nocapture
uv run --project benchmarks/harness python benchmarks/harness/runner/main.py validate-suite
uv run --project benchmarks/harness python benchmarks/harness/runner/main.py validate-plans
full fresh formualizer_rust_native native-best suite sweep

Fresh suite sweep result:

17/17 scenarios passed for formualizer_rust_native

Merge-readiness / artifact hygiene

Checked before opening this PR:

branch is clean
benchmark-generated corpus XLSX files are ignored
harness results/ is ignored
local reference libraries under benchmarks/harness/ref-libs/ are ignored
local tool caches / notes / scratch artifacts remain ignored

Notable ignored paths include:

benchmarks/corpus/synthetic/
benchmarks/corpus/real/*.xlsx
benchmarks/harness/results/
benchmarks/harness/ref-libs/
target/

So the PR contains harness/tooling/contracts/docs/tests/code, but not local benchmark artifacts or vendored comparator checkouts.

Follow-up after merge

I recommend follow-up perf work on top of this merged benchmark baseline rather than extending this PR further.

Most likely next tranche:

deeper chain/scheduler hot-path work only if still needed after compare reruns
SUMIFS/range-criteria focused work (criteria-mask caching / invalidation, fast paths)
loader / ingest ablations and runtime parity reporting refinement
richer nightly compare/report automation from the new plan framework

Add dense INDEX/MATCH and cross-sheet dimension/fact benchmark scenarios with deterministic corpus generation and verification. The dense lookup case is right-sized to 50k rows for practical comparative runs while still exposing lookup scaling behavior.

Add row-insert churn and sheet rename/rebind scenarios, along with structural metadata for recovery coverage.

PSU3D0 force-pushed the feat/benchmark-harness-meta-tranche branch from 517be91 to 009b50b Compare March 6, 2026 20:10

PSU3D0 added 13 commits March 6, 2026 13:17

feat(bench): add comparative harness and benchmark tranches

5ed75b0

feat(bench): add sparse locality incremental scenarios

3993ed3

feat(bench): add aggregate COUNTIFS and grid scenarios

4ef4412

feat(bench): add structural edit benchmark scenarios

3bd3102

feat(bench): add structural benchmark scenarios

04970de

Add row-insert churn and sheet rename/rebind scenarios, along with structural metadata for recovery coverage.

feat(bench): add benchmark claim matrix governance

81f8e08

feat(bench): add real-world benchmark anchors

12ac96e

fix(bench): dedupe scenario definitions

c362832

feat(bench): add ci and nightly benchmark plans

156d21b

perf(bench): add recalc plan reuse mode

2b6efb1

perf(engine): cache static schedules for stable recalcs

1eade51

fix(ci): clean up benchmark clippy surfaces

b3b9f64

PSU3D0 force-pushed the feat/benchmark-harness-meta-tranche branch from 009b50b to b3b9f64 Compare March 6, 2026 20:17

PSU3D0 merged commit 7ddb247 into main Mar 6, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add comparative benchmark harness and stable-recalc perf groundwork#28

Add comparative benchmark harness and stable-recalc perf groundwork#28
PSU3D0 merged 13 commits intomainfrom
feat/benchmark-harness-meta-tranche

PSU3D0 commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PSU3D0 commented Mar 6, 2026

Summary

What landed

Benchmark harness + contracts

New benchmark scenarios

CI / nightly execution plans

Fairness / parity fixes

Stable-topology perf groundwork

1. Benchmark-runner recalc plan reuse mode

2. Engine static schedule cache for stable recalcs

Performance notes

Validation performed

Merge-readiness / artifact hygiene

Follow-up after merge

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant