Skip to content

Add comparative benchmark harness and stable-recalc perf groundwork#28

Merged
PSU3D0 merged 13 commits intomainfrom
feat/benchmark-harness-meta-tranche
Mar 6, 2026
Merged

Add comparative benchmark harness and stable-recalc perf groundwork#28
PSU3D0 merged 13 commits intomainfrom
feat/benchmark-harness-meta-tranche

Conversation

@PSU3D0
Copy link
Owner

@PSU3D0 PSU3D0 commented Mar 6, 2026

Summary

This PR lands the comparative benchmark harness, expands the benchmark suite substantially, adds CI/nightly benchmark execution plans, and includes two narrowly-scoped stable-topology performance improvements for Formualizer.

At a high level it adds:

  • benchmark corpus/scenario contracts and reporting governance
  • native/comparator harness adapters and raw-result/report generation
  • a materially broader benchmark suite (incremental, lookup, aggregate, structural, and real-world anchor scenarios)
  • default batch plans for CI and nightly benchmark execution
  • parity/fairness fixes for generated XLSX corpus used in cross-engine comparison
  • stable-topology recalc plan reuse support in the benchmark runner
  • an internal static schedule cache for stable recalcs in the engine

What landed

Benchmark harness + contracts

  • benchmarks/scenarios.yaml
  • benchmarks/function_matrix.yaml
  • benchmarks/reporting.md
  • benchmarks/README.md
  • benchmarks/harness/...
  • crates/formualizer-bench-core
  • crates/formualizer-testkit

Highlights:

  • scenario metadata for profile/family/tier/comparison-profile/runtime-mode/regression-gate
  • support-policy / claim-class / caveat-label matrix
  • raw JSON result schema and markdown reporting
  • adapters for:
    • formualizer_rust_native
    • ironcalc_rust_native
    • hyperformula_node
    • scaffolded future adapters for wasm/python variants

New benchmark scenarios

Added scenarios across the benchmark meta tranche:

  • incremental / locality
    • inc_sparse_dirty_region_1m
    • inc_cross_sheet_mesh_3x25k
  • lookup / joins
    • lookup_index_match_dense_50k
    • lookup_cross_sheet_dim_fact
  • aggregates / analytics
    • agg_countifs_multi_criteria_100k
    • agg_mixed_rollup_grid_2k_reports
  • structural / edit churn
    • struct_row_insert_middle_50k_refs
    • struct_sheet_rename_rebind
  • real-world anchors
    • real_finance_model_v1
    • real_ops_model_v1

Plus the earlier core scenarios now governed under the same suite metadata:

  • headline_100k_single_edit
  • chain_100k
  • fanout_100k
  • cross_sheet_mesh
  • sparse_whole_column_refs
  • sumifs_fact_table_100k
  • structural_sheet_recovery

CI / nightly execution plans

Added YAML-defined execution plans in:

  • benchmarks/harness/plans.yaml

Plans:

  • ci_formualizer_gate
    • fast formualizer-native-only gate
    • covers core_smoke plus structural_sheet_recovery
  • nightly_native_compares
    • scheduled native-best compare plan for:
      • core comparative scenarios
      • native-strength scenarios
      • nightly-scale watchlist scenarios

The harness runner now supports:

  • list-plans
  • validate-plans
  • run-plan --plan <name>

Fairness / parity fixes

The generated comparison corpus was hardened so comparator engines can ingest the same workbook fairly:

  • style normalization in generated XLSX
  • worksheet formula normalization to strip leading = inside OOXML <f> nodes

This addressed a concrete comparator fairness issue with IronCalc workbook ingest.

Stable-topology perf groundwork

1. Benchmark-runner recalc plan reuse mode

  • perf(bench): add recalc plan reuse mode

Adds a controlled native_best_cached_plan / --reuse-recalc-plan mode to the Formualizer native benchmark runner.

Properties:

  • correctness-safe
  • deterministic
  • invalidates on topology-changing benchmark ops
  • falls back when dynamic refs are present or when no reusable plan exists

2. Engine static schedule cache for stable recalcs

  • perf(engine): cache static schedules for stable recalcs

Adds a conservative internal schedule cache for stable-topology, non-dynamic, non-range-dependency recalcs.

Properties:

  • deterministic
  • invalidated on topology-changing edits
  • covered by engine tests
  • intentionally narrow in scope (no adaptive runtime policy)

Performance notes

We did one extra A/B measurement round on the same machine comparing:

  • pre-perf benchmark branch point: 0ea742a
  • current head: 320d484

Measured as 3-run medians for formualizer_rust_native native_best:

Scenario Pre-perf incremental Current incremental Delta
headline_100k_single_edit 19.467 us 24.887 us +27.8%
chain_100k 105,092.923 us 70,290.640 us -33.1%
fanout_100k 63,018.455 us 45,666.231 us -27.5%
sumifs_fact_table_100k 19,060.541 us 21,267.939 us +11.6%
lookup_cross_sheet_dim_fact 16,538.927 us 14,976.657 us -9.5%

Interpretation:

  • the stable-topology work clearly improves chain/fanout and also helps one lookup-heavy comparative scenario
  • the regressions observed are modest in absolute terms (headline is a few microseconds; sumifs is a couple milliseconds incremental)
  • the current implementation intentionally avoids adaptive/non-deterministic runtime heuristics

Validation performed

Representative validation run on this branch included:

  • cargo fmt --all
  • cargo clippy -p formualizer-eval --lib --tests -- -D warnings
  • cargo clippy -p formualizer-bench-core --all-targets --features xlsx,formualizer_runner,ironcalc_runner -- -D warnings
  • cargo clippy -p formualizer-testkit --all-targets -- -D warnings
  • cargo test -p formualizer-eval recalc_plan -- --nocapture
  • cargo test -p formualizer-eval schedule_cache -- --nocapture
  • cargo test -p formualizer-bench-core --features formualizer_runner --bin run-formualizer-native -- --nocapture
  • uv run --project benchmarks/harness python benchmarks/harness/runner/main.py validate-suite
  • uv run --project benchmarks/harness python benchmarks/harness/runner/main.py validate-plans
  • full fresh formualizer_rust_native native-best suite sweep

Fresh suite sweep result:

  • 17/17 scenarios passed for formualizer_rust_native

Merge-readiness / artifact hygiene

Checked before opening this PR:

  • branch is clean
  • benchmark-generated corpus XLSX files are ignored
  • harness results/ is ignored
  • local reference libraries under benchmarks/harness/ref-libs/ are ignored
  • local tool caches / notes / scratch artifacts remain ignored

Notable ignored paths include:

  • benchmarks/corpus/synthetic/
  • benchmarks/corpus/real/*.xlsx
  • benchmarks/harness/results/
  • benchmarks/harness/ref-libs/
  • target/

So the PR contains harness/tooling/contracts/docs/tests/code, but not local benchmark artifacts or vendored comparator checkouts.

Follow-up after merge

I recommend follow-up perf work on top of this merged benchmark baseline rather than extending this PR further.

Most likely next tranche:

  • deeper chain/scheduler hot-path work only if still needed after compare reruns
  • SUMIFS/range-criteria focused work (criteria-mask caching / invalidation, fast paths)
  • loader / ingest ablations and runtime parity reporting refinement
  • richer nightly compare/report automation from the new plan framework

@PSU3D0 PSU3D0 force-pushed the feat/benchmark-harness-meta-tranche branch from 517be91 to 009b50b Compare March 6, 2026 20:10
@PSU3D0 PSU3D0 force-pushed the feat/benchmark-harness-meta-tranche branch from 009b50b to b3b9f64 Compare March 6, 2026 20:17
@PSU3D0 PSU3D0 merged commit 7ddb247 into main Mar 6, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant