Performance benchmarking suite for the Slick OrderBook library.
This directory contains comprehensive benchmarks measuring the performance characteristics of the orderbook library:
- Latency Measurements: p50, p99, p99.9, p99.99 percentiles
- Throughput Tests: Operations per second under sustained load
- Memory Profiling: Footprint and allocation patterns
- Cache Performance: Alignment, locality, and prefetching effects
- Observer Overhead: Notification system impact
- Realistic Workloads: Market data replay simulations
# Configure with benchmarks enabled
cmake -B build -DCMAKE_BUILD_TYPE=Release -DSLICK_ORDERBOOK_BUILD_BENCHMARKS=ON
# Build
cmake --build build -j
# Benchmarks will be in: build/benchmarks/cd build/benchmarks
# L2 orderbook benchmarks
./bench_orderbook_l2
# L3 orderbook benchmarks
./bench_orderbook_l3
# Multi-symbol manager benchmarks
./bench_orderbook_manager
# Observer notification overhead
./bench_observer_overhead
# Memory usage profiling
./bench_memory_usage
# Realistic market replay
./bench_market_replay
# Cache alignment and locality
./bench_cache_alignmentcd build
cmake --build . --target run_benchmarks
# Results will be in build/benchmarks/*.jsonGoogle Benchmark supports many command-line options:
# Run specific benchmark by name
./bench_orderbook_l2 --benchmark_filter=BM_L2_AddNewLevel
# Run for longer to get more stable results
./bench_orderbook_l2 --benchmark_min_time=10
# Output to console and JSON
./bench_orderbook_l2 --benchmark_out=results.json --benchmark_out_format=json
# Show only summary statistics
./bench_orderbook_l2 --benchmark_counters_tabular=true
# Run with specific thread count
./bench_orderbook_manager --benchmark_filter=ConcurrentPurpose: Measure L2 orderbook operation latencies
Benchmarks:
BM_L2_AddNewLevel: Adding new price levelsBM_L2_ModifyExistingLevel: Modifying quantities at existing levelsBM_L2_DeleteLevel: Removing price levelsBM_L2_GetBestBid/Ask: Top-of-book queries (hot path)BM_L2_GetTopOfBook: Full ToB snapshotBM_L2_GetLevels: Iterating through all levelsBM_L2_MixedWorkload: Realistic mix of operations
Target Performance: < 100ns p99 for all operations
Purpose: Measure L3 orderbook operation latencies
Benchmarks:
BM_L3_AddNewOrder: Adding new ordersBM_L3_ModifyExistingOrder: Modifying order quantitiesBM_L3_DeleteOrder: Removing ordersBM_L3_ExecuteOrderPartial: Partial order fillsBM_L3_ExecuteOrderComplete: Complete order fillsBM_L3_GetLevelsL2: L2 aggregation from L3 dataBM_L3_GetLevelsL3: Iterating through all ordersBM_L3_MixedWorkload: Realistic mix of operations
Target Performance: < 200ns p99 for all operations
Purpose: Measure multi-symbol overhead and scalability
Benchmarks:
BM_Manager_GetOrCreateL2/L3: Symbol lookup and creationBM_Manager_MultiSymbolL2/L3Updates: Distributed updates across symbolsBM_Manager_SymbolLookup: Raw lookup performanceBM_Manager_SymbolChurn: Add/remove symbol cyclesBM_Manager_IterateAllSymbols: Iteration overheadBM_Manager_ConcurrentReadHeavy: Multi-threaded read scalingBM_SingleSymbol_*vsBM_Manager_*: Overhead comparison
Target Performance: Minimal overhead compared to single-symbol operations
Purpose: Measure notification system impact
Benchmarks:
BM_L2/L3_NoObservers: Baseline (no observers attached)BM_L2/L3_WithCountingObservers: Minimal observer overheadBM_L2/L3_WithComputingObservers: Observer with computationBM_L2/L3_EmitSnapshot: Snapshot emission costBM_AddRemoveObserver: Observer management overhead
Tests with 1, 5, 10, 50, and 100 observers to measure scaling.
Target Performance: < 50ns per observer notification
Purpose: Profile memory footprint and allocation patterns
Benchmarks:
BM_L2/L3_MemoryFootprint: Static memory usage at different sizesBM_L2/L3_MemoryGrowth: Incremental growth patternsBM_L2/L3_MemoryChurn: Add/delete cycle efficiencyBM_Manager_MemoryFootprint_*: Multi-symbol memory usageBM_L2_CacheLocality_*: Sequential vs random access patterns
Target Memory:
- L2: < 1KB per symbol
- L3: < 10KB per symbol (with 1000 orders)
Note: For accurate absolute memory measurements, use external tools:
# Valgrind massif
valgrind --tool=massif ./bench_memory_usage
# Heaptrack (Linux)
heaptrack ./bench_memory_usage
# Instruments (macOS)
instruments -t Allocations ./bench_memory_usagePurpose: Simulate realistic market data patterns
Benchmarks:
BM_L2/L3_MarketReplay_LowFreq: 100 events/sec (typical)BM_L2/L3_MarketReplay_HighFreq: 10,000 events/sec (HFT)BM_MultiSymbol_MarketReplay: Multi-symbol scenariosBM_BurstPattern_MarketOpen: Snapshot + incremental updatesBM_Throughput_SustainedUpdates_*: Maximum sustained throughput
Event Distribution:
- L2: 50% modify, 30% add, 20% delete
- L3: 40% new order, 20% modify, 20% delete, 10% execute, 10% query
Purpose: Diagnostic tool for cache optimization
Features:
- Prints alignment information for all critical structures
- Compares aligned vs unaligned access patterns
- Measures false sharing impact
- Array-of-Structures (AoS) vs Structure-of-Arrays (SoA)
- Sequential vs random access (cache locality)
- Prefetch effectiveness
- Hot vs cold cache performance
Running:
./bench_cache_alignment
# Look for alignment info in output:
# ✓ Cache-aligned (64 bytes)
# ⚠ SIMD-aligned (16 bytes) but not cache-aligned
# ✗ Not aligned for optimal performance| Component | Operation | Target (p99) |
|---|---|---|
| L2 OrderBook | Add/Modify/Delete Level | < 100 nanoseconds |
| L2 OrderBook | Get Best Bid/Ask | < 10 nanoseconds |
| L2 OrderBook | Get Top of Book | < 10 nanoseconds |
| L3 OrderBook | Add/Modify/Delete Order | < 200 nanoseconds |
| L3 OrderBook | Execute Order | < 200 nanoseconds |
| L3 OrderBook | L2 Aggregation | < 500 nanoseconds |
| Observer | Notification per Observer | < 50 nanoseconds |
| Manager | Symbol Lookup | < 50 nanoseconds |
---------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------
BM_L2_AddNewLevel/10 68.2 ns 68.1 ns 10240000
BM_L2_AddNewLevel/50 89.5 ns 89.4 ns 7823104
BM_L2_AddNewLevel/100 112 ns 112 ns 6241280
- Time: Wall-clock time per iteration
- CPU: CPU time per iteration (may differ from wall time with I/O or threading)
- Iterations: Number of times benchmark ran to get stable results
- Latency: Time per operation (lower is better)
- Throughput: Operations per second (higher is better)
- Scalability: Performance vs problem size (flatter is better)
- Overhead: Difference between baseline and feature (smaller is better)
Good Performance:
- L2 operations consistently < 100ns
- L3 operations consistently < 200ns
- Linear or better scaling with size
- Minimal observer overhead (< 10% per observer)
Performance Issues:
- Operations > 1μs (investigate why)
- Exponential scaling with size (O(n²) behavior)
- Large variance between runs (inconsistent)
- High observer overhead (> 50ns per observer)
# Cache misses
perf stat -e cache-references,cache-misses ./bench_orderbook_l2
# Branch prediction
perf stat -e branches,branch-misses ./bench_orderbook_l2
# Detailed profiling
perf record -g ./bench_orderbook_l2
perf reportvtune -collect hotspots -result-dir vtune_results ./bench_orderbook_l2Integrate Tracy for frame-based profiling:
#include <tracy/Tracy.hpp>
void addOrModifyLevel(...) {
ZoneScoped; // Automatic function profiling
// ... implementation
}Set up CI to track performance over time:
- Run benchmarks on every commit
- Store results in time-series database
- Alert on regressions (> 10% slowdown)
- Visualize trends with Grafana or similar
Example: GitHub Actions + benchmark-action
- Profile: Run benchmarks to identify bottlenecks
- Hypothesize: Form theory about what's slow and why
- Optimize: Make targeted code changes
- Verify: Re-run benchmarks to confirm improvement
- Iterate: Repeat until targets met
Always measure before and after optimizations!
When adding new features:
- Add corresponding benchmarks
- Document expected performance
- Verify no regressions in existing benchmarks
- Update this README with new benchmark descriptions