Designed for High-Frequency Trading (HFT) environments where every nanosecond counts. This engine implements a subset of NASDAQ OUCH 5.0 and ITCH 5.0 protocols with a focus on mechanical sympathy.
-
Median Latency (
$P50$ ): 45.51 ns -
Tail Latency (
$P99.9$ ): 837.35 ns - Throughput: ~14.37 Million ops/sec
- Execution Model: Single-threaded deterministic hot-path with asynchronous I/O.
The system utilizes a Thread-per-Core architecture. By pinning the Matching Engine to a dedicated physical core and isolating I/O, we eliminate OS context switches and maximize L1/L2 cache residency.
- Zero-Allocation Hot Path: Utilizing a custom
OrderPoolbacked by 2MB Huge Pages (MAP_HUGETLB). This avoidsmallocnon-determinism and minimizes TLB misses during high volatility. - Lock-Free Communication: Components communicate via Single-Producer Single-Consumer (SPSC) ring buffers. This removes the overhead of mutex contention and cache coherency traffic.
- Mechanical Sympathy: Data structures are cache-line aligned (64 bytes) to prevent False Sharing and optimized for spatial locality.
Standard std::unordered_map uses linked-list chaining, causing expensive pointer chasing.
- Our Solution: An open-addressed hash map with Backward Shift Deletion.
- Impact: By shifting elements back upon deletion, we maintain short probe sequences without the performance degradation of "tombstones," resulting in a 3.47x speedup over the STL.
Persistence is usually the bottleneck.
- The Optimization: We use
_mm256_stream_si256(Non-temporal stores) to stream Write-Ahead Logs (WAL) directly to memory-mapped files. - The Why: This bypasses the L3 cache entirely. Since log data is rarely read back immediately, we prevent it from "polluting" the cache, leaving more room for hot Order Book data.
We replaced std::lower_bound with a custom branchless binary search using assembly conditional moves (cmov). This stabilizes performance during price discovery by preventing CPU pipeline stalls caused by branch mispredictions.
Environment: Ryzen 9 6900HS (Isolated Cores), Ubuntu 24.04, Isolated Cores
| Metric | Cycles (@ 4.4 GHz) | Latency |
|---|---|---|
| Minimum | 42 | 9.46 ns |
| Median (P50) | 202 | 45.51 ns |
| 99th Percentile | 1,566 | 352.78 ns |
| 99.9th Percentile | 3,717 | 837.35 ns |
Measured with hardware performance counter (RDPMC) |
| Implementation | Throughput | vs. STL |
|---|---|---|
std::unordered_map |
26.13 M/s | Baseline |
absl::flat_hash_map |
37.16 M/s | 1.42x |
| OrderMap (Fibonacci) | 46.42 M/s | 1.78x |
| OrderMap (Identity) | 90.68 M/s | 3.47x |
In financial systems, speed is worthless without correctness.
- Oracle Validation:
OrderMapis continuously cross-referenced against a shadow STL implementation. - Fuzz Testing: Integrated clang's
libFuzzerto ensure engine stability against malformed OUCH/ITCH packets. - Static Analysis: Built with
-Wall -Wextra -Wpedanticand analyzed via Clang-Tidy.
Note: A kernel with Huge Pages support is suggested for OrderPool and OrderMap.
# Enable Huge Pages
sudo sysctl -w vm.nr_hugepages=1024
# Build with Release optimizations
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
# 3. Run
./OrderBookThis project was created by Dhruvan Gnanadhandayuthapani <dhruvan2006@gmail.com>.

