Skip to content

Conversation

@mkroening
Copy link
Member

@mkroening mkroening commented Nov 4, 2025

This replaces the vec-based MemPool with a bitmap-based IndexAlloc. To track 256 indexes, we now need 32 bytes instead of 512 bytes. I am not sure if this is worth it, though.

These are measurements from my machine of creating the allocator, allocating all indices, and then deallocating them again:

len size old size new
256 512 32
1024 2048 64
2048 4096 128

@mkroening mkroening self-assigned this Nov 4, 2025
@mkroening mkroening changed the title perf(virtqueue): remove unused MemPool::limit field perf(virtqueue): replace vec-based MemPools with bitmap-based IndexAlloc Nov 4, 2025
@mkroening mkroening force-pushed the mempool-bitvec branch 2 times, most recently from 7320231 to fc37370 Compare November 4, 2025 08:53
@mkroening mkroening marked this pull request as ready for review November 4, 2025 08:54
@mkroening mkroening requested review from Gelbpunkt and cagatay-y and removed request for Gelbpunkt November 4, 2025 08:54
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark Results

Benchmark Current: 78d301c Previous: 9d1e4dd Performance Ratio
startup_benchmark Build Time 112.64 s 111.98 s 1.01
startup_benchmark File Size 0.91 MB 0.91 MB 1.00
Startup Time - 1 core 0.92 s (±0.03 s) 0.94 s (±0.02 s) 0.98
Startup Time - 2 cores 0.93 s (±0.02 s) 0.94 s (±0.03 s) 0.99
Startup Time - 4 cores 0.93 s (±0.02 s) 0.94 s (±0.02 s) 0.99
multithreaded_benchmark Build Time 113.19 s 112.12 s 1.01
multithreaded_benchmark File Size 1.02 MB 1.02 MB 1.00
Multithreaded Pi Efficiency - 2 Threads 87.29 % (±6.22 %) 88.00 % (±7.30 %) 0.99
Multithreaded Pi Efficiency - 4 Threads 44.24 % (±3.39 %) 43.95 % (±3.32 %) 1.01
Multithreaded Pi Efficiency - 8 Threads 25.40 % (±2.24 %) 25.25 % (±2.26 %) 1.01
micro_benchmarks Build Time 293.42 s 315.31 s 0.93
micro_benchmarks File Size 1.02 MB 1.02 MB 1.00
Scheduling time - 1 thread 166.87 ticks (±27.41 ticks) 181.72 ticks (±30.16 ticks) 0.92
Scheduling time - 2 threads 101.62 ticks (±22.94 ticks) 107.77 ticks (±18.94 ticks) 0.94
Micro - Time for syscall (getpid) 10.25 ticks (±4.71 ticks) 13.22 ticks (±5.19 ticks) 0.77
Memcpy speed - (built_in) block size 4096 60354.71 MByte/s (±43217.33 MByte/s) 55204.30 MByte/s (±40404.17 MByte/s) 1.09
Memcpy speed - (built_in) block size 1048576 13715.40 MByte/s (±11182.02 MByte/s) 14269.94 MByte/s (±12083.95 MByte/s) 0.96
Memcpy speed - (built_in) block size 16777216 10001.37 MByte/s (±8088.85 MByte/s) 7485.41 MByte/s (±6068.00 MByte/s) 1.34
Memset speed - (built_in) block size 4096 60529.57 MByte/s (±43362.49 MByte/s) 55494.03 MByte/s (±40595.71 MByte/s) 1.09
Memset speed - (built_in) block size 1048576 14069.07 MByte/s (±11388.11 MByte/s) 14670.46 MByte/s (±12334.51 MByte/s) 0.96
Memset speed - (built_in) block size 16777216 10232.46 MByte/s (±8219.09 MByte/s) 7599.70 MByte/s (±6128.76 MByte/s) 1.35
Memcpy speed - (rust) block size 4096 54089.57 MByte/s (±40501.99 MByte/s) 51056.01 MByte/s (±38411.57 MByte/s) 1.06
Memcpy speed - (rust) block size 1048576 13805.14 MByte/s (±11305.04 MByte/s) 13857.34 MByte/s (±11384.86 MByte/s) 1.00
Memcpy speed - (rust) block size 16777216 10006.77 MByte/s (±8110.63 MByte/s) 7529.44 MByte/s (±6166.85 MByte/s) 1.33
Memset speed - (rust) block size 4096 54844.52 MByte/s (±41047.82 MByte/s) 51991.84 MByte/s (±39176.17 MByte/s) 1.05
Memset speed - (rust) block size 1048576 14068.24 MByte/s (±11428.28 MByte/s) 14087.35 MByte/s (±11505.86 MByte/s) 1.00
Memset speed - (rust) block size 16777216 10263.74 MByte/s (±8266.58 MByte/s) 7599.50 MByte/s (±6197.41 MByte/s) 1.35
alloc_benchmarks Build Time 293.71 s 312.10 s 0.94
alloc_benchmarks File Size 0.98 MB 0.98 MB 1.00
Allocations - Allocation success 100.00 % 100.00 % 1
Allocations - Deallocation success 100.00 % 100.00 % 1
Allocations - Pre-fail Allocations 100.00 % 100.00 % 1
Allocations - Average Allocation time 20171.66 Ticks (±975.73 Ticks) 20044.74 Ticks (±1076.26 Ticks) 1.01
Allocations - Average Allocation time (no fail) 20171.66 Ticks (±975.73 Ticks) 20044.74 Ticks (±1076.26 Ticks) 1.01
Allocations - Average Deallocation time 2914.05 Ticks (±1268.86 Ticks) 2980.59 Ticks (±1256.07 Ticks) 0.98
mutex_benchmark Build Time 292.81 s 296.51 s 0.99
mutex_benchmark File Size 1.02 MB 1.02 MB 1.00
Mutex Stress Test Average Time per Iteration - 1 Threads 36.14 ns (±3.33 ns) 36.10 ns (±4.90 ns) 1.00
Mutex Stress Test Average Time per Iteration - 2 Threads 30.00 ns (±3.13 ns) 29.58 ns (±2.65 ns) 1.01

This comment was automatically generated by workflow using github-action-benchmark.

@mkroening mkroening marked this pull request as draft November 4, 2025 13:05
@mkroening mkroening force-pushed the mempool-bitvec branch 2 times, most recently from e3958f3 to 34f9060 Compare November 4, 2025 18:06
Comment on lines +546 to +556
for (word_index, word) in self.bits.iter_mut().enumerate() {
let trailing_ones = word.trailing_ones();
if trailing_ones < usize::BITS {
let mask = 1 << trailing_ones;
*word |= mask;
let index = word_index * USIZE_BITS + usize::try_from(trailing_ones).unwrap();
return Some(index);
}
}

None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (word_index, word) in self.bits.iter_mut().enumerate() {
let trailing_ones = word.trailing_ones();
if trailing_ones < usize::BITS {
let mask = 1 << trailing_ones;
*word |= mask;
let index = word_index * USIZE_BITS + usize::try_from(trailing_ones).unwrap();
return Some(index);
}
}
None
let (word_index, trailing_ones) = self
.bits
.iter()
.copied()
.map(usize::trailing_ones)
.enumerate()
.find(|(_, trailing_ones)| *trailing_ones < usize::BITS)?;
let mask = 1 << trailing_ones;
self.bits[word_index] |= mask;
let index = word_index * USIZE_BITS + usize::try_from(trailing_ones).unwrap();
Some(index)

I am not sure if it would be an improvement but wanted to offer it as an option. It would save us from some nesting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! I have looked into this, and the compiler fails to optimize the bounds check when setting the bit. Also, maybe because the trailing ones calculation is too far away now, the compiler no longer optimizes the masking from shl and or to bts.

For details, see Compiler Explorer.

So I'd keep it as is, even though the performance difference is small, of course (about 5%). :D

@mkroening mkroening marked this pull request as ready for review November 6, 2025 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants