A fast and scalable lock-free bitmap implementation based on the Linux kernel's sbitmap.
sbitmap provides a high-performance, cache-line optimized bitmap for concurrent bit allocation across multiple threads. It's designed for scenarios where many threads need to allocate and free bits from a shared pool efficiently.
- Lock-free: All operations use atomic instructions without locks
- Cache-line aligned: Each bitmap word is on its own cache line to prevent false sharing
- Lightweight hints: Callers pass allocation hints by reference - no thread-local overhead
- Scalable: Tested with high concurrency workloads
- Memory efficient: Bit-level granularity with minimal overhead
This implementation is based on the Linux kernel's sbitmap (from lib/sbitmap.c), specifically designed for:
- High-concurrency scenarios (multiple queues, multiple threads)
- Efficient resource allocation (journal entries, tags, etc.)
- Low-latency allocation and deallocation
- Cache-line separation: Each
SbitmapWordis aligned to 64 bytes - Per-task allocation hints: Caller-provided hints reduce contention without thread-local overhead
- Atomic operations: Acquire/Release semantics for correctness
- No deferred clearing: Direct atomic bit clearing for simplicity
Add to your Cargo.toml:
[dependencies]
sbitmap = "0.1"use sbitmap::Sbitmap;
// Create a bitmap with 1024 bits (non-round-robin mode)
let sb = Sbitmap::new(1024, None, false);
// Each caller maintains its own allocation hint
let mut hint = 0;
// Allocate a bit
if let Some(bit) = sb.get(&mut hint) {
// Use the allocated bit
println!("Allocated bit: {}", bit);
// Free it when done
sb.put(bit, &mut hint);
}use sbitmap::Sbitmap;
use std::sync::Arc;
use std::thread;
let sb = Arc::new(Sbitmap::new(1024, None, false));
let mut handles = vec![];
for _ in 0..8 {
let sb = Arc::clone(&sb);
handles.push(thread::spawn(move || {
// Each thread maintains its own hint in local context
let mut hint = 0;
// Each thread can safely allocate/free bits
if let Some(bit) = sb.get(&mut hint) {
// Do work...
sb.put(bit, &mut hint);
}
}));
}
for h in handles {
h.join().unwrap();
}use sbitmap::Sbitmap;
// Create a bitmap with 1024 bits
let sb = Sbitmap::new(1024, None, false);
let mut hint = 0;
// Allocate 4 consecutive bits atomically
if let Some(start_bit) = sb.get_batch(4, &mut hint) {
// Use bits: start_bit, start_bit+1, start_bit+2, start_bit+3
println!("Allocated bits {}-{}", start_bit, start_bit + 3);
// Process consecutive resources...
for i in 0..4 {
println!("Using bit {}", start_bit + i);
}
// Free all 4 bits atomically when done
sb.put_batch(start_bit, 4, &mut hint);
}Note: Batch operations require nr_bits <= bits_per_word(). All consecutive bits are guaranteed to be within the same word (no spanning across word boundaries).
Create a new sbitmap with depth bits. The shift parameter controls how many bits per word (2^shift bits per word) and is critical for performance - it determines how bits are spread across multiple cache-line aligned words. When None, the shift is auto-calculated for optimal cache usage. The round_robin parameter enables strict round-robin allocation order (usually false for better performance).
Understanding the shift parameter:
- The shift value spreads bits among multiple words, which is key to sbitmap performance
- Each word is on a separate cache line (64 bytes), reducing contention between CPUs
- Smaller shift = more words = better spreading = less contention (but more memory overhead)
- Larger shift = fewer words = more contention (but better memory efficiency)
Allocate a free bit. The hint parameter is a mutable reference to the caller's allocation hint, which helps reduce contention by spreading allocations across different parts of the bitmap. Returns Some(bit_number) on success or None if no free bits are available.
Free a previously allocated bit. The hint parameter is updated to improve cache locality for subsequent allocations.
Allocate nr_bits consecutive free bits from the bitmap atomically. This operation provides acquire barrier semantics on success. Only supports nr_bits <= bits_per_word() to ensure all bits are within the same word (no spanning across word boundaries).
Returns Some(start_bit) where start_bit is the first bit of the allocated consecutive range, or None if no consecutive nr_bits are available or nr_bits > bits_per_word().
Use cases:
- Allocating contiguous resource ranges (e.g., multiple consecutive I/O tags)
- Batch resource allocation for improved efficiency
- DMA buffer allocation requiring consecutive indices
Free nr_bits consecutive previously allocated bits starting from bitnr. This operation provides release barrier semantics, ensuring that all writes to data associated with these bits are visible before the bits are freed. Only supports nr_bits <= bits_per_word() to ensure all bits are within the same word.
The hint parameter is updated for better cache locality in subsequent allocations.
Check if a bit is currently allocated.
Count the number of currently allocated bits.
Get the total number of bits in the bitmap.
- Tag allocation: I/O tag allocation for block devices
- Resource pools: Any scenario requiring efficient concurrent resource allocation
- Lock-free data structures: Building block for concurrent algorithms
- Batch resource allocation: Allocating multiple consecutive I/O tags, DMA buffers, or contiguous resource ranges
- NUMA machine: improvement on NUMA machines is obvious
- Allocation: O(n) worst case, O(1) average with hints
- Deallocation: O(1)
- Batch allocation: O(n * nr_bits) worst case, finds consecutive bits within single word
- Batch deallocation: O(1), atomic clear of consecutive bits
- Memory overhead: ~56 bytes per word (64 bits) due to cache-line alignment
- Thread safety: Lock-free with atomic operations
- Scalability: Linear scaling with number of CPUs up to bitmap depth
The shift parameter is crucial for tuning sbitmap performance based on your workload:
When to use a smaller shift:
- High contention: When many threads are competing heavily for bit allocation and release, use a smaller shift to spread bits across more words and reduce contention on individual cache lines
- NUMA systems: Machines with multiple NUMA nodes benefit significantly from smaller shift values, as this distributes memory accesses across more cache lines and reduces cross-node traffic
- Many concurrent allocators: Systems with a high CPU count see better scalability with smaller shift values
Examples:
// High contention scenario (32-core NUMA system)
let sb = Sbitmap::new(1024, Some(4), false); // 2^4 = 16 bits per word, 64 words
// Low contention scenario (4-core system)
let sb = Sbitmap::new(1024, Some(6), false); // 2^6 = 64 bits per word, 16 words
// Let sbitmap decide (recommended starting point)
let sb = Sbitmap::new(1024, None, false); // Auto-calculated based on depthTrade-offs:
- Smaller shift improves performance under contention but uses more memory (each word needs 64 bytes for cache-line alignment)
- Larger shift reduces memory overhead but increases contention when many threads compete
- The auto-calculated shift (when
None) provides a balanced default suitable for most workloads
get(): Acquire semantics - ensures allocated bit is visible before useput(): Release semantics - ensures all writes complete before bit is freedget_batch(): Acquire semantics - ensures all allocated bits are visible before useput_batch(): Release semantics - ensures all writes complete before bits are freed
| Feature | sbitmap | Mutex + BitVec | AtomicBitSet |
|---|---|---|---|
| Lock-free | ✅ | ❌ | ✅ |
| Cache-optimized | ✅ | ❌ | ❌ |
| Per-thread hints | ✅ | ❌ | ❌ |
| Kernel-proven design | ✅ | ❌ | ❌ |
To compare sbitmap performance against a simple lockless bitmap:
# Run with defaults (32 bits, auto shift, 10 seconds, N-1 tasks)
cargo run --bin bench_compare --release
# Specify bitmap depth and duration
cargo run --bin bench_compare --release -- --depth 1024 --time 5
# Specify bitmap depth, shift, and duration
cargo run --bin bench_compare --release -- --depth 512 --shift 5 --time 10
# Benchmark batch operations (allocating 4 consecutive bits)
cargo run --bin bench_compare --release -- --depth 128 --batch 4 --time 5
# Show help
cargo run --bin bench_compare --release -- --helpThis benchmark:
- Auto-detects available CPUs and spawns N-1 concurrent tasks
- Measures operations per second (get + put pairs for single-bit mode, get_batch + put_batch pairs for batch mode)
- Compares sbitmap vs a baseline lockless implementation (single-bit mode only)
- Defaults: 32 bits, auto-calculated shift, 10 seconds, N-1 tasks (where N is total CPU count)
Options:
--depth DEPTH- Bitmap depth in bits (default: 32)--shift SHIFT- log2(bits per word), auto-calculated if not specified--time TIME- Benchmark duration in seconds (default: 10)--tasks TASKS- Number of concurrent tasks (default: NUM_CPUS - 1)--batch NR_BITS- Use get_batch/put_batch with NR_BITS (default: 1, single bit mode)--round-robin- Enable round-robin allocation mode (default: disabled)
See benches/README.md for more details.
Example output on a 32-CPU system:
System: 32 CPUs detected, 2 NUMA nodes, using 31 tasks for benchmark
Bitmap depth: 32 bits
Shift: auto-calculated (bits per word: 8)
Duration: 10 seconds
=== Sbitmap (Optimized) Benchmark ===
Configuration:
- Duration: 10s
- Tasks: 31
- Bitmap depth: 32 bits
Results:
Task 0: 3101117 ops, 310111 ops/sec (0.3101 Mops/sec)
...
Task 30: 3169582 ops, 316958 ops/sec (0.3170 Mops/sec)
Total: 93604448 ops, 9360444 ops/sec (9.3604 Mops/sec)
=== SimpleBitmap (Baseline) Benchmark ===
Configuration:
- Duration: 10s
- Tasks: 31
- Bitmap depth: 32 bits
Results:
Task 0: 1998241 ops, 199824 ops/sec (0.1998 Mops/sec)
...
Task 30: 1835360 ops, 183536 ops/sec (0.1835 Mops/sec)
Total: 62530560 ops, 6253056 ops/sec (6.2531 Mops/sec)
Licensed under either of:
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
at your option.
Contributions are welcome! Please feel free to submit a Pull Request.
Based on the Linux kernel's sbitmap implementation by Jens Axboe and other contributors.