Skip to content

ublk-org/sbitmap

Repository files navigation

sbitmap - Scalable Bitmap Allocator

A fast and scalable lock-free bitmap implementation based on the Linux kernel's sbitmap.

Overview

sbitmap provides a high-performance, cache-line optimized bitmap for concurrent bit allocation across multiple threads. It's designed for scenarios where many threads need to allocate and free bits from a shared pool efficiently.

Features

  • Lock-free: All operations use atomic instructions without locks
  • Cache-line aligned: Each bitmap word is on its own cache line to prevent false sharing
  • Lightweight hints: Callers pass allocation hints by reference - no thread-local overhead
  • Scalable: Tested with high concurrency workloads
  • Memory efficient: Bit-level granularity with minimal overhead

Design

This implementation is based on the Linux kernel's sbitmap (from lib/sbitmap.c), specifically designed for:

  • High-concurrency scenarios (multiple queues, multiple threads)
  • Efficient resource allocation (journal entries, tags, etc.)
  • Low-latency allocation and deallocation

Key Optimizations

  1. Cache-line separation: Each SbitmapWord is aligned to 64 bytes
  2. Per-task allocation hints: Caller-provided hints reduce contention without thread-local overhead
  3. Atomic operations: Acquire/Release semantics for correctness
  4. No deferred clearing: Direct atomic bit clearing for simplicity

Usage

Add to your Cargo.toml:

[dependencies]
sbitmap = "0.1"

Basic Example

use sbitmap::Sbitmap;

// Create a bitmap with 1024 bits (non-round-robin mode)
let sb = Sbitmap::new(1024, None, false);

// Each caller maintains its own allocation hint
let mut hint = 0;

// Allocate a bit
if let Some(bit) = sb.get(&mut hint) {
    // Use the allocated bit
    println!("Allocated bit: {}", bit);

    // Free it when done
    sb.put(bit, &mut hint);
}

Concurrent Usage

use sbitmap::Sbitmap;
use std::sync::Arc;
use std::thread;

let sb = Arc::new(Sbitmap::new(1024, None, false));
let mut handles = vec![];

for _ in 0..8 {
    let sb = Arc::clone(&sb);
    handles.push(thread::spawn(move || {
        // Each thread maintains its own hint in local context
        let mut hint = 0;

        // Each thread can safely allocate/free bits
        if let Some(bit) = sb.get(&mut hint) {
            // Do work...
            sb.put(bit, &mut hint);
        }
    }));
}

for h in handles {
    h.join().unwrap();
}

Batch Allocation

use sbitmap::Sbitmap;

// Create a bitmap with 1024 bits
let sb = Sbitmap::new(1024, None, false);
let mut hint = 0;

// Allocate 4 consecutive bits atomically
if let Some(start_bit) = sb.get_batch(4, &mut hint) {
    // Use bits: start_bit, start_bit+1, start_bit+2, start_bit+3
    println!("Allocated bits {}-{}", start_bit, start_bit + 3);

    // Process consecutive resources...
    for i in 0..4 {
        println!("Using bit {}", start_bit + i);
    }

    // Free all 4 bits atomically when done
    sb.put_batch(start_bit, 4, &mut hint);
}

Note: Batch operations require nr_bits <= bits_per_word(). All consecutive bits are guaranteed to be within the same word (no spanning across word boundaries).

API

Sbitmap::new(depth: usize, shift: Option<u32>, round_robin: bool) -> Self

Create a new sbitmap with depth bits. The shift parameter controls how many bits per word (2^shift bits per word) and is critical for performance - it determines how bits are spread across multiple cache-line aligned words. When None, the shift is auto-calculated for optimal cache usage. The round_robin parameter enables strict round-robin allocation order (usually false for better performance).

Understanding the shift parameter:

  • The shift value spreads bits among multiple words, which is key to sbitmap performance
  • Each word is on a separate cache line (64 bytes), reducing contention between CPUs
  • Smaller shift = more words = better spreading = less contention (but more memory overhead)
  • Larger shift = fewer words = more contention (but better memory efficiency)

get(&self, hint: &mut usize) -> Option<usize>

Allocate a free bit. The hint parameter is a mutable reference to the caller's allocation hint, which helps reduce contention by spreading allocations across different parts of the bitmap. Returns Some(bit_number) on success or None if no free bits are available.

put(&self, bitnr: usize, hint: &mut usize)

Free a previously allocated bit. The hint parameter is updated to improve cache locality for subsequent allocations.

get_batch(&self, nr_bits: usize, hint: &mut usize) -> Option<usize>

Allocate nr_bits consecutive free bits from the bitmap atomically. This operation provides acquire barrier semantics on success. Only supports nr_bits <= bits_per_word() to ensure all bits are within the same word (no spanning across word boundaries).

Returns Some(start_bit) where start_bit is the first bit of the allocated consecutive range, or None if no consecutive nr_bits are available or nr_bits > bits_per_word().

Use cases:

  • Allocating contiguous resource ranges (e.g., multiple consecutive I/O tags)
  • Batch resource allocation for improved efficiency
  • DMA buffer allocation requiring consecutive indices

put_batch(&self, bitnr: usize, nr_bits: usize, hint: &mut usize)

Free nr_bits consecutive previously allocated bits starting from bitnr. This operation provides release barrier semantics, ensuring that all writes to data associated with these bits are visible before the bits are freed. Only supports nr_bits <= bits_per_word() to ensure all bits are within the same word.

The hint parameter is updated for better cache locality in subsequent allocations.

test_bit(&self, bitnr: usize) -> bool

Check if a bit is currently allocated.

weight(&self) -> usize

Count the number of currently allocated bits.

depth(&self) -> usize

Get the total number of bits in the bitmap.

Use Cases

  • Tag allocation: I/O tag allocation for block devices
  • Resource pools: Any scenario requiring efficient concurrent resource allocation
  • Lock-free data structures: Building block for concurrent algorithms
  • Batch resource allocation: Allocating multiple consecutive I/O tags, DMA buffers, or contiguous resource ranges
  • NUMA machine: improvement on NUMA machines is obvious

Performance Characteristics

  • Allocation: O(n) worst case, O(1) average with hints
  • Deallocation: O(1)
  • Batch allocation: O(n * nr_bits) worst case, finds consecutive bits within single word
  • Batch deallocation: O(1), atomic clear of consecutive bits
  • Memory overhead: ~56 bytes per word (64 bits) due to cache-line alignment
  • Thread safety: Lock-free with atomic operations
  • Scalability: Linear scaling with number of CPUs up to bitmap depth

Performance Tuning

The shift parameter is crucial for tuning sbitmap performance based on your workload:

When to use a smaller shift:

  • High contention: When many threads are competing heavily for bit allocation and release, use a smaller shift to spread bits across more words and reduce contention on individual cache lines
  • NUMA systems: Machines with multiple NUMA nodes benefit significantly from smaller shift values, as this distributes memory accesses across more cache lines and reduces cross-node traffic
  • Many concurrent allocators: Systems with a high CPU count see better scalability with smaller shift values

Examples:

// High contention scenario (32-core NUMA system)
let sb = Sbitmap::new(1024, Some(4), false);  // 2^4 = 16 bits per word, 64 words

// Low contention scenario (4-core system)
let sb = Sbitmap::new(1024, Some(6), false);  // 2^6 = 64 bits per word, 16 words

// Let sbitmap decide (recommended starting point)
let sb = Sbitmap::new(1024, None, false);     // Auto-calculated based on depth

Trade-offs:

  • Smaller shift improves performance under contention but uses more memory (each word needs 64 bytes for cache-line alignment)
  • Larger shift reduces memory overhead but increases contention when many threads compete
  • The auto-calculated shift (when None) provides a balanced default suitable for most workloads

Memory Ordering

  • get(): Acquire semantics - ensures allocated bit is visible before use
  • put(): Release semantics - ensures all writes complete before bit is freed
  • get_batch(): Acquire semantics - ensures all allocated bits are visible before use
  • put_batch(): Release semantics - ensures all writes complete before bits are freed

Comparison with Alternatives

Feature sbitmap Mutex + BitVec AtomicBitSet
Lock-free
Cache-optimized
Per-thread hints
Kernel-proven design

Benchmarks

To compare sbitmap performance against a simple lockless bitmap:

# Run with defaults (32 bits, auto shift, 10 seconds, N-1 tasks)
cargo run --bin bench_compare --release

# Specify bitmap depth and duration
cargo run --bin bench_compare --release -- --depth 1024 --time 5

# Specify bitmap depth, shift, and duration
cargo run --bin bench_compare --release -- --depth 512 --shift 5 --time 10

# Benchmark batch operations (allocating 4 consecutive bits)
cargo run --bin bench_compare --release -- --depth 128 --batch 4 --time 5

# Show help
cargo run --bin bench_compare --release -- --help

This benchmark:

  • Auto-detects available CPUs and spawns N-1 concurrent tasks
  • Measures operations per second (get + put pairs for single-bit mode, get_batch + put_batch pairs for batch mode)
  • Compares sbitmap vs a baseline lockless implementation (single-bit mode only)
  • Defaults: 32 bits, auto-calculated shift, 10 seconds, N-1 tasks (where N is total CPU count)

Options:

  • --depth DEPTH - Bitmap depth in bits (default: 32)
  • --shift SHIFT - log2(bits per word), auto-calculated if not specified
  • --time TIME - Benchmark duration in seconds (default: 10)
  • --tasks TASKS - Number of concurrent tasks (default: NUM_CPUS - 1)
  • --batch NR_BITS - Use get_batch/put_batch with NR_BITS (default: 1, single bit mode)
  • --round-robin - Enable round-robin allocation mode (default: disabled)

See benches/README.md for more details.

Example output on a 32-CPU system:

System: 32 CPUs detected, 2 NUMA nodes, using 31 tasks for benchmark
Bitmap depth: 32 bits
Shift: auto-calculated (bits per word: 8)
Duration: 10 seconds


=== Sbitmap (Optimized) Benchmark ===
Configuration:
  - Duration: 10s
  - Tasks: 31
  - Bitmap depth: 32 bits

Results:
  Task 0: 3101117 ops, 310111 ops/sec (0.3101 Mops/sec)
  ...
  Task 30: 3169582 ops, 316958 ops/sec (0.3170 Mops/sec)
  Total: 93604448 ops, 9360444 ops/sec (9.3604 Mops/sec)

=== SimpleBitmap (Baseline) Benchmark ===
Configuration:
  - Duration: 10s
  - Tasks: 31
  - Bitmap depth: 32 bits

Results:
  Task 0: 1998241 ops, 199824 ops/sec (0.1998 Mops/sec)
  ...
  Task 30: 1835360 ops, 183536 ops/sec (0.1835 Mops/sec)
  Total: 62530560 ops, 6253056 ops/sec (6.2531 Mops/sec)

License

Licensed under either of:

at your option.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Credits

Based on the Linux kernel's sbitmap implementation by Jens Axboe and other contributors.

About

Scable bitmap based on the Linux kernel's sbitmap

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages