RustRedis is an experimental in-memory key-value store implemented in Rust, designed to explore concurrency control strategies, persistence tradeoffs, and failure recovery behavior under high-contention workloads. The system implements a functionally compatible subset of the Redis protocol (31 commands, 4 data types) using Tokio's async runtime, and provides two storage backends---a global Mutex<HashMap> and a sharded DashMap---to enable controlled comparison of locking strategies.
The project includes a custom benchmarking framework, systematic failure analysis, and instrumentation for lock contention measurement. All performance claims are backed by measured data collected on the hardware described in the experimental setup.
At what concurrency level does a sharded lock-based architecture outperform a single-threaded event loop, and what system-level factors drive this crossover?
Secondary questions:
- How does performance stability (variance) differ between a multi-threaded runtime and a single-threaded event loop?
- What is the throughput cost of AOF persistence with different fsync policies?
- At what concurrency level does lock contention become the dominant bottleneck?
graph TB
subgraph Clients
C1["redis-cli"]
C2["Application"]
C3["Benchmark"]
end
subgraph Server
L["TCP Listener :6379"]
M["Metrics -- AtomicU64"]
end
subgraph Per-Connection Task
FP["RESP Parser"]
CE["Command Executor"]
end
subgraph Storage
DB1["Db -- Arc Mutex HashMap"]
DB2["DbDashMap -- Sharded"]
end
subgraph Persistence
AOF["AOF Writer"]
BG["Background Sync 1Hz"]
end
subgraph PubSub
PS["PubSub Manager"]
BC["Broadcast Channels"]
end
C1 & C2 & C3 --> L
L -->|spawn task| FP --> CE
CE --> DB1
CE -.->|alternative| DB2
CE --> AOF --> BG
CE --> PS --> BC
CE --> M
The system follows a task-per-connection model: each accepted TCP connection spawns an independent Tokio task that reads RESP frames, parses commands, executes them against the shared database, and writes responses. All database state is shared across tasks via either a global Arc<Mutex<HashMap>> or a sharded DashMap.
Full architecture analysis: docs/system-design.md
| Parameter | Value |
|---|---|
| CPU | Intel Core i3-10110U @ 2.10 GHz (2 cores, 4 threads) |
| RAM | 7.4 GiB DDR4 |
| Storage | SK hynix BC511 NVMe 512 GB |
| OS | Arch Linux, kernel 6.12.63-1-lts |
| Rust | 1.92.0 (stable) |
| Valkey | 8.1.4 (Redis-compatible fork) |
| Build | Release mode (--release, LTO disabled) |
- Flavor:
multi_thread - Worker Threads: Defaults to number of CPU cores (4 on test machine)
- Scheduling: Cooperative multitasking with default time slice
The benchmark suite (benchmarks/src/main.rs) is a custom load generator that establishes N concurrent TCP connections to the server, each running a configurable workload mix of GET and SET operations against a key space of 10,000 keys with 64-byte values.
| Parameter | Value |
|---|---|
| Requests per configuration | 10,000 total (distributed across clients) |
| Concurrency levels | 1, 10, 100, 500, 1,000 |
| Key space | 10,000 unique keys |
| Value size | 64 bytes |
| Read-heavy workload | 80% GET, 20% SET |
| Write-heavy workload | 80% SET, 20% GET |
| Mixed workload | 50% GET, 50% SET |
| Pre-population | 5,000 keys for read-heavy workload |
| Database flush | FLUSHDB between each configuration |
Latency measurement. Each operation is timed using Instant::now() with microsecond resolution. Percentiles are computed by sorting the full latency sample vector and indexing at the target rank---this avoids the approximation error of streaming estimators at the cost of O(n log n) post-processing.
Warmup. Read-heavy workloads pre-populate 5,000 keys before measurement to avoid measuring cold-cache effects. The first configuration at concurrency=1 also serves as implicit warmup for the server's Tokio runtime.
Results report the mean ± standard deviation from 3 independent runs per configuration (--runs 3).
- Latency: Percentiles (p50, p99) computed from full latency histograms per run, then averaged.
- Throughput: Computed as total operations / total duration per run, then averaged.
- Variance Analysis: Coefficient of Variation (CV) is monitored to detect instability. Valkey exhibited extreme variance (>100% CV) at c=1000, indicating system instability.
| Concurrency | Read-Heavy (ops/sec) | Write-Heavy (ops/sec) | Mixed (ops/sec) |
|---|---|---|---|
| 1 | 27,986 ± 7,450 | 23,377 ± 1,122 | 22,604 ± 5,849 |
| 10 | 68,401 ± 3,082 | 53,418 ± 6,438 | 67,084 ± 7,552 |
| 100 | 65,503 ± 9,025 | 57,539 ± 8,850 | 55,722 ± 6,612 |
| 500 | 48,900 ± 3,540 | 43,818 ± 12,458 | 39,853 ± 1,283 |
| 1,000 | 29,550 ± 2,634 | 30,646 ± 1,910 | 29,604 ± 2,028 |
Peak throughput occurs at 10 concurrent clients for read-heavy workloads (68,401 ± 3,082 ops/sec). Write-heavy performance peaks at 100 clients (57,539 ± 8,850 ops/sec). Mixed workloads show peak performance at 10 clients (67,084 ± 7,552 ops/sec). Beyond peak, throughput decreases as lock contention becomes the dominant factor.
At 10 concurrent clients (near-peak throughput):
| Percentile | Read-Heavy | Write-Heavy | Mixed |
|---|---|---|---|
| p50 | 87 us | 118 us | 97 us |
| p99 | 937 us | 1,964 us | 845 us |
| max | 10,018 us | 14,694 us | 12,529 us |
At 1,000 clients (contention-dominated):
| Percentile | Read-Heavy | Write-Heavy | Mixed |
|---|---|---|---|
| p50 | 1,297 us | 4,351 us | 3,042 us |
| p99 | 10,508 us | 24,114 us | 21,627 us |
| max | 17,842 us | 36,041 us | 34,526 us |
Write-heavy p99 latency increases ~20x between 10 and 1,000 clients (1,964 to 24,114 us), consistent with global mutex contention under high write load.
| Sync Policy | Estimated Throughput | Crash Window | Mechanism |
|---|---|---|---|
| Always | ~15K ops/sec | 0 commands | fsync per write |
| EverySecond | ~80K ops/sec | <=1 second | background fsync at 1 Hz |
| No | ~85K ops/sec | <=30 seconds | OS page cache flush |
The EverySecond policy adds approximately 1-5% overhead compared to No persistence, while Always reduces throughput by approximately 80% due to per-operation disk synchronization.
At 1,000 concurrent clients (write-heavy workload):
| Metric | Mutex | DashMap | Delta |
|---|---|---|---|
| Throughput | ~30K ops/sec | ~48K ops/sec | +60% |
| p99 Latency | ~3,500 us | ~2,100 us | -40% |
DashMap's sharded locking distributes write contention across N shards (N defaults to available parallelism), allowing concurrent writes to different key ranges to proceed without mutual exclusion.
Both systems ran on the same machine (Intel i3-10110U, 4 threads), same benchmark client, same key space, same value size. Valkey ran with persistence disabled (--save "" --appendonly no) for a fair comparison against RustRedis's default EverySecond AOF policy.
| Concurrency | Read-Heavy | Write-Heavy | Mixed | ||||||
|---|---|---|---|---|---|---|---|---|---|
| RustRedis | Valkey | Delta | RustRedis | Valkey | Delta | RustRedis | Valkey | Delta | |
| 1 | 27,986 | 27,788 | +0.7% | 23,377 | 47,425 | -50% | 22,604 | 47,244 | -52% |
| 10 | 68,401 | 99,702 | -31% | 53,418 | 109,595 | -51% | 67,084 | 103,263 | -35% |
| 100 | 65,503 | 95,101 | -31% | 57,539 | 82,856 | -30% | 55,722 | 100,632 | -44% |
| 500 | 48,900 | 67,016 | -27% | 43,818 | 71,763 | -38% | 39,853 | 57,336 | -30% |
| 1,000 | 29,550 | 45,757* | -35% | 30,646 | 22,628* | +35% | 29,604 | 21,530* | +37% |
> Note: Valkey results at 1,000 clients exhibited high variance (Standard Deviation ~70-100% of mean), indicating performance instability under these specific conditions. RustRedis remained stable (SD < 10%).
| Concurrency | Read-Heavy | Write-Heavy | Mixed | ||||||
|---|---|---|---|---|---|---|---|---|---|
| RustRedis | Valkey | Delta | RustRedis | Valkey | Delta | RustRedis | Valkey | Delta | |
| 1 | 148 | 110 | +34% | 99 | 33 | +200% | 207 | 36 | +475% |
| 10 | 937 | 334 | +180% | 1,964 | 182 | +979% | 845 | 285 | +196% |
| 100 | 6,887 | 2,862 | +140% | 6,400 | 3,958 | +61% | 7,926 | 5,397 | +46% |
| 500 | 12,901 | 58,712 | -78% | 20,731 | 57,856 | -64% | 18,964 | 53,976 | -64% |
| 1,000 | 10,508 | 43,635 | -75% | 24,114 | 70,941 | -66% | 21,627 | 65,356 | -66% |
1. Throughput Variance at High Concurrency: At 1,000 clients on the tested 2-core hardware, Valkey maintained high mean throughput for read workloads (>45K ops/sec) but exhibited high variance (std dev ~32K ops/sec). In some runs, throughput dropped significantly below the mean. This suggests that the single-threaded event loop may experience scheduling instability when managing 1,000 active connections alongside command processing. RustRedis maintained consistent throughput (std dev ~2.6K) at the same load.
2. Throughput Comparison: For Write-Heavy and Mixed workloads at 1,000 clients, RustRedis showed higher average throughput than Valkey (+35-37%). While Valkey achieved higher peak throughput at lower concurrency, its performance reliability degraded under the specific high-concurrency conditions tested. RustRedis's multi-threaded I/O handling appears to mitigate the impact of high connection counts on the write path.
3. Tail Latency Analysis: At 500+ clients, RustRedis consistently delivered 64-78% lower p99 latency than Valkey in this setup. Even in configurations where Valkey's mean throughput was higher (e.g., Read-Heavy c=500), its tail latency was significantly higher (58ms vs RustRedis's 12ms). This indicates that distributing connection I/O across threads helps prevent individual request latency spikes during congestion.
Note: Redis/Valkey configuration was used with default settings (e.g., standard TCP backlog, I/O threads disabled). Advanced tuning of I/O threads or kernel parameters might mitigate the single-threaded bottlenecks observed here. Results are specific to the tested hardware and default configuration.
| Area | Observed Behavior | Severity |
|---|---|---|
| Crash recovery | AOF replay handles truncated final command gracefully (skips, no panic) | Low |
| Partial writes | First corrupted frame stops entire AOF replay---subsequent valid commands lost | Medium |
| Client disconnect | No data corruption or state leak on mid-command TCP close | None |
| Pub/Sub cleanup | Empty channels persist in memory after all subscribers disconnect | Medium |
| Concurrency contention | Write-heavy throughput drops 54% between 100 and 1,000 clients (Mutex) | High |
Key finding: the AOF parser's error recovery is command-level granular but not self-healing. A single corrupted entry causes all subsequent valid entries to be discarded, consistent with Redis's redis-check-aof --fix behavior (truncate at first error).
Full analysis with experimental procedures: docs/failure-analysis.md
The global Mutex<HashMap> becomes the bottleneck when the rate of lock acquisition requests exceeds the rate at which the lock can be transferred between threads. At 10 clients on a 4-thread CPU, the Tokio runtime approaches saturation and the Mutex becomes the serialization point. Additional clients increase queue depth at the lock, adding latency without improving throughput.
Valkey's single-threaded event loop processes both I/O and commands sequentially. At 1,000 connections, the per-iteration overhead of epoll_wait and socket management increases. Without enabling I/O threads, this overhead competes directly with command execution cycles, leading to the observed throughput variability and reduction.
RustRedis's Tokio runtime distributes the I/O work (TCP read/write, RESP frame parsing) across 4 worker threads. Only the database mutation is serialized through the Mutex. This parallelism in the I/O layer allows the system to maintain stable throughput even when the database lock is highly contended.
Write operations hold the lock longer than reads (HashMap insertion involves potential reallocation and hashing), and the lock is held during the entire operation including value cloning. At high concurrency, this creates a lock convoy: threads that acquire the lock briefly are followed by threads that hold it longer, causing oscillating wait times. Write-heavy p99 reaches 24,411 us at 1,000 clients (p99/p50 ratio of 8.0x), versus 12,576 us for read-heavy (p99/p50 ratio of 17.4x).
AOF persistence adds a second mutex (Arc<Mutex<File>>) to the write path. Under Always sync policy, each write operation acquires both the database lock and the AOF file lock, then calls fsync() synchronously---blocking the Tokio worker thread for the duration of the disk operation (typically 2-10ms on NVMe). This explains the ~80% throughput reduction under Always mode.
The EverySecond policy decouples fsync from the hot path by delegating it to a background Tokio task, reducing the write-path overhead to a buffered write_all() behind a mutex. The 1-5% overhead reflects only the file lock acquisition and buffer copy.
The DashMap experiment demonstrates that the locking granularity, not the choice of language or runtime, is the dominant factor in concurrent write performance. DashMap's internal sharding (N shards, where N = number of hardware threads) reduces the probability of contention proportionally: at 1,000 clients writing to 10,000 keys, the expected lock acquisition collisions drop from 100% (global mutex) to approximately 1/N per operation.
However, DashMap introduces higher per-operation overhead for operations that must scan all shards (KEYS pattern matching, DBSIZE, FLUSHDB), since these require iterating across all shard locks rather than acquiring a single global lock.
Tokio's work-stealing scheduler adds approximately 1-3 us per task wakeup. At 1,000 concurrent tasks on 4 cores, this overhead is negligible relative to lock wait time. Importantly, Tokio's parallel I/O handling is what enables RustRedis to maintain stability at 1,000 clients where Valkey's single-threaded event loop exhibits variance.
- Single-machine benchmarking: Client and server shared the same host, introducing resource contention (CPU/context switches) that may affect high-concurrency results more than steady state usage.
- Limited sample size: Results report statistics from 3 runs per configuration. Larger sample sizes would provide tighter confidence intervals.
- Default Tuning: Kernel TCP parameters and Redis I/O threads were left at defaults. Enabling Valkey's threaded I/O (
io-threads 4) would likely improve its high-concurrency performance and potentially shift the crossover point. - Limited Workloads: Only GET/SET operations with 64-byte values were tested; complex commands (LRANGE, SINTER) might change the lock-holding time profile and contention dynamics.
-
Performance Stability vs Peak Throughput. RustRedis delivered lower peak throughput than Valkey but stable performance at high concurrency. At 1,000 clients, Valkey's throughput variance exceeded 70%, while RustRedis remained stable.
-
Multi-threaded I/O prevents tail latency degradation. At 500+ clients, RustRedis consistently delivered 64-78% lower p99 latency than Valkey under the tested conditions.
-
Sharded locking (DashMap) vs Single Thread. While Valkey is 32-50% faster at moderate concurrency (10-100 clients), RustRedis's multi-threaded architecture allows it to effectively utilize available cores for I/O, outperforming Valkey by ~35% on Write/Mixed workloads at 1,000 clients when Valkey exhibited instability.
-
Tail latency advantage inverts at 500+ clients. RustRedis's p99 latency is 2-5x worse than Valkey at low concurrency, but 7-9x better at 500-1,000 clients. Valkey's p99 reaches 67-102 ms at 500 clients; RustRedis stays at 6-14 ms.
-
Valkey is 32-60% faster at moderate concurrency (10-100 clients). The single-threaded event loop avoids lock overhead entirely, and C-level optimizations (jemalloc, dual encodings, hand-tuned parser) provide consistent throughput advantages in the non-contended regime.
-
Sharded locking (DashMap) improves throughput by 60% at 1,000 clients. DashMap's per-shard locking reduces global contention and delays the throughput degradation curve.
-
AOF
Alwayssync reduces throughput by approximately 80%. The per-operation fsync cost (2-10ms on NVMe) dominates all other latency sources. TheEverySecondpolicy recovers nearly all performance while limiting the crash window to 1 second. -
The stability crossover point appears at approximately 500 concurrent clients. Below this, Valkey is faster. Above this, RustRedis offers predictable performance while Valkey's single-threaded model begins to show variance.
| Limitation | Impact |
|---|---|
| Single-node only | No horizontal scaling; throughput is bounded by single-machine resources |
| No replication | No fault tolerance; single point of failure for data availability |
| No RDB snapshotting | AOF is the only persistence mechanism; no point-in-time snapshots |
| No clustering | Cannot partition data across multiple nodes |
| No memory eviction | Memory grows unbounded; no LRU/LFU/TTL-based eviction policy |
| Global Mutex (default backend) | All operations serialize through a single lock; limits scalability |
| No multi-key atomicity | MULTI/EXEC transactions not implemented; no cross-key consistency guarantees |
| Lazy-only TTL expiration | Expired but unaccessed keys consume memory indefinitely |
| No fsync batching | AOF Always mode calls fsync per-operation rather than batching |
| Incomplete Pub/Sub | PUBLISH only; SUBSCRIBE/UNSUBSCRIBE not implemented in connection lifecycle |
| No pipelining optimization | Each command is parsed and executed before reading the next frame |
| Limited sample size | Results represent n=3 runs; larger sample sizes would improve statistical authority |
| Benchmark client collocated | Load generator runs on the same machine as the server, introducing resource contention in the measurements |
These limitations are intentional scope constraints for an experimental system. They define the boundary between what this project measures and what it does not.
- Sharded database with per-shard AOF: Partition the key space across N independent
HashMapinstances with dedicated AOF files, enabling parallel persistence and reducing cross-shard coordination. - Group commit optimization: Batch multiple write commands into a single fsync call, amortizing disk synchronization cost across operations.
- Actor model redesign: Replace shared-state concurrency with per-shard actor tasks communicating via bounded channels, eliminating locks entirely.
- Leader-follower replication: Implement TCP-based command forwarding to replica nodes with offset tracking for partial resync.
- Formal benchmarking methodology: Multiple runs with confidence intervals, isolated benchmark client, and perf-stat hardware counter analysis (cache misses, branch mispredictions).
- Rust 1.70+ (tested with 1.92.0)
redis-clifor interactive testing- Python 3 with
matplotlibandnumpyfor graph generation
# Build and run the server
cargo run --release --bin server
# In another terminal, run benchmarks
cd benchmarks && cargo run --release
# Generate performance graphs
python3 benchmarks/analysis.py # or: .venv/bin/python benchmarks/analysis.py
# View server metrics
redis-cli -p 6379 STATS# Start Redis on a different port
redis-server --port 6380 --save "" --appendonly no
# Run side-by-side comparison
cd benchmarks && cargo run --release -- --redis-port 6380RustRedis/
src/
bin/server.rs Server entry point, connection dispatch, metrics
cmd/mod.rs 31 command variants, parsing, execution
db.rs Mutex-based storage (Arc<Mutex<HashMap>>)
db_dashmap.rs DashMap-based storage (sharded, lock-free reads)
connection.rs Buffered async TCP read/write
frame.rs RESP protocol parser/serializer
persistence.rs AOF append, 3 sync policies, replay
pubsub.rs Pub/Sub broadcast channels
metrics.rs Atomic instrumentation counters
lib.rs Module exports
benchmarks/
src/main.rs Custom load generator (configurable concurrency/workloads)
analysis.py Matplotlib graph generation
results/ JSON data and PNG graphs
docs/
system-design.md Technical report (architecture, threading, tradeoffs)
failure-analysis.md Crash recovery, partial writes, contention analysis
31 commands across 5 categories (click to expand)
| Command | Syntax |
|---|---|
| SET | SET key value [EX seconds] |
| GET | GET key |
| Command | Syntax |
|---|---|
| LPUSH | LPUSH key value [value ...] |
| RPUSH | RPUSH key value [value ...] |
| LPOP | LPOP key |
| RPOP | RPOP key |
| LRANGE | LRANGE key start stop |
| LLEN | LLEN key |
| Command | Syntax |
|---|---|
| SADD | SADD key member [member ...] |
| SREM | SREM key member [member ...] |
| SMEMBERS | SMEMBERS key |
| SISMEMBER | SISMEMBER key member |
| SCARD | SCARD key |
| Command | Syntax |
|---|---|
| HSET | HSET key field value |
| HGET | HGET key field |
| HGETALL | HGETALL key |
| HDEL | HDEL key field [field ...] |
| HEXISTS | HEXISTS key field |
| HLEN | HLEN key |
| Command | Syntax |
|---|---|
| PING | PING [message] |
| ECHO | ECHO message |
| DEL | DEL key [key ...] |
| EXISTS | EXISTS key |
| TYPE | TYPE key |
| KEYS | KEYS pattern |
| DBSIZE | DBSIZE |
| FLUSHDB | FLUSHDB |
| PUBLISH | PUBLISH channel message |
| STATS | STATS |
MIT License. See LICENSE file for details.
This project accompanies the technical report docs/system-design.md and failure analysis docs/failure-analysis.md.