-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Labels
duroxide-pgIssues reported by duroxide-pg-opt providerIssues reported by duroxide-pg-opt provider
Description
Problem
The test_multi_threaded_lock_expiration_recovery validation test in src/provider_validation/instance_locking.rs has a timing race condition that causes spurious failures when connection pool establishment has latency.
Current Test Logic
// Thread 1: Fetch and hold lock (don't ack)
let handle1 = tokio::spawn({
async move {
let (item, lock_token, _attempt_count) = provider1
.fetch_orchestration_item(lock_timeout, Duration::ZERO) // <-- Connection latency here
.await
.unwrap()
.unwrap();
// Hold lock but don't ack
tokio::time::sleep(lock_timeout + Duration::from_millis(200)).await;
lock_token
}
});
// Thread 2: Try to fetch immediately (should fail)
let handle2 = tokio::spawn({
async move {
tokio::time::sleep(Duration::from_millis(200)).await; // <-- Timer starts at spawn, not after thread 1 acquires lock
// ...
}
});
// Thread 3: Wait for expiration and then fetch (should succeed)
let handle3 = tokio::spawn({
async move {
tokio::time::sleep(lock_timeout + Duration::from_millis(100)).await; // <-- Timer starts at spawn
// ...
}
});The Race Condition
All three threads are spawned and start their timers simultaneously at t=0. However:
- Thread 1 may have connection establishment latency (e.g., 50-200ms) before acquiring the lock
- Thread 2 wakes at
t=200ms- may wake BEFORE thread 1 acquires the lock if connection latency > 200ms - Thread 3 wakes at
t=lock_timeout+100ms- if thread 1's lock didn't start untilt=100msdue to connection latency, the lock doesn't expire untilt=lock_timeout+100ms, creating a race
This causes the test to be flaky on:
- Cold connection pools (first test run)
- Remote databases with higher latency
- Resource-constrained CI environments
Proposed Fix
Use a barrier or signal to ensure thread 2 and thread 3 timers start AFTER thread 1 has confirmed lock acquisition:
// Thread 1: Fetch and hold lock (don't ack)
let (lock_acquired_tx, lock_acquired_rx) = tokio::sync::oneshot::channel::<()>();
let handle1 = tokio::spawn({
async move {
let (item, lock_token, _attempt_count) = provider1
.fetch_orchestration_item(lock_timeout, Duration::ZERO)
.await
.unwrap()
.unwrap();
// Signal that lock is now held
let _ = lock_acquired_tx.send(());
// Hold lock but don't ack
tokio::time::sleep(lock_timeout + Duration::from_millis(200)).await;
lock_token
}
});
// Thread 2: Try to fetch shortly after thread 1 acquires lock (should fail)
let lock_acquired_rx2 = lock_acquired_rx.shared(); // or use broadcast
let handle2 = tokio::spawn({
async move {
lock_acquired_rx2.await.ok(); // Wait for thread 1 to acquire lock
tokio::time::sleep(Duration::from_millis(100)).await; // Then wait a bit
// ...
}
});
// Thread 3: Wait for expiration after thread 1 acquired lock (should succeed)
let handle3 = tokio::spawn({
async move {
lock_acquired_rx3.await.ok(); // Wait for thread 1 to acquire lock
tokio::time::sleep(lock_timeout + Duration::from_millis(100)).await; // Then wait for expiration
// ...
}
});Alternatively, add more margin (e.g., 500ms instead of 100ms for thread 3) to account for typical connection latencies.
Impact
Affects any provider implementation that:
- Uses a connection pool with cold-start latency
- Is tested on remote databases
- Runs in resource-constrained environments (CI)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
duroxide-pgIssues reported by duroxide-pg-opt providerIssues reported by duroxide-pg-opt provider