Skip to content

fix: test_multi_threaded_lock_expiration_recovery timing race with connection latency #32

@affandar

Description

@affandar

Problem

The test_multi_threaded_lock_expiration_recovery validation test in src/provider_validation/instance_locking.rs has a timing race condition that causes spurious failures when connection pool establishment has latency.

Current Test Logic

// Thread 1: Fetch and hold lock (don't ack)
let handle1 = tokio::spawn({
    async move {
        let (item, lock_token, _attempt_count) = provider1
            .fetch_orchestration_item(lock_timeout, Duration::ZERO)  // <-- Connection latency here
            .await
            .unwrap()
            .unwrap();
        // Hold lock but don't ack
        tokio::time::sleep(lock_timeout + Duration::from_millis(200)).await;
        lock_token
    }
});

// Thread 2: Try to fetch immediately (should fail)
let handle2 = tokio::spawn({
    async move {
        tokio::time::sleep(Duration::from_millis(200)).await;  // <-- Timer starts at spawn, not after thread 1 acquires lock
        // ...
    }
});

// Thread 3: Wait for expiration and then fetch (should succeed)
let handle3 = tokio::spawn({
    async move {
        tokio::time::sleep(lock_timeout + Duration::from_millis(100)).await;  // <-- Timer starts at spawn
        // ...
    }
});

The Race Condition

All three threads are spawned and start their timers simultaneously at t=0. However:

  1. Thread 1 may have connection establishment latency (e.g., 50-200ms) before acquiring the lock
  2. Thread 2 wakes at t=200ms - may wake BEFORE thread 1 acquires the lock if connection latency > 200ms
  3. Thread 3 wakes at t=lock_timeout+100ms - if thread 1's lock didn't start until t=100ms due to connection latency, the lock doesn't expire until t=lock_timeout+100ms, creating a race

This causes the test to be flaky on:

  • Cold connection pools (first test run)
  • Remote databases with higher latency
  • Resource-constrained CI environments

Proposed Fix

Use a barrier or signal to ensure thread 2 and thread 3 timers start AFTER thread 1 has confirmed lock acquisition:

// Thread 1: Fetch and hold lock (don't ack)
let (lock_acquired_tx, lock_acquired_rx) = tokio::sync::oneshot::channel::<()>();
let handle1 = tokio::spawn({
    async move {
        let (item, lock_token, _attempt_count) = provider1
            .fetch_orchestration_item(lock_timeout, Duration::ZERO)
            .await
            .unwrap()
            .unwrap();
        // Signal that lock is now held
        let _ = lock_acquired_tx.send(());
        // Hold lock but don't ack
        tokio::time::sleep(lock_timeout + Duration::from_millis(200)).await;
        lock_token
    }
});

// Thread 2: Try to fetch shortly after thread 1 acquires lock (should fail)
let lock_acquired_rx2 = lock_acquired_rx.shared(); // or use broadcast
let handle2 = tokio::spawn({
    async move {
        lock_acquired_rx2.await.ok();  // Wait for thread 1 to acquire lock
        tokio::time::sleep(Duration::from_millis(100)).await;  // Then wait a bit
        // ...
    }
});

// Thread 3: Wait for expiration after thread 1 acquired lock (should succeed)
let handle3 = tokio::spawn({
    async move {
        lock_acquired_rx3.await.ok();  // Wait for thread 1 to acquire lock
        tokio::time::sleep(lock_timeout + Duration::from_millis(100)).await;  // Then wait for expiration
        // ...
    }
});

Alternatively, add more margin (e.g., 500ms instead of 100ms for thread 3) to account for typical connection latencies.

Impact

Affects any provider implementation that:

  • Uses a connection pool with cold-start latency
  • Is tested on remote databases
  • Runs in resource-constrained environments (CI)

Metadata

Metadata

Assignees

No one assigned

    Labels

    duroxide-pgIssues reported by duroxide-pg-opt provider

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions