Extend workspace lock to cover check-then-start race in get_or_start_workspace (BT-970)#1003
Extend workspace lock to cover check-then-start race in get_or_start_workspace (BT-970)#1003
Conversation
…tart_workspace (BT-970) `create_workspace` acquired a filesystem lock only for the workspace creation step, leaving the check-is-running + start-if-not sequence unguarded. Two concurrent CLI invocations could both observe "not running" and both attempt `start_detached_node`, causing the second call to fail with a duplicate node name at EPMD. Extract `create_workspace_impl` (inner logic, no lock) and refactor `get_or_start_workspace` to acquire the lock once, covering the full sequence: create-if-absent → check-is-running → start-if-not. The second caller now blocks on the lock, then discovers the node already running and returns it without starting a second instance. Add `test_concurrent_get_or_start_workspace_integration` verifying that two concurrent calls produce exactly one started=true and one started=false, both returning the same PID/port. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rphan If an assertion panicked, the guard wouldn't have been created yet, leaving the BEAM node running. Move it immediately after extracting results so it runs on panic unwind. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThe changes implement thread-safe workspace lifecycle management by introducing exclusive locking around creation and startup operations. A private helper function encapsulates core creation logic, while public entry points acquire locks before delegation to prevent concurrent access issues. Integration tests verify concurrent invocations coordinate correctly. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
crates/beamtalk-cli/src/commands/workspace/mod.rs (1)
1234-1250: Create cleanup guards before success assertions.If a thread returns
Errafter the peer already started a node, the assertion at Line 1237 can fail beforeNodeGuardis created at Line 1247, leaving a live node behind.Proposed adjustment
let results: Vec<_> = handles.into_iter().map(|h| h.join().unwrap()).collect(); + // Safety net: register cleanup for any node that may have started, + // even if assertions below fail. + let _guards: Vec<NodeGuard> = results + .iter() + .filter_map(|r| r.as_ref().ok()) + .map(|(info, _, _)| NodeGuard { pid: info.pid }) + .collect(); + // Both calls must succeed for result in &results { assert!( result.is_ok(), "get_or_start_workspace should succeed, got: {:?}", result.as_ref().err() ); } let infos: Vec<_> = results.into_iter().map(|r| r.unwrap()).collect(); - - // Safety net: ensure the node is killed even if an assertion below panics. - let _guard = NodeGuard { - pid: infos[0].0.pid, - };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/beamtalk-cli/src/commands/workspace/mod.rs` around lines 1234 - 1250, The test currently asserts all thread results are Ok before creating the NodeGuard, which can leak a started node if a later assertion panics; change the flow in the block that handles handles->results so that after joining threads (handles.into_iter().map(|h| h.join().unwrap()).collect()) you first scan results and create cleanup guards (e.g., collect NodeGuard instances or PIDs for every Ok(result) returned by get_or_start_workspace) so any started node will be killed even if later assertions fail, and only then perform the assert! loop and unwrap into infos for further use.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@crates/beamtalk-cli/src/commands/workspace/mod.rs`:
- Around line 1234-1250: The test currently asserts all thread results are Ok
before creating the NodeGuard, which can leak a started node if a later
assertion panics; change the flow in the block that handles handles->results so
that after joining threads (handles.into_iter().map(|h|
h.join().unwrap()).collect()) you first scan results and create cleanup guards
(e.g., collect NodeGuard instances or PIDs for every Ok(result) returned by
get_or_start_workspace) so any started node will be killed even if later
assertions fail, and only then perform the assert! loop and unwrap into infos
for further use.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
crates/beamtalk-cli/src/commands/workspace/lifecycle.rscrates/beamtalk-cli/src/commands/workspace/mod.rs
Summary
Fixes a race condition where two concurrent
beamtalk replinvocations could both observe "node not running" and both attemptstart_detached_node, causing the second call to fail when EPMD rejects the duplicate node name.Linear issue: https://linear.app/beamtalk/issue/BT-970
Changes
create_workspace_impl(inner logic without lock acquisition) fromcreate_workspaceget_or_start_workspacenow acquires the workspace lock before the full check-is-running + start-if-not sequence, not just during workspace creation_lockdrop guard) ensuring release on all paths including errorstest_concurrent_get_or_start_workspace_integrationverifying two concurrent calls produce exactly onestarted=trueand onestarted=false, both returning the same PID/portTest plan
just ci)Summary by CodeRabbit
Release Notes
Bug Fixes
Tests