Skip to content

Extend workspace lock to cover check-then-start race in get_or_start_workspace (BT-970)#1003

Merged
jamesc merged 3 commits intomainfrom
worktree-BT-970
Feb 28, 2026
Merged

Extend workspace lock to cover check-then-start race in get_or_start_workspace (BT-970)#1003
jamesc merged 3 commits intomainfrom
worktree-BT-970

Conversation

@jamesc
Copy link
Owner

@jamesc jamesc commented Feb 28, 2026

Summary

Fixes a race condition where two concurrent beamtalk repl invocations could both observe "node not running" and both attempt start_detached_node, causing the second call to fail when EPMD rejects the duplicate node name.

Linear issue: https://linear.app/beamtalk/issue/BT-970

Changes

  • Extract create_workspace_impl (inner logic without lock acquisition) from create_workspace
  • get_or_start_workspace now acquires the workspace lock before the full check-is-running + start-if-not sequence, not just during workspace creation
  • Second concurrent caller blocks on the lock, then discovers the already-running node and returns it
  • Lock uses RAII (_lock drop guard) ensuring release on all paths including errors
  • Add test_concurrent_get_or_start_workspace_integration verifying two concurrent calls produce exactly one started=true and one started=false, both returning the same PID/port

Test plan

  • All existing workspace integration tests pass (11/11)
  • New concurrent integration test validates the race fix
  • Full CI passes (just ci)

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Fixed a race condition in workspace lifecycle management that could cause duplicate node initialization when multiple operations targeted the same workspace concurrently.
  • Tests

    • Added regression tests to validate proper concurrent workspace operation handling.

jamesc and others added 3 commits February 28, 2026 13:50
…tart_workspace (BT-970)

`create_workspace` acquired a filesystem lock only for the workspace
creation step, leaving the check-is-running + start-if-not sequence
unguarded. Two concurrent CLI invocations could both observe "not
running" and both attempt `start_detached_node`, causing the second
call to fail with a duplicate node name at EPMD.

Extract `create_workspace_impl` (inner logic, no lock) and refactor
`get_or_start_workspace` to acquire the lock once, covering the full
sequence: create-if-absent → check-is-running → start-if-not. The
second caller now blocks on the lock, then discovers the node already
running and returns it without starting a second instance.

Add `test_concurrent_get_or_start_workspace_integration` verifying that
two concurrent calls produce exactly one started=true and one
started=false, both returning the same PID/port.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rphan

If an assertion panicked, the guard wouldn't have been created yet,
leaving the BEAM node running. Move it immediately after extracting
results so it runs on panic unwind.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Feb 28, 2026

📝 Walkthrough

Walkthrough

The changes implement thread-safe workspace lifecycle management by introducing exclusive locking around creation and startup operations. A private helper function encapsulates core creation logic, while public entry points acquire locks before delegation to prevent concurrent access issues. Integration tests verify concurrent invocations coordinate correctly.

Changes

Cohort / File(s) Summary
Workspace Lifecycle Locking
crates/beamtalk-cli/src/commands/workspace/lifecycle.rs
Refactors workspace creation into a private create_workspace_impl() helper function that encapsulates core logic without locking. Updates create_workspace() and get_or_start_workspace() to acquire exclusive locks before delegating to the helper, ensuring TOCTOU races are prevented for both creation and startup flows. Adjusts workspace_id to be stored as String and adds documentation on locking strategy.
Concurrent Workspace Tests
crates/beamtalk-cli/src/commands/workspace/mod.rs
Adds integration tests for BT-970 regression that verify concurrent get_or_start_workspace calls on the same workspace ID do not spawn duplicate nodes. Tests use thread barriers for synchronization and validate both calls succeed, exactly one node starts, and both callers receive identical workspace details.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title directly and accurately describes the main change: extending the workspace lock to cover the check-then-start race condition in get_or_start_workspace, with specific issue reference (BT-970).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch worktree-BT-970

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
crates/beamtalk-cli/src/commands/workspace/mod.rs (1)

1234-1250: Create cleanup guards before success assertions.

If a thread returns Err after the peer already started a node, the assertion at Line 1237 can fail before NodeGuard is created at Line 1247, leaving a live node behind.

Proposed adjustment
         let results: Vec<_> = handles.into_iter().map(|h| h.join().unwrap()).collect();

+        // Safety net: register cleanup for any node that may have started,
+        // even if assertions below fail.
+        let _guards: Vec<NodeGuard> = results
+            .iter()
+            .filter_map(|r| r.as_ref().ok())
+            .map(|(info, _, _)| NodeGuard { pid: info.pid })
+            .collect();
+
         // Both calls must succeed
         for result in &results {
             assert!(
                 result.is_ok(),
                 "get_or_start_workspace should succeed, got: {:?}",
                 result.as_ref().err()
             );
         }

         let infos: Vec<_> = results.into_iter().map(|r| r.unwrap()).collect();
-
-        // Safety net: ensure the node is killed even if an assertion below panics.
-        let _guard = NodeGuard {
-            pid: infos[0].0.pid,
-        };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/beamtalk-cli/src/commands/workspace/mod.rs` around lines 1234 - 1250,
The test currently asserts all thread results are Ok before creating the
NodeGuard, which can leak a started node if a later assertion panics; change the
flow in the block that handles handles->results so that after joining threads
(handles.into_iter().map(|h| h.join().unwrap()).collect()) you first scan
results and create cleanup guards (e.g., collect NodeGuard instances or PIDs for
every Ok(result) returned by get_or_start_workspace) so any started node will be
killed even if later assertions fail, and only then perform the assert! loop and
unwrap into infos for further use.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@crates/beamtalk-cli/src/commands/workspace/mod.rs`:
- Around line 1234-1250: The test currently asserts all thread results are Ok
before creating the NodeGuard, which can leak a started node if a later
assertion panics; change the flow in the block that handles handles->results so
that after joining threads (handles.into_iter().map(|h|
h.join().unwrap()).collect()) you first scan results and create cleanup guards
(e.g., collect NodeGuard instances or PIDs for every Ok(result) returned by
get_or_start_workspace) so any started node will be killed even if later
assertions fail, and only then perform the assert! loop and unwrap into infos
for further use.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6a56e73 and a798f8d.

📒 Files selected for processing (2)
  • crates/beamtalk-cli/src/commands/workspace/lifecycle.rs
  • crates/beamtalk-cli/src/commands/workspace/mod.rs

@jamesc jamesc merged commit 70f0c0c into main Feb 28, 2026
5 checks passed
@jamesc jamesc deleted the worktree-BT-970 branch February 28, 2026 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant