Extend workspace lock to cover check-then-start race in get_or_start_workspace (BT-970) by jamesc · Pull Request #1003 · jamesc/beamtalk

jamesc · 2026-02-28T13:57:52Z

Summary

Fixes a race condition where two concurrent beamtalk repl invocations could both observe "node not running" and both attempt start_detached_node, causing the second call to fail when EPMD rejects the duplicate node name.

Linear issue: https://linear.app/beamtalk/issue/BT-970

Changes

Extract create_workspace_impl (inner logic without lock acquisition) from create_workspace
get_or_start_workspace now acquires the workspace lock before the full check-is-running + start-if-not sequence, not just during workspace creation
Second concurrent caller blocks on the lock, then discovers the already-running node and returns it
Lock uses RAII (_lock drop guard) ensuring release on all paths including errors
Add test_concurrent_get_or_start_workspace_integration verifying two concurrent calls produce exactly one started=true and one started=false, both returning the same PID/port

Test plan

All existing workspace integration tests pass (11/11)
New concurrent integration test validates the race fix
Full CI passes (just ci)

Summary by CodeRabbit

Release Notes

Bug Fixes
- Fixed a race condition in workspace lifecycle management that could cause duplicate node initialization when multiple operations targeted the same workspace concurrently.
Tests
- Added regression tests to validate proper concurrent workspace operation handling.

…tart_workspace (BT-970) `create_workspace` acquired a filesystem lock only for the workspace creation step, leaving the check-is-running + start-if-not sequence unguarded. Two concurrent CLI invocations could both observe "not running" and both attempt `start_detached_node`, causing the second call to fail with a duplicate node name at EPMD. Extract `create_workspace_impl` (inner logic, no lock) and refactor `get_or_start_workspace` to acquire the lock once, covering the full sequence: create-if-absent → check-is-running → start-if-not. The second caller now blocks on the lock, then discovers the node already running and returns it without starting a second instance. Add `test_concurrent_get_or_start_workspace_integration` verifying that two concurrent calls produce exactly one started=true and one started=false, both returning the same PID/port. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rphan If an assertion panicked, the guard wouldn't have been created yet, leaving the BEAM node running. Move it immediately after extracting results so it runs on panic unwind. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-02-28T13:58:11Z

📝 Walkthrough

Walkthrough

The changes implement thread-safe workspace lifecycle management by introducing exclusive locking around creation and startup operations. A private helper function encapsulates core creation logic, while public entry points acquire locks before delegation to prevent concurrent access issues. Integration tests verify concurrent invocations coordinate correctly.

Changes

Cohort / File(s)	Summary
Workspace Lifecycle Locking `crates/beamtalk-cli/src/commands/workspace/lifecycle.rs`	Refactors workspace creation into a private `create_workspace_impl()` helper function that encapsulates core logic without locking. Updates `create_workspace()` and `get_or_start_workspace()` to acquire exclusive locks before delegating to the helper, ensuring TOCTOU races are prevented for both creation and startup flows. Adjusts workspace_id to be stored as String and adds documentation on locking strategy.
Concurrent Workspace Tests `crates/beamtalk-cli/src/commands/workspace/mod.rs`	Adds integration tests for BT-970 regression that verify concurrent `get_or_start_workspace` calls on the same workspace ID do not spawn duplicate nodes. Tests use thread barriers for synchronization and validate both calls succeed, exactly one node starts, and both callers receive identical workspace details.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Refactor: Split workspace/mod.rs and doc.rs into focused modules (BT-640) #631: Directly related as it previously modified workspace lifecycle functions and relocated them into lifecycle.rs, which this PR now refactors further by introducing locking around creation and startup.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title directly and accurately describes the main change: extending the workspace lock to cover the check-then-start race condition in get_or_start_workspace, with specific issue reference (BT-970).
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch worktree-BT-970

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

crates/beamtalk-cli/src/commands/workspace/mod.rs (1)

1234-1250: Create cleanup guards before success assertions.

If a thread returns Err after the peer already started a node, the assertion at Line 1237 can fail before NodeGuard is created at Line 1247, leaving a live node behind.

Proposed adjustment

         let results: Vec<_> = handles.into_iter().map(|h| h.join().unwrap()).collect();

+        // Safety net: register cleanup for any node that may have started,
+        // even if assertions below fail.
+        let _guards: Vec<NodeGuard> = results
+            .iter()
+            .filter_map(|r| r.as_ref().ok())
+            .map(|(info, _, _)| NodeGuard { pid: info.pid })
+            .collect();
+
         // Both calls must succeed
         for result in &results {
             assert!(
                 result.is_ok(),
                 "get_or_start_workspace should succeed, got: {:?}",
                 result.as_ref().err()
             );
         }

         let infos: Vec<_> = results.into_iter().map(|r| r.unwrap()).collect();
-
-        // Safety net: ensure the node is killed even if an assertion below panics.
-        let _guard = NodeGuard {
-            pid: infos[0].0.pid,
-        };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/beamtalk-cli/src/commands/workspace/mod.rs` around lines 1234 - 1250,
The test currently asserts all thread results are Ok before creating the
NodeGuard, which can leak a started node if a later assertion panics; change the
flow in the block that handles handles->results so that after joining threads
(handles.into_iter().map(|h| h.join().unwrap()).collect()) you first scan
results and create cleanup guards (e.g., collect NodeGuard instances or PIDs for
every Ok(result) returned by get_or_start_workspace) so any started node will be
killed even if later assertions fail, and only then perform the assert! loop and
unwrap into infos for further use.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@crates/beamtalk-cli/src/commands/workspace/mod.rs`:
- Around line 1234-1250: The test currently asserts all thread results are Ok
before creating the NodeGuard, which can leak a started node if a later
assertion panics; change the flow in the block that handles handles->results so
that after joining threads (handles.into_iter().map(|h|
h.join().unwrap()).collect()) you first scan results and create cleanup guards
(e.g., collect NodeGuard instances or PIDs for every Ok(result) returned by
get_or_start_workspace) so any started node will be killed even if later
assertions fail, and only then perform the assert! loop and unwrap into infos
for further use.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6a56e73 and a798f8d.

📒 Files selected for processing (2)

crates/beamtalk-cli/src/commands/workspace/lifecycle.rs
crates/beamtalk-cli/src/commands/workspace/mod.rs

jamesc and others added 3 commits February 28, 2026 13:50

fmt: fix rustfmt style for assert! in concurrent workspace test

3e429c2

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai bot reviewed Feb 28, 2026

View reviewed changes

coderabbitai bot approved these changes Feb 28, 2026

View reviewed changes

jamesc merged commit 70f0c0c into main Feb 28, 2026
5 checks passed

jamesc deleted the worktree-BT-970 branch February 28, 2026 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend workspace lock to cover check-then-start race in get_or_start_workspace (BT-970)#1003

Extend workspace lock to cover check-then-start race in get_or_start_workspace (BT-970)#1003
jamesc merged 3 commits intomainfrom
worktree-BT-970

jamesc commented Feb 28, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 28, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jamesc commented Feb 28, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jamesc commented Feb 28, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 28, 2026 •

edited

Loading