Skip to content

termio: add zmx backend for persistent terminal sessions#2

Open
mdsakalu wants to merge 4 commits intomanaflow-ai:mainfrom
mdsakalu:zmx-termio-backend
Open

termio: add zmx backend for persistent terminal sessions#2
mdsakalu wants to merge 4 commits intomanaflow-ai:mainfrom
mdsakalu:zmx-termio-backend

Conversation

@mdsakalu
Copy link

@mdsakalu mdsakalu commented Feb 26, 2026

Summary

This PR adds a new zmx termio backend to Ghostty and wires it through config + embedded APIs so surfaces can attach to persistent zmx sessions instead of always spawning a direct PTY subprocess.

Why

cmux persistence work (manaflow-ai/cmux#143, manaflow-ai/cmux#480) needs terminal processes to outlive the app. The existing exec backend ties process lifetime to the Ghostty surface, so restart always loses shell state. zmx provides a daemon-owned PTY/session model that can survive app restarts and reconnect cleanly.

What Changed

1) New zmx backend in termio

  • Added src/termio/Zmx.zig.
  • Added zmx variant to src/termio/backend.zig unions/dispatch.
  • Exported backend in src/termio.zig.

Backend behavior:

  • Connects to zmx over Unix domain socket.
  • Supports session creation (zmx run <session>) when configured.
  • Uses framed IPC for input/output/resize.
  • Sends Init + immediate Resize on attach.
  • On unexpected socket disconnect, emits .child_exited metadata so surface UX remains coherent.
  • On planned shutdown, sends Detach and exits cleanly without false error reporting.

2) Surface/backend selection and fallback

  • Updated src/Surface.zig to choose zmx backend when zmx-session is set.
  • If zmx binary is unavailable, falls back to exec with warning (no crash/no hard failure).

3) Config + embedded API surface

  • Added config keys in src/config/Config.zig:
    • zmx-session
    • zmx-create
  • Added per-surface embedded options in src/apprt/embedded.zig and synced C header:
    • zmx_session
    • zmx_create
    • zmx_mode
  • Updated include/ghostty.h to keep C API struct in sync.

zmx_mode is used for split/tab inheritance in embedded mode: new surfaces created from a zmx-backed surface stay in zmx mode, each with a fresh session.

4) Build compatibility in this branch

  • Added a local uucode fallback vendoring path and compatibility wiring (deps-uucode, unicode build updates) to keep fork builds stable in environments where fetching unicode dependency sources is constrained.

Design Rationale

  • Backend extension point: implemented as a first-class termio backend variant, not a special-case branch inside exec, so behavior stays isolated and maintainable.
  • Per-surface control: zmx session is configured at surface creation time (embedded options), which maps naturally to split/tab semantics.
  • Graceful degradation: missing zmx does not break terminal creation; it downgrades to exec.
  • Socket-readiness gating: session creation is treated as successful when the socket is connectable, avoiding false negatives from launcher exit-code quirks.

Validation

  • Zmx.zig includes targeted tests for IPC framing/buffer behavior and disconnect metadata.
  • End-to-end integration validated through cmux Debug builds and persistence workflows using this Ghostty branch.

Related

Summary by CodeRabbit

  • New Features

    • Added a zmx terminal backend and options to specify/create/use zmx sessions (zmx-session, zmx-create, zmx-mode).
    • Surfaces propagate zmx settings to newly opened surfaces.
  • Behavior

    • zmx is selected when configured; abnormal-exit messages include session info.
    • Surfaces emit a persistent “child disconnected” notification for backend disconnects.
  • Tests

    • Added unit tests for zmx IPC, config parsing, and session handling.

mdsakalu added 2 commits March 9, 2026 13:24
Implement a new termio backend that connects to a zmx daemon session
over a Unix domain socket instead of spawning a direct PTY subprocess.
This enables terminal session persistence across app restarts.

New files:
- src/termio/Zmx.zig: Full zmx backend with IPC protocol, read thread,
  socket buffer, session creation, and all required termio methods

Modified files:
- src/termio/backend.zig: Add zmx variant to Kind, Config, Backend,
  ThreadData unions with all 8 dispatch methods
- src/termio.zig: Register Zmx import
- src/config/Config.zig: Add zmx-session and zmx-create config options
- src/Surface.zig: Backend selection with exec fallback when zmx binary
  missing; fix childExitedAbnormally for zmx sessions
- src/apprt/embedded.zig: Add zmx_session/zmx_create/zmx_mode to
  Surface.Options; auto-generate session names for inherited zmx mode
- include/ghostty.h: Sync C API struct with new zmx fields

Design: zmx mode is opt-in, missing zmx binary falls back to exec with
warning, zmx mode inherits across splits/tabs with auto-generated session
names, threadExit sends Detach (not Kill) to preserve sessions.
@mdsakalu mdsakalu force-pushed the zmx-termio-backend branch from a1ce49a to 4eb974f Compare March 9, 2026 17:25
@coderabbitai
Copy link

coderabbitai bot commented Mar 9, 2026

📝 Walkthrough

Walkthrough

Adds a new zmx termio backend and wires it through config, public API, surface selection, threading/I/O, and message handling; implements IPC framing, socket/session management, read thread, write queuing, and disconnect reporting.

Changes

Cohort / File(s) Summary
Public API & Header
include/ghostty.h
Added three fields to ghostty_surface_config_s: zmx_session: const char*, zmx_create: bool, zmx_mode: bool.
Configuration
src/config/Config.zig
Added zmx-session: ?[]const u8 = null and zmx-create: bool = true config options with comments and tests for parsing/defaults.
Surface & App Runtime
src/Surface.zig, src/apprt/embedded.zig, src/apprt/surface.zig
Surface selection now prefers .zmx when configured; Surface.Options gains zmx_session, zmx_create, zmx_mode and propagates them to child surfaces; added Message.child_disconnected to represent persistent backend disconnects.
Termio Exports & Backend Abstraction
src/termio.zig, src/termio/backend.zig
Exported termio.Zmx; extended public Kind enum, Config union, Backend union, and ThreadData union with .zmx variant; dispatch switches updated to delegate zmx lifecycle methods (init/deinit, threadEnter/Exit, resize, queueWrite, childExitedAbnormally, etc.).
Zmx Backend Implementation & Tests
src/termio/Zmx.zig
New comprehensive Zmx backend: IPC tags/headers, packed header/resize structs, SocketBuffer framing, non-blocking socket I/O, read thread, write queueing and chunking, session create/connect logic, helpers (socket dir, find binary, createSession, disconnect metadata), lifecycle functions (init/deinit/threadEnter/threadExit/resize/queueWrite/childExitedAbnormally), ThreadData/ReadThread structs, and unit tests.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Surface as Surface/Term
    participant Backend as termio.Zmx
    participant Daemon as zmx Daemon
    participant RThread as ReadThread

    Surface->>Backend: threadEnter(td, session)
    Backend->>Backend: resolve socket dir, find zmx binary
    Backend->>Backend: createSession (spawn `zmx run`) / connect socket
    Backend->>Daemon: connect Unix socket
    Backend->>Daemon: send Init, send Resize
    Backend->>RThread: spawn ReadThread
    Surface->>Backend: queueWrite(data)
    Backend->>Daemon: send Input frames
    Daemon->>RThread: socket Output data
    RThread->>Surface: dispatch termio.Output
    Surface->>Backend: threadExit
    Backend->>Daemon: send Detach (close)
    RThread->>Backend: shutdown and join
Loading
sequenceDiagram
    autonumber
    participant IO as Surface IO
    participant Backend as termio.Zmx
    participant Daemon as zmx Daemon

    IO->>Backend: resize(grid, screen)
    Backend->>Backend: update rows/cols
    Backend->>Daemon: send Resize IPC
    IO->>Backend: childExitedAbnormally(exit_code, runtime_ms)
    Backend->>IO: render disconnection message (includes session info)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I tunneled through sockets, nibbling bytes with glee,

Sessions sprouted, threads hummed — a zmx jubilee.
IPC crumbs, resize hops, a daemon's cozy lair,
Terminals blink, messages sink — I danced through every care. 🎉

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding a zmx backend for persistent terminal sessions, which is the core objective of this PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/Surface.zig (1)

632-698: ⚠️ Potential issue | 🟠 Major

Restore cleanup before Termio.init takes ownership.

This block now only errdefers the mailbox. If anything fails after creating termio.Zmx/termio.Manual/termio.Exec or the derived termio config, those resources leak on the init error path.

♻️ Possible fix
-        const io_backend: termio.Backend = if (zmx_backend) |zmx|
+        var io_backend: termio.Backend = if (zmx_backend) |zmx|
             .{ .zmx = zmx }
         else if (use_manual_io) manual_backend: {
             var io_manual = try termio.Manual.init(alloc, .{
                 .write_cb = manual_write_cb,
                 .write_userdata = manual_write_userdata,
@@
             _ = &io_exec;
             break :exec_backend .{ .exec = io_exec };
         };
+        errdefer io_backend.deinit();

         var io_mailbox = try termio.Mailbox.initSPSC(alloc);
         errdefer io_mailbox.deinit(alloc);
+
+        var io_config = try termio.Termio.DerivedConfig.init(alloc, config);
+        errdefer io_config.deinit();

         try termio.Termio.init(&self.io, alloc, .{
             .size = size,
             .full_config = config,
-            .config = try termio.Termio.DerivedConfig.init(alloc, config),
+            .config = io_config,
             .backend = io_backend,
             .mailbox = io_mailbox,
             .renderer_state = &self.renderer_state,
             .renderer_wakeup = render_thread.wakeup,
             .renderer_mailbox = render_thread.mailbox,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Surface.zig` around lines 632 - 698, The code creates termio backends
(termio.Zmx.init, termio.Manual.init, termio.Exec.init) and DerivedConfig via
termio.Termio.DerivedConfig.init but only errdefer's the mailbox, so if
termio.Termio.init fails those backend objects and the derived config will leak;
add cleanup errdefer guards for any created backend (zmx_backend, io_manual,
io_exec) and for the DerivedConfig result, then clear/reset those guards (or set
the owned values to neutral/transfer ownership) immediately after successful
Termio.init so Termio takes responsibility; locate the backend construction
paths around io_backend and the DerivedConfig init used in Termio.init and
ensure corresponding deinit/cleanup calls are invoked on error paths.
🧹 Nitpick comments (1)
src/config/Config.zig (1)

1541-1545: Add a smoke test for the new public config keys.

These are new user-facing config entries, but there isn't a focused Config.loadIter test here for the key names plus the default/override path. A tiny parser test would catch typo or wiring regressions early.

Example test
test "parse zmx config" {
    const testing = std.testing;

    {
        var cfg = try Config.default(testing.allocator);
        defer cfg.deinit();

        var it: TestIterator = .{ .data = &.{
            "--zmx-session=session-1",
        } };
        try cfg.loadIter(testing.allocator, &it);
        try cfg.finalize();

        try testing.expectEqualStrings("session-1", cfg.@"zmx-session".?);
        try testing.expect(cfg.@"zmx-create");
    }

    {
        var cfg = try Config.default(testing.allocator);
        defer cfg.deinit();

        var it: TestIterator = .{ .data = &.{
            "--zmx-session=session-1",
            "--zmx-create=false",
        } };
        try cfg.loadIter(testing.allocator, &it);
        try cfg.finalize();

        try testing.expectEqualStrings("session-1", cfg.@"zmx-session".?);
        try testing.expect(!cfg.@"zmx-create");
    }
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/config/Config.zig` around lines 1541 - 1545, Add a focused unit test that
exercises Config.loadIter and Config.finalize to verify the new public keys
@"zmx-session" and @"zmx-create" parse and default correctly: create
Config.default(allocator) instances, feed a TestIterator with
"--zmx-session=session-1" to ensure cfg.@"zmx-session" == "session-1" and
cfg.@"zmx-create" defaults to true, then a second case with
"--zmx-session=session-1" plus "--zmx-create=false" to ensure cfg.@"zmx-create"
becomes false; deinit the configs and use
testing.expect/testing.expectEqualStrings to assert values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/apprt/embedded.zig`:
- Around line 486-487: The default value for zmx_create in the surface config
causes ghostty_surface_config_new() to opt callers into creating a zmx session
even when they only pass zmx_session and diverges from C zero-init; change the
default for the field zmx_create to false in the config definition so
zero-initialized instances behave correctly and callers that explicitly want
creation must set config.zmx_create = true (see the zmx_create field and
ghostty_surface_config_new and zmx_session usage to locate the change).
- Around line 601-615: When opts.zmx_session is non-null but empty we must treat
it as unset so zmx_mode can auto-generate a session; change the branching around
opts.zmx_session/opts.zmx_mode so that after calling std.mem.sliceTo on
opts.zmx_session you only assign config.@"zmx-session" and config.@"zmx-create"
when session.len > 0, and if session.len == 0 and opts.zmx_mode is true fall
through to the existing auto-generate logic (reuse std.crypto.random.int,
config.arenaAlloc, and std.fmt.allocPrint to create the "cmux-{x}" session and
set config.@"zmx-session" and config.@"zmx-create" = true). Ensure no early
return or else-block prevents the zmx_mode branch when the provided session
string is empty.

In `@src/config/Config.zig`:
- Around line 1537-1541: Update the doc comment for the configuration field
@"zmx-session" (the doc block immediately above the declaration @"zmx-session":
?[]const u8 = null) to clarify that this is not an unconditional switch away
from spawning a shell: retain the existing description of connecting to a zmx
daemon and session persistence, and add a sentence stating that if the named zmx
session or daemon is unavailable the code will fall back to using exec (spawning
a local shell) and emit a warning to the user; ensure the wording is concise and
suitable for generated manual output.

In `@src/termio/Zmx.zig`:
- Around line 623-642: handleDisconnect currently synthesizes a .child_exited
with exit_code = 0 which causes Surface.childExited to treat unexpected zmx
disconnects as clean exits; instead add a dedicated mailbox variant (e.g.,
.child_disconnected or a .child_exited variant with an explicit .abnormal flag)
to the surface_mailbox message union, update disconnectMetadata/handleDisconnect
to push that new variant including runtime_ms and a reason sentinel, and
implement a new Surface.childDisconnected (or branch in Surface.childExited that
checks the new flag) to surface the abnormal disconnect path; update the other
similar call sites (around the other block noted at 727-736) to push the new
variant as well.
- Around line 229-250: The code currently probes the filesystem with
std.fs.accessAbsolute(socket_path) when self.create_if_missing is true, which
can succeed for a stale socket file; instead, after creating or seeing the path,
attempt a real TCP connect to verify readiness and only treat the socket as
ready on successful posix.connect; update the logic in the block around
create_if_missing (referencing create_if_missing, accessAbsolute, createSession,
posix.socket, posix.connect) to: 1) if accessAbsolute fails with FileNotFound
call createSession(socket_path) and then loop with a short backoff trying
posix.connect on the socket until it succeeds or a timeout/attempt limit is
reached, 2) if accessAbsolute succeeds, try posix.connect immediately and only
proceed when connect succeeds (otherwise retry similarly), and 3) remove the
blocking waitpid(pid, 0) readiness dependency and replace it with the
connect-based readiness check; apply the same connect-based readiness change to
the other similar block around lines 698-715.
- Around line 756-763: The test named "socket path resolution with ZMX_DIR"
never sets ZMX_DIR so it doesn't validate the precedence branch; update the code
so resolveSocketDir can be exercised deterministically: either refactor
resolveSocketDir to accept an environment-provider parameter (e.g., a function
or an env map) or add an optional zmxDir override argument, update all call
sites to pass the real env-provider, and then in the test call the new
resolveSocketDir variant with a synthetic environment that contains ZMX_DIR (or
set/unset the process env around the test and restore it). Reference
resolveSocketDir and the test "socket path resolution with ZMX_DIR" when making
the changes.

---

Outside diff comments:
In `@src/Surface.zig`:
- Around line 632-698: The code creates termio backends (termio.Zmx.init,
termio.Manual.init, termio.Exec.init) and DerivedConfig via
termio.Termio.DerivedConfig.init but only errdefer's the mailbox, so if
termio.Termio.init fails those backend objects and the derived config will leak;
add cleanup errdefer guards for any created backend (zmx_backend, io_manual,
io_exec) and for the DerivedConfig result, then clear/reset those guards (or set
the owned values to neutral/transfer ownership) immediately after successful
Termio.init so Termio takes responsibility; locate the backend construction
paths around io_backend and the DerivedConfig init used in Termio.init and
ensure corresponding deinit/cleanup calls are invoked on error paths.

---

Nitpick comments:
In `@src/config/Config.zig`:
- Around line 1541-1545: Add a focused unit test that exercises Config.loadIter
and Config.finalize to verify the new public keys @"zmx-session" and
@"zmx-create" parse and default correctly: create Config.default(allocator)
instances, feed a TestIterator with "--zmx-session=session-1" to ensure
cfg.@"zmx-session" == "session-1" and cfg.@"zmx-create" defaults to true, then a
second case with "--zmx-session=session-1" plus "--zmx-create=false" to ensure
cfg.@"zmx-create" becomes false; deinit the configs and use
testing.expect/testing.expectEqualStrings to assert values.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6f5c2d42-be8e-441f-ac18-00f33ebfd76a

📥 Commits

Reviewing files that changed from the base of the PR and between 1b008f5 and 4eb974f.

📒 Files selected for processing (7)
  • include/ghostty.h
  • src/Surface.zig
  • src/apprt/embedded.zig
  • src/config/Config.zig
  • src/termio.zig
  • src/termio/Zmx.zig
  • src/termio/backend.zig

Comment on lines +1537 to +1541
/// Connect to a zmx daemon session instead of spawning a new shell process.
/// When set, the surface connects to the named zmx session over a Unix
/// domain socket. The zmx daemon owns the PTY and persists independently
/// of the surface, enabling session persistence across restarts.
@"zmx-session": ?[]const u8 = null,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Document the exec fallback here.

zmx-session is not an unconditional switch away from spawning a shell: the PR behavior falls back to exec with a warning when zmx is unavailable. Since these comments feed the generated manual, the current wording overpromises behavior.

Suggested wording
-/// Connect to a zmx daemon session instead of spawning a new shell process.
-/// When set, the surface connects to the named zmx session over a Unix
+/// Prefer connecting to a zmx daemon session instead of spawning a new shell
+/// process.
+/// When set, the surface connects to the named zmx session over a Unix
 /// domain socket. The zmx daemon owns the PTY and persists independently
 /// of the surface, enabling session persistence across restarts.
+/// If zmx is unavailable, Ghostty falls back to the exec backend and logs
+/// a warning.
 @"zmx-session": ?[]const u8 = null,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// Connect to a zmx daemon session instead of spawning a new shell process.
/// When set, the surface connects to the named zmx session over a Unix
/// domain socket. The zmx daemon owns the PTY and persists independently
/// of the surface, enabling session persistence across restarts.
@"zmx-session": ?[]const u8 = null,
/// Prefer connecting to a zmx daemon session instead of spawning a new shell
/// process.
/// When set, the surface connects to the named zmx session over a Unix
/// domain socket. The zmx daemon owns the PTY and persists independently
/// of the surface, enabling session persistence across restarts.
/// If zmx is unavailable, Ghostty falls back to the exec backend and logs
/// a warning.
@"zmx-session": ?[]const u8 = null,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/config/Config.zig` around lines 1537 - 1541, Update the doc comment for
the configuration field @"zmx-session" (the doc block immediately above the
declaration @"zmx-session": ?[]const u8 = null) to clarify that this is not an
unconditional switch away from spawning a shell: retain the existing description
of connecting to a zmx daemon and session persistence, and add a sentence
stating that if the named zmx session or daemon is unavailable the code will
fall back to using exec (spawning a local shell) and emit a warning to the user;
ensure the wording is concise and suitable for generated manual output.

Comment on lines +229 to +250
// Session creation if needed
if (self.create_if_missing) {
// Check if socket already exists
std.fs.accessAbsolute(socket_path, .{}) catch |err| switch (err) {
// Socket doesn't exist — create the session
error.FileNotFound => try self.createSession(socket_path),

// Other errors are real problems (permissions, bad path, etc.)
else => return err,
};
}

// Connect to Unix domain socket
const sock = try posix.socket(
posix.AF.UNIX,
posix.SOCK.STREAM | posix.SOCK.CLOEXEC,
0,
);
errdefer posix.close(sock);

const addr = try std.net.Address.initUnix(socket_path);
try posix.connect(sock, &addr.any, addr.getOsSockLen());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Probe the socket, not just the path, when zmx-create is enabled.

Both the pre-create fast path and the readiness loop use accessAbsolute(socket_path) as the success signal. A stale socket file, or one that exists before the daemon starts accepting connections, will pass those checks and then fail at the later connect(). The blocking waitpid(pid, 0) after the path appears also makes readiness depend on launcher lifetime again. Retry/return based on a successful connect() instead.

Also applies to: 698-715

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/termio/Zmx.zig` around lines 229 - 250, The code currently probes the
filesystem with std.fs.accessAbsolute(socket_path) when self.create_if_missing
is true, which can succeed for a stale socket file; instead, after creating or
seeing the path, attempt a real TCP connect to verify readiness and only treat
the socket as ready on successful posix.connect; update the logic in the block
around create_if_missing (referencing create_if_missing, accessAbsolute,
createSession, posix.socket, posix.connect) to: 1) if accessAbsolute fails with
FileNotFound call createSession(socket_path) and then loop with a short backoff
trying posix.connect on the socket until it succeeds or a timeout/attempt limit
is reached, 2) if accessAbsolute succeeds, try posix.connect immediately and
only proceed when connect succeeds (otherwise retry similarly), and 3) remove
the blocking waitpid(pid, 0) readiness dependency and replace it with the
connect-based readiness check; apply the same connect-based readiness change to
the other similar block around lines 698-715.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (2)
src/apprt/embedded.zig (1)

486-487: ⚠️ Potential issue | 🟡 Minor

Default zmx_create to false for safer opt-in semantics.

With zmx_create: bool = true, callers who only set zmx_session are implicitly opted into session creation. This diverges from C zero-initialization behavior and could cause unexpected session creation. Consider defaulting to false so callers must explicitly enable creation.

Suggested fix
-        zmx_create: bool = true,
+        zmx_create: bool = false,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/apprt/embedded.zig` around lines 486 - 487, Change the default for the
zmx_create field from true to false so callers are not implicitly opted into
session creation; locate the zmx_create: bool = true initializer in the struct
or options block in embedded.zig and set it to false (i.e., zmx_create: bool =
false) so callers must explicitly enable session creation when using
zmx_session.
src/termio/Zmx.zig (1)

724-740: ⚠️ Potential issue | 🟠 Major

Return as soon as the socket is connectable.

Lines 735-739 still block on waitpid(pid, 0) after socketReady(socket_path) succeeds. That couples attach readiness back to the lifetime of zmx run <session>, so a launcher that stays alive a bit longer can still hang surface init even though the socket is already usable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/termio/Zmx.zig` around lines 724 - 740, The code currently does a
blocking posix.waitpid(pid, 0) after socketReady(socket_path) succeeds, which
can hang startup; replace that blocking wait with a non-blocking reap
(posix.waitpid(pid, std.c.W.NOHANG)) or simply skip the blocking reap and return
immediately when socketReady succeeds, ensuring reaped is set only if the
non-blocking wait reports the child exited. Update the logic around reaped,
waitpid, socketReady, and pid so the function returns as soon as
socketReady(socket_path) is true and only uses WNOHANG to detect/reap the child
without blocking.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Surface.zig`:
- Around line 1325-1341: childDisconnected currently falls back to the generic
self.childExitedAbnormally(info) which triggers the terminal fallback UI;
instead invoke the zmx-specific disconnect renderer implemented in
src/termio/Zmx.zig. Replace the call to childExitedAbnormally(info) with a call
into the Zmx disconnect handler (for example, call the Zmx module/method exposed
on your termio or zmx object such as self.termio.zmx.childDisconnected(info) or
Zmx.childDisconnected(self.termio.zmx, info)), preserving the existing error
logging pattern (catch |err| log.err(...)). Ensure you reference the Zmx
disconnect symbol from Zmx.zig and remove the generic childExitedAbnormally
branch for zmx backends.

In `@src/termio/Zmx.zig`:
- Around line 121-133: SocketBuffer.next currently treats any hdr.len as
trustable and returns null for both "need more bytes" and "invalid/oversized
frame", causing indefinite buffering; add a sane maximum frame size constant
(e.g., MAX_IPC_FRAME) and, inside fn next (SocketBuffer.next) after computing
total = hdr_size + hdr.len, check if total > MAX_IPC_FRAME (or hdr.len >
MAX_IPC_FRAME - hdr_size) and if so treat the frame as invalid: perform a hard
disconnect/cleanup on the offending socket buffer (call the existing
disconnect/close path on the SocketBuffer/owner) and return null or an explicit
error if the surrounding API supports it; ensure you reference IpcHeader/hdr.len
and update any tests to expect a disconnect on oversized frames.
- Around line 54-73: ipcSend and writeAll perform direct posix.write calls which
can partial-write or hit EWOULDBLOCK and bypass queued frames; replace all
direct uses (calls from resize(), threadExit(), and the functions themselves) so
outbound IPC is enqueued via the serialized writer path (use queueWrite) instead
of calling posix.write directly. Update ipcSend to build the IpcHeader and then
enqueue the header bytes and payload together (or enqueue a single contiguous
buffer) through queueWrite, and remove/stop using writeAll for socket writes;
ensure error propagation matches the existing API and that queueWrite handles
WouldBlock/retry semantics so framing and ordering are preserved.
- Around line 177-180: The current early check always returns error.ZmxNotFound
when findZmxBinary() fails, preventing socket probing and breaking attach-only
use; change the logic to only require the zmx binary when create_if_missing is
true (i.e., if (create_if_missing && !findZmxBinary()) return
error.ZmxNotFound), so that when create_if_missing is false the code still
proceeds to probe the Unix socket and allow attaching to an existing session
(ensure this change is applied where findZmxBinary() and create_if_missing are
used in this module).

---

Duplicate comments:
In `@src/apprt/embedded.zig`:
- Around line 486-487: Change the default for the zmx_create field from true to
false so callers are not implicitly opted into session creation; locate the
zmx_create: bool = true initializer in the struct or options block in
embedded.zig and set it to false (i.e., zmx_create: bool = false) so callers
must explicitly enable session creation when using zmx_session.

In `@src/termio/Zmx.zig`:
- Around line 724-740: The code currently does a blocking posix.waitpid(pid, 0)
after socketReady(socket_path) succeeds, which can hang startup; replace that
blocking wait with a non-blocking reap (posix.waitpid(pid, std.c.W.NOHANG)) or
simply skip the blocking reap and return immediately when socketReady succeeds,
ensuring reaped is set only if the non-blocking wait reports the child exited.
Update the logic around reaped, waitpid, socketReady, and pid so the function
returns as soon as socketReady(socket_path) is true and only uses WNOHANG to
detect/reap the child without blocking.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 04c952e0-05da-4bb1-a2ab-4eb0a39bb402

📥 Commits

Reviewing files that changed from the base of the PR and between 4eb974f and 8b34709.

📒 Files selected for processing (5)
  • src/Surface.zig
  • src/apprt/embedded.zig
  • src/apprt/surface.zig
  • src/config/Config.zig
  • src/termio/Zmx.zig
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/config/Config.zig

Comment on lines +54 to +73
fn ipcSend(fd: posix.fd_t, tag: IpcTag, data: []const u8) !void {
const header = IpcHeader{
.tag = tag,
.len = @intCast(data.len),
};
const header_bytes = std.mem.asBytes(&header);
try writeAll(fd, header_bytes);
if (data.len > 0) {
try writeAll(fd, data);
}
}

fn writeAll(fd: posix.fd_t, data: []const u8) !void {
var index: usize = 0;
while (index < data.len) {
const n = try posix.write(fd, data[index..]);
if (n == 0) return error.DiskQuota;
index += n;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Serialize all outbound IPC through one writer path.

After Line 532 makes the shared socket fd non-blocking, resize() and threadExit() still send frames via ipcSend()/writeAll() instead of the queued writer used by queueWrite(). That opens two failure modes: a direct posix.write can stop on WouldBlock after only part of a frame, and it can also bypass earlier queued Input frames on the same stream. Either case breaks zmx framing.

Also applies to: 299-307, 331-338, 532-543

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/termio/Zmx.zig` around lines 54 - 73, ipcSend and writeAll perform direct
posix.write calls which can partial-write or hit EWOULDBLOCK and bypass queued
frames; replace all direct uses (calls from resize(), threadExit(), and the
functions themselves) so outbound IPC is enqueued via the serialized writer path
(use queueWrite) instead of calling posix.write directly. Update ipcSend to
build the IpcHeader and then enqueue the header bytes and payload together (or
enqueue a single contiguous buffer) through queueWrite, and remove/stop using
writeAll for socket writes; ensure error propagation matches the existing API
and that queueWrite handles WouldBlock/retry semantics so framing and ordering
are preserved.

Comment on lines +121 to +133
/// Returns next complete IPC message or null.
fn next(self: *SocketBuffer) ?IpcSocketMsg {
const available = self.buf.items[self.head..];
const hdr_size = @sizeOf(IpcHeader);
if (available.len < hdr_size) return null;

const hdr = std.mem.bytesToValue(IpcHeader, available[0..hdr_size]);
const total = hdr_size + hdr.len;
if (available.len < total) return null;

const pay = available[hdr_size..total];
self.head += total;
return .{ .header = hdr, .payload = pay };
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject oversized IPC frames instead of buffering forever.

SocketBuffer.next() trusts hdr.len completely. A stale or misbound socket can advertise a huge payload, and the read loop will keep appending until OOM because “need more bytes” and “invalid frame size” are both represented as null. This needs a sane maximum frame size and a hard disconnect on violation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/termio/Zmx.zig` around lines 121 - 133, SocketBuffer.next currently
treats any hdr.len as trustable and returns null for both "need more bytes" and
"invalid/oversized frame", causing indefinite buffering; add a sane maximum
frame size constant (e.g., MAX_IPC_FRAME) and, inside fn next
(SocketBuffer.next) after computing total = hdr_size + hdr.len, check if total >
MAX_IPC_FRAME (or hdr.len > MAX_IPC_FRAME - hdr_size) and if so treat the frame
as invalid: perform a hard disconnect/cleanup on the offending socket buffer
(call the existing disconnect/close path on the SocketBuffer/owner) and return
null or an explicit error if the surrounding API supports it; ensure you
reference IpcHeader/hdr.len and update any tests to expect a disconnect on
oversized frames.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 8 files

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
src/termio/Zmx.zig (2)

54-73: ⚠️ Potential issue | 🔴 Critical

Serialize all outbound IPC through the queued writer path.

After Line 533 puts the shared socket into non-blocking mode, ipcSend()/writeAll() can fail mid-frame with WouldBlock, and they also bypass earlier queued Input writes. That can corrupt framing or reorder Resize/Detach relative to user input.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/termio/Zmx.zig` around lines 54 - 73, ipcSend and writeAll perform direct
blocking writes and can fail with WouldBlock after the socket is set
non-blocking, bypassing the existing queued writer and breaking framing/order;
change ipcSend to build the full frame (header_bytes + data) and enqueue that
frame into the existing queued writer path instead of calling
writeAll/posix.write directly so the writer code (which already handles partial
writes and WouldBlock) serializes IPC frames with other Output/Input writes;
remove or stop using writeAll for IPC sends and reference ipcSend and writeAll
in your change so reviewers can verify the enqueueing.

121-133: ⚠️ Potential issue | 🟠 Major

Reject oversized IPC frames instead of buffering forever.

SocketBuffer.next() trusts hdr.len completely. A stale or misbound socket can advertise an arbitrarily large payload, and this loop will keep reallocating until OOM because "need more bytes" and "invalid frame size" are both treated as null.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/termio/Zmx.zig` around lines 121 - 133, SocketBuffer.next currently
trusts hdr.len and keeps returning null for both "need more bytes" and "invalid
huge length", allowing a malicious/stale socket to cause unbounded buffering;
add an explicit validation after reading IpcHeader to reject absurd payload
sizes: check hdr.len against a fixed sensible MAX_IPC_PAYLOAD constant and, if
hdr.len > MAX_IPC_PAYLOAD (or hdr_size + hdr.len would overflow/self.buf
bounds), treat it as a fatal/invalid frame (return an error variant or
drop/reset the buffer and advance head past the header to resynchronize) instead
of returning null; update SocketBuffer.next, referencing IpcHeader and hdr.len,
to distinguish "need more bytes" (null) from "invalid frame size" (error/clear)
so the loop cannot allocate forever.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/termio/Zmx.zig`:
- Around line 293-311: threadExit currently nulls self.socket_fd without closing
it, leaking the socket; after joining zmx_td.read_thread and before setting
self.socket_fd = null, call posix.close on zmx_td.socket_fd (handle errors
similarly to other posix calls) and then set the field to null. Apply the same
pattern in ThreadData.deinit (and any cleanup in Zmx.deinit): explicitly close
zmx_td.socket_fd (or the ThreadData's socket_fd) via posix.close with
appropriate error handling before clearing the file descriptor field.

---

Duplicate comments:
In `@src/termio/Zmx.zig`:
- Around line 54-73: ipcSend and writeAll perform direct blocking writes and can
fail with WouldBlock after the socket is set non-blocking, bypassing the
existing queued writer and breaking framing/order; change ipcSend to build the
full frame (header_bytes + data) and enqueue that frame into the existing queued
writer path instead of calling writeAll/posix.write directly so the writer code
(which already handles partial writes and WouldBlock) serializes IPC frames with
other Output/Input writes; remove or stop using writeAll for IPC sends and
reference ipcSend and writeAll in your change so reviewers can verify the
enqueueing.
- Around line 121-133: SocketBuffer.next currently trusts hdr.len and keeps
returning null for both "need more bytes" and "invalid huge length", allowing a
malicious/stale socket to cause unbounded buffering; add an explicit validation
after reading IpcHeader to reject absurd payload sizes: check hdr.len against a
fixed sensible MAX_IPC_PAYLOAD constant and, if hdr.len > MAX_IPC_PAYLOAD (or
hdr_size + hdr.len would overflow/self.buf bounds), treat it as a fatal/invalid
frame (return an error variant or drop/reset the buffer and advance head past
the header to resynchronize) instead of returning null; update
SocketBuffer.next, referencing IpcHeader and hdr.len, to distinguish "need more
bytes" (null) from "invalid frame size" (error/clear) so the loop cannot
allocate forever.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d6dcf67a-ae18-4071-b181-43faaffa7bf4

📥 Commits

Reviewing files that changed from the base of the PR and between 8b34709 and 9b5d9d0.

📒 Files selected for processing (3)
  • src/Surface.zig
  • src/apprt/embedded.zig
  • src/termio/Zmx.zig

Comment on lines +293 to +311
pub fn threadExit(self: *Zmx, td: *termio.Termio.ThreadData) void {
const zmx_td = &td.backend.zmx;

// Signal read thread that upcoming EOF from Detach is expected
zmx_td.shutting_down.store(true, .release);

// Send Detach — keeps session alive for reconnection
ipcSend(zmx_td.socket_fd, .Detach, &.{}) catch |err| {
log.warn("error sending detach err={}", .{err});
};

// Signal and join read thread
_ = posix.write(zmx_td.read_thread_pipe, "x") catch |err| switch (err) {
error.BrokenPipe => {},
else => log.warn("error writing to read thread quit pipe err={}", .{err}),
};
zmx_td.read_thread.join();
self.socket_fd = null;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Zmx shutdown path ==="
rg -n -C2 --type zig 'socket_fd|write_stream\.deinit|posix\.close\(' src/termio/Zmx.zig

echo
echo "=== xev Stream ownership (if vendored in repo) ==="
fd 'xev*.zig' . -x rg -n -C3 --type zig '\binitFd\b|\bdeinit\b|\bclose\b' {}

Repository: manaflow-ai/ghostty

Length of output: 3338


Close the socket fd explicitly before clearing it in threadExit().

threadExit() sets self.socket_fd = null without calling posix.close() first. This causes the file descriptor to leak because Zmx.deinit() will not close a null fd. The socket must be explicitly closed after the read thread is joined and before setting it to null.

This also applies to ThreadData.deinit() (lines 506–512), which has similar cleanup responsibilities and should ensure all resources held by the thread data are properly released.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/termio/Zmx.zig` around lines 293 - 311, threadExit currently nulls
self.socket_fd without closing it, leaking the socket; after joining
zmx_td.read_thread and before setting self.socket_fd = null, call posix.close on
zmx_td.socket_fd (handle errors similarly to other posix calls) and then set the
field to null. Apply the same pattern in ThreadData.deinit (and any cleanup in
Zmx.deinit): explicitly close zmx_td.socket_fd (or the ThreadData's socket_fd)
via posix.close with appropriate error handling before clearing the file
descriptor field.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant