termio: add zmx backend for persistent terminal sessions#2
termio: add zmx backend for persistent terminal sessions#2mdsakalu wants to merge 4 commits intomanaflow-ai:mainfrom
Conversation
Implement a new termio backend that connects to a zmx daemon session over a Unix domain socket instead of spawning a direct PTY subprocess. This enables terminal session persistence across app restarts. New files: - src/termio/Zmx.zig: Full zmx backend with IPC protocol, read thread, socket buffer, session creation, and all required termio methods Modified files: - src/termio/backend.zig: Add zmx variant to Kind, Config, Backend, ThreadData unions with all 8 dispatch methods - src/termio.zig: Register Zmx import - src/config/Config.zig: Add zmx-session and zmx-create config options - src/Surface.zig: Backend selection with exec fallback when zmx binary missing; fix childExitedAbnormally for zmx sessions - src/apprt/embedded.zig: Add zmx_session/zmx_create/zmx_mode to Surface.Options; auto-generate session names for inherited zmx mode - include/ghostty.h: Sync C API struct with new zmx fields Design: zmx mode is opt-in, missing zmx binary falls back to exec with warning, zmx mode inherits across splits/tabs with auto-generated session names, threadExit sends Detach (not Kill) to preserve sessions.
a1ce49a to
4eb974f
Compare
📝 WalkthroughWalkthroughAdds a new zmx termio backend and wires it through config, public API, surface selection, threading/I/O, and message handling; implements IPC framing, socket/session management, read thread, write queuing, and disconnect reporting. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Surface as Surface/Term
participant Backend as termio.Zmx
participant Daemon as zmx Daemon
participant RThread as ReadThread
Surface->>Backend: threadEnter(td, session)
Backend->>Backend: resolve socket dir, find zmx binary
Backend->>Backend: createSession (spawn `zmx run`) / connect socket
Backend->>Daemon: connect Unix socket
Backend->>Daemon: send Init, send Resize
Backend->>RThread: spawn ReadThread
Surface->>Backend: queueWrite(data)
Backend->>Daemon: send Input frames
Daemon->>RThread: socket Output data
RThread->>Surface: dispatch termio.Output
Surface->>Backend: threadExit
Backend->>Daemon: send Detach (close)
RThread->>Backend: shutdown and join
sequenceDiagram
autonumber
participant IO as Surface IO
participant Backend as termio.Zmx
participant Daemon as zmx Daemon
IO->>Backend: resize(grid, screen)
Backend->>Backend: update rows/cols
Backend->>Daemon: send Resize IPC
IO->>Backend: childExitedAbnormally(exit_code, runtime_ms)
Backend->>IO: render disconnection message (includes session info)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/Surface.zig (1)
632-698:⚠️ Potential issue | 🟠 MajorRestore cleanup before
Termio.inittakes ownership.This block now only
errdefers the mailbox. If anything fails after creatingtermio.Zmx/termio.Manual/termio.Execor the derived termio config, those resources leak on the init error path.♻️ Possible fix
- const io_backend: termio.Backend = if (zmx_backend) |zmx| + var io_backend: termio.Backend = if (zmx_backend) |zmx| .{ .zmx = zmx } else if (use_manual_io) manual_backend: { var io_manual = try termio.Manual.init(alloc, .{ .write_cb = manual_write_cb, .write_userdata = manual_write_userdata, @@ _ = &io_exec; break :exec_backend .{ .exec = io_exec }; }; + errdefer io_backend.deinit(); var io_mailbox = try termio.Mailbox.initSPSC(alloc); errdefer io_mailbox.deinit(alloc); + + var io_config = try termio.Termio.DerivedConfig.init(alloc, config); + errdefer io_config.deinit(); try termio.Termio.init(&self.io, alloc, .{ .size = size, .full_config = config, - .config = try termio.Termio.DerivedConfig.init(alloc, config), + .config = io_config, .backend = io_backend, .mailbox = io_mailbox, .renderer_state = &self.renderer_state, .renderer_wakeup = render_thread.wakeup, .renderer_mailbox = render_thread.mailbox,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/Surface.zig` around lines 632 - 698, The code creates termio backends (termio.Zmx.init, termio.Manual.init, termio.Exec.init) and DerivedConfig via termio.Termio.DerivedConfig.init but only errdefer's the mailbox, so if termio.Termio.init fails those backend objects and the derived config will leak; add cleanup errdefer guards for any created backend (zmx_backend, io_manual, io_exec) and for the DerivedConfig result, then clear/reset those guards (or set the owned values to neutral/transfer ownership) immediately after successful Termio.init so Termio takes responsibility; locate the backend construction paths around io_backend and the DerivedConfig init used in Termio.init and ensure corresponding deinit/cleanup calls are invoked on error paths.
🧹 Nitpick comments (1)
src/config/Config.zig (1)
1541-1545: Add a smoke test for the new public config keys.These are new user-facing config entries, but there isn't a focused
Config.loadItertest here for the key names plus the default/override path. A tiny parser test would catch typo or wiring regressions early.Example test
test "parse zmx config" { const testing = std.testing; { var cfg = try Config.default(testing.allocator); defer cfg.deinit(); var it: TestIterator = .{ .data = &.{ "--zmx-session=session-1", } }; try cfg.loadIter(testing.allocator, &it); try cfg.finalize(); try testing.expectEqualStrings("session-1", cfg.@"zmx-session".?); try testing.expect(cfg.@"zmx-create"); } { var cfg = try Config.default(testing.allocator); defer cfg.deinit(); var it: TestIterator = .{ .data = &.{ "--zmx-session=session-1", "--zmx-create=false", } }; try cfg.loadIter(testing.allocator, &it); try cfg.finalize(); try testing.expectEqualStrings("session-1", cfg.@"zmx-session".?); try testing.expect(!cfg.@"zmx-create"); } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/config/Config.zig` around lines 1541 - 1545, Add a focused unit test that exercises Config.loadIter and Config.finalize to verify the new public keys @"zmx-session" and @"zmx-create" parse and default correctly: create Config.default(allocator) instances, feed a TestIterator with "--zmx-session=session-1" to ensure cfg.@"zmx-session" == "session-1" and cfg.@"zmx-create" defaults to true, then a second case with "--zmx-session=session-1" plus "--zmx-create=false" to ensure cfg.@"zmx-create" becomes false; deinit the configs and use testing.expect/testing.expectEqualStrings to assert values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/apprt/embedded.zig`:
- Around line 486-487: The default value for zmx_create in the surface config
causes ghostty_surface_config_new() to opt callers into creating a zmx session
even when they only pass zmx_session and diverges from C zero-init; change the
default for the field zmx_create to false in the config definition so
zero-initialized instances behave correctly and callers that explicitly want
creation must set config.zmx_create = true (see the zmx_create field and
ghostty_surface_config_new and zmx_session usage to locate the change).
- Around line 601-615: When opts.zmx_session is non-null but empty we must treat
it as unset so zmx_mode can auto-generate a session; change the branching around
opts.zmx_session/opts.zmx_mode so that after calling std.mem.sliceTo on
opts.zmx_session you only assign config.@"zmx-session" and config.@"zmx-create"
when session.len > 0, and if session.len == 0 and opts.zmx_mode is true fall
through to the existing auto-generate logic (reuse std.crypto.random.int,
config.arenaAlloc, and std.fmt.allocPrint to create the "cmux-{x}" session and
set config.@"zmx-session" and config.@"zmx-create" = true). Ensure no early
return or else-block prevents the zmx_mode branch when the provided session
string is empty.
In `@src/config/Config.zig`:
- Around line 1537-1541: Update the doc comment for the configuration field
@"zmx-session" (the doc block immediately above the declaration @"zmx-session":
?[]const u8 = null) to clarify that this is not an unconditional switch away
from spawning a shell: retain the existing description of connecting to a zmx
daemon and session persistence, and add a sentence stating that if the named zmx
session or daemon is unavailable the code will fall back to using exec (spawning
a local shell) and emit a warning to the user; ensure the wording is concise and
suitable for generated manual output.
In `@src/termio/Zmx.zig`:
- Around line 623-642: handleDisconnect currently synthesizes a .child_exited
with exit_code = 0 which causes Surface.childExited to treat unexpected zmx
disconnects as clean exits; instead add a dedicated mailbox variant (e.g.,
.child_disconnected or a .child_exited variant with an explicit .abnormal flag)
to the surface_mailbox message union, update disconnectMetadata/handleDisconnect
to push that new variant including runtime_ms and a reason sentinel, and
implement a new Surface.childDisconnected (or branch in Surface.childExited that
checks the new flag) to surface the abnormal disconnect path; update the other
similar call sites (around the other block noted at 727-736) to push the new
variant as well.
- Around line 229-250: The code currently probes the filesystem with
std.fs.accessAbsolute(socket_path) when self.create_if_missing is true, which
can succeed for a stale socket file; instead, after creating or seeing the path,
attempt a real TCP connect to verify readiness and only treat the socket as
ready on successful posix.connect; update the logic in the block around
create_if_missing (referencing create_if_missing, accessAbsolute, createSession,
posix.socket, posix.connect) to: 1) if accessAbsolute fails with FileNotFound
call createSession(socket_path) and then loop with a short backoff trying
posix.connect on the socket until it succeeds or a timeout/attempt limit is
reached, 2) if accessAbsolute succeeds, try posix.connect immediately and only
proceed when connect succeeds (otherwise retry similarly), and 3) remove the
blocking waitpid(pid, 0) readiness dependency and replace it with the
connect-based readiness check; apply the same connect-based readiness change to
the other similar block around lines 698-715.
- Around line 756-763: The test named "socket path resolution with ZMX_DIR"
never sets ZMX_DIR so it doesn't validate the precedence branch; update the code
so resolveSocketDir can be exercised deterministically: either refactor
resolveSocketDir to accept an environment-provider parameter (e.g., a function
or an env map) or add an optional zmxDir override argument, update all call
sites to pass the real env-provider, and then in the test call the new
resolveSocketDir variant with a synthetic environment that contains ZMX_DIR (or
set/unset the process env around the test and restore it). Reference
resolveSocketDir and the test "socket path resolution with ZMX_DIR" when making
the changes.
---
Outside diff comments:
In `@src/Surface.zig`:
- Around line 632-698: The code creates termio backends (termio.Zmx.init,
termio.Manual.init, termio.Exec.init) and DerivedConfig via
termio.Termio.DerivedConfig.init but only errdefer's the mailbox, so if
termio.Termio.init fails those backend objects and the derived config will leak;
add cleanup errdefer guards for any created backend (zmx_backend, io_manual,
io_exec) and for the DerivedConfig result, then clear/reset those guards (or set
the owned values to neutral/transfer ownership) immediately after successful
Termio.init so Termio takes responsibility; locate the backend construction
paths around io_backend and the DerivedConfig init used in Termio.init and
ensure corresponding deinit/cleanup calls are invoked on error paths.
---
Nitpick comments:
In `@src/config/Config.zig`:
- Around line 1541-1545: Add a focused unit test that exercises Config.loadIter
and Config.finalize to verify the new public keys @"zmx-session" and
@"zmx-create" parse and default correctly: create Config.default(allocator)
instances, feed a TestIterator with "--zmx-session=session-1" to ensure
cfg.@"zmx-session" == "session-1" and cfg.@"zmx-create" defaults to true, then a
second case with "--zmx-session=session-1" plus "--zmx-create=false" to ensure
cfg.@"zmx-create" becomes false; deinit the configs and use
testing.expect/testing.expectEqualStrings to assert values.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6f5c2d42-be8e-441f-ac18-00f33ebfd76a
📒 Files selected for processing (7)
include/ghostty.hsrc/Surface.zigsrc/apprt/embedded.zigsrc/config/Config.zigsrc/termio.zigsrc/termio/Zmx.zigsrc/termio/backend.zig
| /// Connect to a zmx daemon session instead of spawning a new shell process. | ||
| /// When set, the surface connects to the named zmx session over a Unix | ||
| /// domain socket. The zmx daemon owns the PTY and persists independently | ||
| /// of the surface, enabling session persistence across restarts. | ||
| @"zmx-session": ?[]const u8 = null, |
There was a problem hiding this comment.
Document the exec fallback here.
zmx-session is not an unconditional switch away from spawning a shell: the PR behavior falls back to exec with a warning when zmx is unavailable. Since these comments feed the generated manual, the current wording overpromises behavior.
Suggested wording
-/// Connect to a zmx daemon session instead of spawning a new shell process.
-/// When set, the surface connects to the named zmx session over a Unix
+/// Prefer connecting to a zmx daemon session instead of spawning a new shell
+/// process.
+/// When set, the surface connects to the named zmx session over a Unix
/// domain socket. The zmx daemon owns the PTY and persists independently
/// of the surface, enabling session persistence across restarts.
+/// If zmx is unavailable, Ghostty falls back to the exec backend and logs
+/// a warning.
@"zmx-session": ?[]const u8 = null,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /// Connect to a zmx daemon session instead of spawning a new shell process. | |
| /// When set, the surface connects to the named zmx session over a Unix | |
| /// domain socket. The zmx daemon owns the PTY and persists independently | |
| /// of the surface, enabling session persistence across restarts. | |
| @"zmx-session": ?[]const u8 = null, | |
| /// Prefer connecting to a zmx daemon session instead of spawning a new shell | |
| /// process. | |
| /// When set, the surface connects to the named zmx session over a Unix | |
| /// domain socket. The zmx daemon owns the PTY and persists independently | |
| /// of the surface, enabling session persistence across restarts. | |
| /// If zmx is unavailable, Ghostty falls back to the exec backend and logs | |
| /// a warning. | |
| @"zmx-session": ?[]const u8 = null, |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/config/Config.zig` around lines 1537 - 1541, Update the doc comment for
the configuration field @"zmx-session" (the doc block immediately above the
declaration @"zmx-session": ?[]const u8 = null) to clarify that this is not an
unconditional switch away from spawning a shell: retain the existing description
of connecting to a zmx daemon and session persistence, and add a sentence
stating that if the named zmx session or daemon is unavailable the code will
fall back to using exec (spawning a local shell) and emit a warning to the user;
ensure the wording is concise and suitable for generated manual output.
src/termio/Zmx.zig
Outdated
| // Session creation if needed | ||
| if (self.create_if_missing) { | ||
| // Check if socket already exists | ||
| std.fs.accessAbsolute(socket_path, .{}) catch |err| switch (err) { | ||
| // Socket doesn't exist — create the session | ||
| error.FileNotFound => try self.createSession(socket_path), | ||
|
|
||
| // Other errors are real problems (permissions, bad path, etc.) | ||
| else => return err, | ||
| }; | ||
| } | ||
|
|
||
| // Connect to Unix domain socket | ||
| const sock = try posix.socket( | ||
| posix.AF.UNIX, | ||
| posix.SOCK.STREAM | posix.SOCK.CLOEXEC, | ||
| 0, | ||
| ); | ||
| errdefer posix.close(sock); | ||
|
|
||
| const addr = try std.net.Address.initUnix(socket_path); | ||
| try posix.connect(sock, &addr.any, addr.getOsSockLen()); |
There was a problem hiding this comment.
Probe the socket, not just the path, when zmx-create is enabled.
Both the pre-create fast path and the readiness loop use accessAbsolute(socket_path) as the success signal. A stale socket file, or one that exists before the daemon starts accepting connections, will pass those checks and then fail at the later connect(). The blocking waitpid(pid, 0) after the path appears also makes readiness depend on launcher lifetime again. Retry/return based on a successful connect() instead.
Also applies to: 698-715
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/termio/Zmx.zig` around lines 229 - 250, The code currently probes the
filesystem with std.fs.accessAbsolute(socket_path) when self.create_if_missing
is true, which can succeed for a stale socket file; instead, after creating or
seeing the path, attempt a real TCP connect to verify readiness and only treat
the socket as ready on successful posix.connect; update the logic in the block
around create_if_missing (referencing create_if_missing, accessAbsolute,
createSession, posix.socket, posix.connect) to: 1) if accessAbsolute fails with
FileNotFound call createSession(socket_path) and then loop with a short backoff
trying posix.connect on the socket until it succeeds or a timeout/attempt limit
is reached, 2) if accessAbsolute succeeds, try posix.connect immediately and
only proceed when connect succeeds (otherwise retry similarly), and 3) remove
the blocking waitpid(pid, 0) readiness dependency and replace it with the
connect-based readiness check; apply the same connect-based readiness change to
the other similar block around lines 698-715.
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (2)
src/apprt/embedded.zig (1)
486-487:⚠️ Potential issue | 🟡 MinorDefault
zmx_createtofalsefor safer opt-in semantics.With
zmx_create: bool = true, callers who only setzmx_sessionare implicitly opted into session creation. This diverges from C zero-initialization behavior and could cause unexpected session creation. Consider defaulting tofalseso callers must explicitly enable creation.Suggested fix
- zmx_create: bool = true, + zmx_create: bool = false,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/apprt/embedded.zig` around lines 486 - 487, Change the default for the zmx_create field from true to false so callers are not implicitly opted into session creation; locate the zmx_create: bool = true initializer in the struct or options block in embedded.zig and set it to false (i.e., zmx_create: bool = false) so callers must explicitly enable session creation when using zmx_session.src/termio/Zmx.zig (1)
724-740:⚠️ Potential issue | 🟠 MajorReturn as soon as the socket is connectable.
Lines 735-739 still block on
waitpid(pid, 0)aftersocketReady(socket_path)succeeds. That couples attach readiness back to the lifetime ofzmx run <session>, so a launcher that stays alive a bit longer can still hang surface init even though the socket is already usable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/termio/Zmx.zig` around lines 724 - 740, The code currently does a blocking posix.waitpid(pid, 0) after socketReady(socket_path) succeeds, which can hang startup; replace that blocking wait with a non-blocking reap (posix.waitpid(pid, std.c.W.NOHANG)) or simply skip the blocking reap and return immediately when socketReady succeeds, ensuring reaped is set only if the non-blocking wait reports the child exited. Update the logic around reaped, waitpid, socketReady, and pid so the function returns as soon as socketReady(socket_path) is true and only uses WNOHANG to detect/reap the child without blocking.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/Surface.zig`:
- Around line 1325-1341: childDisconnected currently falls back to the generic
self.childExitedAbnormally(info) which triggers the terminal fallback UI;
instead invoke the zmx-specific disconnect renderer implemented in
src/termio/Zmx.zig. Replace the call to childExitedAbnormally(info) with a call
into the Zmx disconnect handler (for example, call the Zmx module/method exposed
on your termio or zmx object such as self.termio.zmx.childDisconnected(info) or
Zmx.childDisconnected(self.termio.zmx, info)), preserving the existing error
logging pattern (catch |err| log.err(...)). Ensure you reference the Zmx
disconnect symbol from Zmx.zig and remove the generic childExitedAbnormally
branch for zmx backends.
In `@src/termio/Zmx.zig`:
- Around line 121-133: SocketBuffer.next currently treats any hdr.len as
trustable and returns null for both "need more bytes" and "invalid/oversized
frame", causing indefinite buffering; add a sane maximum frame size constant
(e.g., MAX_IPC_FRAME) and, inside fn next (SocketBuffer.next) after computing
total = hdr_size + hdr.len, check if total > MAX_IPC_FRAME (or hdr.len >
MAX_IPC_FRAME - hdr_size) and if so treat the frame as invalid: perform a hard
disconnect/cleanup on the offending socket buffer (call the existing
disconnect/close path on the SocketBuffer/owner) and return null or an explicit
error if the surrounding API supports it; ensure you reference IpcHeader/hdr.len
and update any tests to expect a disconnect on oversized frames.
- Around line 54-73: ipcSend and writeAll perform direct posix.write calls which
can partial-write or hit EWOULDBLOCK and bypass queued frames; replace all
direct uses (calls from resize(), threadExit(), and the functions themselves) so
outbound IPC is enqueued via the serialized writer path (use queueWrite) instead
of calling posix.write directly. Update ipcSend to build the IpcHeader and then
enqueue the header bytes and payload together (or enqueue a single contiguous
buffer) through queueWrite, and remove/stop using writeAll for socket writes;
ensure error propagation matches the existing API and that queueWrite handles
WouldBlock/retry semantics so framing and ordering are preserved.
- Around line 177-180: The current early check always returns error.ZmxNotFound
when findZmxBinary() fails, preventing socket probing and breaking attach-only
use; change the logic to only require the zmx binary when create_if_missing is
true (i.e., if (create_if_missing && !findZmxBinary()) return
error.ZmxNotFound), so that when create_if_missing is false the code still
proceeds to probe the Unix socket and allow attaching to an existing session
(ensure this change is applied where findZmxBinary() and create_if_missing are
used in this module).
---
Duplicate comments:
In `@src/apprt/embedded.zig`:
- Around line 486-487: Change the default for the zmx_create field from true to
false so callers are not implicitly opted into session creation; locate the
zmx_create: bool = true initializer in the struct or options block in
embedded.zig and set it to false (i.e., zmx_create: bool = false) so callers
must explicitly enable session creation when using zmx_session.
In `@src/termio/Zmx.zig`:
- Around line 724-740: The code currently does a blocking posix.waitpid(pid, 0)
after socketReady(socket_path) succeeds, which can hang startup; replace that
blocking wait with a non-blocking reap (posix.waitpid(pid, std.c.W.NOHANG)) or
simply skip the blocking reap and return immediately when socketReady succeeds,
ensuring reaped is set only if the non-blocking wait reports the child exited.
Update the logic around reaped, waitpid, socketReady, and pid so the function
returns as soon as socketReady(socket_path) is true and only uses WNOHANG to
detect/reap the child without blocking.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 04c952e0-05da-4bb1-a2ab-4eb0a39bb402
📒 Files selected for processing (5)
src/Surface.zigsrc/apprt/embedded.zigsrc/apprt/surface.zigsrc/config/Config.zigsrc/termio/Zmx.zig
🚧 Files skipped from review as they are similar to previous changes (1)
- src/config/Config.zig
| fn ipcSend(fd: posix.fd_t, tag: IpcTag, data: []const u8) !void { | ||
| const header = IpcHeader{ | ||
| .tag = tag, | ||
| .len = @intCast(data.len), | ||
| }; | ||
| const header_bytes = std.mem.asBytes(&header); | ||
| try writeAll(fd, header_bytes); | ||
| if (data.len > 0) { | ||
| try writeAll(fd, data); | ||
| } | ||
| } | ||
|
|
||
| fn writeAll(fd: posix.fd_t, data: []const u8) !void { | ||
| var index: usize = 0; | ||
| while (index < data.len) { | ||
| const n = try posix.write(fd, data[index..]); | ||
| if (n == 0) return error.DiskQuota; | ||
| index += n; | ||
| } | ||
| } |
There was a problem hiding this comment.
Serialize all outbound IPC through one writer path.
After Line 532 makes the shared socket fd non-blocking, resize() and threadExit() still send frames via ipcSend()/writeAll() instead of the queued writer used by queueWrite(). That opens two failure modes: a direct posix.write can stop on WouldBlock after only part of a frame, and it can also bypass earlier queued Input frames on the same stream. Either case breaks zmx framing.
Also applies to: 299-307, 331-338, 532-543
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/termio/Zmx.zig` around lines 54 - 73, ipcSend and writeAll perform direct
posix.write calls which can partial-write or hit EWOULDBLOCK and bypass queued
frames; replace all direct uses (calls from resize(), threadExit(), and the
functions themselves) so outbound IPC is enqueued via the serialized writer path
(use queueWrite) instead of calling posix.write directly. Update ipcSend to
build the IpcHeader and then enqueue the header bytes and payload together (or
enqueue a single contiguous buffer) through queueWrite, and remove/stop using
writeAll for socket writes; ensure error propagation matches the existing API
and that queueWrite handles WouldBlock/retry semantics so framing and ordering
are preserved.
| /// Returns next complete IPC message or null. | ||
| fn next(self: *SocketBuffer) ?IpcSocketMsg { | ||
| const available = self.buf.items[self.head..]; | ||
| const hdr_size = @sizeOf(IpcHeader); | ||
| if (available.len < hdr_size) return null; | ||
|
|
||
| const hdr = std.mem.bytesToValue(IpcHeader, available[0..hdr_size]); | ||
| const total = hdr_size + hdr.len; | ||
| if (available.len < total) return null; | ||
|
|
||
| const pay = available[hdr_size..total]; | ||
| self.head += total; | ||
| return .{ .header = hdr, .payload = pay }; |
There was a problem hiding this comment.
Reject oversized IPC frames instead of buffering forever.
SocketBuffer.next() trusts hdr.len completely. A stale or misbound socket can advertise a huge payload, and the read loop will keep appending until OOM because “need more bytes” and “invalid frame size” are both represented as null. This needs a sane maximum frame size and a hard disconnect on violation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/termio/Zmx.zig` around lines 121 - 133, SocketBuffer.next currently
treats any hdr.len as trustable and returns null for both "need more bytes" and
"invalid/oversized frame", causing indefinite buffering; add a sane maximum
frame size constant (e.g., MAX_IPC_FRAME) and, inside fn next
(SocketBuffer.next) after computing total = hdr_size + hdr.len, check if total >
MAX_IPC_FRAME (or hdr.len > MAX_IPC_FRAME - hdr_size) and if so treat the frame
as invalid: perform a hard disconnect/cleanup on the offending socket buffer
(call the existing disconnect/close path on the SocketBuffer/owner) and return
null or an explicit error if the surrounding API supports it; ensure you
reference IpcHeader/hdr.len and update any tests to expect a disconnect on
oversized frames.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/termio/Zmx.zig (2)
54-73:⚠️ Potential issue | 🔴 CriticalSerialize all outbound IPC through the queued writer path.
After Line 533 puts the shared socket into non-blocking mode,
ipcSend()/writeAll()can fail mid-frame withWouldBlock, and they also bypass earlier queuedInputwrites. That can corrupt framing or reorderResize/Detachrelative to user input.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/termio/Zmx.zig` around lines 54 - 73, ipcSend and writeAll perform direct blocking writes and can fail with WouldBlock after the socket is set non-blocking, bypassing the existing queued writer and breaking framing/order; change ipcSend to build the full frame (header_bytes + data) and enqueue that frame into the existing queued writer path instead of calling writeAll/posix.write directly so the writer code (which already handles partial writes and WouldBlock) serializes IPC frames with other Output/Input writes; remove or stop using writeAll for IPC sends and reference ipcSend and writeAll in your change so reviewers can verify the enqueueing.
121-133:⚠️ Potential issue | 🟠 MajorReject oversized IPC frames instead of buffering forever.
SocketBuffer.next()trustshdr.lencompletely. A stale or misbound socket can advertise an arbitrarily large payload, and this loop will keep reallocating until OOM because "need more bytes" and "invalid frame size" are both treated asnull.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/termio/Zmx.zig` around lines 121 - 133, SocketBuffer.next currently trusts hdr.len and keeps returning null for both "need more bytes" and "invalid huge length", allowing a malicious/stale socket to cause unbounded buffering; add an explicit validation after reading IpcHeader to reject absurd payload sizes: check hdr.len against a fixed sensible MAX_IPC_PAYLOAD constant and, if hdr.len > MAX_IPC_PAYLOAD (or hdr_size + hdr.len would overflow/self.buf bounds), treat it as a fatal/invalid frame (return an error variant or drop/reset the buffer and advance head past the header to resynchronize) instead of returning null; update SocketBuffer.next, referencing IpcHeader and hdr.len, to distinguish "need more bytes" (null) from "invalid frame size" (error/clear) so the loop cannot allocate forever.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/termio/Zmx.zig`:
- Around line 293-311: threadExit currently nulls self.socket_fd without closing
it, leaking the socket; after joining zmx_td.read_thread and before setting
self.socket_fd = null, call posix.close on zmx_td.socket_fd (handle errors
similarly to other posix calls) and then set the field to null. Apply the same
pattern in ThreadData.deinit (and any cleanup in Zmx.deinit): explicitly close
zmx_td.socket_fd (or the ThreadData's socket_fd) via posix.close with
appropriate error handling before clearing the file descriptor field.
---
Duplicate comments:
In `@src/termio/Zmx.zig`:
- Around line 54-73: ipcSend and writeAll perform direct blocking writes and can
fail with WouldBlock after the socket is set non-blocking, bypassing the
existing queued writer and breaking framing/order; change ipcSend to build the
full frame (header_bytes + data) and enqueue that frame into the existing queued
writer path instead of calling writeAll/posix.write directly so the writer code
(which already handles partial writes and WouldBlock) serializes IPC frames with
other Output/Input writes; remove or stop using writeAll for IPC sends and
reference ipcSend and writeAll in your change so reviewers can verify the
enqueueing.
- Around line 121-133: SocketBuffer.next currently trusts hdr.len and keeps
returning null for both "need more bytes" and "invalid huge length", allowing a
malicious/stale socket to cause unbounded buffering; add an explicit validation
after reading IpcHeader to reject absurd payload sizes: check hdr.len against a
fixed sensible MAX_IPC_PAYLOAD constant and, if hdr.len > MAX_IPC_PAYLOAD (or
hdr_size + hdr.len would overflow/self.buf bounds), treat it as a fatal/invalid
frame (return an error variant or drop/reset the buffer and advance head past
the header to resynchronize) instead of returning null; update
SocketBuffer.next, referencing IpcHeader and hdr.len, to distinguish "need more
bytes" (null) from "invalid frame size" (error/clear) so the loop cannot
allocate forever.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d6dcf67a-ae18-4071-b181-43faaffa7bf4
📒 Files selected for processing (3)
src/Surface.zigsrc/apprt/embedded.zigsrc/termio/Zmx.zig
| pub fn threadExit(self: *Zmx, td: *termio.Termio.ThreadData) void { | ||
| const zmx_td = &td.backend.zmx; | ||
|
|
||
| // Signal read thread that upcoming EOF from Detach is expected | ||
| zmx_td.shutting_down.store(true, .release); | ||
|
|
||
| // Send Detach — keeps session alive for reconnection | ||
| ipcSend(zmx_td.socket_fd, .Detach, &.{}) catch |err| { | ||
| log.warn("error sending detach err={}", .{err}); | ||
| }; | ||
|
|
||
| // Signal and join read thread | ||
| _ = posix.write(zmx_td.read_thread_pipe, "x") catch |err| switch (err) { | ||
| error.BrokenPipe => {}, | ||
| else => log.warn("error writing to read thread quit pipe err={}", .{err}), | ||
| }; | ||
| zmx_td.read_thread.join(); | ||
| self.socket_fd = null; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Zmx shutdown path ==="
rg -n -C2 --type zig 'socket_fd|write_stream\.deinit|posix\.close\(' src/termio/Zmx.zig
echo
echo "=== xev Stream ownership (if vendored in repo) ==="
fd 'xev*.zig' . -x rg -n -C3 --type zig '\binitFd\b|\bdeinit\b|\bclose\b' {}Repository: manaflow-ai/ghostty
Length of output: 3338
Close the socket fd explicitly before clearing it in threadExit().
threadExit() sets self.socket_fd = null without calling posix.close() first. This causes the file descriptor to leak because Zmx.deinit() will not close a null fd. The socket must be explicitly closed after the read thread is joined and before setting it to null.
This also applies to ThreadData.deinit() (lines 506–512), which has similar cleanup responsibilities and should ensure all resources held by the thread data are properly released.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/termio/Zmx.zig` around lines 293 - 311, threadExit currently nulls
self.socket_fd without closing it, leaking the socket; after joining
zmx_td.read_thread and before setting self.socket_fd = null, call posix.close on
zmx_td.socket_fd (handle errors similarly to other posix calls) and then set the
field to null. Apply the same pattern in ThreadData.deinit (and any cleanup in
Zmx.deinit): explicitly close zmx_td.socket_fd (or the ThreadData's socket_fd)
via posix.close with appropriate error handling before clearing the file
descriptor field.
Summary
This PR adds a new
zmxtermio backend to Ghostty and wires it through config + embedded APIs so surfaces can attach to persistent zmx sessions instead of always spawning a direct PTY subprocess.Why
cmux persistence work (
manaflow-ai/cmux#143,manaflow-ai/cmux#480) needs terminal processes to outlive the app. The existingexecbackend ties process lifetime to the Ghostty surface, so restart always loses shell state. zmx provides a daemon-owned PTY/session model that can survive app restarts and reconnect cleanly.What Changed
1) New zmx backend in termio
src/termio/Zmx.zig.zmxvariant tosrc/termio/backend.zigunions/dispatch.src/termio.zig.Backend behavior:
zmx run <session>) when configured..child_exitedmetadata so surface UX remains coherent.2) Surface/backend selection and fallback
src/Surface.zigto choose zmx backend whenzmx-sessionis set.execwith warning (no crash/no hard failure).3) Config + embedded API surface
src/config/Config.zig:zmx-sessionzmx-createsrc/apprt/embedded.zigand synced C header:zmx_sessionzmx_createzmx_modeinclude/ghostty.hto keep C API struct in sync.zmx_modeis used for split/tab inheritance in embedded mode: new surfaces created from a zmx-backed surface stay in zmx mode, each with a fresh session.4) Build compatibility in this branch
uucodefallback vendoring path and compatibility wiring (deps-uucode, unicode build updates) to keep fork builds stable in environments where fetching unicode dependency sources is constrained.Design Rationale
exec, so behavior stays isolated and maintainable.exec.Validation
Zmx.zigincludes targeted tests for IPC framing/buffer behavior and disconnect metadata.Related
manaflow-ai/cmux#480manaflow-ai/cmux#143Summary by CodeRabbit
New Features
Behavior
Tests