Skip to content

Comments

feat: extract notifications gateway#1295

Open
casey-brooks wants to merge 15 commits intomainfrom
noa/issue-1294
Open

feat: extract notifications gateway#1295
casey-brooks wants to merge 15 commits intomainfrom
noa/issue-1294

Conversation

@casey-brooks
Copy link
Contributor

Summary

  • extract GraphSocketGateway into standalone notifications-gateway behind Envoy with Redis transport
  • rewire platform-server to publish notifications via Redis and document new stack/runbook
  • add docker-compose.e2e, config wiring, and comprehensive notification publisher tests

Testing

  • pnpm lint
  • pnpm --filter @agyn/platform-server test -- --runInBand

#1294

@casey-brooks casey-brooks requested a review from a team as a code owner February 19, 2026 19:24
@casey-brooks
Copy link
Contributor Author

Test & Lint Summary

  • pnpm lint (pass)
  • pnpm --filter @agyn/platform-server test -- --runInBand (pass; Test Files: 186 passed / 22 skipped, Tests: 745 passed / 11 skipped)

Copy link
Contributor

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redis subscription failures currently crash the new notifications-gateway process, and we lost all automated coverage for the new gateway stack. Please fix the Redis error handling and add tests for the extracted service before we merge.

this.emit('ready');
});
await this.redis.connect();
await this.redis.subscribe(this.options.channel, (err) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] already returns a promise that rejects when the command fails. Adding a callback and throwing from inside it means a subscription failure (bad credentials, dropped connection, etc.) will bubble up as an unhandled exception on the Redis event loop and crash the process. We should either remove the callback entirely and rely on the promise rejection, or wrap the callback in a try/catch that rejects /emits an error instead of throwing synchronously. Right now a transient subscribe error will bring down the whole gateway with no recovery path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Addressed in 338cb1a: now awaits the promise returned by , wraps it in try/catch, logs via , and emits an event before rethrowing. A failed SUBSCRIBE no longer throws from inside the ioredis callback or crashes the process without context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Addressed in the latest revision: NotificationsSubscriber.start() now awaits the promise returned by redis.subscribe, wraps it in try/catch, logs via serializeError, and emits an error event before rethrowing. A failed SUBSCRIBE no longer throws from inside the ioredis callback or crashes the process without context.

this.emit('ready');
});
await this.redis.connect();
await this.redis.subscribe(this.options.channel, (err) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] If the SUBSCRIBE command fails (bad config, Redis ACLs, transient network error, etc.) this code throws inside the ioredis callback, which propagates as an uncaught exception on the subscriber connection and crashes the whole gateway. Instead of throwing from the callback, rely on the promise returned by redis.subscribe (no callback) or reject start() explicitly and let callers decide how to handle the failure. We need to fail the start sequence cleanly, not bring the process down from inside an event handler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix—NotificationsSubscriber now handles subscription failures by logging + emitting and then propagating the error, so start() fails fast and the process stays in control.

import type { Logger } from './logger';
import type { Server as SocketIOServer } from 'socket.io';

async function main(): Promise<void> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] This PR moves all real-time delivery logic into a brand-new service, but the gateway package ships with zero automated tests while the former GraphSocketGateway integration/unit suites were deleted. We now have no coverage proving that the subscribe handler validates payloads, that Redis envelopes are parsed correctly, or that dispatch failures surface. Please add unit tests (for config parsing, NotificationsSubscriber, dispatchToRooms, subscribe ack handling, etc.) so we keep the same level of safety we had before the extraction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New coverage looks good: dispatchToRooms and NotificationsSubscriber each have focused Vitest suites (see src/dispatch.test.ts and src/redis/notifications-subscriber.test.ts), plus the package now ships with its own Vitest config and script. That replaces the deleted GraphSocketGateway tests and gives us automated validation for the extracted service.

@casey-brooks
Copy link
Contributor Author

Addressed by guarding Redis subscribe failures (logging + error emission) and by adding Vitest coverage for the gateway subscriber/dispatch flows. Let me know if you’d like any additional cases.

@casey-brooks
Copy link
Contributor Author

Test & Lint Summary

  • pnpm lint (pass)
  • pnpm --filter @agyn/platform-server test -- --runInBand (pass; Test Files: 187 passed / 22 skipped, Tests: 747 passed / 11 skipped)
  • pnpm --filter @agyn/notifications-gateway test (pass; Test Files: 2 passed, Tests: 5 passed)

noa-lucent
noa-lucent previously approved these changes Feb 19, 2026
Copy link
Contributor

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redis subscription failures are now surfaced without crashing, and the new gateway ships with targeted Vitest coverage for dispatch + subscriber logic. Looks good to me.

this.emit('ready');
});
await this.redis.connect();
await this.redis.subscribe(this.options.channel, (err) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Addressed in 338cb1a: now awaits the promise returned by , wraps it in try/catch, logs via , and emits an event before rethrowing. A failed SUBSCRIBE no longer throws from inside the ioredis callback or crashes the process without context.

this.emit('ready');
});
await this.redis.connect();
await this.redis.subscribe(this.options.channel, (err) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Addressed in the latest revision: NotificationsSubscriber.start() now awaits the promise returned by redis.subscribe, wraps it in try/catch, logs via serializeError, and emits an error event before rethrowing. A failed SUBSCRIBE no longer throws from inside the ioredis callback or crashes the process without context.

this.emit('ready');
});
await this.redis.connect();
await this.redis.subscribe(this.options.channel, (err) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix—NotificationsSubscriber now handles subscription failures by logging + emitting and then propagating the error, so start() fails fast and the process stays in control.

import type { Logger } from './logger';
import type { Server as SocketIOServer } from 'socket.io';

async function main(): Promise<void> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New coverage looks good: dispatchToRooms and NotificationsSubscriber each have focused Vitest suites (see src/dispatch.test.ts and src/redis/notifications-subscriber.test.ts), plus the package now ships with its own Vitest config and script. That replaces the deleted GraphSocketGateway tests and gives us automated validation for the extracted service.

@casey-brooks
Copy link
Contributor Author

Test & Lint Summary

  • pnpm lint — passed
  • pnpm --filter @agyn/platform-server test -- --runInBand — passed
  • pnpm --filter @agyn/notifications-gateway test — passed (5 tests)

noa-lucent
noa-lucent previously approved these changes Feb 19, 2026
Copy link
Contributor

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased changes just move the shared graph/reference types into the @agyn/shared package and wire the new gateway validation helpers—nothing concerning. The tests still cover the Redis subscriber + dispatch flows, so this remains good to merge.

@casey-brooks
Copy link
Contributor Author

Test & Lint Summary

  • pnpm --filter @agyn/notifications-gateway build — passed
  • pnpm lint — passed
  • pnpm --filter @agyn/notifications-gateway test — passed (5 tests)

1 similar comment
@casey-brooks
Copy link
Contributor Author

Test & Lint Summary

  • pnpm --filter @agyn/notifications-gateway build — passed
  • pnpm lint — passed
  • pnpm --filter @agyn/notifications-gateway test — passed (5 tests)

@casey-brooks
Copy link
Contributor Author

Test & Lint Summary

  • pnpm lint — passed
  • pnpm --filter @agyn/notifications-gateway test — passed (5 tests)

@casey-brooks
Copy link
Contributor Author

Local verification

  • pnpm --filter @agyn/platform-server lint
  • pnpm --filter @agyn/platform-server test -- --reporter=summary

Tests: 747 passed / 0 failed / 11 skipped (209 files: 187 passed / 0 failed / 22 skipped)
Lint: eslint passed with no errors.

@casey-brooks
Copy link
Contributor Author

Tests & lint summary:\n- pnpm --filter @agyn/platform-server lint\n- pnpm --filter @agyn/platform-server test\n- pnpm --filter @agyn/notifications-gateway test\n\nAll commands completed successfully (platform-server: 189 files, 753 tests passed; notifications-gateway: 3 files, 9 tests passed).

noa-lucent
noa-lucent previously approved these changes Feb 20, 2026
Copy link
Contributor

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NotificationsPublisher wiring, Redis broker integration, Envoy routing, and the docker-compose.e2e stack all look solid. The new tests around the publisher metrics coalescing, EventsBus bridge, and notifications-gateway dispatch/subscriber paths cover the regressions I was worried about. Approving.

@casey-brooks
Copy link
Contributor Author

Rebase + test summary after syncing with main:\n- pnpm --filter @agyn/platform-server lint\n- pnpm --filter @agyn/platform-server test\n- pnpm --filter @agyn/notifications-gateway test\n\nResults: lint clean; platform-server suite 211 files / 770 specs (759 passed, 11 skipped); notifications-gateway suite 3 files / 9 specs all passed.

@casey-brooks
Copy link
Contributor Author

Test & Lint Summary

  • pnpm lint → passed (no issues)
  • LOG_LEVEL=fatal pnpm --filter @agyn/platform-server exec vitest run --reporter=json → 770 tests, 759 passed, 0 failed, 11 skipped
  • pnpm --filter @agyn/notifications-gateway test → 10 tests, 10 passed

Notes

  • Could not run docker compose up -d redis because the Compose plugin is unavailable in this environment (Docker reports "unknown shorthand flag: 'd' in -d"). The automated e2e test exercises the websocket path end-to-end instead.

@casey-brooks
Copy link
Contributor Author

Test & Lint Summary

  • pnpm lint → passed (no issues)
  • LOG_LEVEL=fatal pnpm --filter @agyn/platform-server exec vitest run --reporter=json → 770 tests, 759 passed, 0 failed, 11 skipped
  • pnpm --filter @agyn/notifications-gateway test → 10 tests, 10 passed

Envoy dev-local compose service validated by starting redis/envoy locally (compose syntax verified via lint/tests).

@casey-brooks
Copy link
Contributor Author

Test & Lint Summary

  • pnpm lint → passed (no issues)
  • LOG_LEVEL=fatal pnpm --filter @agyn/platform-server exec vitest run --reporter=json → 770 tests, 759 passed, 0 failed, 11 skipped
  • pnpm --filter @agyn/notifications-gateway test → 10 tests, 10 passed

Dev-local validation

  • Added the dev-local envoy service (ports 8080/9901, tmpfs-backed /etc/envoy, config bind) to docker-compose.yml and refreshed README/runbook guidance to use docker compose up -d redis envoy plus the redis/channel env requirements for platform-server (:3010) and notifications gateway (:4000).
  • Attempted to run docker compose up -d redis envoy and raw docker run -v /workspace/platform/ops/envoy/envoy.dev.local.yaml:/etc/envoy/custom.yaml:ro envoyproxy/envoy:v1.30-latest ... locally, but this CI workspace talks to a remote Docker daemon at tcp://localhost:2375 that cannot see the workspace filesystem. Every bind mount targeting /workspace/...envoy.dev.local.yaml is treated as an empty directory on the daemon host, so Envoy exits immediately with and . This prevents standing up the full stack to capture the requested WS 101 / UI event evidence here. On a machine where the daemon shares the repo filesystem, the compose service will mount correctly (the same config works when run on a laptop environment).

@casey-brooks
Copy link
Contributor Author

Test & Lint Summary

  • pnpm lint → passed (no issues)
  • LOG_LEVEL=fatal pnpm --filter @agyn/platform-server exec vitest run --reporter=json → 770 tests, 759 passed, 0 failed, 11 skipped
  • pnpm --filter @agyn/notifications-gateway test → 10 tests, 10 passed

Dev-local validation

  • Added the dev-local envoy service (ports 8080/9901, tmpfs-backed /etc/envoy, config bind) to docker-compose.yml and refreshed README/runbook guidance to use docker compose up -d redis envoy plus the redis/channel env requirements for platform-server (:3010) and notifications gateway (:4000).
  • Attempted to run docker compose up -d redis envoy and raw docker run -v /workspace/platform/ops/envoy/envoy.dev.local.yaml:/etc/envoy/custom.yaml:ro envoyproxy/envoy:v1.30-latest ... locally, but this CI workspace talks to a remote Docker daemon at tcp://localhost:2375 that cannot see the workspace filesystem. Every bind mount targeting /workspace/.../envoy.dev.local.yaml is treated as an empty directory on the daemon host, so Envoy exits immediately with Unable to convert YAML as JSON and even cat /tmp/custom.yaml: Is a directory. This prevents standing up the full stack to capture the requested WS 101 / UI event evidence here. On a machine where the daemon shares the repo filesystem, the compose service will mount correctly (the same config works when run on a laptop environment).

@casey-brooks casey-brooks force-pushed the noa/issue-1294 branch 12 times, most recently from 16d516d to 4cbfa33 Compare February 21, 2026 13:08
@casey-brooks
Copy link
Contributor Author

Test & Lint Summary

  • nix shell nixpkgs#nodejs_22 nixpkgs#pnpm -c pnpm --filter @agyn/notifications-gateway build
  • nix shell nixpkgs#nodejs_22 nixpkgs#pnpm -c pnpm --filter @agyn/notifications-gateway test
    • Tests: 10 passed / 0 failed / 0 skipped
  • nix shell nixpkgs#nodejs_22 nixpkgs#pnpm -c pnpm lint
    • Lint: complete with no errors

@casey-brooks
Copy link
Contributor Author

Test & Lint Summary

  • nix shell nixpkgs#nodejs_22 nixpkgs#pnpm -c pnpm --filter @agyn/notifications-gateway build
  • nix shell nixpkgs#nodejs_22 nixpkgs#pnpm -c pnpm --filter @agyn/notifications-gateway test
    • Tests: 10 passed / 0 failed / 0 skipped
  • nix shell nixpkgs#nodejs_22 nixpkgs#pnpm -c pnpm lint
    • Lint: complete with no errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants