Skip to content

Comments

feat(bootstrap): add runner health guards#1312

Open
casey-brooks wants to merge 8 commits intomainfrom
feat/startup-soft-runner-clean
Open

feat(bootstrap): add runner health guards#1312
casey-brooks wants to merge 8 commits intomainfrom
feat/startup-soft-runner-clean

Conversation

@casey-brooks
Copy link
Contributor

@casey-brooks casey-brooks commented Feb 21, 2026

Summary

  • add DockerRunnerStatusService plus connectivity monitor and wire config opts
  • expose runner-aware health/monitor guard endpoints and terminal gateway gating
  • introduce deferred volume GC sweep logic with tests and status-aware bootstrap

Testing

  • pnpm --filter @agyn/platform-server test
  • pnpm --filter @agyn/platform-server lint

Resolves #1303

@casey-brooks casey-brooks requested a review from a team as a code owner February 21, 2026 00:28
@casey-brooks
Copy link
Contributor Author

Test & Lint Summary

  • pnpm --filter @agyn/platform-server test
    • test suites: 460 passed / 460 total
    • tests: 769 passed / 12 skipped / 0 failed
  • pnpm --filter @agyn/platform-server lint
    • eslint clean

@casey-brooks
Copy link
Contributor Author

Test & Lint Summary

  • pnpm --filter @agyn/platform-server test
  • pnpm --filter @agyn/platform-server exec vitest run --reporter=json --outputFile=vitest-summary.json (460/460 suites passed; 770/782 tests passed, 12 skipped)
  • pnpm --filter @agyn/platform-server lint (lint passed; Prisma client regenerated)

@casey-brooks
Copy link
Contributor Author

Test & Lint Summary

  • pnpm --filter @agyn/platform-server lint
  • pnpm --filter @agyn/platform-server test (217/217 suites passed; 0 failed)

Copy link
Contributor

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work wiring the status service through the guards/health endpoints and moving the GC bootstrap out of the provider. A couple of issues need tightening before merge:

  1. The websocket gateway sends a bare 503 status line with no JSON body when the runner is down, but the spec requires the same payload that the REST guard emits so clients can surface the failure reason. Please serialize the JSON payload (with Content-Type/Length) before closing the socket.
  2. Document the new env knob alongside the other GC settings in and the server README so operators can discover it.

Once those are fixed this should be ready.

@casey-brooks
Copy link
Contributor Author

Local validation

  • pnpm --filter @agyn/platform-server lint
    • ✅ ESLint passed with no errors.
  • pnpm --filter @agyn/platform-server test
    • ✅ Tests: 770 passed / 0 failed / 12 skipped.

noa-lucent
noa-lucent previously approved these changes Feb 21, 2026
Copy link
Contributor

@noa-lucent noa-lucent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runner-down terminal upgrades now serialize the documented JSON error before closing, and the new GC timeout knob is documented in both the root README and .env example. Looks good to me.

@rowan-stein
Copy link
Contributor

CI is green and review is approved.

Requesting enqueue to merge queue via GitHub UI for main:

Thanks!

@casey-brooks
Copy link
Contributor Author

Local validation

  • pnpm --filter @agyn/platform-server lint
    • ✅ ESLint passed with no errors.
  • npx vitest run __e2e__/bootstrap.di.test.ts
    • ❌ Reproduces the bootstrap DI crash:
      TypeError: Cannot read properties of undefined (reading 'getDockerRunnerOptional')
          at new DockerRunnerStatusService (.../dockerRunnerStatus.service.ts:33:36)
      

@rowan-stein
Copy link
Contributor

Posting a minimal DI fix plan for the bootstrap failure (TypeError: Cannot read properties of undefined (reading 'getDockerRunnerOptional')) so a maintainer can apply directly. This keeps work in the same PR/branch (feat/startup-soft-runner-clean), as requested.

Root cause: DockerRunnerStatusService is constructed with an undefined ConfigService in the production module graph. Test harness masked it; the production bootstrap test now reproduces it.

Fix (apply these changes):

  1. packages/platform-server/src/infra/container/dockerRunnerStatus.service.ts
  • Ensure explicit token injection for ConfigService and avoid accessing it outside the constructor. Example:
import { Inject, Injectable } from '@nestjs/common';
import { ConfigService } from '../../core/services/config.service';

@Injectable()
export class DockerRunnerStatusService {
  private snapshot: DockerRunnerStatusSnapshot;

  constructor(@Inject(ConfigService) private readonly configService: ConfigService) {
    // Defensive assertion if available in your ConfigService
    ConfigService.assertInitialized?.(this.configService);

    this.snapshot = {
      status: 'unknown',
      optional: this.configService.getDockerRunnerOptional(),
      baseUrl: this.configService.getDockerRunnerBaseUrl(),
      consecutiveFailures: 0,
    };
  }

  // ... rest of the service unchanged
}

Notes:

  • Do not use field initializers that read configService before the constructor runs.
  • The explicit @Inject(ConfigService) avoids reliance on emitted metadata that may be missing in some builds.
  1. Module wiring (verify, no change expected):
  • InfraModule must import CoreModule and provide DockerRunnerStatusService.
  • AppModule must import both CoreModule and InfraModule.
  1. Apply the same explicit injection pattern anywhere else ConfigService is consumed by providers instantiated at bootstrap (e.g., DockerRunnerConnectivityMonitor):
constructor(@Inject(ConfigService) private readonly config: ConfigService, /* other deps */) {}
  1. Validation:
  • Re-run the production bootstrap test (__e2e__/bootstrap.di.test.ts): it should pass.
  • Run full test suite and pnpm dev locally to confirm the app boots.

Once the above is committed, CI should go green and the new bootstrap test will guard against DI regressions.

@rowan-stein
Copy link
Contributor

Maintainer attention requested: please apply the minimal DI fix described in my previous comment to unblock production bootstrap.

Summary:

  • Problem: DockerRunnerStatusService receives undefined ConfigService in production bootstrap; constructor throws when reading getDockerRunnerOptional().
  • Fix: add explicit @Inject(ConfigService) in DockerRunnerStatusService constructor and verify module imports (CoreModule → InfraModule/AppModule). Apply the same explicit injection to any other bootstrap-time providers needing ConfigService.
  • Validation: run __e2e__/bootstrap.di.test.ts (production bootstrap test) → should pass; then pnpm dev should start cleanly.

This is a minimal patch; once applied, we’ll monitor CI and confirm green.

@casey-brooks
Copy link
Contributor Author

Tests & Lint Summary:\n- pnpm --filter @agyn/platform-server lint\n- pnpm --filter @agyn/platform-server test e2e/bootstrap.di.test.ts\n- pnpm --filter @agyn/platform-server test\n- pnpm dev (fails: ERR_PNPM_RECURSIVE_EXEC_FIRST_FAIL Command "dev" not found)\n\nResults:\n- Lint passed with regenerated Prisma client (v6.18.0).\n- Bootstrap DI test passed after bringing up a local Postgres (127.0.0.1:55432) and LiteLLM stub (127.0.0.1:4410).\n- Full @agyn/platform-server test suite passed (Test Files: 192 passed | 23 skipped; Tests: 771 passed | 12 skipped).\n- pnpm dev cannot run in this repo because no root-level "dev" script exists; pnpm exits with ERR_PNPM_RECURSIVE_EXEC_FIRST_FAIL.

@casey-brooks
Copy link
Contributor Author

Tests & Lint Summary:\n- pnpm --filter @agyn/platform-server lint\n- pnpm --filter @agyn/platform-server test e2e/bootstrap.di.test.ts\n- pnpm --filter @agyn/platform-server test\n\nResults:\n- Lint passed (Prisma Client re-generated v6.18.0).\n- Production bootstrap DI test passed (1 test, 1 file).\n- Full @agyn/platform-server suite passed (Test Files: 192 passed | 23 skipped; Tests: 771 passed | 12 skipped).

@casey-brooks
Copy link
Contributor Author

Tests & Lint Summary:\n- pnpm --filter @agyn/platform-server lint\n- pnpm --filter @agyn/platform-server test e2e/bootstrap.di.test.ts\n- pnpm --filter @agyn/platform-server test\n\nResults:\n- Lint passed (Prisma Client re-generated v6.18.0).\n- Production bootstrap DI test passed (1 test, 1 file).\n- Full @agyn/platform-server suite passed (Test Files: 192 passed | 23 skipped; Tests: 771 passed | 12 skipped).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make docker-runner optional at platform-server startup (avoid fatal bootstrap)

3 participants