Skip to content

Make docker-runner optional at platform-server startup (avoid fatal bootstrap) #1303

@rowan-stein

Description

@rowan-stein

User Request

Platform-server crashes during bootstrap if docker-runner is not started first. We need platform-server to keep running even when docker-runner is unavailable at startup. Do not silence the error: log it clearly, but do not crash.

Researcher Specification (by Emerson Gray)

Root Cause

  • DockerRunnerConnectivityProbe.onModuleInit() throws when HttpDockerRunnerClient.checkConnectivity() fails, causing app.init() to reject and src/index.ts to process.exit(1). This makes docker-runner a hard startup dependency.
    • Probe file: packages/platform-server/src/infra/container/dockerRunnerConnectivity.probe.ts
    • Client: packages/platform-server/src/infra/container/httpDockerRunner.client.ts (checks GET /v1/ready)
    • Bootstrap: packages/platform-server/src/index.ts (exits on init failure)

Proposed Changes (Soft dependency at startup)

  1. Replace fatal OnModuleInit probe with a non-fatal background monitor.
    • Add DockerRunnerStatusService to track status (unknown|up|down, last success/failure, error, next retry, consecutiveFailures).
    • Add DockerRunnerConnectivityMonitor (OnModuleInit) that runs a background loop to check connectivity with exponential backoff + jitter, logs failures with retry info, and never throws.
    • Ensure a single loop instance and stop it on module destroy.
  2. Graceful degradation for docker-runner dependent endpoints.
    • Introduce RequireDockerRunnerGuard (or interceptor): if status !== up, return HTTP 503 with JSON { error: { code: 'docker_runner_not_ready', message: 'docker-runner not ready' } }.
    • Apply guard to controllers/routes that require docker-runner (e.g., Containers operations, terminal/exec, image ensure).
  3. Health/readiness behavior.
    • Server readiness should be true once core is initialized even if docker-runner is down.
    • Extend /health to include dependencies.dockerRunner snapshot with status, baseUrl, lastError, times, and counters.
  4. Configuration knobs (defaults favor non-fatal startup):
    • DOCKER_RUNNER_OPTIONAL (default: true). If false, retain failfast behavior.
    • DOCKER_RUNNER_CONNECT_RETRY_BASE_DELAY_MS (default: 500)
    • DOCKER_RUNNER_CONNECT_RETRY_MAX_DELAY_MS (default: 30000)
    • DOCKER_RUNNER_CONNECT_RETRY_JITTER_MS (default: 250)
    • DOCKER_RUNNER_CONNECT_PROBE_INTERVAL_MS (default: 30000 when status is up)
    • DOCKER_RUNNER_CONNECT_MAX_RETRIES (default: 0 unlimited). If set and exceeded, keep status down, stop retries, log exhaustion (non-fatal).
  5. Backwards compatibility: When docker-runner is up at startup, initial connectivity should quickly be up and behavior remains as before.

Implementation Mapping

  • Remove DockerRunnerConnectivityProbe provider from packages/platform-server/src/infra/infra.module.ts and entire fatal path in .../dockerRunnerConnectivity.probe.ts.
  • Add new providers:
    • DockerRunnerStatusService in packages/platform-server/src/infra/container/dockerRunnerStatus.service.ts.
    • DockerRunnerConnectivityMonitor in packages/platform-server/src/infra/container/dockerRunnerConnectivity.monitor.ts.
  • Update ConfigService (packages/platform-server/src/core/services/config.service.ts) to include new env schema and getters for optionality and retry configuration.
  • Add RequireDockerRunnerGuard in packages/platform-server/src/infra/container/requireDockerRunner.guard.ts and apply to docker-runner dependent controllers/routes (e.g., containers.controller.ts, terminal/exec routes).
  • Extend /health (existing controller or new) with docker-runner dependency snapshot.
  • Ensure top-level bootstrap no longer receives fatal errors from docker-runner unavailability; errors are logged by the monitor.

Logging Requirements

  • Structured failure logs include:
    • dependency: "docker-runner", baseUrl, errorCode, message, retryInMs, nextRetryAt (ISO), consecutiveFailures.

Acceptance Criteria

  • Platform-server starts and remains running when docker-runner is down.
  • Docker-runner dependent endpoints return HTTP 503 with { error: { code: 'docker_runner_not_ready', message: 'docker-runner not ready' } } until docker-runner becomes available.
  • /health reports docker-runner status and details while overall server health is OK.
  • Clear error logs are present on failure; no crashes.
  • Behavior unchanged when docker-runner is up at startup.

Test Plan

  1. Startup with docker-runner down: server boots; /health shows dependencies.dockerRunner.status = "down" (or transitions from unknown to down).
  2. Docker required endpoints return HTTP 503 with docker_runner_not_ready when down.
  3. Recovery: once docker-runner becomes available, status turns up and endpoints succeed without restart.
  4. Logging: capture and assert structured error entries with retry metadata during down state.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions