-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
User Request
Platform-server crashes during bootstrap if docker-runner is not started first. We need platform-server to keep running even when docker-runner is unavailable at startup. Do not silence the error: log it clearly, but do not crash.
Researcher Specification (by Emerson Gray)
Root Cause
DockerRunnerConnectivityProbe.onModuleInit()throws whenHttpDockerRunnerClient.checkConnectivity()fails, causingapp.init()to reject andsrc/index.tstoprocess.exit(1). This makes docker-runner a hard startup dependency.- Probe file:
packages/platform-server/src/infra/container/dockerRunnerConnectivity.probe.ts - Client:
packages/platform-server/src/infra/container/httpDockerRunner.client.ts(checksGET /v1/ready) - Bootstrap:
packages/platform-server/src/index.ts(exits on init failure)
- Probe file:
Proposed Changes (Soft dependency at startup)
- Replace fatal
OnModuleInitprobe with a non-fatal background monitor.- Add
DockerRunnerStatusServiceto track status (unknown|up|down, last success/failure, error, next retry, consecutiveFailures). - Add
DockerRunnerConnectivityMonitor(OnModuleInit) that runs a background loop to check connectivity with exponential backoff + jitter, logs failures with retry info, and never throws. - Ensure a single loop instance and stop it on module destroy.
- Add
- Graceful degradation for docker-runner dependent endpoints.
- Introduce
RequireDockerRunnerGuard(or interceptor): if status !==up, return HTTP 503 with JSON{ error: { code: 'docker_runner_not_ready', message: 'docker-runner not ready' } }. - Apply guard to controllers/routes that require docker-runner (e.g., Containers operations, terminal/exec, image ensure).
- Introduce
- Health/readiness behavior.
- Server readiness should be true once core is initialized even if docker-runner is down.
- Extend
/healthto includedependencies.dockerRunnersnapshot withstatus,baseUrl,lastError, times, and counters.
- Configuration knobs (defaults favor non-fatal startup):
DOCKER_RUNNER_OPTIONAL(default:true). Iffalse, retain failfast behavior.DOCKER_RUNNER_CONNECT_RETRY_BASE_DELAY_MS(default:500)DOCKER_RUNNER_CONNECT_RETRY_MAX_DELAY_MS(default:30000)DOCKER_RUNNER_CONNECT_RETRY_JITTER_MS(default:250)DOCKER_RUNNER_CONNECT_PROBE_INTERVAL_MS(default:30000when status isup)DOCKER_RUNNER_CONNECT_MAX_RETRIES(default:0unlimited). If set and exceeded, keep statusdown, stop retries, log exhaustion (non-fatal).
- Backwards compatibility: When docker-runner is up at startup, initial connectivity should quickly be
upand behavior remains as before.
Implementation Mapping
- Remove
DockerRunnerConnectivityProbeprovider frompackages/platform-server/src/infra/infra.module.tsand entire fatal path in.../dockerRunnerConnectivity.probe.ts. - Add new providers:
DockerRunnerStatusServiceinpackages/platform-server/src/infra/container/dockerRunnerStatus.service.ts.DockerRunnerConnectivityMonitorinpackages/platform-server/src/infra/container/dockerRunnerConnectivity.monitor.ts.
- Update
ConfigService(packages/platform-server/src/core/services/config.service.ts) to include new env schema and getters for optionality and retry configuration. - Add
RequireDockerRunnerGuardinpackages/platform-server/src/infra/container/requireDockerRunner.guard.tsand apply to docker-runner dependent controllers/routes (e.g.,containers.controller.ts, terminal/exec routes). - Extend
/health(existing controller or new) with docker-runner dependency snapshot. - Ensure top-level bootstrap no longer receives fatal errors from docker-runner unavailability; errors are logged by the monitor.
Logging Requirements
- Structured failure logs include:
dependency: "docker-runner",baseUrl,errorCode,message,retryInMs,nextRetryAt(ISO),consecutiveFailures.
Acceptance Criteria
- Platform-server starts and remains running when docker-runner is down.
- Docker-runner dependent endpoints return HTTP 503 with
{ error: { code: 'docker_runner_not_ready', message: 'docker-runner not ready' } }until docker-runner becomes available. /healthreports docker-runner status and details while overall server health is OK.- Clear error logs are present on failure; no crashes.
- Behavior unchanged when docker-runner is up at startup.
Test Plan
- Startup with docker-runner down: server boots;
/healthshowsdependencies.dockerRunner.status = "down"(or transitions fromunknowntodown). - Docker required endpoints return HTTP 503 with
docker_runner_not_readywhen down. - Recovery: once docker-runner becomes available, status turns
upand endpoints succeed without restart. - Logging: capture and assert structured error entries with retry metadata during down state.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels