-
Notifications
You must be signed in to change notification settings - Fork 88
Description
Request
Display health indicators for CLP package services in the WebUI to help users identify when critical backend components are unavailable.
Problem
Users have experienced situations where the WebUI is accessible and appears functional, but queries are not being processed because the query job orchestration components (query-scheduler, query-worker, reducer, etc.) are down. While administrators can check service health via orchestrator tools (docker compose ps, kubectl get pods), there is no visibility into service health from user-facing interfaces like the WebUI, making it difficult for users to diagnose such issues.
Why this matters
- User Experience: Users cannot easily determine why their queries are not processing
- Debugging: Without health indicators, users have no visibility into which components might be failing
- Operational Awareness: Administrators need to quickly identify service outages without manually checking container/pod status
Affected services
Based on the current architecture in tools/deployment/package/docker-compose-all.yaml and tools/deployment/package-helm/templates/, the services that need health monitoring include:
Note: Current health checks are defined per orchestrator:
- Docker Compose:
healthcheckblocks indocker-compose-all.yaml - Kubernetes (Helm):
readinessProbe/livenessProbein deployment templates
Core job orchestration services
| Service | Description | Port | Current Health Endpoint |
|---|---|---|---|
query-scheduler |
Schedules query jobs | 7000 | None (the port is only a TCP listener for reducers) |
compression-scheduler |
Schedules compression jobs | - | None |
query-worker |
Celery worker for executing queries | - | None (Celery process) |
compression-worker |
Celery worker for executing compression jobs | - | None (Celery process) |
reducer |
Aggregates query results | - | None |
Supporting services
| Service | Description | Port | Current Health Endpoint |
|---|---|---|---|
api-server |
REST API server | 3001 | GET /health |
webui |
Web interface | 4000 | TCP socket check |
garbage-collector |
Cleans up old archives and results | - | None |
mcp-server |
MCP server (optional) | 8000 | GET /health |
log-ingestor |
Ingestion service | 3002 | GET /health |
Third-party services
| Service | Description | Port | Health Check Method |
|---|---|---|---|
database |
MariaDB | 3306 | mysqladmin ping |
queue |
RabbitMQ | 5672 | rabbitmq-diagnostics check_running |
redis |
Redis | 6379 | redis-cli PING |
results-cache |
MongoDB | 27017 | mongosh ping |
Initialization jobs
| Service | Description | Health Check Method |
|---|---|---|
db-table-creator |
Creates database tables in MariaDB | Job completion status (one-time) |
results-cache-indices-creator |
Initializes MongoDB indices | Job completion status (one-time) |
Note: Most services depend on these initialization jobs completing successfully before starting.
Possible implementation
Two decisions need to be made:
- How services report health — the mechanism for collecting health status from services
- How health statuses are cached/exposed — how the API server stores and exposes aggregated health data
1. Alternative approaches for health reporting
How should services report their health status to a central aggregator?
| Option | Description | Advantages | Disadvantages |
|---|---|---|---|
| 1A: Orchestrator-based | Leverage Docker/Kubernetes APIs to get container/pod health status | • Uses existing health checks defined in compose/helm files | • Not orchestrator agnostic • Docker: requires exposing socket (security concern) • Kubernetes: requires additional RBAC permissions |
| 1B: Services send heartbeats to API server (recommended) | Services periodically POST health reports to API server | • Orchestrator agnostic • Services only need to make HTTP requests (simpler than serving) • Single aggregation point as source of truth |
• Requires adding HTTP client to each service |
| 1C: API server scrapes services | API server periodically polls each service's health endpoint | • Similar to Prometheus model • Bypasses orchestrator |
• Requires all services to expose HTTP endpoints (not all are HTTP servers) • Requires service discovery (API server needs to know hostnames assigned by Docker Compose / Kubernetes) |
Option 1B implementation details:
- Add a
POST /healthendpoint to the API server that accepts health reports - Each service periodically (e.g., every 10 seconds) sends a report with:
- Service name
- Service instance ID
- Optional timestamp (for debugging clock skew / network delays; API server's receive time is authoritative for health calculations)
- Optional error message to explicitly mark as unhealthy (e.g., "failed to connect to database")
- Optional status details (e.g., queue depth, active jobs)
- API server also marks services as unhealthy if no heartbeat received within a threshold (e.g., 30 seconds)
2. Alternatives for health status storage/caching
Some entity (now we assume the API server) aggregates health statuses. Options for how it
stores/exposes them:
| Option | Description | Advantages | Disadvantages |
|---|---|---|---|
| 2A: API server in-memory cache (recommended) | Cache in memory, expose via GET /health |
• Simplest; no external storage • Health data is ephemeral by nature • Orchestrators can query for health checks |
• WebUI must poll (no push updates) • Data lost on restart (acceptable) |
2B: MongoDB (results-cache) |
Store in dedicated MongoDB collection | • WebUI can use CDC via Socket.IO for real-time updates (existing pattern) | • Additional complexity for ephemeral data |
| 2C: Redis | Store with TTL-based expiry | • Fast reads/writes • TTL auto-expires stale entries |
• WebUI doesn't connect to Redis • Requires new infrastructure |
2D: MariaDB / MySQL (clp-db) |
Store in heartbeat table | • WebUI already connects to clp-db• Transactional consistency |
• WebUI must poll (no CDC) • Additional load on primary database |
Option 2A endpoints:
GET /health— returns health status of all services (for WebUI)GET /health?service=<name>&instance=<id>— returns health status of a specific service instance (for container orchestrator health checks on services without their own endpoints)
Caveat for orchestrator health checks: This creates a chicken-and-egg problem. Currently, API server has hard dependencies (depends_on database / initContainers waiting for db-table-creator), so it can't start before other services without relaxing these dependencies.
Recommended architecture (Option 1B + 2A)
flowchart LR
subgraph Services
QS[query-scheduler]
CS[compression-scheduler]
QW[query-worker]
CW[compression-worker]
R[reducer]
GC[garbage-collector]
end
subgraph Aggregator
API[API Server<br/>in-memory cache]
end
subgraph Frontend
WebUI[WebUI]
end
subgraph Orchestrator
DC[Docker Compose /<br/>Kubernetes]
end
QS -->|POST /health| API
CS -->|POST /health| API
QW -->|POST /health| API
CW -->|POST /health| API
R -->|POST /health| API
GC -->|POST /health| API
API -->|GET /health| WebUI
API -->|GET /health?service=X| DC
Implementation steps (Option 1B + 2A)
-
API server changes:
- Add
POST /healthendpoint to accept service health reports - Add
GET /healthendpoint to return aggregated health status of all services (for WebUI) - Add
GET /health?service=<name>&instance=<id>for querying specific service health (for container orchestrator health checks) - Add background task to mark services as unhealthy if no report received within threshold
- Cache health statuses in memory
- Add
-
Service changes (first-party):
- Add health report HTTP client to each long-running Python service (query-scheduler, compression-scheduler, reducer, garbage-collector)
- Add health report mechanism to Celery workers (query-worker, compression-worker)
- Configure report interval via environment variable or config
- For initialization jobs (db-table-creator, results-cache-indices-creator): report completion status once upon success/failure
-
Third-party service health reporting (optional, future enhancement):
Third-party services (database, queue, redis, results-cache) don't run our code, so they can't directly report health. Possible approaches:
- Extend existing healthchecks: Append a curl command to existing healthcheck scripts, e.g.,
mysqladmin ping && curl -X POST http://api-server:3001/health -d '{"service":"database"}'
- Extend existing healthchecks: Append a curl command to existing healthcheck scripts, e.g.,
-
WebUI changes:
- Poll
GET /healthendpoint periodically (e.g., every 5 seconds) - Add health status display component (e.g., status bar or dedicated page)
- Visual indicators: green (healthy), red (unhealthy/missing); optionally yellow (degraded) in future
- Poll
-
Container orchestrator changes (optional, requires more thought):
- For services without their own health endpoints, configure health checks to use
GET /health?service=<name>&instance=<id>on the API server (see caveat in Option 2A above)
- For services without their own health endpoints, configure health checks to use
Configuration options
# Example values.yaml additions
clpConfig:
# Per-service config to optionally enable reporting (so services don't have a hard dependency on API server)
query_scheduler:
health_reporting:
enabled: true # optional; service continues to function if API server is unavailable
interval: 10 # seconds between reports
unhealthy_threshold: 30 # seconds without report before API server marks as unhealthy
compression_scheduler:
health_reporting:
enabled: true
interval: 10
unhealthy_threshold: 30
# ... similar for other servicesReferences
- Docker Compose health checks:
tools/deployment/package/docker-compose-all.yaml - Helm deployment templates:
tools/deployment/package-helm/templates/ - API server routes:
components/api-server/src/routes.rs - WebUI MongoDB integration:
components/webui/server/src/plugins/app/socket/MongoSocketIoServer/ - Query scheduler:
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py