Service health indicators in WebUI

## Request

Display health indicators for CLP package services in the WebUI to help users identify when critical backend components are unavailable.

### Problem

Users have experienced situations where the WebUI is accessible and appears functional, but queries are not being processed because the query job orchestration components (query-scheduler, query-worker, reducer, etc.) are down. While administrators can check service health via orchestrator tools (`docker compose ps`, `kubectl get pods`), there is no visibility into service health from user-facing interfaces like the WebUI, making it difficult for users to diagnose such issues.

### Why this matters

- **User Experience**: Users cannot easily determine why their queries are not processing
- **Debugging**: Without health indicators, users have no visibility into which components might be failing
- **Operational Awareness**: Administrators need to quickly identify service outages without manually checking container/pod status

### Affected services

Based on the current architecture in `tools/deployment/package/docker-compose-all.yaml` and `tools/deployment/package-helm/templates/`, the services that need health monitoring include:

**Note:** Current health checks are defined per orchestrator:
- **Docker Compose**: `healthcheck` blocks in `docker-compose-all.yaml`
- **Kubernetes (Helm)**: `readinessProbe` / `livenessProbe` in deployment templates

#### Core job orchestration services

| Service | Description | Port | Current Health Endpoint |
|-------------------------|----------------------------------------------|------|-----------------------------------------------------|
| `query-scheduler` | Schedules query jobs | 7000 | None (the port is only a TCP listener for reducers) |
| `compression-scheduler` | Schedules compression jobs | - | None |
| `query-worker` | Celery worker for executing queries | - | None (Celery process) |
| `compression-worker` | Celery worker for executing compression jobs | - | None (Celery process) |
| `reducer` | Aggregates query results | - | None |

#### Supporting services

| Service | Description | Port | Current Health Endpoint |
|---------------------|------------------------------------|------|-------------------------|
| `api-server` | REST API server | 3001 | `GET /health` |
| `webui` | Web interface | 4000 | TCP socket check |
| `garbage-collector` | Cleans up old archives and results | - | None |
| `mcp-server` | MCP server (optional) | 8000 | `GET /health` |
| `log-ingestor` | Ingestion service | 3002 | `GET /health` |

#### Third-party services

| Service | Description | Port | Health Check Method |
|-----------------|-------------|-------|--------------------------------------|
| `database` | MariaDB | 3306 | `mysqladmin ping` |
| `queue` | RabbitMQ | 5672 | `rabbitmq-diagnostics check_running` |
| `redis` | Redis | 6379 | `redis-cli PING` |
| `results-cache` | MongoDB | 27017 | `mongosh ping` |

#### Initialization jobs

| Service | Description | Health Check Method |
|---------------------------------|------------------------------------|----------------------------------|
| `db-table-creator` | Creates database tables in MariaDB | Job completion status (one-time) |
| `results-cache-indices-creator` | Initializes MongoDB indices | Job completion status (one-time) |

**Note:** Most services depend on these initialization jobs completing successfully before starting.


## Possible implementation

Two decisions need to be made:
1. **How services report health** — the mechanism for collecting health status from services
2. **How health statuses are cached/exposed** — how the API server stores and exposes aggregated health data

### 1. Alternative approaches for health reporting

How should services report their health status to a central aggregator?

| Option | Description | Advantages | Disadvantages |
|--------------------------------------------------------------|--------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **1A: Orchestrator-based** | Leverage Docker/Kubernetes APIs to get container/pod health status | • Uses existing health checks defined in compose/helm files | • Not orchestrator agnostic • Docker: requires exposing socket (security concern) • Kubernetes: requires additional RBAC permissions |
| **1B: Services send heartbeats to API server (recommended)** | Services periodically POST health reports to API server | • Orchestrator agnostic • Services only need to make HTTP requests (simpler than serving) • Single aggregation point as source of truth | • Requires adding HTTP client to each service |
| **1C: API server scrapes services** | API server periodically polls each service's health endpoint | • Similar to Prometheus model • Bypasses orchestrator | • Requires all services to expose HTTP endpoints (not all are HTTP servers) • Requires service discovery (API server needs to know hostnames assigned by Docker Compose / Kubernetes) |

**Option 1B implementation details:**
1. Add a `POST /health` endpoint to the API server that accepts health reports
2. Each service periodically (e.g., every 10 seconds) sends a report with:
 - Service name
 - Service instance ID
 - Optional timestamp (for debugging clock skew / network delays; **API server's receive time is authoritative** for health calculations)
 - Optional error message to **explicitly mark as unhealthy** (e.g., "failed to connect to database")
 - Optional status details (e.g., queue depth, active jobs)
3. API server also marks services as **unhealthy if no heartbeat received** within a threshold (e.g., 30 seconds)

### 2. Alternatives for health status storage/caching

Some entity (now we assume the API server) aggregates health statuses. Options for how it 
stores/exposes them:

| Option | Description | Advantages | Disadvantages |
|--------|-------------|------------|---------------|
| **2A: API server in-memory cache (recommended)** | Cache in memory, expose via `GET /health` | • Simplest; no external storage • Health data is ephemeral by nature • Orchestrators can query for health checks | • WebUI must poll (no push updates) • Data lost on restart (acceptable) |
| **2B: MongoDB (`results-cache`)** | Store in dedicated MongoDB collection | • WebUI can use CDC via Socket.IO for real-time updates (existing pattern) | • Additional complexity for ephemeral data |
| **2C: Redis** | Store with TTL-based expiry | • Fast reads/writes • TTL auto-expires stale entries | • WebUI doesn't connect to Redis • Requires new infrastructure |
| **2D: MariaDB / MySQL (`clp-db`)** | Store in heartbeat table | • WebUI already connects to `clp-db` • Transactional consistency | • WebUI must poll (no CDC) • Additional load on primary database |

**Option 2A endpoints:**
- `GET /health` — returns health status of all services (for WebUI)
- `GET /health?service=<name>&instance=<id>` — returns health status of a specific service instance (for container orchestrator health checks on services without their own endpoints)

**Caveat for orchestrator health checks:** This creates a chicken-and-egg problem. Currently, API server has hard dependencies (`depends_on` database / `initContainers` waiting for `db-table-creator`), so it can't start before other services without relaxing these dependencies.

### Recommended architecture (Option 1B + 2A)

```mermaid
flowchart LR
 subgraph Services
 QS[query-scheduler]
 CS[compression-scheduler]
 QW[query-worker]
 CW[compression-worker]
 R[reducer]
 GC[garbage-collector]
 end

 subgraph Aggregator
 API[API Server in-memory cache]
 end

 subgraph Frontend
 WebUI[WebUI]
 end

 subgraph Orchestrator
 DC[Docker Compose / Kubernetes]
 end

 QS -->|POST /health| API
 CS -->|POST /health| API
 QW -->|POST /health| API
 CW -->|POST /health| API
 R -->|POST /health| API
 GC -->|POST /health| API

 API -->|GET /health| WebUI
 API -->|GET /health?service=X| DC
```

### Implementation steps (Option 1B + 2A)

1. **API server changes:**
 - Add `POST /health` endpoint to accept service health reports
 - Add `GET /health` endpoint to return aggregated health status of all services (for WebUI)
 - Add `GET /health?service=<name>&instance=<id>` for querying specific service health (for container orchestrator health checks)
 - Add background task to mark services as unhealthy if no report received within threshold
 - Cache health statuses in memory

2. **Service changes (first-party):**
 - Add health report HTTP client to each long-running Python service (query-scheduler, compression-scheduler, reducer, garbage-collector)
 - Add health report mechanism to Celery workers (query-worker, compression-worker)
 - Configure report interval via environment variable or config
 - For initialization jobs (db-table-creator, results-cache-indices-creator): report completion status once upon success/failure

3. **Third-party service health reporting (optional, future enhancement):**

 Third-party services (database, queue, redis, results-cache) don't run our code, so they can't directly report health. Possible approaches:
 - **Extend existing healthchecks**: Append a curl command to existing healthcheck scripts, e.g.,
 `mysqladmin ping && curl -X POST http://api-server:3001/health -d '{"service":"database"}'`

4. **WebUI changes:**
 - Poll `GET /health` endpoint periodically (e.g., every 5 seconds)
 - Add health status display component (e.g., status bar or dedicated page)
 - Visual indicators: green (healthy), red (unhealthy/missing); optionally yellow (degraded) in future

5. **Container orchestrator changes (optional, requires more thought):**
 - For services without their own health endpoints, configure health checks to use `GET /health?service=<name>&instance=<id>` on the API server (see caveat in Option 2A above)

### Configuration options

```yaml
# Example values.yaml additions
clpConfig:
 # Per-service config to optionally enable reporting (so services don't have a hard dependency on API server)
 query_scheduler:
 health_reporting:
 enabled: true # optional; service continues to function if API server is unavailable
 interval: 10 # seconds between reports
 unhealthy_threshold: 30 # seconds without report before API server marks as unhealthy

 compression_scheduler:
 health_reporting:
 enabled: true
 interval: 10
 unhealthy_threshold: 30

 # ... similar for other services
```

### References

- Docker Compose health checks: `tools/deployment/package/docker-compose-all.yaml`
- Helm deployment templates: `tools/deployment/package-helm/templates/`
- API server routes: `components/api-server/src/routes.rs`
- WebUI MongoDB integration: `components/webui/server/src/plugins/app/socket/MongoSocketIoServer/`
- Query scheduler: `components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service health indicators in WebUI #1794

Request

Problem

Why this matters

Affected services

Core job orchestration services

Supporting services

Third-party services

Initialization jobs

Possible implementation

1. Alternative approaches for health reporting

2. Alternatives for health status storage/caching

Recommended architecture (Option 1B + 2A)

Implementation steps (Option 1B + 2A)

Configuration options

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Service	Description	Port	Current Health Endpoint
`query-scheduler`	Schedules query jobs	7000	None (the port is only a TCP listener for reducers)
`compression-scheduler`	Schedules compression jobs	-	None
`query-worker`	Celery worker for executing queries	-	None (Celery process)
`compression-worker`	Celery worker for executing compression jobs	-	None (Celery process)
`reducer`	Aggregates query results	-	None

Service	Description	Port	Current Health Endpoint
`api-server`	REST API server	3001	`GET /health`
`webui`	Web interface	4000	TCP socket check
`garbage-collector`	Cleans up old archives and results	-	None
`mcp-server`	MCP server (optional)	8000	`GET /health`
`log-ingestor`	Ingestion service	3002	`GET /health`

Service	Description	Port	Health Check Method
`database`	MariaDB	3306	`mysqladmin ping`
`queue`	RabbitMQ	5672	`rabbitmq-diagnostics check_running`
`redis`	Redis	6379	`redis-cli PING`
`results-cache`	MongoDB	27017	`mongosh ping`

Service	Description	Health Check Method
`db-table-creator`	Creates database tables in MariaDB	Job completion status (one-time)
`results-cache-indices-creator`	Initializes MongoDB indices	Job completion status (one-time)

Option	Description	Advantages	Disadvantages
1A: Orchestrator-based	Leverage Docker/Kubernetes APIs to get container/pod health status	• Uses existing health checks defined in compose/helm files	• Not orchestrator agnostic • Docker: requires exposing socket (security concern) • Kubernetes: requires additional RBAC permissions
1B: Services send heartbeats to API server (recommended)	Services periodically POST health reports to API server	• Orchestrator agnostic • Services only need to make HTTP requests (simpler than serving) • Single aggregation point as source of truth	• Requires adding HTTP client to each service
1C: API server scrapes services	API server periodically polls each service's health endpoint	• Similar to Prometheus model • Bypasses orchestrator	• Requires all services to expose HTTP endpoints (not all are HTTP servers) • Requires service discovery (API server needs to know hostnames assigned by Docker Compose / Kubernetes)

Option	Description	Advantages	Disadvantages
2A: API server in-memory cache (recommended)	Cache in memory, expose via `GET /health`	• Simplest; no external storage • Health data is ephemeral by nature • Orchestrators can query for health checks	• WebUI must poll (no push updates) • Data lost on restart (acceptable)
2B: MongoDB (`results-cache`)	Store in dedicated MongoDB collection	• WebUI can use CDC via Socket.IO for real-time updates (existing pattern)	• Additional complexity for ephemeral data
2C: Redis	Store with TTL-based expiry	• Fast reads/writes • TTL auto-expires stale entries	• WebUI doesn't connect to Redis • Requires new infrastructure
2D: MariaDB / MySQL (`clp-db`)	Store in heartbeat table	• WebUI already connects to `clp-db` • Transactional consistency	• WebUI must poll (no CDC) • Additional load on primary database

Service health indicators in WebUI #1794

Description

Request

Problem

Why this matters

Affected services

Core job orchestration services

Supporting services

Third-party services

Initialization jobs

Possible implementation

1. Alternative approaches for health reporting

2. Alternatives for health status storage/caching

Recommended architecture (Option 1B + 2A)

Implementation steps (Option 1B + 2A)

Configuration options

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions