Skip to content

Service health indicators in WebUI #1794

@junhaoliao

Description

@junhaoliao

Request

Display health indicators for CLP package services in the WebUI to help users identify when critical backend components are unavailable.

Problem

Users have experienced situations where the WebUI is accessible and appears functional, but queries are not being processed because the query job orchestration components (query-scheduler, query-worker, reducer, etc.) are down. While administrators can check service health via orchestrator tools (docker compose ps, kubectl get pods), there is no visibility into service health from user-facing interfaces like the WebUI, making it difficult for users to diagnose such issues.

Why this matters

  • User Experience: Users cannot easily determine why their queries are not processing
  • Debugging: Without health indicators, users have no visibility into which components might be failing
  • Operational Awareness: Administrators need to quickly identify service outages without manually checking container/pod status

Affected services

Based on the current architecture in tools/deployment/package/docker-compose-all.yaml and tools/deployment/package-helm/templates/, the services that need health monitoring include:

Note: Current health checks are defined per orchestrator:

  • Docker Compose: healthcheck blocks in docker-compose-all.yaml
  • Kubernetes (Helm): readinessProbe / livenessProbe in deployment templates

Core job orchestration services

Service Description Port Current Health Endpoint
query-scheduler Schedules query jobs 7000 None (the port is only a TCP listener for reducers)
compression-scheduler Schedules compression jobs - None
query-worker Celery worker for executing queries - None (Celery process)
compression-worker Celery worker for executing compression jobs - None (Celery process)
reducer Aggregates query results - None

Supporting services

Service Description Port Current Health Endpoint
api-server REST API server 3001 GET /health
webui Web interface 4000 TCP socket check
garbage-collector Cleans up old archives and results - None
mcp-server MCP server (optional) 8000 GET /health
log-ingestor Ingestion service 3002 GET /health

Third-party services

Service Description Port Health Check Method
database MariaDB 3306 mysqladmin ping
queue RabbitMQ 5672 rabbitmq-diagnostics check_running
redis Redis 6379 redis-cli PING
results-cache MongoDB 27017 mongosh ping

Initialization jobs

Service Description Health Check Method
db-table-creator Creates database tables in MariaDB Job completion status (one-time)
results-cache-indices-creator Initializes MongoDB indices Job completion status (one-time)

Note: Most services depend on these initialization jobs completing successfully before starting.

Possible implementation

Two decisions need to be made:

  1. How services report health — the mechanism for collecting health status from services
  2. How health statuses are cached/exposed — how the API server stores and exposes aggregated health data

1. Alternative approaches for health reporting

How should services report their health status to a central aggregator?

Option Description Advantages Disadvantages
1A: Orchestrator-based Leverage Docker/Kubernetes APIs to get container/pod health status • Uses existing health checks defined in compose/helm files • Not orchestrator agnostic
• Docker: requires exposing socket (security concern)
• Kubernetes: requires additional RBAC permissions
1B: Services send heartbeats to API server (recommended) Services periodically POST health reports to API server • Orchestrator agnostic
• Services only need to make HTTP requests (simpler than serving)
• Single aggregation point as source of truth
• Requires adding HTTP client to each service
1C: API server scrapes services API server periodically polls each service's health endpoint • Similar to Prometheus model
• Bypasses orchestrator
• Requires all services to expose HTTP endpoints (not all are HTTP servers)
• Requires service discovery (API server needs to know hostnames assigned by Docker Compose / Kubernetes)

Option 1B implementation details:

  1. Add a POST /health endpoint to the API server that accepts health reports
  2. Each service periodically (e.g., every 10 seconds) sends a report with:
    • Service name
    • Service instance ID
    • Optional timestamp (for debugging clock skew / network delays; API server's receive time is authoritative for health calculations)
    • Optional error message to explicitly mark as unhealthy (e.g., "failed to connect to database")
    • Optional status details (e.g., queue depth, active jobs)
  3. API server also marks services as unhealthy if no heartbeat received within a threshold (e.g., 30 seconds)

2. Alternatives for health status storage/caching

Some entity (now we assume the API server) aggregates health statuses. Options for how it
stores/exposes them:

Option Description Advantages Disadvantages
2A: API server in-memory cache (recommended) Cache in memory, expose via GET /health • Simplest; no external storage
• Health data is ephemeral by nature
• Orchestrators can query for health checks
• WebUI must poll (no push updates)
• Data lost on restart (acceptable)
2B: MongoDB (results-cache) Store in dedicated MongoDB collection • WebUI can use CDC via Socket.IO for real-time updates (existing pattern) • Additional complexity for ephemeral data
2C: Redis Store with TTL-based expiry • Fast reads/writes
• TTL auto-expires stale entries
• WebUI doesn't connect to Redis
• Requires new infrastructure
2D: MariaDB / MySQL (clp-db) Store in heartbeat table • WebUI already connects to clp-db
• Transactional consistency
• WebUI must poll (no CDC)
• Additional load on primary database

Option 2A endpoints:

  • GET /health — returns health status of all services (for WebUI)
  • GET /health?service=<name>&instance=<id> — returns health status of a specific service instance (for container orchestrator health checks on services without their own endpoints)

Caveat for orchestrator health checks: This creates a chicken-and-egg problem. Currently, API server has hard dependencies (depends_on database / initContainers waiting for db-table-creator), so it can't start before other services without relaxing these dependencies.

Recommended architecture (Option 1B + 2A)

flowchart LR
    subgraph Services
        QS[query-scheduler]
        CS[compression-scheduler]
        QW[query-worker]
        CW[compression-worker]
        R[reducer]
        GC[garbage-collector]
    end

    subgraph Aggregator
        API[API Server<br/>in-memory cache]
    end

    subgraph Frontend
        WebUI[WebUI]
    end

    subgraph Orchestrator
        DC[Docker Compose /<br/>Kubernetes]
    end

    QS -->|POST /health| API
    CS -->|POST /health| API
    QW -->|POST /health| API
    CW -->|POST /health| API
    R -->|POST /health| API
    GC -->|POST /health| API

    API -->|GET /health| WebUI
    API -->|GET /health?service=X| DC
Loading

Implementation steps (Option 1B + 2A)

  1. API server changes:

    • Add POST /health endpoint to accept service health reports
    • Add GET /health endpoint to return aggregated health status of all services (for WebUI)
    • Add GET /health?service=<name>&instance=<id> for querying specific service health (for container orchestrator health checks)
    • Add background task to mark services as unhealthy if no report received within threshold
    • Cache health statuses in memory
  2. Service changes (first-party):

    • Add health report HTTP client to each long-running Python service (query-scheduler, compression-scheduler, reducer, garbage-collector)
    • Add health report mechanism to Celery workers (query-worker, compression-worker)
    • Configure report interval via environment variable or config
    • For initialization jobs (db-table-creator, results-cache-indices-creator): report completion status once upon success/failure
  3. Third-party service health reporting (optional, future enhancement):

    Third-party services (database, queue, redis, results-cache) don't run our code, so they can't directly report health. Possible approaches:

    • Extend existing healthchecks: Append a curl command to existing healthcheck scripts, e.g.,
      mysqladmin ping && curl -X POST http://api-server:3001/health -d '{"service":"database"}'
  4. WebUI changes:

    • Poll GET /health endpoint periodically (e.g., every 5 seconds)
    • Add health status display component (e.g., status bar or dedicated page)
    • Visual indicators: green (healthy), red (unhealthy/missing); optionally yellow (degraded) in future
  5. Container orchestrator changes (optional, requires more thought):

    • For services without their own health endpoints, configure health checks to use GET /health?service=<name>&instance=<id> on the API server (see caveat in Option 2A above)

Configuration options

# Example values.yaml additions
clpConfig:
  # Per-service config to optionally enable reporting (so services don't have a hard dependency on API server)
  query_scheduler:
    health_reporting:
      enabled: true           # optional; service continues to function if API server is unavailable
      interval: 10            # seconds between reports
      unhealthy_threshold: 30 # seconds without report before API server marks as unhealthy

  compression_scheduler:
    health_reporting:
      enabled: true
      interval: 10
      unhealthy_threshold: 30

  # ... similar for other services

References

  • Docker Compose health checks: tools/deployment/package/docker-compose-all.yaml
  • Helm deployment templates: tools/deployment/package-helm/templates/
  • API server routes: components/api-server/src/routes.rs
  • WebUI MongoDB integration: components/webui/server/src/plugins/app/socket/MongoSocketIoServer/
  • Query scheduler: components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions