Skip to content

Production Monitoring and Alerting #164

@rucka

Description

@rucka

Story Statement

As a platform operations engineer
I want complete monitoring and alerting for the knowledge service
So that I can ensure SLA compliance, detect issues proactively, and respond to incidents rapidly

Where: Knowledge service infrastructure — monitoring stack + alerting pipeline

Epic Context

Parent Epic: Platform Hardening & Enterprise Readiness #68
Status: Refined
Priority: P0 (Must-Have)

Status Workflow

  • Refined: Story is detailed, estimated, and ready for development
  • In Progress: Story is actively being developed
  • Done: Story delivered and accepted

Acceptance Criteria

Functional Requirements

  1. Given the knowledge service is running
    When an ops engineer accesses the metrics endpoint GET /metrics
    Then it returns Prometheus-format metrics: request_duration_seconds (p50/p95/p99), request_total (by method/status), active_connections, db_pool_utilization, s3_operation_duration

  2. Given request latency p95 exceeds 500ms for 5 minutes
    When the alert rule evaluates
    Then a "High Latency" alert fires with severity "warning" to configured channels (email, Slack webhook)

  3. Given error rate exceeds 5% for 2 minutes
    When the alert rule evaluates
    Then a "High Error Rate" alert fires with severity "critical" and escalation to PagerDuty/webhook

  4. Given the service is running with monitoring enabled
    When an ops engineer accesses the health dashboard
    Then they see: request rate, latency percentiles, error rate, active connections, DB connection pool status, S3 operation status, uptime counter

  5. Given a health check endpoint
    When the liveness probe calls GET /api/v1/health/live
    Then it returns 200 if the process is alive (no dependency check)
    When the readiness probe calls GET /api/v1/health/ready
    Then it returns 200 only if DB and S3 are reachable, 503 otherwise

  6. Given the service emits structured logs
    When any request is processed
    Then the log includes: correlation_id, method, path, status, duration_ms, user_id (if authenticated), timestamp in JSON format

Business Rules

  • Metrics format: Prometheus exposition format (compatible with Prometheus, Grafana, Datadog)
  • Alert severity levels: info, warning, critical
  • Alert channels: email (all), Slack webhook (warning+), PagerDuty/webhook (critical)
  • Escalation: warning unacknowledged for 15min → escalate to critical
  • Health probes: liveness (process alive), readiness (dependencies OK), startup (initialization complete)
  • Structured logging: JSON format with correlation ID for request tracing
  • Uptime tracking: service uptime counter in metrics for SLA calculation

Edge Cases and Error Handling

  • Metrics endpoint under load: Metrics collection must not degrade service performance (<5ms overhead)
  • Alert channel unreachable: Retry 3x with backoff; log alert locally if all channels fail
  • DB connection pool exhausted: Readiness probe returns 503; alert fires immediately
  • Log volume spike: Structured logging with configurable log level (info default, debug for troubleshooting)
  • Monitoring bootstrap: Graceful startup — metrics available only after startup probe passes

Definition of Done Checklist

Development Completion

  • All 6 acceptance criteria implemented and verified
  • Prometheus metrics endpoint
  • Structured JSON logging with correlation IDs
  • Liveness, readiness, startup health probes
  • Alert rules configuration (latency, error rate, connection pool)
  • Alert channel integration (email, Slack, webhook)
  • Monitoring dashboard configuration (Grafana or equivalent)
  • Unit tests for metrics collection and health probes
  • Integration tests for alerting flow

Quality Assurance

  • Metrics collection adds <5ms latency
  • Alert fires within 30s of threshold breach
  • All health probes return correct status under various conditions
  • Structured logs parseable by log aggregator

Deployment and Release

  • Monitoring stack deployment documented (Prometheus + Grafana or cloud equivalent)
  • Alert channel credentials configured
  • Dashboard templates included in deployment

Story Sizing and Sprint Readiness

Refined Story Points

Final Story Points: XL(10)
Confidence Level: Low
Sizing Justification: Full observability stack — metrics instrumentation, health probes, structured logging, alert rules, dashboard, channel integration. Broadest story in the epic. Infrastructure choice significantly impacts effort.

Sprint Capacity Validation

Sprint Fit Assessment: May not fit in single sprint
Total Effort Assessment: Borderline

Story Splitting Recommendations

  1. Production Monitoring and Alerting #164-A: Metrics endpoint + health probes + structured logging (L(5))
  2. Production Monitoring and Alerting #164-B: Alert rules + channel integration + dashboard (L(5))

Dependencies and Coordination

Story Dependencies

Prerequisite Stories: Epic #66 #149 (Org Setup — service must be running)
Dependent Stories: #168 (Performance Analytics), #169 (SLA Reporting) — consume monitoring data

External Dependencies

Infrastructure Requirements: Prometheus (or compatible), Grafana, alert channel endpoints (Slack, PagerDuty)

Validation and Testing Strategy

Acceptance Testing Approach

Testing Methods: Integration tests: trigger high latency/error rate → verify alert fires; unit tests for metrics collection, health probes; load test for metrics overhead
Test Data Requirements: Simulated load for alert threshold testing
Environment Requirements: Prometheus test instance, mock alert channels

Notes

Refinement Insights: ADR needed for monitoring stack choice (Prometheus+Grafana vs cloud-native). This decision affects deployment complexity for all enterprises.

Technical Analysis

Implementation Approach

Technical Strategy: Instrument service with prom-client (Prometheus Node.js client). Expose /metrics endpoint. Health probes as lightweight endpoints. Structured logging via pino (JSON format, correlation ID via cls-hooked or AsyncLocalStorage). Alert rules as Prometheus recording/alerting rules or Alertmanager config.
Key Components: Metrics middleware (prom-client), health probe endpoints, structured logger (pino), alert rule configs, Grafana dashboard JSON
Data Flow: Request → metrics middleware (record latency/status) → handler → structured log → response. Prometheus scrapes /metrics → evaluates alert rules → Alertmanager → channels

Technical Requirements

  • prom-client for Prometheus metrics (histogram for latency, counter for requests, gauge for connections)
  • pino for structured JSON logging (fast, low overhead)
  • AsyncLocalStorage for correlation ID propagation
  • Health probes: /api/v1/health/live (200 always), /api/v1/health/ready (200 if DB+S3 OK), /api/v1/health/startup (200 after init)
  • Grafana dashboard: JSON template with panels for request rate, latency percentiles, error rate, DB pool, uptime

Technical Risks and Mitigation

Risk Impact Probability Mitigation Strategy
Monitoring stack complexity for self-hosted enterprises High Medium Provide both self-hosted (Prometheus+Grafana) and cloud-native (Datadog/CloudWatch) guides
High cardinality metrics (per-endpoint labels) Medium Medium Limit label cardinality; use route patterns, not full paths

Spike Requirements

Required Spikes: Evaluate monitoring stack (Prometheus+Grafana self-hosted vs cloud-native) — record as ADR

Metadata

Metadata

Assignees

No one assigned

    Labels

    user storyWork item representing a user story

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions