-
Notifications
You must be signed in to change notification settings - Fork 0
Production Monitoring and Alerting #164
Description
Story Statement
As a platform operations engineer
I want complete monitoring and alerting for the knowledge service
So that I can ensure SLA compliance, detect issues proactively, and respond to incidents rapidly
Where: Knowledge service infrastructure — monitoring stack + alerting pipeline
Epic Context
Parent Epic: Platform Hardening & Enterprise Readiness #68
Status: Refined
Priority: P0 (Must-Have)
Status Workflow
- Refined: Story is detailed, estimated, and ready for development
- In Progress: Story is actively being developed
- Done: Story delivered and accepted
Acceptance Criteria
Functional Requirements
-
Given the knowledge service is running
When an ops engineer accesses the metrics endpoint GET/metrics
Then it returns Prometheus-format metrics: request_duration_seconds (p50/p95/p99), request_total (by method/status), active_connections, db_pool_utilization, s3_operation_duration -
Given request latency p95 exceeds 500ms for 5 minutes
When the alert rule evaluates
Then a "High Latency" alert fires with severity "warning" to configured channels (email, Slack webhook) -
Given error rate exceeds 5% for 2 minutes
When the alert rule evaluates
Then a "High Error Rate" alert fires with severity "critical" and escalation to PagerDuty/webhook -
Given the service is running with monitoring enabled
When an ops engineer accesses the health dashboard
Then they see: request rate, latency percentiles, error rate, active connections, DB connection pool status, S3 operation status, uptime counter -
Given a health check endpoint
When the liveness probe calls GET/api/v1/health/live
Then it returns 200 if the process is alive (no dependency check)
When the readiness probe calls GET/api/v1/health/ready
Then it returns 200 only if DB and S3 are reachable, 503 otherwise -
Given the service emits structured logs
When any request is processed
Then the log includes: correlation_id, method, path, status, duration_ms, user_id (if authenticated), timestamp in JSON format
Business Rules
- Metrics format: Prometheus exposition format (compatible with Prometheus, Grafana, Datadog)
- Alert severity levels: info, warning, critical
- Alert channels: email (all), Slack webhook (warning+), PagerDuty/webhook (critical)
- Escalation: warning unacknowledged for 15min → escalate to critical
- Health probes: liveness (process alive), readiness (dependencies OK), startup (initialization complete)
- Structured logging: JSON format with correlation ID for request tracing
- Uptime tracking: service uptime counter in metrics for SLA calculation
Edge Cases and Error Handling
- Metrics endpoint under load: Metrics collection must not degrade service performance (<5ms overhead)
- Alert channel unreachable: Retry 3x with backoff; log alert locally if all channels fail
- DB connection pool exhausted: Readiness probe returns 503; alert fires immediately
- Log volume spike: Structured logging with configurable log level (info default, debug for troubleshooting)
- Monitoring bootstrap: Graceful startup — metrics available only after startup probe passes
Definition of Done Checklist
Development Completion
- All 6 acceptance criteria implemented and verified
- Prometheus metrics endpoint
- Structured JSON logging with correlation IDs
- Liveness, readiness, startup health probes
- Alert rules configuration (latency, error rate, connection pool)
- Alert channel integration (email, Slack, webhook)
- Monitoring dashboard configuration (Grafana or equivalent)
- Unit tests for metrics collection and health probes
- Integration tests for alerting flow
Quality Assurance
- Metrics collection adds <5ms latency
- Alert fires within 30s of threshold breach
- All health probes return correct status under various conditions
- Structured logs parseable by log aggregator
Deployment and Release
- Monitoring stack deployment documented (Prometheus + Grafana or cloud equivalent)
- Alert channel credentials configured
- Dashboard templates included in deployment
Story Sizing and Sprint Readiness
Refined Story Points
Final Story Points: XL(10)
Confidence Level: Low
Sizing Justification: Full observability stack — metrics instrumentation, health probes, structured logging, alert rules, dashboard, channel integration. Broadest story in the epic. Infrastructure choice significantly impacts effort.
Sprint Capacity Validation
Sprint Fit Assessment: May not fit in single sprint
Total Effort Assessment: Borderline
Story Splitting Recommendations
- Production Monitoring and Alerting #164-A: Metrics endpoint + health probes + structured logging (L(5))
- Production Monitoring and Alerting #164-B: Alert rules + channel integration + dashboard (L(5))
Dependencies and Coordination
Story Dependencies
Prerequisite Stories: Epic #66 #149 (Org Setup — service must be running)
Dependent Stories: #168 (Performance Analytics), #169 (SLA Reporting) — consume monitoring data
External Dependencies
Infrastructure Requirements: Prometheus (or compatible), Grafana, alert channel endpoints (Slack, PagerDuty)
Validation and Testing Strategy
Acceptance Testing Approach
Testing Methods: Integration tests: trigger high latency/error rate → verify alert fires; unit tests for metrics collection, health probes; load test for metrics overhead
Test Data Requirements: Simulated load for alert threshold testing
Environment Requirements: Prometheus test instance, mock alert channels
Notes
Refinement Insights: ADR needed for monitoring stack choice (Prometheus+Grafana vs cloud-native). This decision affects deployment complexity for all enterprises.
Technical Analysis
Implementation Approach
Technical Strategy: Instrument service with prom-client (Prometheus Node.js client). Expose /metrics endpoint. Health probes as lightweight endpoints. Structured logging via pino (JSON format, correlation ID via cls-hooked or AsyncLocalStorage). Alert rules as Prometheus recording/alerting rules or Alertmanager config.
Key Components: Metrics middleware (prom-client), health probe endpoints, structured logger (pino), alert rule configs, Grafana dashboard JSON
Data Flow: Request → metrics middleware (record latency/status) → handler → structured log → response. Prometheus scrapes /metrics → evaluates alert rules → Alertmanager → channels
Technical Requirements
prom-clientfor Prometheus metrics (histogram for latency, counter for requests, gauge for connections)pinofor structured JSON logging (fast, low overhead)AsyncLocalStoragefor correlation ID propagation- Health probes:
/api/v1/health/live(200 always),/api/v1/health/ready(200 if DB+S3 OK),/api/v1/health/startup(200 after init) - Grafana dashboard: JSON template with panels for request rate, latency percentiles, error rate, DB pool, uptime
Technical Risks and Mitigation
| Risk | Impact | Probability | Mitigation Strategy |
|---|---|---|---|
| Monitoring stack complexity for self-hosted enterprises | High | Medium | Provide both self-hosted (Prometheus+Grafana) and cloud-native (Datadog/CloudWatch) guides |
| High cardinality metrics (per-endpoint labels) | Medium | Medium | Limit label cardinality; use route patterns, not full paths |
Spike Requirements
Required Spikes: Evaluate monitoring stack (Prometheus+Grafana self-hosted vs cloud-native) — record as ADR