diff --git a/docs/my-website/docs/proxy/prometheus.md b/docs/my-website/docs/proxy/prometheus.md index f3c2f2e37d62..8bb861613816 100644 --- a/docs/my-website/docs/proxy/prometheus.md +++ b/docs/my-website/docs/proxy/prometheus.md @@ -158,11 +158,21 @@ Use this for LLM API Error monitoring and tracking remaining rate limits and tok | `litellm_remaining_tokens_metric` | Track `x-ratelimit-remaining-tokens` return from LLM API Deployment. Labels: `"model_group", "api_provider", "api_base", "litellm_model_name", "hashed_api_key", "api_key_alias"` | ### Deployment State + | Metric Name | Description | |----------------------|--------------------------------------| | `litellm_deployment_state` | The state of the deployment: 0 = healthy, 1 = partial outage, 2 = complete outage. Labels: `"litellm_model_name", "model_id", "api_base", "api_provider"` | | `litellm_deployment_latency_per_output_token` | Latency per output token for deployment. Labels: `"litellm_model_name", "model_id", "api_base", "api_provider", "hashed_api_key", "api_key_alias", "team", "team_alias"` | +#### State Transitions + +| From State | To State | Trigger Conditions | +|------------|----------|-------------------| +| **Healthy (0)** | **Partial Outage (1)** | • Any single API call fails
• Network timeout
• Authentication error (401)
• Rate limit hit (429)
• Server error (5xx)
• Any other exception during API call | +| **Partial Outage (1)** | **Complete Outage (2)** | • Cooldown logic triggers (multiple failures)
• Rate limiting detected
• High failure rate (>50%)
• Non-retryable errors accumulate | +| **Partial Outage (1)** | **Healthy (0)** | • Next successful API call
• Deployment recovers from cooldown
• Manual intervention | +| **Complete Outage (2)** | **Healthy (0)** | • Cooldown TTL expires (default: 5 seconds)
• Successful request after cooldown period
• Manual intervention | + #### Fallback (Failover) Metrics | Metric Name | Description | diff --git a/docs/my-website/docs/routing.md b/docs/my-website/docs/routing.md index 971427806ed0..9e0e2abc928e 100644 --- a/docs/my-website/docs/routing.md +++ b/docs/my-website/docs/routing.md @@ -1051,6 +1051,19 @@ The router automatically cools down deployments based on the following condition During cooldown, the specific deployment is temporarily removed from the available pool, while other healthy deployments continue serving requests. +#### Deployment State Lifecycle + +``` +🟢 Healthy (0) → 🟡 Partial Outage (1) → 🔴 Complete Outage (2) → 🟢 Healthy (0) +``` + +| From State | To State | Concrete Triggers | +|------------|----------|-------------------| +| **Healthy (0)** | **Partial Outage (1)** | • Any single API call fails
• Network timeout
• Authentication error (401)
• Rate limit hit (429)
• Server error (5xx) | +| **Partial Outage (1)** | **Complete Outage (2)** | • >50% failure rate in current minute
• 429 rate limit errors
• Non-retryable errors (401, 404, 408)
• Exceeds allowed fails limit (default: 3) | +| **Partial Outage (1)** | **Healthy (0)** | • Next successful API call | +| **Complete Outage (2)** | **Healthy (0)** | • Cooldown TTL expires (default: 5 seconds)
• Successful request after cooldown period | + #### Cooldown Recovery Deployments automatically recover from cooldown after the cooldown period expires. The router will: