Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions docs/my-website/docs/proxy/prometheus.md
Original file line number Diff line number Diff line change
Expand Up @@ -158,11 +158,21 @@ Use this for LLM API Error monitoring and tracking remaining rate limits and tok
| `litellm_remaining_tokens_metric` | Track `x-ratelimit-remaining-tokens` return from LLM API Deployment. Labels: `"model_group", "api_provider", "api_base", "litellm_model_name", "hashed_api_key", "api_key_alias"` |

### Deployment State

| Metric Name | Description |
|----------------------|--------------------------------------|
| `litellm_deployment_state` | The state of the deployment: 0 = healthy, 1 = partial outage, 2 = complete outage. Labels: `"litellm_model_name", "model_id", "api_base", "api_provider"` |
| `litellm_deployment_latency_per_output_token` | Latency per output token for deployment. Labels: `"litellm_model_name", "model_id", "api_base", "api_provider", "hashed_api_key", "api_key_alias", "team", "team_alias"` |

#### State Transitions

| From State | To State | Trigger Conditions |
|------------|----------|-------------------|
| **Healthy (0)** | **Partial Outage (1)** | • Any single API call fails<br/>• Network timeout<br/>• Authentication error (401)<br/>• Rate limit hit (429)<br/>• Server error (5xx)<br/>• Any other exception during API call |
| **Partial Outage (1)** | **Complete Outage (2)** | • Cooldown logic triggers (multiple failures)<br/>• Rate limiting detected<br/>• High failure rate (>50%)<br/>• Non-retryable errors accumulate |
| **Partial Outage (1)** | **Healthy (0)** | • Next successful API call<br/>• Deployment recovers from cooldown<br/>• Manual intervention |
| **Complete Outage (2)** | **Healthy (0)** | • Cooldown TTL expires (default: 5 seconds)<br/>• Successful request after cooldown period<br/>• Manual intervention |

#### Fallback (Failover) Metrics

| Metric Name | Description |
Expand Down
13 changes: 13 additions & 0 deletions docs/my-website/docs/routing.md
Original file line number Diff line number Diff line change
Expand Up @@ -1051,6 +1051,19 @@ The router automatically cools down deployments based on the following condition

During cooldown, the specific deployment is temporarily removed from the available pool, while other healthy deployments continue serving requests.

#### Deployment State Lifecycle

```
🟢 Healthy (0) → 🟡 Partial Outage (1) → 🔴 Complete Outage (2) → 🟢 Healthy (0)
```

| From State | To State | Concrete Triggers |
|------------|----------|-------------------|
| **Healthy (0)** | **Partial Outage (1)** | • Any single API call fails<br/>• Network timeout<br/>• Authentication error (401)<br/>• Rate limit hit (429)<br/>• Server error (5xx) |
| **Partial Outage (1)** | **Complete Outage (2)** | • >50% failure rate in current minute<br/>• 429 rate limit errors<br/>• Non-retryable errors (401, 404, 408)<br/>• Exceeds allowed fails limit (default: 3) |
| **Partial Outage (1)** | **Healthy (0)** | • Next successful API call |
| **Complete Outage (2)** | **Healthy (0)** | • Cooldown TTL expires (default: 5 seconds)<br/>• Successful request after cooldown period |

#### Cooldown Recovery

Deployments automatically recover from cooldown after the cooldown period expires. The router will:
Expand Down
Loading