This document helps contributors understand where to make changes in LiteLLM.
The LiteLLM AI Gateway (Proxy) uses the LiteLLM SDK internally for all LLM calls:
OpenAI SDK (client) ──▶ LiteLLM AI Gateway (proxy/) ──▶ LiteLLM SDK (litellm/) ──▶ LLM API
Anthropic SDK (client) ──▶ LiteLLMAI Gateway (proxy/) ──▶ LiteLLM SDK (litellm/) ──▶ LLM API
Any HTTP client ──▶ LiteLLMAI Gateway (proxy/) ──▶ LiteLLM SDK (litellm/) ──▶ LLM API
The AI Gateway adds authentication, rate limiting, budgets, and routing on top of the SDK. The SDK handles the actual LLM provider calls, request/response transformations, and streaming.
The AI Gateway (litellm/proxy/) wraps the SDK with authentication, rate limiting, and management features.
sequenceDiagram
participant Client
participant ProxyServer as proxy/proxy_server.py
participant Auth as proxy/auth/user_api_key_auth.py
participant Redis as Redis Cache
participant Hooks as proxy/hooks/
participant Router as router.py
participant Main as main.py + utils.py
participant Handler as llms/custom_httpx/llm_http_handler.py
participant Transform as llms/{provider}/chat/transformation.py
participant Provider as LLM Provider API
participant CostCalc as cost_calculator.py
participant LoggingObj as litellm_logging.py
participant DBWriter as db/db_spend_update_writer.py
participant Postgres as PostgreSQL
%% Request Flow
Client->>ProxyServer: POST /v1/chat/completions
ProxyServer->>Auth: user_api_key_auth()
Auth->>Redis: Check API key cache
Redis-->>Auth: Key info + spend limits
ProxyServer->>Hooks: max_budget_limiter, parallel_request_limiter
Hooks->>Redis: Check/increment rate limit counters
ProxyServer->>Router: route_request()
Router->>Main: litellm.acompletion()
Main->>Handler: BaseLLMHTTPHandler.completion()
Handler->>Transform: ProviderConfig.transform_request()
Handler->>Provider: HTTP Request
Provider-->>Handler: Response
Handler->>Transform: ProviderConfig.transform_response()
Transform-->>Handler: ModelResponse
Handler-->>Main: ModelResponse
%% Cost Attribution (in utils.py wrapper)
Main->>LoggingObj: update_response_metadata()
LoggingObj->>CostCalc: _response_cost_calculator()
CostCalc->>CostCalc: completion_cost(tokens × price)
CostCalc-->>LoggingObj: response_cost
LoggingObj-->>Main: Set response._hidden_params["response_cost"]
Main-->>ProxyServer: ModelResponse (with cost in _hidden_params)
%% Response Headers + Async Logging
ProxyServer->>ProxyServer: Extract cost from hidden_params
ProxyServer->>LoggingObj: async_success_handler()
LoggingObj->>Hooks: async_log_success_event()
Hooks->>DBWriter: update_database(response_cost)
DBWriter->>Redis: Queue spend increment
DBWriter->>Postgres: Batch write spend logs (async)
ProxyServer-->>Client: ModelResponse + x-litellm-response-cost header
graph TD
subgraph "Incoming Request"
Client["POST /v1/chat/completions"]
end
subgraph "proxy/proxy_server.py"
Endpoint["chat_completion()"]
end
subgraph "proxy/auth/"
Auth["user_api_key_auth()"]
end
subgraph "proxy/"
PreCall["litellm_pre_call_utils.py"]
RouteRequest["route_llm_request.py"]
end
subgraph "litellm/"
Router["router.py"]
Main["main.py"]
end
subgraph "Infrastructure"
DualCache["DualCache<br/>(in-memory + Redis)"]
Postgres["PostgreSQL<br/>(keys, teams, spend logs)"]
end
Client --> Endpoint
Endpoint --> Auth
Auth --> DualCache
DualCache -.->|cache miss| Postgres
Auth --> PreCall
PreCall --> RouteRequest
RouteRequest --> Router
Router --> DualCache
Router --> Main
Main --> Client
Key proxy files:
proxy/proxy_server.py- Main API endpointsproxy/auth/- Authentication (API keys, JWT, OAuth2)proxy/hooks/- Proxy-level callbacksrouter.py- Load balancing, fallbacksrouter_strategy/- Routing algorithms (lowest_latency.py,simple_shuffle.py, etc.)
LLM-specific proxy endpoints:
| Endpoint | Directory | Purpose |
|---|---|---|
/v1/messages |
proxy/anthropic_endpoints/ |
Anthropic Messages API |
/vertex-ai/* |
proxy/vertex_ai_endpoints/ |
Vertex AI passthrough |
/gemini/* |
proxy/google_endpoints/ |
Google AI Studio passthrough |
/v1/images/* |
proxy/image_endpoints/ |
Image generation |
/v1/batches |
proxy/batches_endpoints/ |
Batch processing |
/v1/files |
proxy/openai_files_endpoints/ |
File uploads |
/v1/fine_tuning |
proxy/fine_tuning_endpoints/ |
Fine-tuning jobs |
/v1/rerank |
proxy/rerank_endpoints/ |
Reranking |
/v1/responses |
proxy/response_api_endpoints/ |
OpenAI Responses API |
/v1/vector_stores |
proxy/vector_store_endpoints/ |
Vector stores |
/* (passthrough) |
proxy/pass_through_endpoints/ |
Direct provider passthrough |
Proxy Hooks (proxy/hooks/__init__.py):
| Hook | File | Purpose |
|---|---|---|
max_budget_limiter |
proxy/hooks/max_budget_limiter.py |
Enforce budget limits |
parallel_request_limiter |
proxy/hooks/parallel_request_limiter_v3.py |
Rate limiting per key/user |
cache_control_check |
proxy/hooks/cache_control_check.py |
Cache validation |
responses_id_security |
proxy/hooks/responses_id_security.py |
Response ID validation |
litellm_skills |
proxy/hooks/skills_injection.py |
Skills injection |
To add a new proxy hook, implement CustomLogger and register in PROXY_HOOKS.
The AI Gateway uses external infrastructure for persistence and caching:
graph LR
subgraph "AI Gateway (proxy/)"
Proxy["proxy_server.py"]
Auth["auth/user_api_key_auth.py"]
DBWriter["db/db_spend_update_writer.py<br/>DBSpendUpdateWriter"]
InternalCache["utils.py<br/>InternalUsageCache"]
CostCallback["hooks/proxy_track_cost_callback.py<br/>_ProxyDBLogger"]
Scheduler["APScheduler<br/>ProxyStartupEvent"]
end
subgraph "SDK (litellm/)"
Router["router.py<br/>Router.cache (DualCache)"]
LLMCache["caching/caching_handler.py<br/>LLMCachingHandler"]
CacheClass["caching/caching.py<br/>Cache"]
end
subgraph "Redis (caching/redis_cache.py)"
RateLimit["Rate Limit Counters"]
SpendQueue["Spend Increment Queue"]
KeyCache["API Key Cache"]
TPM_RPM["TPM/RPM Tracking"]
Cooldowns["Deployment Cooldowns"]
LLMResponseCache["LLM Response Cache"]
end
subgraph "PostgreSQL (proxy/schema.prisma)"
Keys["LiteLLM_VerificationToken"]
Teams["LiteLLM_TeamTable"]
SpendLogs["LiteLLM_SpendLogs"]
Users["LiteLLM_UserTable"]
end
Auth --> InternalCache
InternalCache --> KeyCache
InternalCache -.->|cache miss| Keys
InternalCache --> RateLimit
Router --> TPM_RPM
Router --> Cooldowns
LLMCache --> CacheClass
CacheClass --> LLMResponseCache
CostCallback --> DBWriter
DBWriter --> SpendQueue
DBWriter --> SpendLogs
Scheduler --> SpendLogs
Scheduler --> Keys
| Component | Purpose | Key Files/Classes |
|---|---|---|
| Redis | Rate limiting, API key caching, TPM/RPM tracking, cooldowns, LLM response caching, spend queuing | caching/redis_cache.py (RedisCache), caching/dual_cache.py (DualCache) |
| PostgreSQL | API keys, teams, users, spend logs | proxy/utils.py (PrismaClient), proxy/schema.prisma |
| InternalUsageCache | Proxy-level cache for rate limits + API keys (in-memory + Redis) | proxy/utils.py (InternalUsageCache) |
| Router.cache | TPM/RPM tracking, deployment cooldowns, client caching (in-memory + Redis) | router.py (Router.cache: DualCache) |
| LLMCachingHandler | SDK-level LLM response/embedding caching | caching/caching_handler.py (LLMCachingHandler), caching/caching.py (Cache) |
| DBSpendUpdateWriter | Batches spend updates to reduce DB writes | proxy/db/db_spend_update_writer.py (DBSpendUpdateWriter) |
| Cost Tracking | Calculates and logs response costs | proxy/hooks/proxy_track_cost_callback.py (_ProxyDBLogger) |
Background Jobs (APScheduler, initialized in proxy/proxy_server.py → ProxyStartupEvent.initialize_scheduled_background_jobs()):
| Job | Interval | Purpose | Key Files |
|---|---|---|---|
update_spend |
60s | Batch write spend logs to PostgreSQL | proxy/db/db_spend_update_writer.py |
reset_budget |
10-12min | Reset budgets for keys/users/teams | proxy/management_helpers/budget_reset_job.py |
add_deployment |
10s | Sync new model deployments from DB | proxy/proxy_server.py (ProxyConfig) |
cleanup_old_spend_logs |
cron/interval | Delete old spend logs | proxy/management_helpers/spend_log_cleanup.py |
check_batch_cost |
30min | Calculate costs for batch jobs | proxy/management_helpers/check_batch_cost_job.py |
check_responses_cost |
30min | Calculate costs for responses API | proxy/management_helpers/check_responses_cost_job.py |
process_rotations |
1hr | Auto-rotate API keys | proxy/management_helpers/key_rotation_manager.py |
_run_background_health_check |
continuous | Health check model deployments | proxy/proxy_server.py |
send_weekly_spend_report |
weekly | Slack spend alerts | proxy/utils.py (SlackAlerting) |
send_monthly_spend_report |
monthly | Slack spend alerts | proxy/utils.py (SlackAlerting) |
Cost Attribution Flow:
- LLM response returns to
utils.pywrapper afterlitellm.acompletion()completes update_response_metadata()(llm_response_utils/response_metadata.py) is calledlogging_obj._response_cost_calculator()(litellm_logging.py) calculates cost vialitellm.completion_cost()(cost_calculator.py)- Cost is stored in
response._hidden_params["response_cost"] proxy/common_request_processing.pyextracts cost fromhidden_paramsand adds to response headers (x-litellm-response-cost)logging_obj.async_success_handler()triggers callbacks including_ProxyDBLogger.async_log_success_event()DBSpendUpdateWriter.update_database()queues spend increments to Redis- Background job
update_spendflushes queued spend to PostgreSQL every 60s
The SDK (litellm/) provides the core LLM calling functionality used by both direct SDK users and the AI Gateway.
graph TD
subgraph "SDK Entry Points"
Completion["litellm.completion()"]
Messages["litellm.messages()"]
end
subgraph "main.py"
Main["completion()<br/>acompletion()"]
end
subgraph "utils.py"
GetProvider["get_llm_provider()"]
end
subgraph "llms/custom_httpx/"
Handler["llm_http_handler.py<br/>BaseLLMHTTPHandler"]
HTTP["http_handler.py<br/>HTTPHandler / AsyncHTTPHandler"]
end
subgraph "llms/{provider}/chat/"
TransformReq["transform_request()"]
TransformResp["transform_response()"]
end
subgraph "litellm_core_utils/"
Streaming["streaming_handler.py"]
end
subgraph "integrations/ (async, off main thread)"
Callbacks["custom_logger.py<br/>Langfuse, Datadog, etc."]
end
Completion --> Main
Messages --> Main
Main --> GetProvider
GetProvider --> Handler
Handler --> TransformReq
TransformReq --> HTTP
HTTP --> Provider["LLM Provider API"]
Provider --> HTTP
HTTP --> TransformResp
TransformResp --> Streaming
Streaming --> Response["ModelResponse"]
Response -.->|async| Callbacks
Key SDK files:
main.py- Entry points:completion(),acompletion(),embedding()utils.py-get_llm_provider()resolves model → providerllms/custom_httpx/llm_http_handler.py- Central HTTP orchestratorllms/custom_httpx/http_handler.py- Low-level HTTP clientllms/{provider}/chat/transformation.py- Provider-specific transformationslitellm_core_utils/streaming_handler.py- Streaming response handlingintegrations/- Async callbacks (Langfuse, Datadog, etc.)
When a request comes in, it goes through a translation layer that converts between API formats. Each translation is isolated in its own file, making it easy to test and modify independently.
| Incoming API | Provider | Translation File |
|---|---|---|
/v1/chat/completions |
Anthropic | llms/anthropic/chat/transformation.py |
/v1/chat/completions |
Bedrock Converse | llms/bedrock/chat/converse_transformation.py |
/v1/chat/completions |
Bedrock Invoke | llms/bedrock/chat/invoke_transformations/anthropic_claude3_transformation.py |
/v1/chat/completions |
Gemini | llms/gemini/chat/transformation.py |
/v1/chat/completions |
Vertex AI | llms/vertex_ai/gemini/transformation.py |
/v1/chat/completions |
OpenAI | llms/openai/chat/gpt_transformation.py |
/v1/messages (passthrough) |
Anthropic | llms/anthropic/experimental_pass_through/messages/transformation.py |
/v1/messages (passthrough) |
Bedrock | llms/bedrock/messages/invoke_transformations/anthropic_claude3_transformation.py |
/v1/messages (passthrough) |
Vertex AI | llms/vertex_ai/vertex_ai_partner_models/anthropic/experimental_pass_through/transformation.py |
| Passthrough endpoints | All | proxy/pass_through_endpoints/llm_provider_handlers/ |
If /v1/messages → Bedrock Converse prompt caching isn't working but Bedrock Invoke works:
- Bedrock Converse translation:
llms/bedrock/chat/converse_transformation.py - Bedrock Invoke translation:
llms/bedrock/chat/invoke_transformations/anthropic_claude3_transformation.py - Compare how each handles
cache_controlintransform_request()
Each provider has a Config class that inherits from BaseConfig (llms/base_llm/chat/transformation.py):
class ProviderConfig(BaseConfig):
def transform_request(self, model, messages, optional_params, litellm_params, headers):
# Convert OpenAI format → Provider format
return {"messages": transformed_messages, ...}
def transform_response(self, model, raw_response, model_response, logging_obj, ...):
# Convert Provider format → OpenAI format
return ModelResponse(choices=[...], usage=Usage(...))The BaseLLMHTTPHandler (llms/custom_httpx/llm_http_handler.py) calls these methods - you never need to modify the handler itself.
- Create
llms/{provider}/chat/transformation.py - Implement
Configclass withtransform_request()andtransform_response() - Add tests in
tests/llm_translation/test_{provider}.py
- Find the translation file from the table above
- Modify
transform_request()to handle the new parameter - Add unit tests that verify the transformation
When adding a feature, verify it works across all paths:
| Test | File Pattern |
|---|---|
| OpenAI passthrough | tests/llm_translation/test_openai*.py |
| Anthropic direct | tests/llm_translation/test_anthropic*.py |
| Bedrock Invoke | tests/llm_translation/test_bedrock*.py |
| Bedrock Converse | tests/llm_translation/test_bedrock*converse*.py |
| Vertex AI | tests/llm_translation/test_vertex*.py |
| Gemini | tests/llm_translation/test_gemini*.py |
Translations are designed to be unit testable without making API calls:
from litellm.llms.bedrock.chat.converse_transformation import BedrockConverseConfig
def test_prompt_caching_transform():
config = BedrockConverseConfig()
result = config.transform_request(
model="anthropic.claude-3-opus",
messages=[{"role": "user", "content": "test", "cache_control": {"type": "ephemeral"}}],
optional_params={},
litellm_params={},
headers={}
)
assert "cachePoint" in str(result) # Verify cache_control was translated