diff --git a/docs.json b/docs.json
index 006bd76..b19bf8b 100644
--- a/docs.json
+++ b/docs.json
@@ -2,7 +2,7 @@
"$schema": "https://mintlify.com/docs.json",
"theme": "almond",
"name": "Edgee documentation",
- "description": "Edgee is a unified AI Gateway that gives you control over your LLM infrastructure.",
+ "description": "Edgee is an edge-native AI Gateway that reduces LLM costs by up to 50% through token compression and intelligent routing.",
"colors": {
"primary": "#8924A6",
"light": "#C876FA",
@@ -79,7 +79,10 @@
{
"group": "Features",
"pages": [
- "features/overview"
+ "features/overview",
+ "features/token-compression",
+ "features/observability",
+ "features/automatic-model-selection"
]
},
{
diff --git a/features/automatic-model-selection.mdx b/features/automatic-model-selection.mdx
index e6520cc..b58bd66 100644
--- a/features/automatic-model-selection.mdx
+++ b/features/automatic-model-selection.mdx
@@ -1,9 +1,206 @@
---
title: Automatic Model Selection
-description: Discover the Automatic Model Selection feature.
+description: Intelligent routing that optimizes for cost, performance, or both.
icon: circuit-board
---
+Edgee's automatic model selection routes requests to the optimal model based on your priorities. Combined with token compression, it can reduce total AI costs by 60-70%.
+
+## Cost-Aware Routing
+
+Let Edgee automatically select the cheapest model that meets your quality requirements:
+
+```typescript
+const response = await edgee.send({
+ model: 'auto', // Enable automatic selection
+ strategy: 'cost', // Optimize for lowest cost
+ input: 'What is the capital of France?',
+ quality_threshold: 0.95, // Only use models with 95%+ quality score
+});
+
+console.log(`Model used: ${response.model}`); // e.g., "gpt-5.2"
+console.log(`Cost: $${response.cost.toFixed(4)}`);
+console.log(`Tokens saved (compression): ${response.usage.saved_tokens}`);
+```
+
+**How it works:**
+1. Analyze the request complexity and requirements
+2. Filter models that meet your quality threshold
+3. Route to the cheapest model after token compression
+4. Track savings from both compression and routing
+
+**Typical savings:**
+- Simple queries: Route to GPT-4o-mini or Claude Haiku (60-80% cheaper)
+- Complex tasks: Route to mid-tier models like GPT-4o or Claude 3.5 Sonnet
+- Specialized needs: Route to task-specific models (coding, vision, etc.)
+
+Combined with compression, you can save 60-70% on total AI costs.
+
+
+ Quality thresholds are based on benchmark performance across standard tasks. You can customize thresholds per request or set defaults per project.
+
+
+## Performance-Optimized Routing
+
+Route to the fastest model when latency matters more than cost:
+
+```typescript
+const response = await edgee.send({
+ model: 'auto',
+ strategy: 'performance', // Optimize for speed
+ input: 'Generate a summary of this document...',
+ max_latency_ms: 2000, // Must respond in under 2s
+});
+
+console.log(`Model used: ${response.model}`); // e.g., "gpt-4o"
+console.log(`Latency: ${response.latency_ms}ms`);
+```
+
+**Performance routing considers:**
+- Model inference speed (tokens/second)
+- Provider API latency
+- Time to first token (TTFT)
+- Geographic proximity to provider
+
+## Balanced Strategy
+
+Find the optimal trade-off between cost and performance:
+
+```typescript
+const response = await edgee.send({
+ model: 'auto',
+ strategy: 'balanced',
+ input: 'Analyze this customer feedback...',
+ cost_budget: 0.01, // Max $0.01 per request
+ quality_threshold: 0.9, // 90% quality minimum
+});
+```
+
+**Balanced routing:**
+- Stays within your cost budget
+- Meets quality requirements
+- Optimizes for best performance within constraints
+- Automatically adjusts based on token compression
+
+## Automatic Failover
+
+When a provider fails, Edgee automatically retries with backup models:
+
+```typescript
+const response = await edgee.send({
+ model: 'gpt-4o',
+ fallback_models: ['claude-3.5-sonnet', 'gemini-pro'], // Backup chain
+ input: 'Your prompt here',
+});
+
+// If GPT-4o is unavailable, Edgee tries Claude 3.5, then Gemini
+console.log(`Model used: ${response.model}`);
+console.log(`Fallback used: ${response.fallback_used}`); // true/false
+```
+
+**Failover triggers:**
+- Rate limits (429 errors)
+- Provider outages (5xx errors)
+- Timeout errors
+- Model unavailability
+
+**Failover behavior:**
+- Instant retry with next model in chain
+- No additional latency (parallel health checks)
+- Preserves request context and compression
+- Logs failover events for monitoring
+
+## Cost + Compression Savings
+
+Automatic model selection works seamlessly with token compression for maximum savings:
+
+| Scenario | Without Edgee | With Compression Only | With Compression + Routing | **Total Savings** |
+|----------|---------------|----------------------|----------------------------|-------------------|
+| Simple Q&A | $0.10 (GPT-4o) | $0.05 (50% compression) | $0.02 (GPT-4o-mini + compression) | **80%** |
+| RAG Pipeline | $0.50 (GPT-4o) | $0.25 (50% compression) | $0.15 (GPT-4o + compression + routing) | **70%** |
+| Document Analysis | $1.00 (Claude Opus) | $0.50 (50% compression) | $0.30 (Claude Sonnet + compression) | **70%** |
+
+
+ Savings vary by use case. Track your actual savings using the [observability dashboard](/features/observability).
+
+
+## Route by Use Case
+
+Configure default routing strategies per use case:
+
+```typescript
+// RAG Q&A: Optimize for cost
+await edgee.routing.configure({
+ name: 'rag-qa',
+ strategy: 'cost',
+ allowed_models: ['gpt-5.2', 'gpt-5.1', 'claude-3.5-sonnet'],
+ quality_threshold: 0.9,
+});
+
+// Code generation: Optimize for performance
+await edgee.routing.configure({
+ name: 'code-gen',
+ strategy: 'performance',
+ allowed_models: ['gpt-4o', 'claude-3.5-sonnet'],
+ quality_threshold: 0.95,
+});
+
+// Then use per request
+const response = await edgee.send({
+ model: 'auto',
+ routing_profile: 'rag-qa', // Use pre-configured strategy
+ input: 'Answer based on these documents...',
+});
+```
+
+## Custom Routing Rules
+
+Define custom routing logic based on request properties:
+
+```typescript
+await edgee.routing.addRule({
+ name: 'route-by-length',
+ condition: {
+ token_count: { gt: 10000 }, // Requests over 10k tokens
+ },
+ action: {
+ models: ['claude-3.5-sonnet'], // Use Claude for long contexts
+ strategy: 'cost',
+ },
+});
+
+await edgee.routing.addRule({
+ name: 'route-critical-requests',
+ condition: {
+ metadata: { priority: 'high' }, // High-priority requests
+ },
+ action: {
+ models: ['gpt-4o', 'claude-opus'], // Use premium models
+ strategy: 'performance',
+ },
+});
+```
+
+## What's Next
+
+
+
+ Learn how compression reduces costs by up to 50% before routing.
+
+
+
+ Track routing decisions, costs, and compression savings.
+
+
+
+ Get started with automatic model selection in 5 minutes.
+
+
+
+ Explore the full API for routing configuration.
+
+
+
-This feature page is still under construction. We're working on it and will be published soon.
+This feature is under active development. Some routing strategies and configuration options may be added in future releases.
diff --git a/features/observability.mdx b/features/observability.mdx
index 97e11f0..7cdad18 100644
--- a/features/observability.mdx
+++ b/features/observability.mdx
@@ -1,9 +1,371 @@
---
title: Observability
-description: Discover the observability features of Edgee.
+description: Track every request, measure every token, optimize every dollar.
icon: eye
---
-
-This feature page is still under construction. We're working on it and will be published soon.
-
\ No newline at end of file
+Edgee provides complete visibility into your AI infrastructure with real-time metrics on costs, token usage, compression savings, performance, and errors. Every request is tracked and exportable for analysis, budgeting, and optimization.
+
+## Cost Tracking
+
+Every Edgee response includes detailed cost information so you can track spending in real-time:
+
+```typescript
+const response = await edgee.send({
+ model: 'gpt-4o',
+ input: 'Your prompt here',
+});
+
+console.log(response.cost); // Total cost in USD (e.g., 0.0234)
+console.log(response.usage.prompt_tokens); // Compressed input tokens
+console.log(response.usage.completion_tokens); // Output tokens
+console.log(response.usage.total_tokens); // Total for billing
+```
+
+**Track spending by:**
+- Model (GPT-4o vs Claude vs Gemini)
+- Project or application
+- Environment (production vs staging)
+- User or tenant (for multi-tenant apps)
+- Time period (daily, weekly, monthly)
+
+
+ Costs are calculated using real-time provider pricing. Edgee automatically handles rate changes and updates your historical data accordingly.
+
+
+## Request Tags for Analytics
+
+Tags allow you to categorize and label requests for filtering and grouping in your analytics dashboard. Add tags to track requests by environment, feature, user, team, or any custom dimension.
+
+**Using tags in native SDKs:**
+
+
+
+ ```typescript
+ import Edgee from 'edgee';
+
+ const edgee = new Edgee("your-api-key");
+
+ const response = await edgee.send({
+ model: 'gpt-4o',
+ input: {
+ messages: [{ role: 'user', content: 'Hello!' }],
+ tags: ['production', 'chat-feature', 'user-123', 'team-backend']
+ }
+ });
+ ```
+
+
+
+ ```python
+ from edgee import Edgee, InputObject, Message
+
+ edgee = Edgee("your-api-key")
+
+ response = edgee.send(
+ model="gpt-4o",
+ input=InputObject(
+ messages=[Message(role="user", content="Hello!")],
+ tags=["production", "chat-feature", "user-123", "team-backend"]
+ )
+ )
+ ```
+
+
+
+ ```go
+ import "github.com/edgee-cloud/go-sdk/edgee"
+
+ client, _ := edgee.NewClient("your-api-key")
+
+ response, err := client.Send("gpt-4o", edgee.InputObject{
+ Messages: []edgee.Message{
+ {Role: "user", Content: "Hello!"},
+ },
+ Tags: []string{"production", "chat-feature", "user-123", "team-backend"},
+ })
+ ```
+
+
+
+ ```rust
+ use edgee::{Edgee, InputObject, Message};
+
+ let client = Edgee::from_env()?;
+
+ let input = InputObject::new(vec![Message::user("Hello!")])
+ .with_tags(vec![
+ "production".to_string(),
+ "chat-feature".to_string(),
+ "user-123".to_string(),
+ "team-backend".to_string(),
+ ]);
+
+ let response = client.send("gpt-4o", input).await?;
+ ```
+
+
+
+**Using tags with OpenAI/Anthropic SDKs via headers:**
+
+If you're using the OpenAI or Anthropic SDKs with Edgee, add tags via the `x-edgee-tags` header (comma-separated):
+
+
+
+ ```typescript
+ import OpenAI from "openai";
+
+ const openai = new OpenAI({
+ baseURL: "https://api.edgee.ai/v1",
+ apiKey: process.env.EDGEE_API_KEY,
+ defaultHeaders: {
+ "x-edgee-tags": "production,chat-feature,user-123,team-backend"
+ }
+ });
+ ```
+
+
+
+ ```python
+ from anthropic import Anthropic
+
+ client = Anthropic(
+ base_url="https://api.edgee.ai/v1",
+ api_key=os.environ.get("EDGEE_API_KEY"),
+ default_headers={
+ "x-edgee-tags": "production,chat-feature,user-123,team-backend"
+ }
+ )
+ ```
+
+
+
+**Common tagging strategies:**
+
+
+
+ **Environment tagging**
+
+ Tag by environment: `production`, `staging`, `development`
+
+
+
+ **Feature tagging**
+
+ Tag by feature: `chat`, `summarization`, `code-generation`, `rag-qa`
+
+
+
+ **User/tenant tagging**
+
+ Track per-user or per-tenant usage: `user-123`, `tenant-acme`, `customer-xyz`
+
+
+
+ **Team tagging**
+
+ Organize by team: `team-backend`, `team-frontend`, `team-data`
+
+
+
+
+ Use tags consistently across your application to enable powerful filtering and cost attribution in your analytics dashboard. You can filter by multiple tags to drill down into specific segments (e.g., "production + chat-feature + team-backend").
+
+
+## Compression Metrics
+
+See exactly how much token compression is saving you on every request:
+
+```typescript
+const response = await edgee.send({
+ model: 'gpt-4o',
+ input: 'Long prompt with lots of context...',
+});
+
+// Compression details
+console.log(response.usage.prompt_tokens_original); // Original token count
+console.log(response.usage.prompt_tokens); // After compression
+console.log(response.usage.saved_tokens); // Tokens saved
+console.log(response.usage.compression_ratio); // Percentage reduction (e.g., 45%)
+```
+
+**Analyze compression effectiveness:**
+- **By use case**: Compare RAG vs agents vs document analysis
+- **Over time**: Track cumulative savings weekly or monthly
+- **Per model**: See which models compress best for your workload
+- **By prompt length**: Identify high-value optimization opportunities
+
+
+
+ **Cumulative savings**
+
+ View total tokens and dollars saved since you started using Edgee
+
+
+
+ **Compression trends**
+
+ Track compression ratios over time to identify optimization opportunities
+
+
+
+ **By use case**
+
+ Compare compression effectiveness across different prompt types
+
+
+
+ **Top savers**
+
+ Identify which requests generate the highest savings
+
+
+
+## Performance Monitoring
+
+Track latency and throughput across all your AI requests:
+
+**Latency metrics:**
+- Total request time (end-to-end)
+- Time to first token (TTFT)
+- Tokens per second (streaming)
+- Edge processing overhead
+
+**By dimension:**
+- Model and provider
+- Geographic region
+- Request size (token count)
+- Time of day or week
+
+**Error tracking:**
+- Provider errors (rate limits, timeouts, 5xx)
+- Automatic failover events
+- Retry attempts and success rates
+- Error codes and messages
+
+## Usage Analytics
+
+Understand how your AI infrastructure is being used:
+
+**Request volume:**
+- Total requests per day/week/month
+- Requests by model and provider
+- Peak usage times
+- Growth trends
+
+**Token consumption:**
+- Input tokens (original vs compressed)
+- Output tokens
+- Total tokens by model
+- Average tokens per request
+
+**Model distribution:**
+- Which models are used most
+- Provider mix (OpenAI vs Anthropic vs Google)
+- Cost per model over time
+- Model switching patterns
+
+## Alerts & Budgets
+
+Stay in control with proactive alerts:
+
+**Budget alerts:**
+- Set monthly spending limits per project
+- Get notified at 80%, 90%, 100% of budget
+- Automatic rate limiting at threshold
+- Email and webhook notifications
+
+**Usage alerts:**
+- Unusual spike in requests
+- High error rates for specific models
+- Compression ratio drops below threshold
+- Latency exceeds acceptable levels
+
+**Example alert configuration:**
+
+```typescript
+await edgee.alerts.create({
+ name: 'Monthly budget alert',
+ type: 'budget',
+ threshold: 1000, // $1,000 USD
+ actions: [
+ { type: 'email', to: 'team@company.com' },
+ { type: 'webhook', url: 'https://api.company.com/alerts' },
+ ],
+});
+```
+
+## Export & Integration
+
+Get your data where you need it:
+
+**Export formats:**
+- JSON for custom analysis
+- CSV for spreadsheets
+- Parquet for data warehouses
+- Streaming webhooks for real-time ingestion
+
+**Integration targets:**
+- Datadog, New Relic, Grafana for dashboards
+- Snowflake, BigQuery for analytics
+- S3, GCS for long-term storage
+- Custom webhooks for internal systems
+
+**Example export:**
+
+```typescript
+// Export last 30 days of usage data
+const data = await edgee.analytics.export({
+ startDate: '2024-01-01',
+ endDate: '2024-01-31',
+ format: 'json',
+ metrics: ['cost', 'tokens', 'latency', 'compression'],
+ groupBy: ['model', 'date'],
+});
+```
+
+## Dashboard Views
+
+The Edgee dashboard provides pre-built views for common use cases:
+
+
+
+ Track spending trends, compare models, and identify cost optimization opportunities.
+
+
+
+ Monitor token savings, compression ratios, and cumulative cost reductions.
+
+
+
+ Analyze latency, throughput, error rates, and provider health across regions.
+
+
+
+ Understand request volume, model distribution, and usage trends over time.
+
+
+
+
+ Dashboard access is included with all Edgee plans. Enterprise customers can customize dashboards and create team-specific views.
+
+
+## What's Next
+
+
+
+ Learn how token compression reduces costs by up to 50%.
+
+
+
+ Optimize for cost or performance with automatic model selection.
+
+
+
+ Get started with Edgee in 5 minutes.
+
+
+
+ Explore the full API for analytics and observability.
+
+
\ No newline at end of file
diff --git a/features/overview.mdx b/features/overview.mdx
index 8d339d2..8253ad2 100644
--- a/features/overview.mdx
+++ b/features/overview.mdx
@@ -4,32 +4,35 @@ description: An overview of what we’re building in Edgee AI Gateway.
icon: sparkles
---
-Edgee AI Gateway is a **unified, OpenAI-compatible API** that sits between your application and LLM providers.
-It’s designed to help teams ship faster while keeping **routing, reliability, observability, and privacy controls** in one place.
+Edgee AI Gateway is an **edge-native, cost-optimized AI Gateway** that reduces LLM spending by up to 50% through token compression. Behind a single OpenAI-compatible API, you get access to 200+ models with intelligent routing, reliability, observability, and privacy controls.
Edgee AI Gateway is still under active construction. We’re building fast, shipping incrementally, and writing docs in parallel.
-If you need something more mature today, jump to the [Edgee Proxy documentation](/proxy/overview).
+Don't hesitate to [contact us](https://www.edgee.cloud/contact) if you have any questions or feedback.
## What the AI Gateway focuses on
-These are the core capabilities we’re designing the gateway around:
+These are the core capabilities we're designing the gateway around:
+
+ Reduce input tokens by up to 50% with edge-native compression. Ideal for RAG, long contexts, and agent workflows.
+
+
- One integration that can route across providers without rewriting your application logic.
+ One integration that routes across 200+ models from OpenAI, Anthropic, Google, Mistral, and more.
-
- Policy-based routing, fallbacks, and safer failure modes when providers rate-limit or degrade.
+
+ Track token savings, costs, latency, and errors in real-time. Export data for analysis and budgeting.
-
- The signals you need to run production AI: latency, errors, usage, and cost — exportable and actionable.
+
+ Policy-based routing, automatic failover, and cost-aware model selection for optimal spend.
-
+
Configurable logging and retention, plus provider-side ZDR where available.
@@ -40,9 +43,3 @@ These are the core capabilities we’re designing the gateway around:
- **Clear defaults, configurable controls**: the goal is to reduce “LLM glue code” while keeping you in charge.
- **Docs expanding quickly**: each feature page will get deeper guides, examples, and best practices as we ship.
-## Looking for the mature platform?
-
-Edgee Proxy has a large set of production-ready capabilities and much deeper documentation today:
-
-- **Start here**: [Edgee Proxy overview](/proxy/overview)
-- **Implementation guides**: [Proxy getting started](/proxy/getting-started/index)
diff --git a/features/token-compression.mdx b/features/token-compression.mdx
new file mode 100644
index 0000000..2080833
--- /dev/null
+++ b/features/token-compression.mdx
@@ -0,0 +1,199 @@
+---
+title: Token Compression
+description: Reduce LLM costs by up to 50% with edge-native prompt compression.
+icon: dollar-sign
+---
+
+# Reduce LLM costs by up to 50%
+
+Edgee's token compression runs at the edge before every request reaches LLM providers, automatically reducing prompt size by up to 50% while preserving semantic meaning and output quality.
+
+This is particularly effective for:
+- RAG pipelines with large document contexts
+- Long conversation histories in multi-turn agents
+- Verbose system instructions and formatting
+- Document analysis and summarization tasks
+
+## How It Works
+
+Token compression happens automatically on every request through a four-step process:
+
+
+
+ Analyze the prompt structure to identify redundant context, verbose formatting, and compressible sections without losing critical information.
+
+
+
+ Compress repeated context (common in RAG), condense verbose formatting, and remove unnecessary whitespace while maintaining semantic relationships.
+
+
+
+ Preserve critical instructions, few-shot examples, and task-specific requirements. System prompts and user intent remain intact.
+
+
+
+ Verify the compressed prompt maintains semantic equivalence to the original. If quality checks fail, the original prompt is used.
+
+
+
+
+ Compression is most effective for prompts with repeated context (RAG), long system instructions, or verbose multi-turn histories. Simple queries may see minimal compression.
+
+
+## When It Works Best
+
+Token compression delivers the highest savings for these common use cases:
+
+
+
+ **40-50% reduction**
+
+ Large document contexts with redundant information compress effectively. Ideal for Q&A systems, knowledge bases, and semantic search.
+
+
+
+ **30-45% reduction**
+
+ Lengthy conversation histories, documentation, or background information. Common in chatbots and assistant applications.
+
+
+
+ **35-50% reduction**
+
+ Summarization, extraction, and analysis of long documents. Verbose source material compresses well.
+
+
+
+ **25-40% reduction**
+
+ Conversational agents with growing context windows. Savings increase with conversation length.
+
+
+
+## Code Example
+
+Every response includes compression metrics so you can track your savings:
+
+```typescript
+import Edgee from 'edgee';
+
+const edgee = new Edgee("your-api-key");
+
+// Example: RAG Q&A with large context
+const documents = [
+ "Long document content here...",
+ "Another document with context...",
+ "More relevant information..."
+];
+
+const response = await edgee.send({
+ model: 'gpt-4o',
+ input: `Answer the question based on these documents:\n\n${documents.join('\n\n')}\n\nQuestion: What is the main topic?`,
+});
+
+console.log(response.text);
+
+// Compression metrics
+console.log(`Original tokens: ${response.usage.prompt_tokens_original}`);
+console.log(`Compressed tokens: ${response.usage.prompt_tokens}`);
+console.log(`Tokens saved: ${response.usage.saved_tokens}`);
+console.log(`Compression ratio: ${response.usage.compression_ratio}%`);
+console.log(`Request cost: $${response.cost.toFixed(4)}`);
+```
+
+**Example output:**
+```
+Original tokens: 2,450
+Compressed tokens: 1,225
+Tokens saved: 1,225
+Compression ratio: 50%
+Request cost: $0.0184
+```
+
+## Real-World Savings
+
+Here's what token compression means for your monthly AI bill:
+
+| Use Case | Monthly Requests | Without Edgee | With Edgee (50% compression) | **Monthly Savings** |
+|----------|-----------------|---------------|------------------------------|---------------------|
+| RAG Q&A (GPT-4o) | 100,000 @ 2,000 tokens | $3,000 | $1,500 | **$1,500** |
+| Document Analysis (Claude 3.5) | 50,000 @ 4,000 tokens | $1,800 | $900 | **$900** |
+| Chatbot (GPT-4o-mini) | 500,000 @ 500 tokens | $375 | $188 | **$187** |
+| Multi-turn Agent (GPT-4o) | 200,000 @ 1,000 tokens | $3,000 | $1,500 | **$1,500** |
+
+
+ Savings calculations use list pricing for GPT-4o ($5/1M input tokens), Claude 3.5 Sonnet ($3/1M input tokens), and GPT-4o-mini ($0.15/1M input tokens). Actual compression ratios vary by use case.
+
+
+## Best Practices
+
+
+
+ - Structure RAG contexts with clear sections
+ - Use consistent formatting in document chunks
+ - Avoid excessive whitespace in system prompts
+ - Group similar information together
+
+
+
+ - Monitor `usage.saved_tokens` across requests
+ - Calculate cumulative savings weekly or monthly
+ - Use observability tools to identify high-compression opportunities
+ - Compare costs across different use cases
+
+
+
+ - Enable compression by default for all requests
+ - Compression happens automatically without configuration
+ - Track `compression_ratio` to understand effectiveness
+ - Use response metrics to optimize prompt design
+
+
+
+ - Use [automatic model selection](/features/automatic-model-selection) for additional savings
+ - Route to cheaper models when appropriate
+ - Compression + routing can reduce costs by 60-70% total
+ - Monitor both compression and routing savings
+
+
+
+## Response Fields
+
+Every Edgee response includes detailed compression metrics:
+
+```typescript
+response.usage.prompt_tokens // Compressed token count (billed)
+response.usage.prompt_tokens_original // Original token count (before compression)
+response.usage.saved_tokens // Tokens saved by compression
+response.usage.compression_ratio // Percentage reduction
+response.usage.completion_tokens // Output tokens (unchanged)
+response.usage.total_tokens // Total for billing calculation
+
+response.cost // Total request cost in USD
+```
+
+Use these fields to:
+- Track savings in real-time
+- Build cost dashboards and budgeting tools
+- Identify high-value compression opportunities
+- Optimize prompt design for maximum compression
+
+## What's Next
+
+
+
+ Monitor token savings, costs, and compression ratios across all requests.
+
+
+
+ Combine compression with cost-aware model routing for even greater savings.
+
+
+
+ Get started in 5 minutes and start saving on your next request.
+
+
+
+ Explore SDKs in TypeScript, Python, Go, and Rust with built-in compression support.
+
+
diff --git a/integrations/anthropic-sdk.mdx b/integrations/anthropic-sdk.mdx
index 7500ab3..e19169c 100644
--- a/integrations/anthropic-sdk.mdx
+++ b/integrations/anthropic-sdk.mdx
@@ -4,7 +4,7 @@ description: Use Edgee with the Anthropic SDK for building AI applications with
icon: /images/icons/anthropic.svg
---
-The Anthropic SDK provides official Python and TypeScript clients for interacting with Claude models. Edgee's OpenAI-compatible API works seamlessly with the Anthropic SDK, allowing you to leverage the SDK's features while gaining access to Edgee's unified gateway, cost tracking, automatic failover, and observability.
+The Anthropic SDK provides official Python and TypeScript clients for interacting with Claude models. Edgee's OpenAI-compatible API works seamlessly with the Anthropic SDK, allowing you to leverage the SDK's features while gaining access to Edgee's **up to 50% cost reduction** through token compression, unified gateway, automatic failover, and full observability.
## Installation
@@ -45,7 +45,10 @@ The Anthropic SDK provides official Python and TypeScript clients for interactin
)
print(message.content[0].text)
- # "The capital of France is Paris."
+
+ # Access token usage and cost metrics
+ print(f"Tokens saved: {message.usage.input_tokens_original - message.usage.input_tokens}")
+ print(f"Total tokens: {message.usage.input_tokens + message.usage.output_tokens}")
```
@@ -69,7 +72,10 @@ The Anthropic SDK provides official Python and TypeScript clients for interactin
});
console.log(message.content[0].text);
- // "The capital of France is Paris."
+
+ // Access token usage and cost metrics
+ console.log(`Tokens saved: ${message.usage.input_tokens_original - message.usage.input_tokens}`);
+ console.log(`Total tokens: ${message.usage.input_tokens + message.usage.output_tokens}`);
```
@@ -130,9 +136,77 @@ Stream responses for real-time token delivery:
+## Cost Tracking & Compression
+
+Every Edgee response includes token compression metrics through the Anthropic API's `usage` field:
+
+
+
+ ```python
+ from anthropic import Anthropic
+
+ client = Anthropic(
+ base_url="https://api.edgee.ai/v1",
+ api_key=os.environ.get("EDGEE_API_KEY"),
+ )
+
+ message = client.messages.create(
+ model="claude-sonnet-4.5",
+ max_tokens=1024,
+ messages=[{"role": "user", "content": "Analyze this long document..."}]
+ )
+
+ print(message.content[0].text)
+
+ # Compression metrics
+ usage = message.usage
+ tokens_saved = usage.input_tokens_original - usage.input_tokens
+ compression_ratio = (tokens_saved / usage.input_tokens_original) * 100
+
+ print(f"Original input tokens: {usage.input_tokens_original}")
+ print(f"Compressed input tokens: {usage.input_tokens}")
+ print(f"Tokens saved: {tokens_saved}")
+ print(f"Compression ratio: {compression_ratio:.1f}%")
+ ```
+
+
+
+ ```typescript
+ import Anthropic from '@anthropic-ai/sdk';
+
+ const client = new Anthropic({
+ baseURL: 'https://api.edgee.ai/v1',
+ apiKey: process.env.EDGEE_API_KEY,
+ });
+
+ const message = await client.messages.create({
+ model: 'claude-sonnet-4.5',
+ max_tokens: 1024,
+ messages: [{ role: 'user', content: 'Analyze this long document...' }]
+ });
+
+ console.log(message.content[0].text);
+
+ // Compression metrics
+ const usage = message.usage;
+ const tokensSaved = usage.input_tokens_original - usage.input_tokens;
+ const compressionRatio = (tokensSaved / usage.input_tokens_original) * 100;
+
+ console.log(`Original input tokens: ${usage.input_tokens_original}`);
+ console.log(`Compressed input tokens: ${usage.input_tokens}`);
+ console.log(`Tokens saved: ${tokensSaved}`);
+ console.log(`Compression ratio: ${compressionRatio.toFixed(1)}%`);
+ ```
+
+
+
+
+ Edgee extends the Anthropic API response with `input_tokens_original` to show the token count before compression. All other fields remain standard Anthropic format.
+
+
## Multi-Provider Access
-With Edgee, you can access models from multiple providers using the same Anthropic SDK client:
+With Edgee, you can access models from multiple providers using the same Anthropic SDK client and compare costs across providers:
@@ -413,12 +487,12 @@ Authorization: Bearer {api_key}
## Benefits of Using Anthropic SDK with Edgee
-
- Use the familiar Anthropic SDK to access Claude, GPT-4, Mistral, and 200+ other models through one interface.
+
+ Automatic token compression on every request reduces input tokens by up to 50% while preserving output quality.
-
- Every response includes detailed cost information. Track spending across all providers in one dashboard.
+
+ Compare costs across Claude, GPT-4, Mistral, and 200+ models. Track compression savings per provider.
@@ -426,7 +500,7 @@ Authorization: Bearer {api_key}
- Monitor latency, token usage, error rates, and costs for all requests in Edgee's unified dashboard.
+ Monitor latency, token usage, compression ratios, error rates, and costs for all requests in one dashboard.
diff --git a/integrations/claude-code.mdx b/integrations/claude-code.mdx
index d50dca1..7855539 100644
--- a/integrations/claude-code.mdx
+++ b/integrations/claude-code.mdx
@@ -120,23 +120,27 @@ To get your API key:
## Benefits of Using Claude Code with Edgee
-
- Access multiple LLM providers through Edgee while using Claude Code's powerful CLI interface.
+
+ Token compression works automatically on every Claude Code request, reducing costs by up to 50% without any code changes.
-
- Track costs in real-time and set budget alerts for your Claude Code usage through Edgee's dashboard.
+
+ View exactly how much you're saving on each coding session in the Edgee dashboard with detailed compression metrics.
-
- If Claude API is down or rate-limited, Edgee can automatically route to backup models without interrupting your workflow.
+
+ Access multiple LLM providers through Edgee while using Claude Code's powerful CLI interface.
-
- Monitor all Claude Code sessions with detailed metrics: latency, token usage, costs, and error rates.
+
+ If Claude API is down or rate-limited, Edgee can automatically route to backup models without interrupting your workflow.
+
+ Compression happens automatically at the edge before requests reach Claude. You'll see token savings in the Edgee dashboard without any changes to your Claude Code workflow.
+
+
## Troubleshooting
### Connection Issues
diff --git a/integrations/openai-sdk.mdx b/integrations/openai-sdk.mdx
index a7c855b..68ef2b7 100644
--- a/integrations/openai-sdk.mdx
+++ b/integrations/openai-sdk.mdx
@@ -9,6 +9,8 @@ Edgee provides an **OpenAI-compatible API**, which means you can use the officia
## Why Use OpenAI SDK with Edgee?
+- **Up to 50% Cost Reduction**: Automatic token compression on every request
+- **Real-Time Savings**: See exactly how many tokens and dollars you've saved
- **No Code Changes**: Use your existing OpenAI SDK code as-is
- **Multi-Provider Access**: Route to OpenAI, Anthropic, Google, and more through one API
- **Automatic Failover**: Built-in reliability with fallback providers
@@ -85,6 +87,70 @@ print(completion.choices[0].message.content)
+## Cost Tracking & Compression
+
+Every response includes token compression and cost metrics through the standard OpenAI `usage` field:
+
+
+
+```typescript title="TypeScript"
+import OpenAI from "openai";
+
+const openai = new OpenAI({
+ baseURL: "https://api.edgee.ai/v1",
+ apiKey: process.env.EDGEE_API_KEY,
+});
+
+const completion = await openai.chat.completions.create({
+ model: "gpt-4o",
+ messages: [
+ { role: "user", content: "Summarize this long document..." }
+ ],
+});
+
+console.log(completion.choices[0].message.content);
+
+// Access compression metrics
+const usage = completion.usage;
+console.log(`Tokens saved: ${usage.prompt_tokens_original - usage.prompt_tokens}`);
+console.log(`Compression ratio: ${((usage.prompt_tokens_original - usage.prompt_tokens) / usage.prompt_tokens_original * 100).toFixed(1)}%`);
+console.log(`Total tokens: ${usage.total_tokens}`);
+```
+
+```python title="Python"
+from openai import OpenAI
+from os import getenv
+
+client = OpenAI(
+ base_url="https://api.edgee.ai/v1",
+ api_key=getenv("EDGEE_API_KEY"),
+)
+
+completion = client.chat.completions.create(
+ model="gpt-4o",
+ messages=[
+ {"role": "user", "content": "Summarize this long document..."}
+ ],
+)
+
+print(completion.choices[0].message.content)
+
+# Access compression metrics
+usage = completion.usage
+tokens_saved = usage.prompt_tokens_original - usage.prompt_tokens
+compression_ratio = (tokens_saved / usage.prompt_tokens_original) * 100
+
+print(f"Tokens saved: {tokens_saved}")
+print(f"Compression ratio: {compression_ratio:.1f}%")
+print(f"Total tokens: {usage.total_tokens}")
+```
+
+
+
+
+ Edgee extends the OpenAI API response with `prompt_tokens_original` to show the token count before compression. All other fields remain standard OpenAI format.
+
+
## Advanced Usage
### Function Calling (Tools)
diff --git a/introduction.mdx b/introduction.mdx
index c8615e2..6fdc24d 100644
--- a/introduction.mdx
+++ b/introduction.mdx
@@ -1,13 +1,11 @@
---
title: Welcome to Edgee
-description: The AI Gateway that gives you control over your LLM infrastructure.
+description: The AI Gateway that TL;DR tokens.
icon: house
mode: "center"
---
-Edgee is a **unified AI Gateway** that sits between your application and LLM providers, giving you complete control over your AI infrastructure.
-
-**One API. Every model. Total visibility.**
+Edgee is an **AI Gateway** that reduces LLM costs by up to 50% through intelligent token compression. Behind a single OpenAI-compatible API, you get access to 200+ models with automatic cost optimization, intelligent routing, and full observability.
## Get Started in 6 Lines
@@ -22,9 +20,10 @@ Edgee is a **unified AI Gateway** that sits between your application and LLM pro
model: 'gpt-4o',
input: 'What is the capital of France?',
});
-
+
console.log(response.text);
- // "The capital of France is Paris."
+ console.log(`Tokens saved: ${response.usage.saved_tokens}`);
+ console.log(`Cost: $${response.cost.toFixed(4)}`);
```
@@ -40,7 +39,8 @@ Edgee is a **unified AI Gateway** that sits between your application and LLM pro
)
print(response.text)
- # "The capital of France is Paris."
+ print(f"Tokens saved: {response.usage.saved_tokens}")
+ print(f"Cost: ${response.cost:.4f}")
```
@@ -63,7 +63,8 @@ Edgee is a **unified AI Gateway** that sits between your application and LLM pro
}
fmt.Println(response.Text())
- // "The capital of France is Paris."
+ fmt.Printf("Tokens saved: %d\n", response.Usage.SavedTokens)
+ fmt.Printf("Cost: $%.4f\n", response.Cost)
}
```
@@ -76,7 +77,8 @@ Edgee is a **unified AI Gateway** that sits between your application and LLM pro
let response = client.send("gpt-4o", "What is the capital of France?").await.unwrap();
println!("{}", response.text().unwrap_or(""));
- // "The capital of France is Paris."
+ println!("Tokens saved: {}", response.usage.saved_tokens);
+ println!("Cost: ${:.4}", response.cost);
```
@@ -88,13 +90,19 @@ That's it. You now have access to every major LLM provider, automatic failovers,
+
+
+
+
+
## Why Choose Edgee?
Building with LLMs is powerful, but comes with challenges:
-- **Vendor lock-in**: Your code is tightly coupled to a single provider's API
+- **Exploding AI costs**: Token usage adds up fast with RAG, long contexts, and multi-turn conversations
- **Cost opacity**: Bills spike with no visibility into what's driving costs
+- **Vendor lock-in**: Your code is tightly coupled to a single provider's API
- **No fallbacks**: When OpenAI goes down, your app goes down
- **Security concerns**: Sensitive data flows directly to third-party providers
- **Fragmented observability**: Logs scattered across multiple dashboards
@@ -105,18 +113,23 @@ Building with LLMs is powerful, but comes with challenges:
## Core Capabilities
+
+ Reduce prompt size by up to 50% without losing intent.
+ Ideal for RAG, long contexts, and multi-turn agents.
+
+
- One SDK, access to 200+ models from OpenAI, Anthropic, Google, Mistral, and more.
+ One SDK, access to 200+ models from OpenAI, Anthropic, Google, Mistral, and more.
Switch providers with a single line change.
- Automatic failover, load balancing, and smart model selection.
+ Automatic failover, load balancing, and smart model selection.
Optimize for cost, performance, or both.
- Real-time cost tracking, latency metrics, and request logs.
+ Real-time cost tracking, latency metrics, and request logs.
Know exactly what your AI is doing and costing.
diff --git a/introduction/faq.mdx b/introduction/faq.mdx
index ebb044c..278fb09 100644
--- a/introduction/faq.mdx
+++ b/introduction/faq.mdx
@@ -5,14 +5,49 @@ icon: message-circle-question-mark
---
+
+ Edgee reduces LLM costs through two mechanisms:
+
+ **Token Compression (up to 50% input token reduction):**
+ - **RAG pipelines**: 40-50% reduction on document-heavy contexts
+ - **Long contexts**: 30-45% reduction on conversation histories
+ - **Document analysis**: 35-50% reduction on summarization tasks
+ - **Multi-turn agents**: 25-40% reduction as conversations grow
+
+ **Cost-Aware Routing (20-60% additional savings):**
+ - Automatically routes to cheaper models when quality thresholds are met
+ - Combines with compression for 60-70% total cost reduction
+
+ Example: A RAG Q&A system using GPT-4o with 100,000 monthly requests at 2,000 tokens each would save $1,500/month with compression alone.
+
+
+
+ Token compression happens automatically at the edge on every request through a four-step process:
+
+ 1. **Semantic Analysis**: Identify redundant context and compressible sections
+ 2. **Context Optimization**: Compress repeated context (common in RAG) and remove unnecessary formatting
+ 3. **Instruction Preservation**: Keep critical instructions, few-shot examples, and task requirements intact
+ 4. **Quality Verification**: Ensure compressed prompts maintain semantic equivalence
+
+ Compression is most effective for:
+ - Prompts with repeated context (RAG document chunks)
+ - Long system instructions with verbose formatting
+ - Multi-turn conversations with growing history
+ - Document analysis with redundant information
+
+ Every response includes compression metrics (`saved_tokens`, `compression_ratio`) so you can track your savings in real-time.
+
+
- Edgee is a unified AI Gateway that sits between your application and LLM providers like OpenAI, Anthropic, Google, and Mistral. It provides a single API to access 200+ models, with built-in intelligent routing, cost tracking, automatic failovers, and full observability.
+ Edgee is an **edge-native AI Gateway that reduces LLM costs by up to 50%** through token compression. It sits between your application and LLM providers like OpenAI, Anthropic, Google, and Mistral, providing a single API to access 200+ models with built-in intelligent routing, cost tracking, automatic failovers, and full observability.
When you use LLM APIs directly, you're locked into a single provider's API format, have no visibility into costs until your bill arrives, no automatic failovers when providers go down, and scattered logs across multiple dashboards.
-
+
Edgee gives you:
+ - **Up to 50% cost reduction** — automatic token compression at the edge
+ - **Real-time savings tracking** — see exactly how many tokens and dollars you've saved
- **One API** for all providers — switch models with a single line change
- **Real-time cost tracking** — know exactly what each request costs
- **Automatic failovers** — when OpenAI is down, Claude takes over seamlessly
@@ -22,15 +57,17 @@ icon: message-circle-question-mark
Edgee supports all major LLM providers:
- - **OpenAI** (GPT-4, GPT-4o, GPT-3.5, o1, etc.)
- - **Anthropic** (Claude 3.5, Claude 3 Opus/Sonnet/Haiku)
- - **Google** (Gemini Pro, Gemini Ultra)
- - **Mistral** (Mistral Large, Medium, Small)
- - **Meta** (Llama 3.1, Llama 3)
- - **Cohere** (Command R+, Command R)
- - **AWS Bedrock** (all supported models)
- - **Azure OpenAI** (all GPT models)
- - **And 200+ more models**
+ - **OpenAI**
+ - **Anthropic**
+ - **Google**
+ - **Mistral**
+ - **Meta**
+ - **Cohere**
+ - **AWS Bedrock**
+ - **Azure OpenAI**
+ - **And more**
+
+ To see the full list of supported models, [see our dedicated models page](https://www.edgee.cloud/models).
We regularly add new providers and models. If there's a model you need that we don't support, [let us know](https://www.edgee.cloud/contact).
diff --git a/introduction/why-edgee.mdx b/introduction/why-edgee.mdx
index 0f7f580..24559ae 100644
--- a/introduction/why-edgee.mdx
+++ b/introduction/why-edgee.mdx
@@ -4,7 +4,35 @@ description: The technology behind the fastest, most secure AI Gateway
icon: sparkles
---
-Edgee isn't just another proxy. It's purpose-built infrastructure for production AI workloads, combining edge computing, intelligent routing, and zero-trust security in one platform.
+Edgee isn't just another proxy. It's an **edge-native AI Gateway that cuts LLM costs by up to 50%** through intelligent token compression. Combined with edge computing, intelligent routing, and zero-trust security, it's purpose-built for production AI workloads at scale.
+
+## Token Compression
+
+When enabled, token compression runs at the edge before your request reaches LLM providers. This can reduce input tokens by up to 50% for common workloads like RAG pipelines, long document analysis, and multi-turn conversations.
+
+
+
+ **Up to 50%** token reduction
+
+
+ **Lower latency** with smaller payloads
+
+
+ **Real-time** savings tracking
+
+
+
+### How It Works
+
+Token compression analyzes your prompt structure to:
+- Remove redundant context without losing semantic meaning
+- Optimize RAG document formatting for better compression ratios
+- Preserve critical instructions and few-shot examples
+- Maintain output quality while reducing input costs
+
+
+ Compression is most effective for prompts with repeated context (RAG), long system instructions, or verbose multi-turn histories. Simple queries may see minimal compression.
+
## Edge-First Architecture
@@ -28,6 +56,9 @@ Traditional AI gateways route all traffic through centralized servers. Edgee pro
Your request arrives at one of 100+ global PoPs within milliseconds.
+
+ Prompts are compressed by up to 50% while preserving semantic meaning.
+
Our engine selects the optimal model based on cost, performance, or your custom rules.
@@ -35,7 +66,7 @@ Traditional AI gateways route all traffic through centralized servers. Edgee pro
If a provider fails, we instantly retry with your backup models.
- Results stream directly to your app with full observability logged.
+ Results stream directly to your app with full observability and cost tracking logged.
diff --git a/quickstart/api-key.mdx b/quickstart/api-key.mdx
index 8f37f3b..3e725a6 100644
--- a/quickstart/api-key.mdx
+++ b/quickstart/api-key.mdx
@@ -43,7 +43,7 @@ curl https://api.edgee.ai/v1/chat/completions \
-H "Authorization: Bearer $EDGEE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
- "model": "gpt-4o-mini",
+ "model": "gpt-5.2",
"messages": [{"role": "user", "content": "Hello!"}]
}'
```