Drop-in OpenAI replacement β’ High-performance Go architecture β’ Enterprise-grade reliability
Handle thousands of concurrent requests on a single instance |
Significantly reduced infrastructure costs vs interpreted alternatives |
Minimal overhead with native Go performance |
ποΈ Performance Benchmarks
Metric | PLLM (Go) | Typical Interpreted Gateway | Advantage |
---|---|---|---|
Concurrent Connections | High (thousands) | Limited | Superior concurrency π |
Memory Usage | 50-80MB | 150-300MB+ | Lower footprint πΎ |
Startup Time | <100ms | 2-5s | Instant startup β‘ |
CPU Efficiency | All cores utilized | GIL limitations | True parallelism π₯ |
Response Latency | Sub-millisecond | Variable | Consistent performance π |
Infrastructure | Single instance capable | Often requires scaling | Higher efficiency πͺ |
π° Cost Analysis (High Concurrency Scenario)
βββββββββββββββββββββββββββββββββββββββββββββββ
β PLLM: 1x instance required β
β Interpreted Gateway: Multiple instances β
β β
β Result: Significant infrastructure savings β
β Lower operational complexity β
βββββββββββββββββββββββββββββββββββββββββββββββ
π§ Technical Architecture Advantages
- Python's Global Interpreter Lock β Single-threaded execution
- Go's goroutines β True parallel processing on all cores
- No interpreter overhead
- Direct machine code execution
- Optimized memory management
- Battle-tested Chi router
- 6 load balancing strategies
- Hot configuration reloading
- Zero-downtime deployments
- β 100% OpenAI Compatible - Drop-in replacement, no code changes needed
- β Multi-Provider Support - OpenAI, Anthropic, Azure, Bedrock, Vertex AI, Grok, Cohere
- β Streaming Support - Real-time streaming responses for all providers
- β Adaptive Routing - Zero failed requests with automatic failover
- β Multi-Key Load Balancing - Distribute load across multiple API keys
- β Advanced Rate Limiting - Per-user, per-model, per-endpoint controls
- β Intelligent Caching - Redis-backed response caching
- β Budget Management - User and group-based spending controls
- β JWT Authentication - Enterprise-grade auth with role-based access
- β Comprehensive Metrics - Prometheus, Grafana, distributed tracing
- β Health Monitoring - Circuit breakers, health scores, auto-recovery
- β Audit Logging - Complete request/response audit trail
- β
Swagger UI - Interactive API documentation at
/swagger
- β Admin Dashboard - Web UI for monitoring and configuration
- β Hot Reload - Change configs without restarts
- β Docker Ready - One-command deployment
Deploy pLLM on Kubernetes with high availability and auto-scaling:
# 1. Add the Helm repository
helm repo add pllm https://andreimerfu.github.io/pllm
helm repo update
# 2. Create your configuration
cat > pllm-values.yaml <<EOF
pllm:
secrets:
jwtSecret: "your-super-secret-jwt-key"
masterKey: "sk-master-production-key"
openaiApiKey: "sk-your-openai-key"
ingress:
enabled: true
className: nginx
hosts:
- host: pllm.yourdomain.com
paths:
- path: /
pathType: Prefix
replicaCount: 3
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
EOF
# 3. Install pLLM
helm install pllm pllm/pllm -f pllm-values.yaml
# 4. Check status
kubectl get pods -l app.kubernetes.io/name=pllm
For local development and testing:
# 1. Clone and setup
git clone https://github.com/andreimerfu/pllm.git && cd pllm
cp .env.example .env
# 2. Add your API key to .env
echo "OPENAI_API_KEY=sk-your-key-here" >> .env
# 3. Launch PLLM
docker compose up -d
# 4. Test it works
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hello!"}]}'
Service | URL | Description |
---|---|---|
π API | http://localhost:8080/v1 | Main gateway endpoint |
π Swagger | http://localhost:8080/swagger | Interactive API docs |
ποΈ Admin UI | http://localhost:8080/ui | Web admin dashboard |
π Documentation | http://localhost:8080/docs | Project documentation |
π Metrics | http://localhost:8080/metrics | Prometheus metrics |
Option 1: Using Swagger UI
- Open http://localhost:8080/swagger
- Navigate to
/v1/chat/completions
- Click "Try it out" and paste:
{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7
}
Option 2: Using Python
from openai import OpenAI
client = OpenAI(
api_key="your-api-key",
base_url="http://localhost:8080/v1"
)
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
Option 3: Using cURL
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-API-Key: your-api-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# .env file
OPENAI_API_KEY=sk-your-key-here
# Optional: Multi-key load balancing
OPENAI_API_KEY_2=sk-second-key
OPENAI_API_KEY_3=sk-third-key
# Optional: Other providers
ANTHROPIC_API_KEY=your-anthropic-key
AZURE_API_KEY=your-azure-key
Model Configuration (config.yaml)
model_list:
- model_name: my-gpt-4
params:
model: gpt-4
api_key: ${OPENAI_API_KEY}
Routing Configuration
router:
routing_strategy: "latency-based"
circuit_breaker_enabled: true
fallbacks:
my-gpt-4: ["my-gpt-35-turbo"] # Automatic fallback chains
pLLM provides a comprehensive Helm chart for production Kubernetes deployments with built-in high availability, auto-scaling, and monitoring.
# Add the official Helm repository
helm repo add pllm https://andreimerfu.github.io/pllm
helm repo update
# Install with default configuration
helm install pllm pllm/pllm \
--set pllm.secrets.jwtSecret="your-jwt-secret" \
--set pllm.secrets.masterKey="sk-master-your-key" \
--set pllm.secrets.openaiApiKey="sk-your-openai-key"
High Availability Configuration
# production-values.yaml
pllm:
secrets:
jwtSecret: "your-super-secret-jwt-key-min-32-chars"
masterKey: "sk-master-production-key"
openaiApiKey: "sk-your-openai-key"
anthropicApiKey: "sk-ant-your-anthropic-key"
# High availability setup
replicaCount: 3
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 20
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
# Resource limits
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 200m
memory: 256Mi
# Ingress with TLS
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/rate-limit: "1000"
nginx.ingress.kubernetes.io/rate-limit-window: "1m"
hosts:
- host: api.pllm.yourdomain.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: pllm-api-tls
hosts:
- api.pllm.yourdomain.com
# Monitoring
serviceMonitor:
enabled: true
labels:
prometheus: kube-prometheus
# Database and Redis (production ready)
postgresql:
enabled: true
auth:
database: pllm
username: pllm
password: "your-secure-db-password"
primary:
persistence:
size: 20Gi
resources:
requests:
memory: 256Mi
cpu: 250m
redis:
enabled: true
auth:
enabled: true
password: "your-secure-redis-password"
master:
persistence:
size: 8Gi
resources:
requests:
memory: 256Mi
cpu: 100m
Deploy with:
helm install pllm pllm/pllm -f production-values.yaml
External Dependencies (Cloud)
For cloud deployments using managed services:
# cloud-values.yaml
# Disable internal dependencies
postgresql:
enabled: false
redis:
enabled: false
dex:
enabled: false
pllm:
config:
database:
host: "your-rds-instance.amazonaws.com"
port: 5432
name: pllm
user: pllm
sslMode: require
redis:
host: "your-redis-cluster.cache.amazonaws.com"
port: 6379
tls: true
auth:
dex:
issuer: "https://your-auth-provider.com"
secrets:
databasePassword: "your-db-password"
redisPassword: "your-redis-password"
jwtSecret: "your-jwt-secret"
masterKey: "sk-master-key"
openaiApiKey: "sk-openai-key"
dexClientSecret: "your-auth-client-secret"
# Multi-region setup
replicaCount: 5
autoscaling:
enabled: true
minReplicas: 5
maxReplicas: 50
# Pod topology spread for availability zones
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
The pLLM Helm chart is available through multiple registries:
Registry | Command |
---|---|
GitHub Pages | helm repo add pllm https://andreimerfu.github.io/pllm |
Docker Hub (OCI) | helm install pllm oci://registry-1.docker.io/amerfu/pllm |
ArtifactHub | View on ArtifactHub |
The Helm chart includes comprehensive monitoring out of the box:
- Prometheus Metrics - ServiceMonitor for automatic discovery
- Grafana Dashboards - Pre-built dashboards for key metrics
- Health Checks - Kubernetes health and readiness probes
- Distributed Tracing - OpenTelemetry integration ready
# List available versions
helm search repo pllm/pllm --versions
# Upgrade to latest
helm repo update
helm upgrade pllm pllm/pllm -f your-values.yaml
# Rollback if needed
helm rollback pllm 1
For simpler deployments or development environments:
Docker Compose
# Clone and deploy
git clone https://github.com/andreimerfu/pllm.git
cd pllm
# Configure environment
cp .env.example .env
# Edit .env with your API keys
# Deploy
docker compose up -d
# Scale if needed
docker compose up -d --scale pllm=3
Standalone Docker
# Run pLLM container
docker run -d \
--name pllm \
-p 8080:8080 \
-e OPENAI_API_KEY=sk-your-key \
-e JWT_SECRET=your-jwt-secret \
-e MASTER_KEY=sk-master-key \
amerfu/pllm:latest
# With external database
docker run -d \
--name pllm \
-p 8080:8080 \
-e DATABASE_URL=postgres://user:pass@host:5432/pllm \
-e REDIS_URL=redis://host:6379 \
-e OPENAI_API_KEY=sk-your-key \
amerfu/pllm:latest
Local Development
# Prerequisites: Go 1.23+, PostgreSQL, Redis
git clone https://github.com/andreimerfu/pllm.git
cd pllm
# Start dependencies
docker compose up postgres redis -d
# Install dependencies
go mod download
cd web && npm ci && cd ..
# Run with hot reload
make dev
# Or run directly
go run cmd/server/main.go
from openai import OpenAI
# Just change the base_url - that's it!
client = OpenAI(
api_key="your-api-key",
base_url="http://localhost:8080/v1" # β Point to PLLM
)
# Use exactly like OpenAI
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hello!"}]
)
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: 'your-api-key',
baseURL: 'http://localhost:8080/v1' // β Point to PLLM
});
const completion = await openai.chat.completions.create({
model: "gpt-3.5-turbo",
messages: [{role: "user", content: "Hello!"}]
});
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(
openai_api_base="http://localhost:8080/v1",
openai_api_key="your-api-key",
model="gpt-3.5-turbo"
)
PLLM automatically handles failures and load spikes:
graph LR
A[Request] --> B{Health Check}
B -->|Healthy| C[Primary Model]
B -->|Degraded| D[Fallback Model]
B -->|Failed| E[Circuit Breaker]
C --> F[Response]
D --> F
E --> D
- π¨ Automatic Failover - Instant fallback to healthy providers
- π Performance Routing - Routes to fastest responding models
- π― Health Scoring - Real-time 0-100 health scores
- π Circuit Breaking - Prevents cascade failures
- π‘οΈ Load Protection - Graceful degradation under load
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Load Balancer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β PLLM Gateway β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β Chi β β Auth β β Cache β β
β β Router β β Layer β β Layer β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Provider Abstraction Layer β
β ββββββββ ββββββββ ββββββββ ββββββββ ββββββββ β
β βOpenAIβ βClaudeβ βAzure β βVertexβ βBedrockβ β
β ββββββββ ββββββββ ββββββββ ββββββββ ββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tech Stack:
- π Chi Router - Lightning-fast HTTP routing
- ποΈ PostgreSQL + GORM - Reliable data persistence
- β‘ Redis - High-speed caching & rate limiting
- π Prometheus - Enterprise monitoring
- π Swagger - Auto-generated API docs
Strategy | Description | Best For |
---|---|---|
π Round Robin | Even distribution | Balanced load |
π Least Busy | Routes to least loaded | Variable workloads |
βοΈ Weighted | Custom weight distribution | Tiered providers |
β Priority | Prefers high-priority | Cost optimization |
β‘ Latency-Based | Fastest response wins | Performance critical |
π Usage-Based | Respects rate limits | Token management |
Access real-time metrics at http://localhost:8080/metrics
ββββββββββββββββββββββββββββββββββββββββ
β Request Rate: 1,234 req/s β
β P99 Latency: 0.8ms β
β Cache Hit Rate: 92% β
β Active Models: 12/15 β
β Token Usage: 45,678/100,000 β
β Error Rate: 0.01% β
ββββββββββββββββββββββββββββββββββββββββ
Endpoint | Description | Response |
---|---|---|
/health |
Basic health | {"status": "ok"} |
/ready |
Full readiness check | Includes all dependencies |
/metrics |
Prometheus metrics | Full metrics export |
- Handle thousands of concurrent requests on a single instance
- Consistent low latency across percentiles
- True multi-core utilization without interpreter limitations
- Reduced infrastructure costs vs interpreted alternatives
- Fewer instances required for equivalent load
- Simplified operational complexity and maintenance
- Built on Go's battle-tested concurrency model
- Zero-downtime deployments with hot reload
- 99.99% uptime capability with proper configuration
- <100ms startup time enables aggressive scaling
- Minimal memory footprint (50-80MB)
- Kubernetes-ready with health checks and metrics
β οΈ Critical for High-Volume DeploymentsFor massive performance and ultra-low latency, the bottleneck is often the LLM providers themselves, not the gateway. To achieve true enterprise scale:
- Multiple LLM Deployments: Deploy several instances of the same model (e.g., 5-10 GPT-4 Azure OpenAI deployments)
- Multi-Provider Redundancy: Use multiple AWS Bedrock accounts, Azure regions, or provider accounts
- Geographic Distribution: Deploy models across regions for latency optimization
Example Enterprise Setup:
# High-Performance Configuration model_list: - model_name: gpt-4 deployments: - azure_deployment_1_east - azure_deployment_2_east - azure_deployment_3_west - bedrock_account_1 - bedrock_account_2Why This Matters: A single LLM deployment typically handles 60-100 RPM. For 10,000+ concurrent users, you need multiple deployments of the same model to prevent provider-side bottlenecks. PLLM's adaptive routing automatically distributes load across all deployments.
Most companies ignore this critical scaling requirement and hit provider limits rather than gateway limits.
- π Documentation - Comprehensive guides
- π GitHub Issues - Bug reports & features
We welcome contributions! Please see our GitHub Issues for:
- π Bug reports
- β¨ Feature requests
- π§ Pull requests
- π Documentation improvements
- OpenAI compatibility
- Multi-provider support
- Adaptive routing
- Prometheus metrics
- Web admin UI
- Semantic caching
- Custom model fine-tuning
- GraphQL API
Licensed under the MIT License
Built with β€οΈ by the PLLM Team