Skip to content

High-performance LLM Gateway built in Go - OpenAI compatible proxy with multi-provider support, adaptive routing, and enterprise features

License

Notifications You must be signed in to change notification settings

andreimerfu/pllm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

⚑ pLLM - performant LLM Gateway

Enterprise-Grade LLM Gateway Built in Go

CI Status Coverage Go Version License Docker Helm Chart OpenAI Compatible

Drop-in OpenAI replacement β€’ High-performance Go architecture β€’ Enterprise-grade reliability

πŸš€ Quick Start β€’ πŸ“Š Benchmarks β€’ πŸ“– Documentation


🎯 Why pLLM?

πŸš€ High Performance

Handle thousands of concurrent requests on a single instance

πŸ’° Cost Efficient

Significantly reduced infrastructure costs vs interpreted alternatives

⚑ Low Latency

Minimal overhead with native Go performance

πŸ“Š Performance Benchmarks

🏎️ Performance Benchmarks
Metric PLLM (Go) Typical Interpreted Gateway Advantage
Concurrent Connections High (thousands) Limited Superior concurrency πŸš€
Memory Usage 50-80MB 150-300MB+ Lower footprint πŸ’Ύ
Startup Time <100ms 2-5s Instant startup ⚑
CPU Efficiency All cores utilized GIL limitations True parallelism πŸ”₯
Response Latency Sub-millisecond Variable Consistent performance πŸ“ˆ
Infrastructure Single instance capable Often requires scaling Higher efficiency πŸ’ͺ
πŸ’° Cost Analysis (High Concurrency Scenario)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PLLM:              1x instance required    β”‚
β”‚ Interpreted Gateway: Multiple instances    β”‚
β”‚                                             β”‚
β”‚ Result: Significant infrastructure savings β”‚
β”‚ Lower operational complexity               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
πŸ”§ Technical Architecture Advantages

βœ… No GIL Bottleneck

  • Python's Global Interpreter Lock β†’ Single-threaded execution
  • Go's goroutines β†’ True parallel processing on all cores

βœ… Native Compilation

  • No interpreter overhead
  • Direct machine code execution
  • Optimized memory management

βœ… Enterprise-Ready

  • Battle-tested Chi router
  • 6 load balancing strategies
  • Hot configuration reloading
  • Zero-downtime deployments

✨ Features

πŸ”Œ Compatibility

  • βœ… 100% OpenAI Compatible - Drop-in replacement, no code changes needed
  • βœ… Multi-Provider Support - OpenAI, Anthropic, Azure, Bedrock, Vertex AI, Grok, Cohere
  • βœ… Streaming Support - Real-time streaming responses for all providers

🎯 Enterprise Features

  • βœ… Adaptive Routing - Zero failed requests with automatic failover
  • βœ… Multi-Key Load Balancing - Distribute load across multiple API keys
  • βœ… Advanced Rate Limiting - Per-user, per-model, per-endpoint controls
  • βœ… Intelligent Caching - Redis-backed response caching
  • βœ… Budget Management - User and group-based spending controls

πŸ›‘οΈ Security & Monitoring

  • βœ… JWT Authentication - Enterprise-grade auth with role-based access
  • βœ… Comprehensive Metrics - Prometheus, Grafana, distributed tracing
  • βœ… Health Monitoring - Circuit breakers, health scores, auto-recovery
  • βœ… Audit Logging - Complete request/response audit trail

🎨 Developer Experience

  • βœ… Swagger UI - Interactive API documentation at /swagger
  • βœ… Admin Dashboard - Web UI for monitoring and configuration
  • βœ… Hot Reload - Change configs without restarts
  • βœ… Docker Ready - One-command deployment

πŸš€ Quick Start

βš“ Kubernetes with Helm (Production Ready)

Deploy pLLM on Kubernetes with high availability and auto-scaling:

# 1. Add the Helm repository
helm repo add pllm https://andreimerfu.github.io/pllm
helm repo update

# 2. Create your configuration
cat > pllm-values.yaml <<EOF
pllm:
  secrets:
    jwtSecret: "your-super-secret-jwt-key"
    masterKey: "sk-master-production-key"
    openaiApiKey: "sk-your-openai-key"

ingress:
  enabled: true
  className: nginx
  hosts:
    - host: pllm.yourdomain.com
      paths:
        - path: /
          pathType: Prefix

replicaCount: 3
autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 10
EOF

# 3. Install pLLM
helm install pllm pllm/pllm -f pllm-values.yaml

# 4. Check status
kubectl get pods -l app.kubernetes.io/name=pllm

🐳 Docker Compose (Development)

For local development and testing:

# 1. Clone and setup
git clone https://github.com/andreimerfu/pllm.git && cd pllm
cp .env.example .env

# 2. Add your API key to .env
echo "OPENAI_API_KEY=sk-your-key-here" >> .env

# 3. Launch PLLM
docker compose up -d

# 4. Test it works
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hello!"}]}'

πŸ“ Service Endpoints

Service URL Description
🌐 API http://localhost:8080/v1 Main gateway endpoint
πŸ“š Swagger http://localhost:8080/swagger Interactive API docs
πŸŽ›οΈ Admin UI http://localhost:8080/ui Web admin dashboard
πŸ“– Documentation http://localhost:8080/docs Project documentation
πŸ“Š Metrics http://localhost:8080/metrics Prometheus metrics

πŸ§ͺ Quick Test

Option 1: Using Swagger UI
  1. Open http://localhost:8080/swagger
  2. Navigate to /v1/chat/completions
  3. Click "Try it out" and paste:
{
  "model": "gpt-3.5-turbo",
  "messages": [{"role": "user", "content": "Hello!"}],
  "temperature": 0.7
}
Option 2: Using Python
from openai import OpenAI

client = OpenAI(
    api_key="your-api-key",
    base_url="http://localhost:8080/v1"
)

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
Option 3: Using cURL
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-api-key" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

βš™οΈ Configuration

πŸ”‘ Basic Setup

# .env file
OPENAI_API_KEY=sk-your-key-here

# Optional: Multi-key load balancing
OPENAI_API_KEY_2=sk-second-key
OPENAI_API_KEY_3=sk-third-key

# Optional: Other providers
ANTHROPIC_API_KEY=your-anthropic-key
AZURE_API_KEY=your-azure-key

πŸŽ›οΈ Advanced Configuration

Model Configuration (config.yaml)
model_list:
  - model_name: my-gpt-4
    params:
      model: gpt-4
      api_key: ${OPENAI_API_KEY}
Routing Configuration
router:
  routing_strategy: "latency-based"
  circuit_breaker_enabled: true
  fallbacks:
    my-gpt-4: ["my-gpt-35-turbo"]  # Automatic fallback chains

πŸ“¦ Deployment Options

βš“ Production Deployment with Helm

pLLM provides a comprehensive Helm chart for production Kubernetes deployments with built-in high availability, auto-scaling, and monitoring.

Quick Deployment

# Add the official Helm repository
helm repo add pllm https://andreimerfu.github.io/pllm
helm repo update

# Install with default configuration
helm install pllm pllm/pllm \
  --set pllm.secrets.jwtSecret="your-jwt-secret" \
  --set pllm.secrets.masterKey="sk-master-your-key" \
  --set pllm.secrets.openaiApiKey="sk-your-openai-key"

Advanced Production Setup

High Availability Configuration
# production-values.yaml
pllm:
  secrets:
    jwtSecret: "your-super-secret-jwt-key-min-32-chars"
    masterKey: "sk-master-production-key"
    openaiApiKey: "sk-your-openai-key"
    anthropicApiKey: "sk-ant-your-anthropic-key"

# High availability setup
replicaCount: 3
autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 20
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

# Resource limits
resources:
  limits:
    cpu: 1000m
    memory: 1Gi
  requests:
    cpu: 200m
    memory: 256Mi

# Ingress with TLS
ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/rate-limit: "1000"
    nginx.ingress.kubernetes.io/rate-limit-window: "1m"
  hosts:
    - host: api.pllm.yourdomain.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: pllm-api-tls
      hosts:
        - api.pllm.yourdomain.com

# Monitoring
serviceMonitor:
  enabled: true
  labels:
    prometheus: kube-prometheus

# Database and Redis (production ready)
postgresql:
  enabled: true
  auth:
    database: pllm
    username: pllm
    password: "your-secure-db-password"
  primary:
    persistence:
      size: 20Gi
    resources:
      requests:
        memory: 256Mi
        cpu: 250m

redis:
  enabled: true
  auth:
    enabled: true
    password: "your-secure-redis-password"
  master:
    persistence:
      size: 8Gi
    resources:
      requests:
        memory: 256Mi
        cpu: 100m

Deploy with:

helm install pllm pllm/pllm -f production-values.yaml
External Dependencies (Cloud)

For cloud deployments using managed services:

# cloud-values.yaml
# Disable internal dependencies
postgresql:
  enabled: false
redis:
  enabled: false
dex:
  enabled: false

pllm:
  config:
    database:
      host: "your-rds-instance.amazonaws.com"
      port: 5432
      name: pllm
      user: pllm
      sslMode: require
    redis:
      host: "your-redis-cluster.cache.amazonaws.com"
      port: 6379
      tls: true
    auth:
      dex:
        issuer: "https://your-auth-provider.com"

  secrets:
    databasePassword: "your-db-password"
    redisPassword: "your-redis-password"
    jwtSecret: "your-jwt-secret"
    masterKey: "sk-master-key"
    openaiApiKey: "sk-openai-key"
    dexClientSecret: "your-auth-client-secret"

# Multi-region setup
replicaCount: 5
autoscaling:
  enabled: true
  minReplicas: 5
  maxReplicas: 50

# Pod topology spread for availability zones
topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule

Helm Chart Registry

The pLLM Helm chart is available through multiple registries:

Registry Command
GitHub Pages helm repo add pllm https://andreimerfu.github.io/pllm
Docker Hub (OCI) helm install pllm oci://registry-1.docker.io/amerfu/pllm
ArtifactHub View on ArtifactHub

Monitoring & Observability

The Helm chart includes comprehensive monitoring out of the box:

  • Prometheus Metrics - ServiceMonitor for automatic discovery
  • Grafana Dashboards - Pre-built dashboards for key metrics
  • Health Checks - Kubernetes health and readiness probes
  • Distributed Tracing - OpenTelemetry integration ready

Chart Versioning & Updates

# List available versions
helm search repo pllm/pllm --versions

# Upgrade to latest
helm repo update
helm upgrade pllm pllm/pllm -f your-values.yaml

# Rollback if needed
helm rollback pllm 1

🐳 Docker Deployment

For simpler deployments or development environments:

Docker Compose
# Clone and deploy
git clone https://github.com/andreimerfu/pllm.git
cd pllm

# Configure environment
cp .env.example .env
# Edit .env with your API keys

# Deploy
docker compose up -d

# Scale if needed
docker compose up -d --scale pllm=3
Standalone Docker
# Run pLLM container
docker run -d \
  --name pllm \
  -p 8080:8080 \
  -e OPENAI_API_KEY=sk-your-key \
  -e JWT_SECRET=your-jwt-secret \
  -e MASTER_KEY=sk-master-key \
  amerfu/pllm:latest

# With external database
docker run -d \
  --name pllm \
  -p 8080:8080 \
  -e DATABASE_URL=postgres://user:pass@host:5432/pllm \
  -e REDIS_URL=redis://host:6379 \
  -e OPENAI_API_KEY=sk-your-key \
  amerfu/pllm:latest

πŸ—οΈ Development Setup

Local Development
# Prerequisites: Go 1.23+, PostgreSQL, Redis
git clone https://github.com/andreimerfu/pllm.git
cd pllm

# Start dependencies
docker compose up postgres redis -d

# Install dependencies
go mod download
cd web && npm ci && cd ..

# Run with hot reload
make dev

# Or run directly
go run cmd/server/main.go

πŸ”Œ Integration Examples

Python

from openai import OpenAI

# Just change the base_url - that's it!
client = OpenAI(
    api_key="your-api-key",
    base_url="http://localhost:8080/v1"  # ← Point to PLLM
)

# Use exactly like OpenAI
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Hello!"}]
)

Node.js

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: 'your-api-key',
  baseURL: 'http://localhost:8080/v1'  // ← Point to PLLM
});

const completion = await openai.chat.completions.create({
  model: "gpt-3.5-turbo",
  messages: [{role: "user", content: "Hello!"}]
});

LangChain

from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(
    openai_api_base="http://localhost:8080/v1",
    openai_api_key="your-api-key",
    model="gpt-3.5-turbo"
)

🎯 Advanced Features

πŸ”„ Adaptive Routing

PLLM automatically handles failures and load spikes:

graph LR
    A[Request] --> B{Health Check}
    B -->|Healthy| C[Primary Model]
    B -->|Degraded| D[Fallback Model]
    B -->|Failed| E[Circuit Breaker]
    C --> F[Response]
    D --> F
    E --> D
Loading
  • 🚨 Automatic Failover - Instant fallback to healthy providers
  • πŸ“Š Performance Routing - Routes to fastest responding models
  • πŸ’― Health Scoring - Real-time 0-100 health scores
  • πŸ”Œ Circuit Breaking - Prevents cascade failures
  • πŸ›‘οΈ Load Protection - Graceful degradation under load

β†’ See Implementation

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Load Balancer                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                    PLLM Gateway                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
β”‚  β”‚   Chi    β”‚  β”‚  Auth    β”‚  β”‚  Cache   β”‚             β”‚
β”‚  β”‚  Router  β”‚  β”‚  Layer   β”‚  β”‚  Layer   β”‚             β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚              Provider Abstraction Layer                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚OpenAIβ”‚  β”‚Claudeβ”‚  β”‚Azure β”‚  β”‚Vertexβ”‚  β”‚Bedrockβ”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Tech Stack:

  • πŸš€ Chi Router - Lightning-fast HTTP routing
  • πŸ—„οΈ PostgreSQL + GORM - Reliable data persistence
  • ⚑ Redis - High-speed caching & rate limiting
  • πŸ“Š Prometheus - Enterprise monitoring
  • πŸ“š Swagger - Auto-generated API docs

βš–οΈ Load Balancing Strategies

Strategy Description Best For
πŸ”„ Round Robin Even distribution Balanced load
πŸ“Š Least Busy Routes to least loaded Variable workloads
βš–οΈ Weighted Custom weight distribution Tiered providers
⭐ Priority Prefers high-priority Cost optimization
⚑ Latency-Based Fastest response wins Performance critical
πŸ“ˆ Usage-Based Respects rate limits Token management

πŸ“Š Monitoring & Observability

Metrics Dashboard

Access real-time metrics at http://localhost:8080/metrics

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Request Rate:     1,234 req/s       β”‚
β”‚  P99 Latency:      0.8ms             β”‚
β”‚  Cache Hit Rate:   92%               β”‚
β”‚  Active Models:    12/15             β”‚
β”‚  Token Usage:      45,678/100,000    β”‚
β”‚  Error Rate:       0.01%             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Health Endpoints

Endpoint Description Response
/health Basic health {"status": "ok"}
/ready Full readiness check Includes all dependencies
/metrics Prometheus metrics Full metrics export

🏒 Enterprise Benefits

πŸš€ Performance at Scale

  • Handle thousands of concurrent requests on a single instance
  • Consistent low latency across percentiles
  • True multi-core utilization without interpreter limitations

πŸ’° Infrastructure Efficiency

  • Reduced infrastructure costs vs interpreted alternatives
  • Fewer instances required for equivalent load
  • Simplified operational complexity and maintenance

πŸ›‘οΈ Production Reliability

  • Built on Go's battle-tested concurrency model
  • Zero-downtime deployments with hot reload
  • 99.99% uptime capability with proper configuration

⚑ Instant Auto-scaling

  • <100ms startup time enables aggressive scaling
  • Minimal memory footprint (50-80MB)
  • Kubernetes-ready with health checks and metrics

🏭 Enterprise Performance Scaling

⚠️ Critical for High-Volume Deployments

For massive performance and ultra-low latency, the bottleneck is often the LLM providers themselves, not the gateway. To achieve true enterprise scale:

  • Multiple LLM Deployments: Deploy several instances of the same model (e.g., 5-10 GPT-4 Azure OpenAI deployments)
  • Multi-Provider Redundancy: Use multiple AWS Bedrock accounts, Azure regions, or provider accounts
  • Geographic Distribution: Deploy models across regions for latency optimization

Example Enterprise Setup:

# High-Performance Configuration
model_list:
  - model_name: gpt-4
    deployments:
      - azure_deployment_1_east
      - azure_deployment_2_east
      - azure_deployment_3_west
      - bedrock_account_1
      - bedrock_account_2

Why This Matters: A single LLM deployment typically handles 60-100 RPM. For 10,000+ concurrent users, you need multiple deployments of the same model to prevent provider-side bottlenecks. PLLM's adaptive routing automatically distributes load across all deployments.

Most companies ignore this critical scaling requirement and hit provider limits rather than gateway limits.

🀝 Community & Support

Get Help

Contributing

We welcome contributions! Please see our GitHub Issues for:

  • πŸ› Bug reports
  • ✨ Feature requests
  • πŸ”§ Pull requests
  • πŸ“– Documentation improvements

πŸ“ˆ Roadmap

  • OpenAI compatibility
  • Multi-provider support
  • Adaptive routing
  • Prometheus metrics
  • Web admin UI
  • Semantic caching
  • Custom model fine-tuning
  • GraphQL API

πŸ“„ License

Licensed under the MIT License


Built with ❀️ by the PLLM Team

⭐ Star us on GitHub

Sponsor this project

 

Packages

No packages published

Contributors 2

  •  
  •