GPU-aware autoscaler for LLM inference workloads with optimal resource allocation
The Workload-Variant-Autoscaler (WVA) is a Kubernetes controller that performs intelligent autoscaling for inference model servers. It assigns GPU types to models, determines optimal replica counts for given request traffic loads and service classes, and configures batch sizes—all while optimizing for cost and performance.
- Intelligent Autoscaling: Optimizes replica count and GPU allocation based on workload, performance models, and SLO requirements
- Cost Optimization: Minimizes infrastructure costs while meeting SLO requirements
- Performance Modeling: Uses queueing theory (M/M/1/k, M/G/1 models) for accurate latency and throughput prediction
- Multi-Model Support: Manages multiple models with different service classes and priorities
- Kubernetes v1.31.0+ (or OpenShift 4.18+)
- Helm 3.x
- kubectl
# Add the WVA Helm repository (when published)
helm upgrade -i workload-variant-autoscaler ./charts/workload-variant-autoscaler \
--namespace workload-variant-autoscaler-system \
--set-file prometheus.caCert=/tmp/prometheus-ca.crt \
--set variantAutoscaling.accelerator=L40S \
--set variantAutoscaling.modelID=unsloth/Meta-Llama-3.1-8B \
--set vllmService.enabled=true \
--set vllmService.nodePort=30000
--create-namespace
# Deploy WVA with llm-d infrastructure on a local Kind cluster
make deploy-llm-d-wva-emulated-on-kind
# This creates a Kind cluster with emulated GPUs and deploys:
# - WVA controller
# - llm-d infrastructure (simulation mode)
# - Prometheus and monitoring stack
# - vLLM emulator for testing
Works on Mac (Apple Silicon/Intel) and Windows - no physical GPUs needed!
Perfect for development and testing with GPU emulation.
See the Installation Guide for detailed instructions.
- Architecture Overview
- Architecture Diagrams - Visual architecture and workflow diagrams
WVA consists of several key components:
- Reconciler: Kubernetes controller that manages VariantAutoscaling resources
- Collector: Gathers cluster state and vLLM server metrics
- Model Analyzer: Performs per-model analysis using queueing theory
- Optimizer: Makes global scaling decisions across models
- Actuator: Emits metrics to Prometheus and updates deployment replicas
For detailed architecture information, see the design documentation.
- Platform admin deploys llm-d infrastructure (including model servers) and waits for servers to warm up and start serving requests
- Platform admin creates a
VariantAutoscaling
CR for the running deployment - WVA continuously monitors request rates and server performance via Prometheus metrics
- Model Analyzer estimates latency and throughput using queueing models
- Optimizer solves for minimal cost allocation meeting all SLOs
- Actuator emits optimization metrics to Prometheus and updates VariantAutoscaling status
- External autoscaler (HPA/KEDA) reads the metrics and scales the deployment accordingly
Important Notes:
- Create the VariantAutoscaling CR only after your deployment is warmed up to avoid immediate scale-down
- Configure HPA stabilization window (recommend 120s+) for gradual scaling behavior
- WVA updates the VA status with current and desired allocations every reconciliation cycle
apiVersion: llmd.ai/v1alpha1
kind: VariantAutoscaling
metadata:
name: llama-8b-autoscaler
namespace: llm-inference
spec:
modelName: "meta/llama-3.1-8b"
serviceClass: "Premium"
acceleratorType: "A100"
minReplicas: 1
maxBatchSize: 256
More examples in config/samples/.
We welcome contributions! See the llm-d Contributing Guide for guidelines.
Join the llm-d autoscaling community meetings to get involved.
Apache 2.0 - see LICENSE for details.
For detailed documentation, visit the docs directory.