Skip to content

Commit 04e7542

Browse files
author
Noa Limoy
committed
feat(llm-katan): Add Kubernetes deployment support
- Add comprehensive Kustomize manifests (base + overlays for gpt35/claude) - Implement initContainer for efficient model caching using PVC - Fix config.py to read YLLM_SERVED_MODEL_NAME from environment variables - Add deployment documentation with examples for Kind cluster / Minikube This enables running multiple llm-katan instances in Kubernetes, each serving different model aliases while sharing the same underlying model. The overlays (gpt35, claude) demonstrate multi-instance deployments where each instance exposes a different served model name (e.g., gpt-3.5-turbo, claude-3-haiku-20240307) via the API. The served model name now works via environment variables, enabling Kubernetes deployments to expose diffrent model name via the API. Signed-off-by: Noa Limoy <nlimoy@nlimoy-thinkpadp1gen7.raanaii.csb>
1 parent a149800 commit 04e7542

File tree

10 files changed

+583
-3
lines changed

10 files changed

+583
-3
lines changed
Lines changed: 288 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,288 @@
1+
# LLM Katan - Kubernetes Deployment
2+
3+
Comprehensive Kubernetes support for deploying LLM Katan in cloud-native environments.
4+
5+
## Overview
6+
7+
This directory provides production-ready Kubernetes manifests using Kustomize for deploying LLM Katan - a lightweight LLM server designed for testing and development workflows.
8+
9+
## Architecture
10+
11+
### Pod Structure
12+
Each deployment consists of two containers:
13+
- **initContainer (model-downloader)**: Downloads models from HuggingFace to PVC
14+
- Image: `python:3.11-slim` (~45MB)
15+
- Checks if model exists before downloading
16+
- Runs once before main container starts
17+
18+
- **main container (llm-katan)**: Serves the LLM API
19+
- Image: `llm-katan:latest` (~1.35GB)
20+
- Loads model from PVC cache
21+
- Exposes OpenAI-compatible API on port 8000
22+
23+
### Storage
24+
- **PersistentVolumeClaim**: 5Gi for model caching
25+
- **Mount Path**: `/cache/models/`
26+
- **Access Mode**: ReadWriteOnce (single Pod write)
27+
- Models persist across Pod restarts
28+
29+
### Namespace
30+
All resources deploy to the `llm-katan-system` namespace. Each overlay creates isolated instances within this namespace:
31+
- **gpt35**: Simulates GPT-3.5-turbo
32+
- **claude**: Simulates Claude-3-Haiku
33+
34+
### Resource Naming
35+
Kustomize applies `nameSuffix` to avoid conflicts:
36+
- Base: `llm-katan`
37+
- gpt35 overlay: `llm-katan-gpt35` (via `nameSuffix: -gpt35`)
38+
- claude overlay: `llm-katan-claude` (via `nameSuffix: -claude`)
39+
40+
**How it works:**
41+
```yaml
42+
# overlays/gpt35/kustomization.yaml
43+
nameSuffix: -gpt35 # Automatically appends to all resource names
44+
```
45+
46+
This creates unique resource names for each overlay without manual patches, allowing multiple instances to coexist in the same namespace.
47+
48+
### Networking
49+
- **Service Type**: ClusterIP (internal only)
50+
- **Port**: 8000 (HTTP)
51+
- **Endpoints**: `/health`, `/v1/models`, `/v1/chat/completions`
52+
53+
### Health Checks
54+
- **Startup Probe**: 30s initial delay, 60 failures (15 min max startup)
55+
- **Liveness Probe**: 15s delay, checks every 20s
56+
- **Readiness Probe**: 5s delay, checks every 10s
57+
58+
## Directory Structure
59+
60+
kubernetes/
61+
├── base/ # Base Kubernetes manifests
62+
│ ├── namespace.yaml # llm-katan-system namespace
63+
│ ├── deployment.yaml # Main deployment with health checks
64+
│ ├── service.yaml # ClusterIP service (port 8000)
65+
│ ├── pvc.yaml # Model cache storage (5Gi)
66+
│ └── kustomization.yaml # Base kustomization
67+
68+
├── components/ # Reusable Kustomize components
69+
│ └── common/ # Common labels for all resources
70+
│ └── kustomization.yaml # Shared label definitions
71+
72+
└── overlays/ # Environment-specific configurations
73+
├── gpt35/ # GPT-3.5-turbo simulation
74+
│ └── kustomization.yaml # Overlay with patches for gpt35
75+
76+
└── claude/ # Claude-3-Haiku simulation
77+
└── kustomization.yaml # Overlay with patches for claude
78+
79+
## Prerequisites
80+
81+
Before starting, ensure you have the following tools installed:
82+
83+
- [Docker](https://docs.docker.com/get-docker/) - Container runtime
84+
- [minikube](https://minikube.sigs.k8s.io/docs/start/) - Local Kubernetes
85+
- [kind](https://kind.sigs.k8s.io/docs/user/quick-start/#installation) - Kubernetes in Docker
86+
- [kubectl](https://kubernetes.io/docs/tasks/tools/) - Kubernetes CLI
87+
- `kustomize` (built into kubectl 1.14+)
88+
89+
90+
## Configuration
91+
92+
### Environment Variables
93+
94+
Configure via `config.env` or overlay ConfigMaps:
95+
96+
| Variable | Default | Description |
97+
|----------|---------|-------------|
98+
| `YLLM_MODEL` | `Qwen/Qwen3-0.6B` | HuggingFace model to load |
99+
| `YLLM_SERVED_MODEL_NAME` | (empty) | Model name for API (defaults to YLLM_MODEL) |
100+
| `YLLM_BACKEND` | `transformers` | Backend: `transformers` or `vllm` |
101+
| `YLLM_HOST` | `0.0.0.0` | Server bind address |
102+
| `YLLM_PORT` | `8000` | Server port |
103+
104+
### Resource Limits
105+
106+
Default per instance:
107+
108+
```yaml
109+
resources:
110+
requests:
111+
cpu: "1"
112+
memory: "3Gi"
113+
limits:
114+
cpu: "2"
115+
memory: "6Gi"
116+
```
117+
118+
### Storage
119+
120+
- **PVC Size**: 5Gi (adjust in overlays if needed)
121+
- **Access Mode**: ReadWriteOnce
122+
- **Mount Path**: `/cache/models/`
123+
- **Purpose**: Cache downloaded models between restarts
124+
125+
### Deploy Single Instance (Base)
126+
127+
```bash
128+
# From repository root
129+
cd e2e-tests/llm-katan/deploy/kubernetes
130+
131+
# Deploy with default settings
132+
kubectl apply -k base/
133+
134+
# Check status
135+
kubectl get pods -n llm-katan-system
136+
kubectl logs -n llm-katan-system -l app=llm-katan -f
137+
138+
# Test the deployment
139+
kubectl port-forward -n llm-katan-system svc/llm-katan 8000:8000
140+
curl http://localhost:8000/health
141+
```
142+
143+
### Deploy Multi-Instance (Overlays)
144+
145+
```bash
146+
# Deploy GPT-3.5-turbo simulation
147+
kubectl apply -k overlays/gpt35/
148+
149+
# Deploy Claude-3-Haiku simulation
150+
kubectl apply -k overlays/claude/
151+
152+
# Or deploy both simultaneously
153+
kubectl apply -k overlays/gpt35/ && kubectl apply -k overlays/claude/
154+
155+
# Verify both are running
156+
kubectl get pods -n llm-katan-system
157+
kubectl get svc -n llm-katan-system
158+
159+
160+
## Testing & Verification
161+
162+
### Health Check
163+
164+
```bash
165+
kubectl port-forward -n llm-katan-system svc/llm-katan 8000:8000
166+
curl http://localhost:8000/health
167+
168+
# Expected response:
169+
# {"status":"ok","model":"Qwen/Qwen3-0.6B","backend":"transformers"}
170+
```
171+
172+
### Chat Completion
173+
174+
```bash
175+
curl http://localhost:8000/v1/chat/completions \
176+
-H "Content-Type: application/json" \
177+
-d '{
178+
"model": "Qwen/Qwen3-0.6B",
179+
"messages": [{"role": "user", "content": "Hello!"}]
180+
}'
181+
```
182+
183+
### Models Endpoint
184+
185+
```bash
186+
curl http://localhost:8000/v1/models
187+
```
188+
189+
### Metrics (Prometheus)
190+
191+
```bash
192+
dont forget -> kubectl port-forward -n llm-katan-system svc/llm-katan 8000:8000
193+
curl http://localhost:8000/metrics
194+
195+
# Metrics exposed:
196+
# - llm_katan_requests_total
197+
# - llm_katan_tokens_generated_total
198+
# - llm_katan_response_time_seconds
199+
# - llm_katan_uptime_seconds
200+
```
201+
202+
## Troubleshooting
203+
204+
### Common Issues
205+
206+
**Common pod error:**
207+
208+
- OOMKilled: Increase memory limits (current: 6Gi)
209+
- ImagePullBackOff: Load image into Kind with kind load docker-image llm-katan:latest
210+
- Init:CrashLoopBackOff: Check initContainer logs for download issues
211+
212+
**Pod not starting:**
213+
214+
```bash
215+
# Check pod status
216+
kubectl get pods -n llm-katan-system
217+
218+
# Describe pod for events
219+
kubectl describe pod -n llm-katan-system -l app.kubernetes.io/name=llm-katan
220+
221+
# Check initContainer logs (model download)
222+
kubectl logs -n llm-katan-system -l app.kubernetes.io/name=llm-katan -c model-downloader
223+
224+
# Check main container logs
225+
kubectl logs -n llm-katan-system -l app.kubernetes.io/name=llm-katan -c llm-katan -f
226+
```
227+
228+
**LLM Katan not responding::**
229+
230+
```bash
231+
# Check deployment status
232+
kubectl get deployment -n llm-katan-system
233+
234+
# Check service
235+
kubectl get svc -n llm-katan-system
236+
237+
# Check if port-forward is active
238+
ps aux | grep "port-forward" | grep llm-katan
239+
240+
# Test health endpoint
241+
kubectl port-forward -n llm-katan-system svc/llm-katan-gpt35 8000:8000 &
242+
curl http://localhost:8000/health
243+
```
244+
245+
**PVC issues:**
246+
247+
```bash
248+
# Check PVC status
249+
kubectl get pvc -n llm-katan-system
250+
251+
# Check PVC details
252+
kubectl describe pvc -n llm-katan-system
253+
254+
# Check volume contents (if pod is running)
255+
kubectl exec -n llm-katan-system <pod-name> -- ls -lah /cache/models/
256+
```
257+
258+
## Cleanup
259+
260+
**Remove Specific Overlay:**
261+
262+
```bash
263+
# Remove gpt35 instance
264+
kubectl delete -k e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35/
265+
266+
# Remove claude instance
267+
kubectl delete -k e2e-tests/llm-katan/deploy/kubernetes/overlays/claude/
268+
```
269+
270+
**Remove All llm-katan Resources:**
271+
272+
```bash
273+
# Delete entire namespace (removes everything)
274+
kubectl delete namespace llm-katan-system
275+
276+
# Or delete base deployment
277+
kubectl delete -k e2e-tests/llm-katan/deploy/kubernetes/base/
278+
```
279+
280+
**Cleanup Kind Cluster:**
281+
282+
```bash
283+
# Stop Kind cluster
284+
kind delete cluster --name llm-katan-test
285+
286+
# Or if using default cluster name
287+
kind delete cluster
288+
```

0 commit comments

Comments
 (0)