This project demonstrates the deployment and management of DeepSeek AI models on Amazon Elastic Kubernetes Service (EKS). It showcases various large language models, including Qwen-14B, Qwen-32B, and DeepSeek-R1-Distill-Llama-8B, using vLLM for efficient inference. The project also includes node pool management with Karpenter, performance testing capabilities, and a user-friendly web interface for interacting with the models.
-
Qwen-14B Deployment (
qwen-14b-deployment.yaml
)- Deploys the
DeepSeek-R1-Distill-Qwen-14B
model - Uses vLLM for serving
- Configured with GPU support
- Deploys the
-
Qwen-32B Deployment (
qwen-32b-deployment.yaml
)- Deploys the
DeepSeek-R1-Distill-Qwen-32B
model - Similar configuration to Qwen-14B, but with adjusted resource requirements
- Deploys the
-
DeepSeek-R1-Distill-Llama-8B Deployment (
deployment.yaml
)- Deploys the
DeepSeek-R1-Distill-Llama-8B
model - Uses vLLM for serving
- Configured with GPU support
- Deploys the
-
GPU Node Pool (
nodepool.yaml
)- Configures Karpenter for managing GPU-enabled nodes
- Supports various GPU instance types (g5, g6, g6e, p5, p4)
- Includes both spot and on-demand instances
-
ML Accelerator Node Pool (
nodepool.yaml
)- Configures Karpenter for managing nodes with AWS Inferentia and Trainium
- Supports instance families: inf1, inf2, trn1, trn1n
- GenAI Performance Tool (
genai-perf.yaml
,prompts.sh
)- Deploys a Triton Inference Server for performance testing
- Includes scripts for running performance profiles on the deployed models
- Open-WebUI (
open-webui.yaml
)- Deploys a web-based user interface for interacting with the DeepSeek AI models
- Connects to the deployed vLLM services
-
Prerequisites
- Amazon EKS cluster
- kubectl configured to access your cluster
- Karpenter installed and configured
-
Node Pool Configuration
-
Apply the node pool configurations:
kubectl apply -f nodepool.yaml
-
-
Deployment
-
Apply the deployment YAML files:
kubectl apply -f qwen-14b-deployment.yaml kubectl apply -f qwen-32b-deployment.yaml kubectl apply -f deployment.yaml
-
-
Performance Testing
-
Deploy the Triton Inference Server:
kubectl apply -f genai-perf.yaml
-
Use the
prompts.sh
script to run performance tests
-
-
User Interface
-
Deploy the Open-WebUI:
kubectl apply -f open-webui.yaml
-
Access the UI through the exposed service
-
.
├── deepseek-using-vllm-on-eks
│ ├── chatbot-ui
│ │ ├── application
│ │ │ ├── app.py
│ │ │ ├── Dockerfile
│ │ │ └── requirements.txt
│ │ └── manifests
│ │ ├── deployment.yaml
│ │ └── ingress-class.yaml
│ ├── CODE_OF_CONDUCT.md
│ ├── CONTRIBUTING.md
│ ├── main.tf
│ ├── manifests
│ │ ├── deepseek-deployment-gpu.yaml
│ │ └── gpu-nodepool.yaml
│ └── README.md
├── ec2nodepool.yaml
├── genai-perf.yaml
├── k8s-manifest
│ ├── genai-perf
│ │ ├── genai-perf-2409.yaml
│ │ └── genai-perf-2412.yaml
│ ├── priority-class.yaml
│ ├── sglang
│ │ ├── llama-8b-sglang.yaml
│ │ └── qwen-32b-sglang.yaml
│ └── vllm
│ ├── llama-8b-vllm.yaml
│ ├── qwen-14b-deployment.yaml
│ └── qwen-32b-deployment.yaml
├── nodepool.yaml
├── open-webui.yaml
├── prompts.sh
└── README.md
Contributions to this project are welcome. Please refer to the CONTRIBUTING.md file for guidelines.
This project is licensed under [LICENSE_NAME]. Please see the LICENSE file for details.