A practical example of deploying fine-tuned LLMs locally, demonstrating how to apply custom fine-tuned models from Hugging Face and compare different model configurations (base, LoRA, and merged).
This project showcases how to build a local LLM inference system using fine-tuned models from Hugging Face:
- Models: Fine-tuned LLaMA-3.2-1B models from Hugging Face Hub
- Fine-tuning Process: See the complete fine-tuning workflow at bioinstruct-finetuning-experiment
- Framework: Hugging Face Transformers + PyTorch
- API: FastAPI with Swagger documentation
- UI: Vanilla HTML/CSS/JavaScript chat interface
- Deployment: Docker with GPU support
This repository serves as a practical example of:
- Deploying Fine-tuned Models: How to load and serve custom fine-tuned models from Hugging Face Hub
- Model Comparison: Comparing three approaches to fine-tuning (base model, LoRA adapter, and merged weights)
- Real-world Application: Applying biomedical fine-tuning to create a domain-specific assistant
- Performance Differences: Understanding the trade-offs between different model configurations
The fine-tuned models were created using the BioInstruct dataset to specialize LLaMA-3.2-1B for biomedical questions. This project demonstrates how to integrate these models into a production-ready inference system.
- NVIDIA GPU with CUDA support (4GB+ VRAM recommended)
- Docker with NVIDIA Container Toolkit
# Build the image
docker build -t llm-inference-demo .
# Run with GPU support
docker run --gpus all -p 8000:8000 llm-inference-demo# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# or: venv\Scripts\activate # Windows
# Install PyTorch with CUDA 12.1 support
# For other CUDA versions, check: https://pytorch.org/get-started/locally/
pip install torch --index-url https://download.pytorch.org/whl/cu121
# Install dependencies
pip install -r requirements.txt
# Run the server
cd app
python -m uvicorn main:app --host 0.0.0.0 --port 8000Note: Models will be automatically downloaded from Hugging Face on first request. First startup may take a few minutes depending on your internet connection.
- Chat UI: http://localhost:8000
- API Docs: http://localhost:8000/docs
- Health Check: http://localhost:8000/health
On first request (or when switching models), the application will automatically download models from Hugging Face:
| Model | Repository | Size | Purpose |
|---|---|---|---|
| Base | unsloth/Llama-3.2-1B-Instruct |
~2.5GB | General-purpose baseline |
| LoRA Adapter | daffakautsar/bioinstruct-llama3.2-1b-lora |
~100MB | Biomedical fine-tuning adapter |
| Merged | daffakautsar/bioinstruct-llama3.2-1b-merged |
~2.5GB | Pre-merged biomedical model |
Models total size: ~5GB
The Docker image is approximately 17-20GB, which includes:
| Component | Approximate Size |
|---|---|
| NVIDIA CUDA base image | ~5-6GB |
| PyTorch + ML dependencies | ~3-4GB |
| LLM models (3 models) | ~5GB |
| System packages | ~1-2GB |
| Total Docker image | ~17-20GB |
-
Docker:
- Models are pre-downloaded during image build
- Total image size: ~17-20GB
- No additional downloads needed after build
- Instant startup (models already cached in image)
-
Local:
- Models download on-demand when first requested
- Download size: ~5GB (models only)
- First request takes 2-5 minutes depending on internet speed
- Models cached in
~/.cache/huggingface/hub/ - Subsequent requests use cached models (instant loading)
- First request: Longer response time due to model download and loading
- Switching models: Brief delay (~10-30s) to unload old model and load new one
- Subsequent requests: Fast inference using the loaded model
This project demonstrates three different approaches to using fine-tuned models, allowing you to compare their behavior and performance:
| Mode | Model | Description |
|---|---|---|
| Base | unsloth/Llama-3.2-1B-Instruct |
General-purpose instruction-following model |
| LoRA | Base + daffakautsar/bioinstruct-llama3.2-1b-lora |
Base model with biomedical adapter applied at runtime |
| Merged | daffakautsar/bioinstruct-llama3.2-1b-merged |
Pre-merged model with biomedical fine-tuning |
All fine-tuned models are available on Hugging Face and were trained using the process documented in the bioinstruct-finetuning-experiment repository.
Base Model (1B params)
│
├──── Direct use ────────────────► Base Mode
│ (General purpose)
│
├──── + LoRA Adapter ────────────► LoRA Mode
│ (Applied at runtime) (Domain-specific, modular)
│ (NOT merged)
│
└──── Pre-merged weights ────────► Merged Mode
(LoRA baked into weights) (Domain-specific, faster)
Why LoRA (Low-Rank Adaptation)?
- Enables domain-specific fine-tuning with minimal parameters
- Keeps base weights frozen, adds small trainable matrices
- Can swap adapters without retraining the base model
- Memory efficient: adapter is ~10MB vs 2GB for full fine-tuning
Use Cases for Model Comparison:
- Base: General questions - serves as baseline to compare against fine-tuned versions
- LoRA: Biomedical questions - demonstrates the modular adapter approach with minimal storage
- Merged: Biomedical questions - shows the performance of pre-merged weights vs runtime adapter loading
This setup allows you to directly compare how fine-tuning impacts model responses for domain-specific tasks versus general queries.
Generate text from a prompt.
Request:
{
"prompt": "What is the function of mitochondria?",
"model_type": "lora",
"max_new_tokens": 256,
"temperature": 0.7
}Response:
{
"response": "Mitochondria are often called the powerhouse of the cell...",
"model_type": "lora"
}Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt |
string | required | Input text (1-4096 chars) |
model_type |
string | "base" | One of: "base", "lora", "merged" |
max_new_tokens |
int | 256 | Max tokens to generate (1-2048) |
temperature |
float | 0.7 | Sampling temperature (0.0-2.0) |
Check server status.
Response:
{
"status": "healthy",
"current_model": "lora",
"gpu_available": true
}┌─────────────────────────────────────────────────────────────┐
│ Client Browser │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Chat UI (HTML/CSS/JS) ││
│ │ ┌─────────┐ ┌──────────────┐ ┌─────────────────────┐││
│ │ │ Model │ │ Chat Input │ │ Message Display │││
│ │ │ Selector│ │ │ │ (User/Assistant) │││
│ │ └─────────┘ └──────────────┘ └─────────────────────┘││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
│
│ HTTP POST /generate
▼
┌─────────────────────────────────────────────────────────────┐
│ FastAPI Backend │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ main.py ││
│ │ • POST /generate endpoint ││
│ │ • Static file serving ││
│ │ • Request validation (Pydantic) ││
│ └─────────────────────────────────────────────────────────┘│
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ model_loader.py ││
│ │ • ModelManager class ││
│ │ • Model loading/switching ││
│ │ • 8-bit quantization ││
│ │ • VRAM management ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ GPU (CUDA) │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ LLaMA-3.2-1B Model ││
│ │ (8-bit quantized) ││
│ │ ~1.3GB VRAM ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
Quantization enables running LLMs on consumer-grade GPUs by reducing memory requirements:
| Precision | Model Size | Total VRAM* | Quality | Suitable For |
|---|---|---|---|---|
| FP32 | ~5GB | ~5.5GB | Highest | Training only |
| FP16 | ~2.5GB | ~3.0GB | High | High-end GPUs (6GB+) |
| 8-bit | ~1.3GB | ~1.8GB | Good | Consumer GPUs (4GB+, default) |
| 4-bit | ~0.7GB | ~1.2GB | Acceptable | Extreme VRAM constraints (2GB+) |
*Total VRAM = Model + KV Cache (~0.3GB) + Generation overhead (~0.2GB)
While FP16 (~3GB total) can run on 4GB+ GPUs, 8-bit (~1.8GB) is the better choice:
Advantages of 8-bit:
- More headroom: 2.2GB free vs 1GB free on a 4GB GPU
- Longer sequences: KV cache grows with conversation length
- Better stability: Safer margin for exactly 4GB GPUs
- Multi-tasking: Can run alongside other GPU applications
- Minimal quality loss: <1% degradation for inference tasks
- Sometimes faster: Less memory bandwidth required
When to use each:
- FP32/FP16: Training and fine-tuning only
- 8-bit: Production inference (best quality-to-VRAM ratio)
- 4-bit: Extreme VRAM constraints or running multiple models
# 8-bit quantization (bitsandbytes)
BitsAndBytesConfig(load_in_8bit=True)
# Compresses weights from FP16 (16 bits) to INT8 (8 bits)
# Uses dynamic scaling to preserve precision
# Computation still done in FP16 for accuracy1. User sends prompt via UI
│
▼
2. JavaScript POSTs to /generate
{
"prompt": "What is DNA?",
"model_type": "lora",
"max_new_tokens": 256,
"temperature": 0.7
}
│
▼
3. FastAPI validates request (Pydantic)
│
▼
4. ModelManager checks if correct model is loaded
│
┌────┴────┐
│ Same? │
└────┬────┘
│
No ──┼── Yes
│ │
▼ │
5. Clear VRAM, │
Load new │
model │
│ │
◄────┘
│
▼
6. Tokenize prompt with chat template
│
▼
7. Generate tokens (autoregressive)
│
▼
8. Decode tokens to text
│
▼
9. Return response to client
│
▼
10. JavaScript displays in chat UI
With 8-bit quantization, the typical VRAM usage breakdown is:
| Component | Approximate VRAM |
|---|---|
| Model (8-bit) | ~1.3GB |
| KV Cache | ~0.3GB |
| Generation overhead | ~0.2GB |
| Total | ~1.8GB |
This allows the system to run on consumer GPUs with 4GB+ VRAM.
- Use 8-bit quantization (default configuration)
- Enable 4-bit quantization if needed (set
use_4bit=Trueinmodel_loader.py) - Reduce
max_new_tokensfor longer generations - Close other GPU applications to free up memory
llm-inference/
├── app/
│ ├── main.py # FastAPI application
│ ├── model_loader.py # Model management
│ └── static/
│ ├── index.html # Chat UI
│ ├── style.css # Styling
│ └── script.js # Client logic
├── Dockerfile # GPU container
├── requirements.txt # Python dependencies
└── README.md # This file
| Component | Technology |
|---|---|
| LLM | LLaMA-3.2-1B-Instruct |
| Inference | Hugging Face Transformers |
| LoRA | PEFT library |
| Quantization | bitsandbytes |
| Backend | FastAPI + Uvicorn |
| Frontend | Vanilla HTML/CSS/JavaScript |
| Container | Docker + NVIDIA Container Toolkit |
This project is for educational and demonstration purposes.
- Base model: LLaMA-3.2 (Llama 3.2 Community License)
- Dataset used for fine-tuning: BioInstruct
- Fine-tuned models: daffakautsar on Hugging Face
- Fine-tuning Process: bioinstruct-finetuning-experiment - Complete workflow for creating the biomedical fine-tuned models
- Fine-tuned Models: daffakautsar on Hugging Face - Pre-trained models ready for inference
- Hugging Face Transformers - Model loading and inference
- PEFT: Parameter-Efficient Fine-Tuning - LoRA adapter implementation
- bitsandbytes - Quantization library
- FastAPI - API framework
- NVIDIA Container Toolkit - GPU support in Docker