Skip to content

ZeeetOne/llm-inference-deployment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Inference

A practical example of deploying fine-tuned LLMs locally, demonstrating how to apply custom fine-tuned models from Hugging Face and compare different model configurations (base, LoRA, and merged).

Overview

This project showcases how to build a local LLM inference system using fine-tuned models from Hugging Face:

  • Models: Fine-tuned LLaMA-3.2-1B models from Hugging Face Hub
  • Fine-tuning Process: See the complete fine-tuning workflow at bioinstruct-finetuning-experiment
  • Framework: Hugging Face Transformers + PyTorch
  • API: FastAPI with Swagger documentation
  • UI: Vanilla HTML/CSS/JavaScript chat interface
  • Deployment: Docker with GPU support

Purpose

This repository serves as a practical example of:

  1. Deploying Fine-tuned Models: How to load and serve custom fine-tuned models from Hugging Face Hub
  2. Model Comparison: Comparing three approaches to fine-tuning (base model, LoRA adapter, and merged weights)
  3. Real-world Application: Applying biomedical fine-tuning to create a domain-specific assistant
  4. Performance Differences: Understanding the trade-offs between different model configurations

The fine-tuned models were created using the BioInstruct dataset to specialize LLaMA-3.2-1B for biomedical questions. This project demonstrates how to integrate these models into a production-ready inference system.

Quick Start

Prerequisites

  1. NVIDIA GPU with CUDA support (4GB+ VRAM recommended)
  2. Docker with NVIDIA Container Toolkit

Running with Docker

# Build the image
docker build -t llm-inference-demo .

# Run with GPU support
docker run --gpus all -p 8000:8000 llm-inference-demo

Running Locally (without Docker)

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or: venv\Scripts\activate  # Windows

# Install PyTorch with CUDA 12.1 support
# For other CUDA versions, check: https://pytorch.org/get-started/locally/
pip install torch --index-url https://download.pytorch.org/whl/cu121

# Install dependencies
pip install -r requirements.txt

# Run the server
cd app
python -m uvicorn main:app --host 0.0.0.0 --port 8000

Note: Models will be automatically downloaded from Hugging Face on first request. First startup may take a few minutes depending on your internet connection.

Access

First-Time Setup

Model Downloads

On first request (or when switching models), the application will automatically download models from Hugging Face:

Model Repository Size Purpose
Base unsloth/Llama-3.2-1B-Instruct ~2.5GB General-purpose baseline
LoRA Adapter daffakautsar/bioinstruct-llama3.2-1b-lora ~100MB Biomedical fine-tuning adapter
Merged daffakautsar/bioinstruct-llama3.2-1b-merged ~2.5GB Pre-merged biomedical model

Models total size: ~5GB

Docker Image Size

The Docker image is approximately 17-20GB, which includes:

Component Approximate Size
NVIDIA CUDA base image ~5-6GB
PyTorch + ML dependencies ~3-4GB
LLM models (3 models) ~5GB
System packages ~1-2GB
Total Docker image ~17-20GB

Download Behavior

  • Docker:

    • Models are pre-downloaded during image build
    • Total image size: ~17-20GB
    • No additional downloads needed after build
    • Instant startup (models already cached in image)
  • Local:

    • Models download on-demand when first requested
    • Download size: ~5GB (models only)
    • First request takes 2-5 minutes depending on internet speed
    • Models cached in ~/.cache/huggingface/hub/
    • Subsequent requests use cached models (instant loading)

What to Expect

  1. First request: Longer response time due to model download and loading
  2. Switching models: Brief delay (~10-30s) to unload old model and load new one
  3. Subsequent requests: Fast inference using the loaded model

Model Comparison

This project demonstrates three different approaches to using fine-tuned models, allowing you to compare their behavior and performance:

Mode Model Description
Base unsloth/Llama-3.2-1B-Instruct General-purpose instruction-following model
LoRA Base + daffakautsar/bioinstruct-llama3.2-1b-lora Base model with biomedical adapter applied at runtime
Merged daffakautsar/bioinstruct-llama3.2-1b-merged Pre-merged model with biomedical fine-tuning

All fine-tuned models are available on Hugging Face and were trained using the process documented in the bioinstruct-finetuning-experiment repository.

Base vs LoRA vs Merged

Base Model (1B params)
       │
       ├──── Direct use ────────────────► Base Mode
       │                                  (General purpose)
       │
       ├──── + LoRA Adapter ────────────► LoRA Mode
       │     (Applied at runtime)         (Domain-specific, modular)
       │     (NOT merged)
       │
       └──── Pre-merged weights ────────► Merged Mode
             (LoRA baked into weights)    (Domain-specific, faster)

Why LoRA (Low-Rank Adaptation)?

  • Enables domain-specific fine-tuning with minimal parameters
  • Keeps base weights frozen, adds small trainable matrices
  • Can swap adapters without retraining the base model
  • Memory efficient: adapter is ~10MB vs 2GB for full fine-tuning

Use Cases for Model Comparison:

  • Base: General questions - serves as baseline to compare against fine-tuned versions
  • LoRA: Biomedical questions - demonstrates the modular adapter approach with minimal storage
  • Merged: Biomedical questions - shows the performance of pre-merged weights vs runtime adapter loading

This setup allows you to directly compare how fine-tuning impacts model responses for domain-specific tasks versus general queries.

API Reference

POST /generate

Generate text from a prompt.

Request:

{
  "prompt": "What is the function of mitochondria?",
  "model_type": "lora",
  "max_new_tokens": 256,
  "temperature": 0.7
}

Response:

{
  "response": "Mitochondria are often called the powerhouse of the cell...",
  "model_type": "lora"
}

Parameters:

Parameter Type Default Description
prompt string required Input text (1-4096 chars)
model_type string "base" One of: "base", "lora", "merged"
max_new_tokens int 256 Max tokens to generate (1-2048)
temperature float 0.7 Sampling temperature (0.0-2.0)

GET /health

Check server status.

Response:

{
  "status": "healthy",
  "current_model": "lora",
  "gpu_available": true
}

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        Client Browser                        │
│  ┌─────────────────────────────────────────────────────────┐│
│  │              Chat UI (HTML/CSS/JS)                      ││
│  │  ┌─────────┐  ┌──────────────┐  ┌─────────────────────┐││
│  │  │ Model   │  │ Chat Input   │  │ Message Display     │││
│  │  │ Selector│  │              │  │ (User/Assistant)    │││
│  │  └─────────┘  └──────────────┘  └─────────────────────┘││
│  └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
                              │
                              │ HTTP POST /generate
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    FastAPI Backend                           │
│  ┌─────────────────────────────────────────────────────────┐│
│  │                    main.py                              ││
│  │  • POST /generate endpoint                              ││
│  │  • Static file serving                                  ││
│  │  • Request validation (Pydantic)                        ││
│  └─────────────────────────────────────────────────────────┘│
│                              │                               │
│                              ▼                               │
│  ┌─────────────────────────────────────────────────────────┐│
│  │                  model_loader.py                        ││
│  │  • ModelManager class                                   ││
│  │  • Model loading/switching                              ││
│  │  • 8-bit quantization                                   ││
│  │  • VRAM management                                      ││
│  └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                       GPU (CUDA)                             │
│  ┌─────────────────────────────────────────────────────────┐│
│  │              LLaMA-3.2-1B Model                         ││
│  │              (8-bit quantized)                          ││
│  │                  ~1.3GB VRAM                            ││
│  └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘

Quantization

Why Quantization is Necessary

Quantization enables running LLMs on consumer-grade GPUs by reducing memory requirements:

Precision Model Size Total VRAM* Quality Suitable For
FP32 ~5GB ~5.5GB Highest Training only
FP16 ~2.5GB ~3.0GB High High-end GPUs (6GB+)
8-bit ~1.3GB ~1.8GB Good Consumer GPUs (4GB+, default)
4-bit ~0.7GB ~1.2GB Acceptable Extreme VRAM constraints (2GB+)

*Total VRAM = Model + KV Cache (~0.3GB) + Generation overhead (~0.2GB)

Why 8-bit Instead of FP16?

While FP16 (~3GB total) can run on 4GB+ GPUs, 8-bit (~1.8GB) is the better choice:

Advantages of 8-bit:

  • More headroom: 2.2GB free vs 1GB free on a 4GB GPU
  • Longer sequences: KV cache grows with conversation length
  • Better stability: Safer margin for exactly 4GB GPUs
  • Multi-tasking: Can run alongside other GPU applications
  • Minimal quality loss: <1% degradation for inference tasks
  • Sometimes faster: Less memory bandwidth required

When to use each:

  • FP32/FP16: Training and fine-tuning only
  • 8-bit: Production inference (best quality-to-VRAM ratio)
  • 4-bit: Extreme VRAM constraints or running multiple models

How Quantization Works

# 8-bit quantization (bitsandbytes)
BitsAndBytesConfig(load_in_8bit=True)

# Compresses weights from FP16 (16 bits) to INT8 (8 bits)
# Uses dynamic scaling to preserve precision
# Computation still done in FP16 for accuracy

Inference Flow

1. User sends prompt via UI
        │
        ▼
2. JavaScript POSTs to /generate
   {
     "prompt": "What is DNA?",
     "model_type": "lora",
     "max_new_tokens": 256,
     "temperature": 0.7
   }
        │
        ▼
3. FastAPI validates request (Pydantic)
        │
        ▼
4. ModelManager checks if correct model is loaded
        │
   ┌────┴────┐
   │ Same?   │
   └────┬────┘
        │
   No ──┼── Yes
        │    │
        ▼    │
5. Clear VRAM, │
   Load new   │
   model      │
        │    │
        ◄────┘
        │
        ▼
6. Tokenize prompt with chat template
        │
        ▼
7. Generate tokens (autoregressive)
        │
        ▼
8. Decode tokens to text
        │
        ▼
9. Return response to client
        │
        ▼
10. JavaScript displays in chat UI

GPU and VRAM Considerations

Memory Budget

With 8-bit quantization, the typical VRAM usage breakdown is:

Component Approximate VRAM
Model (8-bit) ~1.3GB
KV Cache ~0.3GB
Generation overhead ~0.2GB
Total ~1.8GB

This allows the system to run on consumer GPUs with 4GB+ VRAM.

Tips for Optimizing VRAM Usage

  1. Use 8-bit quantization (default configuration)
  2. Enable 4-bit quantization if needed (set use_4bit=True in model_loader.py)
  3. Reduce max_new_tokens for longer generations
  4. Close other GPU applications to free up memory

Project Structure

llm-inference/
├── app/
│   ├── main.py              # FastAPI application
│   ├── model_loader.py      # Model management
│   └── static/
│       ├── index.html       # Chat UI
│       ├── style.css        # Styling
│       └── script.js        # Client logic
├── Dockerfile               # GPU container
├── requirements.txt         # Python dependencies
└── README.md                # This file

Technologies Used

Component Technology
LLM LLaMA-3.2-1B-Instruct
Inference Hugging Face Transformers
LoRA PEFT library
Quantization bitsandbytes
Backend FastAPI + Uvicorn
Frontend Vanilla HTML/CSS/JavaScript
Container Docker + NVIDIA Container Toolkit

License

This project is for educational and demonstration purposes.

Related Projects

References

About

Practical example of deploying fine-tuned LLMs locally with FastAPI. Demonstrates base vs LoRA vs merged model comparison using LLaMA 3.2-1B for biomedical domain. Includes 8-bit quantization, Docker GPU support, and chat UI.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors