LLM Inference

A practical example of deploying fine-tuned LLMs locally, demonstrating how to apply custom fine-tuned models from Hugging Face and compare different model configurations (base, LoRA, and merged).

Overview

This project showcases how to build a local LLM inference system using fine-tuned models from Hugging Face:

Models: Fine-tuned LLaMA-3.2-1B models from Hugging Face Hub
Fine-tuning Process: See the complete fine-tuning workflow at bioinstruct-finetuning-experiment
Framework: Hugging Face Transformers + PyTorch
API: FastAPI with Swagger documentation
UI: Vanilla HTML/CSS/JavaScript chat interface
Deployment: Docker with GPU support

Purpose

This repository serves as a practical example of:

Deploying Fine-tuned Models: How to load and serve custom fine-tuned models from Hugging Face Hub
Model Comparison: Comparing three approaches to fine-tuning (base model, LoRA adapter, and merged weights)
Real-world Application: Applying biomedical fine-tuning to create a domain-specific assistant
Performance Differences: Understanding the trade-offs between different model configurations

The fine-tuned models were created using the BioInstruct dataset to specialize LLaMA-3.2-1B for biomedical questions. This project demonstrates how to integrate these models into a production-ready inference system.

Quick Start

Prerequisites

NVIDIA GPU with CUDA support (4GB+ VRAM recommended)
Docker with NVIDIA Container Toolkit

Running with Docker

# Build the image
docker build -t llm-inference-demo .

# Run with GPU support
docker run --gpus all -p 8000:8000 llm-inference-demo

Running Locally (without Docker)

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or: venv\Scripts\activate  # Windows

# Install PyTorch with CUDA 12.1 support
# For other CUDA versions, check: https://pytorch.org/get-started/locally/
pip install torch --index-url https://download.pytorch.org/whl/cu121

# Install dependencies
pip install -r requirements.txt

# Run the server
cd app
python -m uvicorn main:app --host 0.0.0.0 --port 8000

Note: Models will be automatically downloaded from Hugging Face on first request. First startup may take a few minutes depending on your internet connection.

Access

Chat UI: http://localhost:8000
API Docs: http://localhost:8000/docs
Health Check: http://localhost:8000/health

First-Time Setup

Model Downloads

On first request (or when switching models), the application will automatically download models from Hugging Face:

Model	Repository	Size	Purpose
Base	`unsloth/Llama-3.2-1B-Instruct`	~2.5GB	General-purpose baseline
LoRA Adapter	`daffakautsar/bioinstruct-llama3.2-1b-lora`	~100MB	Biomedical fine-tuning adapter
Merged	`daffakautsar/bioinstruct-llama3.2-1b-merged`	~2.5GB	Pre-merged biomedical model

Models total size: ~5GB

Docker Image Size

The Docker image is approximately 17-20GB, which includes:

Component	Approximate Size
NVIDIA CUDA base image	~5-6GB
PyTorch + ML dependencies	~3-4GB
LLM models (3 models)	~5GB
System packages	~1-2GB
Total Docker image	~17-20GB

Download Behavior

Docker:
- Models are pre-downloaded during image build
- Total image size: ~17-20GB
- No additional downloads needed after build
- Instant startup (models already cached in image)
Local:
- Models download on-demand when first requested
- Download size: ~5GB (models only)
- First request takes 2-5 minutes depending on internet speed
- Models cached in ~/.cache/huggingface/hub/
- Subsequent requests use cached models (instant loading)

What to Expect

First request: Longer response time due to model download and loading
Switching models: Brief delay (~10-30s) to unload old model and load new one
Subsequent requests: Fast inference using the loaded model

Model Comparison

This project demonstrates three different approaches to using fine-tuned models, allowing you to compare their behavior and performance:

Mode	Model	Description
Base	`unsloth/Llama-3.2-1B-Instruct`	General-purpose instruction-following model
LoRA	Base + `daffakautsar/bioinstruct-llama3.2-1b-lora`	Base model with biomedical adapter applied at runtime
Merged	`daffakautsar/bioinstruct-llama3.2-1b-merged`	Pre-merged model with biomedical fine-tuning

All fine-tuned models are available on Hugging Face and were trained using the process documented in the bioinstruct-finetuning-experiment repository.

Base vs LoRA vs Merged

Base Model (1B params)
       │
       ├──── Direct use ────────────────► Base Mode
       │                                  (General purpose)
       │
       ├──── + LoRA Adapter ────────────► LoRA Mode
       │     (Applied at runtime)         (Domain-specific, modular)
       │     (NOT merged)
       │
       └──── Pre-merged weights ────────► Merged Mode
             (LoRA baked into weights)    (Domain-specific, faster)

Why LoRA (Low-Rank Adaptation)?

Enables domain-specific fine-tuning with minimal parameters
Keeps base weights frozen, adds small trainable matrices
Can swap adapters without retraining the base model
Memory efficient: adapter is ~10MB vs 2GB for full fine-tuning

Use Cases for Model Comparison:

Base: General questions - serves as baseline to compare against fine-tuned versions
LoRA: Biomedical questions - demonstrates the modular adapter approach with minimal storage
Merged: Biomedical questions - shows the performance of pre-merged weights vs runtime adapter loading

This setup allows you to directly compare how fine-tuning impacts model responses for domain-specific tasks versus general queries.

API Reference

POST /generate

Generate text from a prompt.

Request:

{
  "prompt": "What is the function of mitochondria?",
  "model_type": "lora",
  "max_new_tokens": 256,
  "temperature": 0.7
}

Response:

{
  "response": "Mitochondria are often called the powerhouse of the cell...",
  "model_type": "lora"
}

Parameters:

Parameter	Type	Default	Description
`prompt`	string	required	Input text (1-4096 chars)
`model_type`	string	"base"	One of: "base", "lora", "merged"
`max_new_tokens`	int	256	Max tokens to generate (1-2048)
`temperature`	float	0.7	Sampling temperature (0.0-2.0)

GET /health

Check server status.

Response:

{
  "status": "healthy",
  "current_model": "lora",
  "gpu_available": true
}

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        Client Browser                        │
│  ┌─────────────────────────────────────────────────────────┐│
│  │              Chat UI (HTML/CSS/JS)                      ││
│  │  ┌─────────┐  ┌──────────────┐  ┌─────────────────────┐││
│  │  │ Model   │  │ Chat Input   │  │ Message Display     │││
│  │  │ Selector│  │              │  │ (User/Assistant)    │││
│  │  └─────────┘  └──────────────┘  └─────────────────────┘││
│  └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
                              │
                              │ HTTP POST /generate
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    FastAPI Backend                           │
│  ┌─────────────────────────────────────────────────────────┐│
│  │                    main.py                              ││
│  │  • POST /generate endpoint                              ││
│  │  • Static file serving                                  ││
│  │  • Request validation (Pydantic)                        ││
│  └─────────────────────────────────────────────────────────┘│
│                              │                               │
│                              ▼                               │
│  ┌─────────────────────────────────────────────────────────┐│
│  │                  model_loader.py                        ││
│  │  • ModelManager class                                   ││
│  │  • Model loading/switching                              ││
│  │  • 8-bit quantization                                   ││
│  │  • VRAM management                                      ││
│  └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                       GPU (CUDA)                             │
│  ┌─────────────────────────────────────────────────────────┐│
│  │              LLaMA-3.2-1B Model                         ││
│  │              (8-bit quantized)                          ││
│  │                  ~1.3GB VRAM                            ││
│  └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘

Quantization

Why Quantization is Necessary

Quantization enables running LLMs on consumer-grade GPUs by reducing memory requirements:

Precision	Model Size	Total VRAM*	Quality	Suitable For
FP32	~5GB	~5.5GB	Highest	Training only
FP16	~2.5GB	~3.0GB	High	High-end GPUs (6GB+)
8-bit	~1.3GB	~1.8GB	Good	Consumer GPUs (4GB+, default)
4-bit	~0.7GB	~1.2GB	Acceptable	Extreme VRAM constraints (2GB+)

*Total VRAM = Model + KV Cache (~0.3GB) + Generation overhead (~0.2GB)

Why 8-bit Instead of FP16?

While FP16 (~3GB total) can run on 4GB+ GPUs, 8-bit (~1.8GB) is the better choice:

Advantages of 8-bit:

More headroom: 2.2GB free vs 1GB free on a 4GB GPU
Longer sequences: KV cache grows with conversation length
Better stability: Safer margin for exactly 4GB GPUs
Multi-tasking: Can run alongside other GPU applications
Minimal quality loss: <1% degradation for inference tasks
Sometimes faster: Less memory bandwidth required

When to use each:

FP32/FP16: Training and fine-tuning only
8-bit: Production inference (best quality-to-VRAM ratio)
4-bit: Extreme VRAM constraints or running multiple models

How Quantization Works

# 8-bit quantization (bitsandbytes)
BitsAndBytesConfig(load_in_8bit=True)

# Compresses weights from FP16 (16 bits) to INT8 (8 bits)
# Uses dynamic scaling to preserve precision
# Computation still done in FP16 for accuracy

Inference Flow

1. User sends prompt via UI
        │
        ▼
2. JavaScript POSTs to /generate
   {
     "prompt": "What is DNA?",
     "model_type": "lora",
     "max_new_tokens": 256,
     "temperature": 0.7
   }
        │
        ▼
3. FastAPI validates request (Pydantic)
        │
        ▼
4. ModelManager checks if correct model is loaded
        │
   ┌────┴────┐
   │ Same?   │
   └────┬────┘
        │
   No ──┼── Yes
        │    │
        ▼    │
5. Clear VRAM, │
   Load new   │
   model      │
        │    │
        ◄────┘
        │
        ▼
6. Tokenize prompt with chat template
        │
        ▼
7. Generate tokens (autoregressive)
        │
        ▼
8. Decode tokens to text
        │
        ▼
9. Return response to client
        │
        ▼
10. JavaScript displays in chat UI

GPU and VRAM Considerations

Memory Budget

With 8-bit quantization, the typical VRAM usage breakdown is:

Component	Approximate VRAM
Model (8-bit)	~1.3GB
KV Cache	~0.3GB
Generation overhead	~0.2GB
Total	~1.8GB

This allows the system to run on consumer GPUs with 4GB+ VRAM.

Tips for Optimizing VRAM Usage

Use 8-bit quantization (default configuration)
Enable 4-bit quantization if needed (set use_4bit=True in model_loader.py)
Reduce max_new_tokens for longer generations
Close other GPU applications to free up memory

Project Structure

llm-inference/
├── app/
│   ├── main.py              # FastAPI application
│   ├── model_loader.py      # Model management
│   └── static/
│       ├── index.html       # Chat UI
│       ├── style.css        # Styling
│       └── script.js        # Client logic
├── Dockerfile               # GPU container
├── requirements.txt         # Python dependencies
└── README.md                # This file

Technologies Used

Component	Technology
LLM	LLaMA-3.2-1B-Instruct
Inference	Hugging Face Transformers
LoRA	PEFT library
Quantization	bitsandbytes
Backend	FastAPI + Uvicorn
Frontend	Vanilla HTML/CSS/JavaScript
Container	Docker + NVIDIA Container Toolkit

License

This project is for educational and demonstration purposes.

Base model: LLaMA-3.2 (Llama 3.2 Community License)
Dataset used for fine-tuning: BioInstruct
Fine-tuned models: daffakautsar on Hugging Face

Related Projects

Fine-tuning Process: bioinstruct-finetuning-experiment - Complete workflow for creating the biomedical fine-tuned models
Fine-tuned Models: daffakautsar on Hugging Face - Pre-trained models ready for inference

References

Hugging Face Transformers - Model loading and inference
PEFT: Parameter-Efficient Fine-Tuning - LoRA adapter implementation
bitsandbytes - Quantization library
FastAPI - API framework
NVIDIA Container Toolkit - GPU support in Docker

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LLM Inference

Overview

Purpose

Quick Start

Prerequisites

Running with Docker

Running Locally (without Docker)

Access

First-Time Setup

Model Downloads

Docker Image Size

Download Behavior

What to Expect

Model Comparison

Base vs LoRA vs Merged

API Reference

POST /generate

GET /health

Architecture

Quantization

Why Quantization is Necessary

Why 8-bit Instead of FP16?

How Quantization Works

Inference Flow

GPU and VRAM Considerations

Memory Budget

Tips for Optimizing VRAM Usage

Project Structure

Technologies Used

License

Related Projects

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages