Skip to content

longhoag/slm-ft-serving

Repository files navigation

Medical Information Extraction - Backend Server

Production backend serving fine-tuned Llama 3.1 8B model for cancer-specific entity extraction (IE) via vLLM on AWS EC2

Live Demo GitHub Actions AWS Docker


🎯 Overview

This repository contains the backend infrastructure for a medical information extraction system that uses a fine-tuned Llama 3.1 8B model (qLoRA 4-bit quantization) to extract structured cancer-related entities from clinical text.

The system performs cancer-specific named entity recognition (NER) and information extraction (IE) on unstructured clinical notes, returning structured data across 7 medical fields: cancer type, stage, gene mutations, biomarkers, treatments, treatment responses, and metastasis sites.

🌐 Live Application: https://medical-extraction.vercel.app

📱 Frontend Repository: slm-ft-serving-frontend


🏗️ Architecture

flowchart TB
    subgraph Browser["🌐 User Browser"]
        UI["medical-extraction.vercel.app"]
    end

    subgraph Vercel["☁️ Vercel"]
        Frontend["Next.js Frontend<br/>• React UI components<br/>• Server-side API routes<br/>• Input validation"]
    end

    subgraph EC2["🖥️ EC2 g6.2xlarge (us-east-1)"]
        subgraph Gateway["FastAPI Gateway :8080"]
            API["REST API endpoints<br/>• Request validation (Pydantic)<br/>• CORS configuration"]
        end
        subgraph vLLM["vLLM Server :8000"]
            Model["Llama 3.1 8B + LoRA<br/>• GPU inference (L4)<br/>• Model caching on EBS"]
        end
        Gateway -->|HTTP| vLLM
    end

    Browser -->|HTTPS| Vercel
    Vercel -->|HTTP proxied| EC2
Loading

⚠️ Note: This is an experimental project. The frontend is always live on Vercel, but the backend EC2 server may not be running at all times to save costs (~$1/hour for GPU instance). If extraction requests fail, the backend is likely stopped.

Component Details

Component Technology Port Description
Frontend Next.js 16 + React N/A User interface on Vercel (separate repo)
Gateway FastAPI + Pydantic 8080 REST API layer with validation
Inference vLLM + Llama 3.1 8B 8000 LLM serving with LoRA adapter
Infrastructure EC2 + Docker Compose N/A Container orchestration
CI/CD GitHub Actions + ECR N/A Automated builds and deployments

🧬 The Model

Fine-tuning Details

The model was fine-tuned on synthetic cancer clinical data using qLoRA (4-bit quantization) technique for parameter-efficient training.

Base Model: meta-llama/Llama-3.1-8B
Fine-tuned Adapter: loghoag/llama-3.1-8b-medical-ie

Training Data Format

{
  "instruction": "Extract all cancer-related entities from the text.",
  "input": "70-year-old man with widely metastatic cutaneous melanoma...",
  "output": {
    "cancer_type": "melanoma (cutaneous)",
    "stage": "IV",
    "gene_mutation": null,
    "biomarker": "PD-L1 5%; TMB-high",
    "treatment": "nivolumab and ipilimumab; stereotactic radiosurgery",
    "response": "mixed response",
    "metastasis_site": "brain"
  }
}

Extraction Fields

The model extracts 7 structured fields:

Field Description Example
cancer_type Type of cancer melanoma, breast cancer, NSCLC
stage Cancer stage III, IV, metastatic
gene_mutation Genetic mutations EGFR exon 19, KRAS G12D, BRCA1
biomarker Biomarker status HER2+, PD-L1 5%, TMB-high
treatment Treatments given nivolumab, chemotherapy, surgery
response Treatment response complete response, stable disease
metastasis_site Metastasis locations brain, liver, bone

🚀 Getting Started

Prerequisites

  • AWS account with EC2, ECR, SSM, Secrets Manager access
  • GitHub account with Actions enabled
  • Poetry installed locally (brew install poetry)
  • AWS CLI configured with credentials
  • HuggingFace account with Llama 3.1 access

Required Secrets

Secret Purpose Location
HF_TOKEN HuggingFace access token AWS Secrets Manager
AWS_ACCESS_KEY_ID AWS credentials GitHub Secrets
AWS_SECRET_ACCESS_KEY AWS credentials GitHub Secrets

Quick Start

# Clone and install
git clone https://github.com/longhoag/slm-ft-serving.git
cd slm-ft-serving && poetry install

# Deploy to EC2
poetry run python scripts/deploy.py

# Verify deployment
curl http://<ec2-ip>:8080/health

CI/CD Workflow

  1. Push to main → GitHub Actions triggers
  2. Parallel builds → vLLM + Gateway Docker images
  3. Push to ECR → Cache-optimized registry
  4. Manual deploypoetry run python scripts/deploy.py

🛠️ Tech Stack

Backend (This Repository)

Category Technology
Inference Engine vLLM (optimized LLM serving)
API Framework FastAPI + Pydantic
Language Python 3.11+
Dependency Management Poetry
Containerization Docker + Docker Compose
Cloud Infrastructure AWS EC2 (g6.2xlarge, L4 GPU)
Container Registry AWS ECR
Remote Execution AWS Systems Manager (SSM)
Secrets Management AWS Secrets Manager + SSM Parameter Store
CI/CD GitHub Actions
Logging Loguru + CloudWatch Logs
Category Technology
Framework Next.js 16 (App Router)
Language TypeScript (strict mode)
Styling TailwindCSS v4
UI Components ShadcnUI (Radix primitives)
Deployment Vercel (serverless)

🔧 Backend Deep Dive

vLLM Inference Server

flowchart TB
    subgraph vLLM["vLLM Server (Port 8000)"]
        subgraph Models["Model Loading"]
            Base["Base Model<br/>Llama 3.1 8B<br/>(32.1 GB)"]
            LoRA["LoRA Adapter<br/>medical-ie<br/>(71.8 MB)"]
            Base --> LoRA
        end
        
        subgraph GPU["NVIDIA L4 GPU (24GB VRAM)"]
            Batch["Continuous batching"]
            Paged["PagedAttention"]
            API["OpenAI-compatible API"]
        end
        
        Models --> GPU
        
        Cache[("Docker Volume<br/>huggingface-cache<br/>(EBS persistent)")]
    end
Loading

Key Features:

  • LoRA hot-loading
  • Model persistence on EBS
  • Health endpoint for orchestration
  • Custom chat template

FastAPI Gateway

flowchart TB
    subgraph Gateway["FastAPI Gateway (Port 8080)"]
        subgraph Endpoints["API Endpoints"]
            Health["/health<br/>Health check"]
            Docs["/docs<br/>Swagger UI"]
            Extract["/api/v1/extract<br/>Main endpoint"]
        end
        
        subgraph Pipeline["Request Processing Pipeline"]
            direction TB
            P1["1. Pydantic validation"]
            P2["2. Prompt construction"]
            P3["3. vLLM API call"]
            P4["4. JSON parsing"]
            P5["5. Structured response"]
            P1 --> P2 --> P3 --> P4 --> P5
        end
        
        Extract --> Pipeline
        
        CORS["CORS: Vercel domains only"]
    end
Loading

Endpoints:

  • GET /health - Health check
  • GET /docs - Swagger UI
  • POST /api/v1/extract - Main extraction endpoint

Container Orchestration

flowchart TB
    subgraph Compose["Docker Compose Architecture"]
        subgraph Network["Docker Network"]
            GW["Gateway<br/>:8080<br/>depends_on: vllm:healthy"]
            VLLM["vLLM<br/>:8000<br/>GPU: L4"]
            GW -->|HTTP| VLLM
        end
        
        Volume[("Named Volume<br/>huggingface-cache<br/>Persistent on EBS")]
        VLLM --> Volume
    end
    
    Client["External Client"] -->|Port 8080| GW
Loading

Orchestration:

  • Health check dependencies
  • 6-min startup for model loading
  • GPU reservation
  • Persistent volumes

CI/CD Pipeline

flowchart TB
    Push["Push to main"] --> Actions["GitHub Actions"]
    
    Actions --> Build1["Build vLLM Image"]
    Actions --> Build2["Build Gateway Image"]
    
    Build1 --> ECR["AWS ECR<br/>• slm-ft-serving-vllm<br/>• slm-ft-serving-gateway<br/>• :buildcache layer"]
    Build2 --> ECR
    
    ECR --> Deploy["Manual: deploy.py<br/>(SSM → EC2)"]
Loading

Optimizations:

  • Parallel builds
  • ECR layer caching
  • Disk cleanup before builds

Remote Deployment (SSM)

flowchart LR
    subgraph Local["Local Mac"]
        Script["deploy.py<br/>• Start EC2<br/>• Wait OK<br/>• Send commands"]
    end
    
    subgraph AWS["AWS"]
        SSM["SSM<br/>Run Command"]
        subgraph EC2["EC2 g6.2xlarge"]
            Agent["SSM Agent<br/>• Fetch HF token<br/>• ECR login<br/>• Pull images<br/>• docker compose up"]
        end
        SSM --> EC2
    end
    
    Local -->|"AWS SSM API"| SSM
Loading

Security:

  • No SSH/.pem keys
  • Secrets from AWS Secrets Manager
  • SSM Parameter Store for config

📊 Features & Capabilities

Backend Features

  • High-Performance Inference - vLLM optimizations for fast LLM serving
  • GPU Acceleration - NVIDIA L4 GPU for efficient inference
  • LoRA Adapter Support - Load fine-tuned adapters without full model retraining
  • Model Caching - Persistent storage on EBS (survives container restarts)
  • Health Checks - Automated container health monitoring
  • Input Validation - Pydantic models for request/response validation
  • CORS Security - Restricted to Vercel domains
  • API Documentation - Interactive Swagger UI at /docs
  • Structured Output - 7 medical fields in JSON format
  • Error Handling - Proper HTTP status codes and error messages

Frontend Features (View Frontend Repo)

  • Real-time Extraction - Extract medical entities in 2-3 seconds
  • 🔒 Secure Architecture - EC2 backend IP hidden via server-side proxy
  • 📱 Responsive Design - Works seamlessly on mobile and desktop
  • 🎯 Type-safe - Full TypeScript coverage with strict mode
  • Fast & Modern - Built with Next.js 16 and TailwindCSS v4
  • 🔄 Auto-deploy - Push to main → live on Vercel instantly

🔧 Development Notes

Design Principles

  • SSM-only access: No SSH, no .pem keys for EC2 access
  • Secrets Manager: All secrets stored securely, never in .env files
  • Poetry for Python: No raw pip install commands
  • Loguru for logging: No print() statements in production code
  • Staged development: Complete each stage before moving forward
  • Fail-safe execution: Commands execute with error handling and retries

Project Structure

slm-ft-serving/
├── .github/
│   ├── workflows/deploy.yml        # CI/CD pipeline
│   └── copilot-instructions.md     # AI assistant context
├── config/deployment.yml           # Deployment configuration
├── docs/STAGE-3.md                 # Stage 3 documentation
├── gateway/
│   ├── routers/extraction.py       # Extraction endpoint
│   ├── main.py                     # FastAPI app
│   └── Dockerfile                  # Gateway Docker image
├── scripts/deploy.py               # SSM deployment script
├── Dockerfile                      # vLLM Docker image
├── docker-compose.yml              # Container orchestration
├── pyproject.toml                  # Poetry dependencies
└── README.md

📋 Project Stages

This project follows a staged development approach:

Stage Status Description
1 ✅ Complete vLLM server with LoRA adapter on EC2 g6.2xlarge
2 ✅ Complete FastAPI gateway with Docker Compose orchestration
3 ✅ Complete Next.js frontend on Vercel
4 🔮 Planned CloudWatch monitoring & observability

🌐 Related Links


📝 License

MIT License - see LICENSE file for details.


🙏 Acknowledgments

  • vLLM Team - High-performance LLM inference engine
  • Meta AI - Llama 3.1 base model
  • HuggingFace - Model hosting and fine-tuning infrastructure
  • FastAPI - Modern Python web framework
  • Vercel - Frontend hosting platform

Releases

No releases published

Packages

 
 
 

Contributors