Academic LLM Fine-Tuning System

A comprehensive system for building a custom academic Q&A assistant using RAG (Retrieval Augmented Generation) and QLoRA fine-tuning.

🚀 Features

Data Collection: Automated arXiv paper scraping and PDF processing
RAG Pipeline: Hybrid retrieval combining vector search (FAISS) and keyword search (SQLite FTS5)
Synthetic Data Generation: GPT-4 powered Q&A pair generation for fine-tuning
QLoRA Fine-Tuning: Efficient fine-tuning of LLaMA 3.1 8B model with 4-bit quantization
Gradio UI: Interactive web interface for testing and comparing models
FastAPI Backend: RESTful API for integration with external applications

📋 Requirements

Python 3.10+
CUDA-capable GPU (recommended)
OpenAI API key (for synthetic data generation)

🎯 Quick Start

Option 1: Gradio UI (Recommended for beginners)

export GRADIO_SERVER_PORT=7861
python gradio-ui.py

Then access the UI at http://localhost:7861

Option 2: Pipeline Runner

python pipeline-runner.py

Option 3: FastAPI Server

uvicorn module8-api:app --host 0.0.0.0 --port 8000

📁 Project Structure

├── config/              # Configuration files
├── modules/             # Core modules
│   ├── m1_langchain_llama/    # LLM loading and chain building
│   ├── m2_data_collection/    # arXiv scraping
│   ├── m3_rag_pipeline/       # RAG indexing
│   ├── m4_hybrid_retrieval/  # Hybrid search (FAISS + SQLite)
│   ├── m5_synthetic_data/    # Q&A generation
│   ├── m6_fine_tuning/       # QLoRA training
│   └── m8_api_service/       # FastAPI endpoints
├── storage/            # Data, indexes, and models
├── gradio-ui.py        # Main Gradio interface
├── module8-api.py       # FastAPI application
└── pipeline-runner.py   # Full pipeline execution

🔧 Configuration

Edit config/settings.py to customize:

Model selection (base model, embedding model)
Training parameters (LoRA rank, learning rate, etc.)
RAG settings (top-k retrieval, similarity thresholds)
Data collection (arXiv categories, number of papers)

📚 Documentation

Setup Guide - Setting up OpenAI API
Deployment Guide - Production deployment
Restart Guide - Troubleshooting Gradio UI

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📝 License

This project is open source and available under the MIT License.

🙏 Acknowledgments

Built with LangChain
Fine-tuning powered by Unsloth
UI built with Gradio

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.Trash-1001		.Trash-1001
config		config
frontend		frontend
modules		modules
storage/indexes/sqlite		storage/indexes/sqlite
.gitignore		.gitignore
.metadata.json		.metadata.json
=0.110.0		=0.110.0
=0.27.0		=0.27.0
=4.19.0		=4.19.0
FRONTEND_DEPLOYMENT_SUMMARY.md		FRONTEND_DEPLOYMENT_SUMMARY.md
GRADIO_DEPLOYMENT.md		GRADIO_DEPLOYMENT.md
QUICK_START_GUIDE.md		QUICK_START_GUIDE.md
README.md		README.md
SIMPLE_STEPS.md		SIMPLE_STEPS.md
VERCEL_DEPLOYMENT_STEPS.md		VERCEL_DEPLOYMENT_STEPS.md
VERCEL_ENV_SETUP.md		VERCEL_ENV_SETUP.md
VERCEL_RECONNECT.md		VERCEL_RECONNECT.md
config-module.py		config-module.py
create_download_archive.sh		create_download_archive.sh
deployment-guide.md		deployment-guide.md
get_https_url.sh		get_https_url.sh
gradio-ui.py		gradio-ui.py
local_test.py		local_test.py
module1-langchain.py		module1-langchain.py
module2-data.py		module2-data.py
module3-rag.py		module3-rag.py
module4-5-hybrid-synthetic.py		module4-5-hybrid-synthetic.py
module6-7-finetune-eval.py		module6-7-finetune-eval.py
module8-api.py		module8-api.py
pipeline-runner.py		pipeline-runner.py
project-structure.md		project-structure.md
requirements-file.txt		requirements-file.txt
setup_https_tunnel.sh		setup_https_tunnel.sh
start_backend_api.sh		start_backend_api.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Academic LLM Fine-Tuning System

🚀 Features

📋 Requirements

🎯 Quick Start

Option 1: Gradio UI (Recommended for beginners)

Option 2: Pipeline Runner

Option 3: FastAPI Server

📁 Project Structure

🔧 Configuration

📚 Documentation

🤝 Contributing

📝 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

EmmaW215/Academic-LLM-Fine-Tuning-System

Folders and files

Latest commit

History

Repository files navigation

Academic LLM Fine-Tuning System

🚀 Features

📋 Requirements

🎯 Quick Start

Option 1: Gradio UI (Recommended for beginners)

Option 2: Pipeline Runner

Option 3: FastAPI Server

📁 Project Structure

🔧 Configuration

📚 Documentation

🤝 Contributing

📝 License

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages