Secure offline research assistant leveraging a 20B LLM and retrieval-augmented generation (RAG) for local document search and question answering.
- Model: 20B LLM running in LM Studio
- Role: Handles text generation, summarization, and Q&A
- Access: Exposes a local API endpoint for frontend communication
- Storage: Store PDFs, Word docs, and text files locally
- Preprocessing:
- Split documents into chunks (500–1000 tokens each)
- Remove irrelevant formatting for cleaner embeddings
- Embeddings: Open-source embedding model (e.g.,
sentence-transformers) to vectorize document chunks - Vector DB Options:
- FAISS (lightweight, local)
- Milvus (advanced, heavier)
- Role: Fast retrieval of relevant chunks to feed the model contextually
Workflow:
- User asks a question
- Query vector DB → retrieve top-K relevant chunks
- Construct a prompt with context + user question
- Send to LM Studio for answer generation
Benefit: Allows a 20B model to answer long-document questions without exceeding the context window
- Framework: Streamlit (offline, Python-based)
- Features:
- Upload documents (PDF, DOCX, TXT)
- Search & query interface
- Display answers with reference snippets
- Optional toggle to allow fallback to LLM knowledge if context is insufficient
- Show total response time for each query
- Notebook/history for past queries
┌───────────────┐
│ User (You) │
└───────┬───────┘
│
▼
┌─────────────────┐
│ Streamlit UI │
│ - Upload docs │
│ - Ask questions │
│ - View answers │
│ - Response time │
└───────┬─────────┘
│
▼
┌─────────────────────┐
│ Local Backend API │
│ - /docs │
│ - /ask │
│ - Manage FAISS index │
└───────┬─────────────┘
│
▼
┌───────────────────────┐
│ FAISS Vector DB │
│ - Embeddings via │
│ all-MiniLM-L6-v2 │
│ - Retrieve top-K chunks│
└───────┬───────────────┘
│
▼
┌───────────────────────┐
│ LM Studio 20B LLM │
│ - Receives prompt: │
│ "Question + Context"│
│ - Generates answer │
│ - Optionally fallback │
│ to own knowledge │
└──────────────┬────────┘
│
▼
┌────────────────┐
│ Answer + Sources│
│ + Response Time │
└────────────────┘
│
▼
Streamlit UI displays
- Run your 20B model locally and expose a local API
- Download LM Studio and load your 20B model
- Enable the API server (LM Studio supports a local REST endpoint)
git clone <your-repo-url>
cd secure-research-assistantsecure-research-assistant/
│
├── backend/
│ ├── api.py
│ ├── ingest.py
│ ├── embeddings.py
│ ├── retrieval.py
│ ├── config.py
│ └── utils.py
│
├── frontend/
│ └── chat.py # Streamlit UI
│
├── models/
│ └── 20B_model/ # LM Studio model directory
│
├── data/
│ ├── documents/ # Uploaded documents
│ └── embeddings/ # FAISS index files
│
├── scripts/
│ ├── start_backend.sh
│ └── preprocess_docs.sh
│
├── README.md
└── requirements.txt
uv venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
uv pip install -r requirements.txtcp config.example.env .envUpdate .env with your local settings:
LMSTUDIO_API_URL=http://127.0.0.1:1234/v1/chat/completions
LLM_MODEL=openai/gpt-oss-20b
CHUNK_SIZE=500
TOP_K=5
EMBEDDING_MODEL=all-MiniLM-L6-v2
DATA_DIR=data/documents
EMBEDDING_DIR=data/embeddingsuv run backend/app.pyOptional: Run with Gunicorn for production:
gunicorn -w 4 backend.api:appstreamlit run frontend/chat.py