Fork of leestott/FLPerformance, adapted to benchmark GGUF models served by a remote llama-swap / llama.cpp server instead of Microsoft Foundry Local.
A full-stack web application for measuring and comparing the inference performance of multiple LLMs running on a GPU server.
The original FLPerformance was built around Microsoft Foundry Local (Windows, ONNX models). This fork replaces that backend with:
| Original | This fork |
|---|---|
| Microsoft Foundry Local SDK | Direct HTTP to llama-swap / llama.cpp OpenAI-compatible API |
| ONNX models | GGUF models |
| Windows-centric | Linux (tested on Ubuntu + AMD ROCm) |
| Hardcoded local service | Remote GPU server via configurable URL |
| No SSH integration | SSH scan to discover .gguf files on remote server |
Configuration via .env only |
Settings UI persisted to settings.json |
| Single-prompt scenarios | Dual-slot concurrent benchmarking (system + user prompts) |
| No RAG support | Full RAG pipeline: PDF ingest → Qdrant → retrieval-augmented benchmarking |
- Model management: Add and track GGUF models; llama-swap loads them implicitly on first inference request
- Benchmark engine: Runs standardised prompt suites and collects:
- TTFT (time to first token) — per slot
- TPOT (time per output token) — combined across both slots
- TPS / GenTPS (throughput) — combined token count over wall-clock time
- P50 / P95 / P99 latency percentiles — wall-clock of the concurrent pair
- CPU, RAM, GPU utilisation
- VRAM peak usage (MB and %) and free margin
- Error rate and performance score
- Dual-slot concurrent benchmarking: Suites can define
prompt_system(JSON, low temperature) andprompt_user(natural language, higher temperature) pairs — both slots are sent concurrently viaPromise.all, mirroring aparallel=2llama-swap setup. All metrics (tokens, delays, latency) aggregate both slots - RAG mode: Suites with a
ragblock activate retrieval-augmented generation. Before each scenario the engine embeds thequestionvia a llama.cpp embeddings server (/v1/embeddings), retrieves the top-k matching chunks from Qdrant, and passes them as context in the system message. The Benchmarks tab provides Ingest PDF and Skip buttons; the Results tab shows the retrieved chunks with their similarity scores and retrieval latency - Multi-model sequential benchmarking: Select any number of models — including stopped ones — and the engine tests each in sequence, waiting for VRAM to clear between models, then shows a side-by-side comparison
- Multi-model comparison: Side-by-side charts and a radar overview
- VRAM usage section: Bar chart and table of peak VRAM consumption per model, with a 100% reference line and colour-coded margin alerts
- Vision model support: Automatic mmproj pairing for VL models
- SSH model discovery: Scan a remote directory for
.gguffiles and sync tomodels.jsonwithout leaving the UI; llama-swap model list also queried as a fallback - Settings UI: All runtime configuration (API URL, SSH credentials, log level) stored in
settings.json— no need to restart for most changes - Export: JSON and CSV download of any benchmark run
- Storage: SQLite (preferred) with automatic JSON fallback
| Component | Notes |
|---|---|
| Node.js ≥ 18 | Backend and Vite dev server |
| llama-swap or llama.cpp | Running on a GPU machine, accessible by HTTP |
| SSH access to GPU machine | Only needed for the model discovery scan |
| GPU stats endpoint (optional) | http://<gpu-host>:9999/gpu — used for real-time VRAM telemetry during benchmarks |
| llama.cpp embeddings server (optional) | Only needed for RAG suites; must serve an embedding model via the /v1/embeddings endpoint (OpenAI format). A separate llama.cpp instance loaded with an embedding model (e.g. nomic-embed-text-v1.5) works. |
| Qdrant vector database (optional) | Only needed for RAG suites; accessible by HTTP with an API key |
The application does not need to run on the GPU machine itself. The backend can run on any Linux host that has network access to the llama.cpp server.
# 1. Clone the repo
git clone https://github.com/jimmiflowers/LlamaPerformance.git
cd LlamaPerformance
# 2. Install all dependencies (backend + frontend)
npm run setup./START_APP.shBuilds the React frontend and serves everything from the Express server on port 3001.
Open http://localhost:3001 once the server is ready.
./START_APP.sh --dev
# or: npm run devSkips the build and starts Vite dev server alongside Express:
- Frontend (Vite, hot-reload):
http://localhost:3000 - Backend API:
http://localhost:3001
Open http://localhost:3000 for the hot-reloading dev UI.
npm run build # builds the React app into src/client/dist/
npm run server # serves the built UI from the Express server on port 3001-
Open http://localhost:3000 (dev mode) or http://localhost:3001 (production) in your browser.
-
A blue "Setup required" banner will appear on every page until settings are saved.
-
Go to Settings → Connection Settings:
Option A — Remote server (default)
- Set Llama API URL to the address of your llama.cpp / llama.cpp server (e.g.
http://gpu-host.lan:8000) - Adjust Server Port if needed (restart required to take effect)
- Choose Log Level
- Click Save Connection Settings
Option B — Local llama.cpp on the same machine
- Tick Use local llama.cpp instance
- Set the port where llama.cpp is listening (default
8080) - Click Save Connection Settings (SSH section is hidden in local mode)
- Set Llama API URL to the address of your llama.cpp / llama.cpp server (e.g.
-
Go to Settings → Local/SSH Model Discovery:
Option A — Remote server (SSH)
- Set Remote Models Directory (full path on the GPU server where
.gguffiles are stored) - Set SSH Username
- Either tick SSH Trust Relationship (uses
~/.ssh/id_rsa) or enter the SSH password - Click Save SSH Settings
- Click Scan Remote Models — a table of
.gguffiles found on the remote server appears
Option B — Local llama.cpp
- Set Local Models Directory (full path on this machine where
.gguffiles are stored) - Click Save Directory
- Click Scan Local Models — a table of
.gguffiles found in that directory appears
In both cases: click Sync all N to models.json to replace the model inventory.
- Set Remote Models Directory (full path on the GPU server where
-
Lists all models from
models.json -
Add Model: add a single model manually, or tick Add all models to batch-import all models discovered in the last SSH/local scan. A progress bar is shown during batch import; models already in the list are skipped automatically.
-
Load: sends a load request to llama.cpp for that model. If another model is already loaded, a dialog offers three choices: unload first, keep both (advanced — may hang on single-GPU hardware), or cancel.
-
Unload: unloads the model and frees VRAM
-
Test: runs a single inference request to verify the model responds
-
Info: shows live slot status from llama.cpp (
/slots), server configuration (/props), and recent benchmark results for that model -
Params: opens a per-model dialog to configure the llama.cpp load parameters that will be sent every time this model is loaded (manually or during a benchmark run):
Parameter Control Description n_ctxDropdown + custom input Context window size: 4k, 8k, 16k, 24k, 32k, or any custom value in tokens n_batchDropdown + custom input Prompt batch size: 128, 256, 512, 1024, or custom flash_attnCheckbox Enable Flash Attention cache_type_kDropdown KV-cache quantisation for K: q4_0,q8_0,fp16cache_type_vDropdown KV-cache quantisation for V: q4_0,q8_0,fp16Leave any field at Server default to let llama.cpp use its own startup value. Models with custom params show a Params * button (highlighted in blue) as a reminder.
- Select a benchmark suite:
default— 9 single-prompt scenariosmayordomo_spanish— 8 dual-slot scenarios in Spanish (concurrent JSON system + natural-language user prompts)profesor_spanish— 8 RAG scenarios in Spanish (retrieval-augmented; requires a llama.cpp embeddings server + Qdrant). Marked with a RAG badgeingeniero_spanish— 7 dual-slot scenarios in Spanish for engineering assistant role evaluation
- RAG suites only — an ingest panel appears below the suite card showing the collection name, top_k, chunk size, and source PDF path. Before running benchmarks for the first time click Ingest PDF to parse the PDF, embed all chunks (one at a time via llama.cpp
/v1/embeddings), and upload them to Qdrant. On subsequent runs you can click Skip to reuse the existing collection. The PDF must be placed in theRAG/folder at the project root (gitignored). - Tick one or more models — any model can be selected, regardless of whether it is currently loaded — or use the Select all models checkbox
- Configure iterations, timeout, and temperature. When a suite is selected its
default_configvalues are loaded automatically. Dual-prompt suites expose two separate temperature fields: System temperature (for JSON prompts, default 0.1) and User temperature (for natural-language prompts, default 0.7) - If the GPU stats endpoint (
aion.home.lan:9999) is not reachable, a warning banner is shown and Run Benchmark is blocked until the endpoint is available. - Click Run Benchmark — the engine will:
- For each model in order: send the first inference request (triggering implicit load in llama-swap), then retry
GET /upstream/{model}/healthevery 3 s for up to 60 s until the model reportsstatus: ok - For RAG suites: embed each scenario's
questionvia the llama.cpp embeddings server (/v1/embeddings), retrieve top-k chunks from Qdrant, prepend them as context in the system message, then run as a single-slot inference - For dual-slot suites: send system and user prompts concurrently (
Promise.all) — wall-clock latency and total tokens aggregate both slots - Apply a 3-second settling pause after model confirmation to avoid inflated TTFT on the first timed inference
- Between models: poll
GET /runningevery 500 ms until the previous model disappears (up to 60 s), confirming VRAM is free before triggering the next load - Unload the last model when its tests are done
- For each model in order: send the first inference request (triggering implicit load in llama-swap), then retry
- The progress card shows the model currently under test and its position in the queue (e.g. "Testing Gemma-3-12B (2 of 3)")
- Pause and Abort buttons are available during a run. Aborting terminates the active inference immediately and auto-deletes the run; a modal overlay confirms the abort is in progress.
- Select any past benchmark run from the dropdown
- View performance cards, comparison charts, and a detailed metrics table
- Export JSON / CSV — filename includes model name and datetime
- Export PDF — triggers the browser's print dialog; the UI (sidebar, controls) is hidden and a report header is injected automatically, so the printed output contains only charts and data tables. Use "Save as PDF" in the browser print dialog to get a PDF file.
- Delete Run — permanently removes a run from storage
- VRAM Usage section — table (Model | VRAM peak MB | VRAM peak % | GPU avg % | Free margin MB) and bar chart with a red 100% reference line. Sorted by consumption descending. Colour-coded: green < 75%, orange 75–90%, red > 90%; margin red if < 500 MB. Only shown when VRAM telemetry data is present
- Model Responses table — below the Detailed Results table, a cross-table shows the actual text returned by each model for every scenario (rows = scenarios, columns = models). For dual-slot suites, each cell shows both the User and System responses, labelled. Cells are truncated to ~5 lines with a Show more / Show less toggle. The full text is always shown when printing
- RAG — Contexto Recuperado — for RAG benchmark runs, a card below Model Responses shows the document chunks retrieved from Qdrant for each scenario. Each chunk displays its similarity score and the retrieval latency (ms). Full chunk text is preserved (no truncation). This allows side-by-side evaluation of how well each model used the provided context
Three sections, each saved independently:
- Connection Settings — Local/remote toggle, Llama API URL (or local port), server port, log level
- Local/SSH Model Discovery — Models directory path; SSH credentials and remote scan in remote mode, local filesystem scan in local mode
- System Information — Live display of active configuration
LlamaPerformance/
├── src/
│ ├── server/
│ │ ├── index.js # Express server + all API routes
│ │ ├── orchestrator.js # llama-swap connection manager (health, idle detection)
│ │ ├── benchmark.js # Benchmark engine (dual-slot, RAG, VRAM metrics)
│ │ ├── storage.js # SQLite + JSON persistence
│ │ ├── cacheManager.js # Model inventory (models.json)
│ │ ├── settingsManager.js # Persistent settings (settings.json)
│ │ ├── logger.js # Winston structured logging
│ │ └── rag/
│ │ ├── ragEngine.js # RAG query engine (embed + Qdrant search + assemble messages)
│ │ └── ingest.js # PDF ingest pipeline (chunk → embed → Qdrant upload)
│ └── client/
│ └── src/
│ ├── pages/ # Dashboard, Models, Benchmarks, Results, Settings, Cache
│ └── utils/api.js # Axios client (includes ragAPI)
├── benchmarks/
│ └── suites/
│ ├── default.json # 9 single-prompt benchmark scenarios
│ ├── mayordomo_spanish.json # 8 dual-slot scenarios (ES) for role-model selection
│ ├── ingeniero_spanish.json # 7 dual-slot scenarios (ES) for engineering assistant
│ └── profesor_spanish.json # 8 RAG scenarios (ES) for university professor assistant
├── RAG/ # PDF files for RAG ingestion — gitignored
├── docs/
│ ├── api.md # Full API reference
│ └── changelog.md # Detailed changelog per session
├── models.json # GGUF model inventory — gitignored, created via Settings → SSH scan
├── settings.json # Runtime configuration — gitignored, created on first Settings save
├── results/ # Benchmark results (SQLite or JSON)
└── logs/ # Winston log files
All settings are stored in settings.json at the project root and are editable from the Settings UI. On first run, if settings.json does not exist, values fall back to environment variables from .env.
| Setting | Env fallback | Description |
|---|---|---|
llamaApiUrl |
LLAMA_API_URL |
URL of the llama.cpp / llama.cpp server |
port |
PORT |
Backend Express port (restart required) |
logLevel |
LOG_LEVEL |
Winston log level — applied immediately |
modelsDir |
MODELS_DIR |
Remote directory scanned for .gguf files |
ssh.username |
— | SSH user on the GPU server |
ssh.sshPort |
— | SSH port (default 22) |
ssh.trustRelationship |
— | Use ~/.ssh/id_rsa instead of password |
VITE_API_URL must remain in .env — it is a Vite build-time variable and cannot be stored in settings.json.
[
{
"id": "Llama-3.1-8B-Instruct-Q6_K.gguf",
"alias": "Llama 3.1 8B Instruct Q6 K"
},
{
"id": "Qwen2.5-VL-7B-Instruct-Q5_K_M.gguf",
"alias": "Qwen2.5 VL 7B Instruct Q5 K M",
"mmproj": "mmproj-Qwen2.5-VL-7B-F16.gguf"
}
]id— filename of the.gguffile as known to llama.cpp (path excluded, extension included)alias— display name shown in the UImmproj— (optional) mmproj filename for vision-language models
The Settings → SSH scan automatically detects mmproj files and assigns them to models whose filename contains vl.
A suite activates RAG mode when it contains a top-level rag block. Each scenario uses a question field instead of prompt_system/prompt_user:
{
"name": "my_rag_suite",
"description": "RAG benchmark over a document",
"rag": {
"embeddings_endpoint": "http://llama-embed-host:7998",
"embeddings_model": "model.gguf",
"qdrant_endpoint": "http://qdrant-host:6333",
"qdrant_api_key": "your-api-key",
"collection": "my_collection",
"top_k": 5,
"chunk_size": 512,
"chunk_overlap": 64,
"source_pdf": "./RAG/document.pdf"
},
"system_prompt": "You are an expert assistant. Answer using only the provided material.",
"scenarios": [
{
"name": "Key concept",
"description": "Ask about a key concept from the document",
"question": "What is the main concept explained in the material?",
"max_tokens": 400,
"expected_output_length": "medium"
}
],
"default_config": {
"iterations": 3,
"concurrency": 1,
"timeout": 90000,
"temperature": 0.7,
"streaming": true
}
}| RAG field | Description |
|---|---|
embeddings_endpoint |
Base URL of the llama.cpp embeddings server (e.g. http://host:7998). The server must expose POST /v1/embeddings in OpenAI format |
embeddings_model |
Model name sent in the request body (e.g. model.gguf) |
qdrant_endpoint |
Base URL of the Qdrant instance |
qdrant_api_key |
API key for Qdrant authentication |
collection |
Qdrant collection name (created/replaced on ingest) |
top_k |
Number of chunks to retrieve per query |
chunk_size |
Target chunk size in tokens (1 token ≈ 4 characters) |
chunk_overlap |
Overlap between consecutive chunks in tokens |
source_pdf |
Path to the PDF file on the machine running the server. Relative paths are resolved from the project root. Recommended: place the PDF in RAG/ (gitignored) and use ./RAG/file.pdf |
The system_prompt at the suite level is used as the base system message; retrieved chunks are appended to it automatically.
- Verify the URL in Settings → Connection Settings
- Confirm the llama.cpp / llama.cpp process is running:
curl http://<host>:8000/health
- Verify SSH username and port are correct in Settings
- If using trust relationship, confirm
~/.ssh/id_rsaexists on the machine running the Node.js server and the public key is authorised on the GPU server - Test manually:
ssh -p <port> <user>@<host> ls /path/to/models
- The
idinmodels.jsonmust match exactly what llama.cpp expects (no path prefix) - For VL models, verify the
mmprojfilename is correct and thatmodelsDirin Settings points to the directory containing both files
- Increase the timeout in the Benchmarks tab configuration
- Reduce concurrency to 1
- Large models may require more time for TTFT on first load
This is expected: the engine intentionally waits for VRAM to be fully freed between models (polls GET /running every 500 ms, up to 60 s) and adds a 3-second settling pause after model confirmation. If you run benchmarks individually you skip these safety delays. The results should be comparable — if they are still significantly worse, check that no other process is using GPU memory during the run.
The GPU stats endpoint (http://aion.home.lan:9999/gpu) must be reachable during the benchmark run. If it is not available, getGpuMetrics() returns nulls silently and the section is hidden. Check connectivity to the stats service on the GPU host.
MIT — see LICENSE
Original project: leestott/FLPerformance (MIT)