LlamaPerformance

Fork of leestott/FLPerformance, adapted to benchmark GGUF models served by a remote llama-swap / llama.cpp server instead of Microsoft Foundry Local.

A full-stack web application for measuring and comparing the inference performance of multiple LLMs running on a GPU server.

What this fork changes

The original FLPerformance was built around Microsoft Foundry Local (Windows, ONNX models). This fork replaces that backend with:

Original	This fork
Microsoft Foundry Local SDK	Direct HTTP to llama-swap / llama.cpp OpenAI-compatible API
ONNX models	GGUF models
Windows-centric	Linux (tested on Ubuntu + AMD ROCm)
Hardcoded local service	Remote GPU server via configurable URL
No SSH integration	SSH scan to discover `.gguf` files on remote server
Configuration via `.env` only	Settings UI persisted to `settings.json`
Single-prompt scenarios	Dual-slot concurrent benchmarking (system + user prompts)
No RAG support	Full RAG pipeline: PDF ingest → Qdrant → retrieval-augmented benchmarking

Features

Model management: Add and track GGUF models; llama-swap loads them implicitly on first inference request
Benchmark engine: Runs standardised prompt suites and collects:
- TTFT (time to first token) — per slot
- TPOT (time per output token) — combined across both slots
- TPS / GenTPS (throughput) — combined token count over wall-clock time
- P50 / P95 / P99 latency percentiles — wall-clock of the concurrent pair
- CPU, RAM, GPU utilisation
- VRAM peak usage (MB and %) and free margin
- Error rate and performance score
Dual-slot concurrent benchmarking: Suites can define prompt_system (JSON, low temperature) and prompt_user (natural language, higher temperature) pairs — both slots are sent concurrently via Promise.all, mirroring a parallel=2 llama-swap setup. All metrics (tokens, delays, latency) aggregate both slots
RAG mode: Suites with a rag block activate retrieval-augmented generation. Before each scenario the engine embeds the question via a llama.cpp embeddings server (/v1/embeddings), retrieves the top-k matching chunks from Qdrant, and passes them as context in the system message. The Benchmarks tab provides Ingest PDF and Skip buttons; the Results tab shows the retrieved chunks with their similarity scores and retrieval latency
Multi-model sequential benchmarking: Select any number of models — including stopped ones — and the engine tests each in sequence, waiting for VRAM to clear between models, then shows a side-by-side comparison
Multi-model comparison: Side-by-side charts and a radar overview
VRAM usage section: Bar chart and table of peak VRAM consumption per model, with a 100% reference line and colour-coded margin alerts
Vision model support: Automatic mmproj pairing for VL models
SSH model discovery: Scan a remote directory for .gguf files and sync to models.json without leaving the UI; llama-swap model list also queried as a fallback
Settings UI: All runtime configuration (API URL, SSH credentials, log level) stored in settings.json — no need to restart for most changes
Export: JSON and CSV download of any benchmark run
Storage: SQLite (preferred) with automatic JSON fallback

Requirements

Component	Notes
Node.js ≥ 18	Backend and Vite dev server
llama-swap or llama.cpp	Running on a GPU machine, accessible by HTTP
SSH access to GPU machine	Only needed for the model discovery scan
GPU stats endpoint (optional)	`http://<gpu-host>:9999/gpu` — used for real-time VRAM telemetry during benchmarks
llama.cpp embeddings server (optional)	Only needed for RAG suites; must serve an embedding model via the `/v1/embeddings` endpoint (OpenAI format). A separate llama.cpp instance loaded with an embedding model (e.g. nomic-embed-text-v1.5) works.
Qdrant vector database (optional)	Only needed for RAG suites; accessible by HTTP with an API key

The application does not need to run on the GPU machine itself. The backend can run on any Linux host that has network access to the llama.cpp server.

Installation

# 1. Clone the repo
git clone https://github.com/jimmiflowers/LlamaPerformance.git
cd LlamaPerformance

# 2. Install all dependencies (backend + frontend)
npm run setup

Starting the application

Normal use (single port)

./START_APP.sh

Builds the React frontend and serves everything from the Express server on port 3001. Open http://localhost:3001 once the server is ready.

Development (hot-reload on both ends)

./START_APP.sh --dev
# or: npm run dev

Skips the build and starts Vite dev server alongside Express:

Frontend (Vite, hot-reload): http://localhost:3000
Backend API: http://localhost:3001

Open http://localhost:3000 for the hot-reloading dev UI.

Manual production build

npm run build    # builds the React app into src/client/dist/
npm run server   # serves the built UI from the Express server on port 3001

First-time configuration

Open http://localhost:3000 (dev mode) or http://localhost:3001 (production) in your browser.
A blue "Setup required" banner will appear on every page until settings are saved.
Go to Settings → Connection Settings:

Option A — Remote server (default)
- Set Llama API URL to the address of your llama.cpp / llama.cpp server (e.g. http://gpu-host.lan:8000)
- Adjust Server Port if needed (restart required to take effect)
- Choose Log Level
- Click Save Connection Settings
Option B — Local llama.cpp on the same machine
- Tick Use local llama.cpp instance
- Set the port where llama.cpp is listening (default 8080)
- Click Save Connection Settings (SSH section is hidden in local mode)
Go to Settings → Local/SSH Model Discovery:

Option A — Remote server (SSH)
- Set Remote Models Directory (full path on the GPU server where .gguf files are stored)
- Set SSH Username
- Either tick SSH Trust Relationship (uses ~/.ssh/id_rsa) or enter the SSH password
- Click Save SSH Settings
- Click Scan Remote Models — a table of .gguf files found on the remote server appears
Option B — Local llama.cpp
- Set Local Models Directory (full path on this machine where .gguf files are stored)
- Click Save Directory
- Click Scan Local Models — a table of .gguf files found in that directory appears
In both cases: click Sync all N to models.json to replace the model inventory.

Using the application

Models tab

Lists all models from models.json
Add Model: add a single model manually, or tick Add all models to batch-import all models discovered in the last SSH/local scan. A progress bar is shown during batch import; models already in the list are skipped automatically.
Load: sends a load request to llama.cpp for that model. If another model is already loaded, a dialog offers three choices: unload first, keep both (advanced — may hang on single-GPU hardware), or cancel.
Unload: unloads the model and frees VRAM
Test: runs a single inference request to verify the model responds
Info: shows live slot status from llama.cpp (/slots), server configuration (/props), and recent benchmark results for that model

Params: opens a per-model dialog to configure the llama.cpp load parameters that will be sent every time this model is loaded (manually or during a benchmark run):

Parameter	Control	Description
`n_ctx`	Dropdown + custom input	Context window size: 4k, 8k, 16k, 24k, 32k, or any custom value in tokens
`n_batch`	Dropdown + custom input	Prompt batch size: 128, 256, 512, 1024, or custom
`flash_attn`	Checkbox	Enable Flash Attention
`cache_type_k`	Dropdown	KV-cache quantisation for K: `q4_0`, `q8_0`, `fp16`
`cache_type_v`	Dropdown	KV-cache quantisation for V: `q4_0`, `q8_0`, `fp16`

Leave any field at Server default to let llama.cpp use its own startup value. Models with custom params show a Params * button (highlighted in blue) as a reminder.

Benchmarks tab

Select a benchmark suite:
- default — 9 single-prompt scenarios
- mayordomo_spanish — 8 dual-slot scenarios in Spanish (concurrent JSON system + natural-language user prompts)
- profesor_spanish — 8 RAG scenarios in Spanish (retrieval-augmented; requires a llama.cpp embeddings server + Qdrant). Marked with a RAG badge
- ingeniero_spanish — 7 dual-slot scenarios in Spanish for engineering assistant role evaluation
RAG suites only — an ingest panel appears below the suite card showing the collection name, top_k, chunk size, and source PDF path. Before running benchmarks for the first time click Ingest PDF to parse the PDF, embed all chunks (one at a time via llama.cpp /v1/embeddings), and upload them to Qdrant. On subsequent runs you can click Skip to reuse the existing collection. The PDF must be placed in the RAG/ folder at the project root (gitignored).
Tick one or more models — any model can be selected, regardless of whether it is currently loaded — or use the Select all models checkbox
Configure iterations, timeout, and temperature. When a suite is selected its default_config values are loaded automatically. Dual-prompt suites expose two separate temperature fields: System temperature (for JSON prompts, default 0.1) and User temperature (for natural-language prompts, default 0.7)
If the GPU stats endpoint (aion.home.lan:9999) is not reachable, a warning banner is shown and Run Benchmark is blocked until the endpoint is available.
Click Run Benchmark — the engine will:
- For each model in order: send the first inference request (triggering implicit load in llama-swap), then retry GET /upstream/{model}/health every 3 s for up to 60 s until the model reports status: ok
- For RAG suites: embed each scenario's question via the llama.cpp embeddings server (/v1/embeddings), retrieve top-k chunks from Qdrant, prepend them as context in the system message, then run as a single-slot inference
- For dual-slot suites: send system and user prompts concurrently (Promise.all) — wall-clock latency and total tokens aggregate both slots
- Apply a 3-second settling pause after model confirmation to avoid inflated TTFT on the first timed inference
- Between models: poll GET /running every 500 ms until the previous model disappears (up to 60 s), confirming VRAM is free before triggering the next load
- Unload the last model when its tests are done
The progress card shows the model currently under test and its position in the queue (e.g. "Testing Gemma-3-12B (2 of 3)")
Pause and Abort buttons are available during a run. Aborting terminates the active inference immediately and auto-deletes the run; a modal overlay confirms the abort is in progress.

Results tab

Select any past benchmark run from the dropdown
View performance cards, comparison charts, and a detailed metrics table
Export JSON / CSV — filename includes model name and datetime
Export PDF — triggers the browser's print dialog; the UI (sidebar, controls) is hidden and a report header is injected automatically, so the printed output contains only charts and data tables. Use "Save as PDF" in the browser print dialog to get a PDF file.
Delete Run — permanently removes a run from storage
VRAM Usage section — table (Model | VRAM peak MB | VRAM peak % | GPU avg % | Free margin MB) and bar chart with a red 100% reference line. Sorted by consumption descending. Colour-coded: green < 75%, orange 75–90%, red > 90%; margin red if < 500 MB. Only shown when VRAM telemetry data is present
Model Responses table — below the Detailed Results table, a cross-table shows the actual text returned by each model for every scenario (rows = scenarios, columns = models). For dual-slot suites, each cell shows both the User and System responses, labelled. Cells are truncated to ~5 lines with a Show more / Show less toggle. The full text is always shown when printing
RAG — Contexto Recuperado — for RAG benchmark runs, a card below Model Responses shows the document chunks retrieved from Qdrant for each scenario. Each chunk displays its similarity score and the retrieval latency (ms). Full chunk text is preserved (no truncation). This allows side-by-side evaluation of how well each model used the provided context

Settings tab

Three sections, each saved independently:

Connection Settings — Local/remote toggle, Llama API URL (or local port), server port, log level
Local/SSH Model Discovery — Models directory path; SSH credentials and remote scan in remote mode, local filesystem scan in local mode
System Information — Live display of active configuration

Project structure

LlamaPerformance/
├── src/
│   ├── server/
│   │   ├── index.js           # Express server + all API routes
│   │   ├── orchestrator.js    # llama-swap connection manager (health, idle detection)
│   │   ├── benchmark.js       # Benchmark engine (dual-slot, RAG, VRAM metrics)
│   │   ├── storage.js         # SQLite + JSON persistence
│   │   ├── cacheManager.js    # Model inventory (models.json)
│   │   ├── settingsManager.js # Persistent settings (settings.json)
│   │   ├── logger.js          # Winston structured logging
│   │   └── rag/
│   │       ├── ragEngine.js   # RAG query engine (embed + Qdrant search + assemble messages)
│   │       └── ingest.js      # PDF ingest pipeline (chunk → embed → Qdrant upload)
│   └── client/
│       └── src/
│           ├── pages/         # Dashboard, Models, Benchmarks, Results, Settings, Cache
│           └── utils/api.js   # Axios client (includes ragAPI)
├── benchmarks/
│   └── suites/
│       ├── default.json           # 9 single-prompt benchmark scenarios
│       ├── mayordomo_spanish.json # 8 dual-slot scenarios (ES) for role-model selection
│       ├── ingeniero_spanish.json # 7 dual-slot scenarios (ES) for engineering assistant
│       └── profesor_spanish.json  # 8 RAG scenarios (ES) for university professor assistant
├── RAG/                           # PDF files for RAG ingestion — gitignored
├── docs/
│   ├── api.md                 # Full API reference
│   └── changelog.md           # Detailed changelog per session
├── models.json                # GGUF model inventory — gitignored, created via Settings → SSH scan
├── settings.json              # Runtime configuration — gitignored, created on first Settings save
├── results/                   # Benchmark results (SQLite or JSON)
└── logs/                      # Winston log files

Configuration reference

All settings are stored in settings.json at the project root and are editable from the Settings UI. On first run, if settings.json does not exist, values fall back to environment variables from .env.

Setting	Env fallback	Description
`llamaApiUrl`	`LLAMA_API_URL`	URL of the llama.cpp / llama.cpp server
`port`	`PORT`	Backend Express port (restart required)
`logLevel`	`LOG_LEVEL`	Winston log level — applied immediately
`modelsDir`	`MODELS_DIR`	Remote directory scanned for `.gguf` files
`ssh.username`	—	SSH user on the GPU server
`ssh.sshPort`	—	SSH port (default 22)
`ssh.trustRelationship`	—	Use `~/.ssh/id_rsa` instead of password

VITE_API_URL must remain in .env — it is a Vite build-time variable and cannot be stored in settings.json.

models.json format

[
  {
    "id": "Llama-3.1-8B-Instruct-Q6_K.gguf",
    "alias": "Llama 3.1 8B Instruct Q6 K"
  },
  {
    "id": "Qwen2.5-VL-7B-Instruct-Q5_K_M.gguf",
    "alias": "Qwen2.5 VL 7B Instruct Q5 K M",
    "mmproj": "mmproj-Qwen2.5-VL-7B-F16.gguf"
  }
]

id — filename of the .gguf file as known to llama.cpp (path excluded, extension included)
alias — display name shown in the UI
mmproj — (optional) mmproj filename for vision-language models

The Settings → SSH scan automatically detects mmproj files and assigns them to models whose filename contains vl.

RAG suite format

A suite activates RAG mode when it contains a top-level rag block. Each scenario uses a question field instead of prompt_system/prompt_user:

{
  "name": "my_rag_suite",
  "description": "RAG benchmark over a document",
  "rag": {
    "embeddings_endpoint": "http://llama-embed-host:7998",
    "embeddings_model": "model.gguf",
    "qdrant_endpoint": "http://qdrant-host:6333",
    "qdrant_api_key": "your-api-key",
    "collection": "my_collection",
    "top_k": 5,
    "chunk_size": 512,
    "chunk_overlap": 64,
    "source_pdf": "./RAG/document.pdf"
  },
  "system_prompt": "You are an expert assistant. Answer using only the provided material.",
  "scenarios": [
    {
      "name": "Key concept",
      "description": "Ask about a key concept from the document",
      "question": "What is the main concept explained in the material?",
      "max_tokens": 400,
      "expected_output_length": "medium"
    }
  ],
  "default_config": {
    "iterations": 3,
    "concurrency": 1,
    "timeout": 90000,
    "temperature": 0.7,
    "streaming": true
  }
}

RAG field	Description
`embeddings_endpoint`	Base URL of the llama.cpp embeddings server (e.g. `http://host:7998`). The server must expose `POST /v1/embeddings` in OpenAI format
`embeddings_model`	Model name sent in the request body (e.g. `model.gguf`)
`qdrant_endpoint`	Base URL of the Qdrant instance
`qdrant_api_key`	API key for Qdrant authentication
`collection`	Qdrant collection name (created/replaced on ingest)
`top_k`	Number of chunks to retrieve per query
`chunk_size`	Target chunk size in tokens (1 token ≈ 4 characters)
`chunk_overlap`	Overlap between consecutive chunks in tokens
`source_pdf`	Path to the PDF file on the machine running the server. Relative paths are resolved from the project root. Recommended: place the PDF in `RAG/` (gitignored) and use `./RAG/file.pdf`

The system_prompt at the suite level is used as the base system message; retrieved chunks are appended to it automatically.

Troubleshooting

Cannot connect to the llama.cpp server

Verify the URL in Settings → Connection Settings
Confirm the llama.cpp / llama.cpp process is running: curl http://<host>:8000/health

SSH scan fails

Verify SSH username and port are correct in Settings
If using trust relationship, confirm ~/.ssh/id_rsa exists on the machine running the Node.js server and the public key is authorised on the GPU server
Test manually: ssh -p <port> <user>@<host> ls /path/to/models

Model fails to load

The id in models.json must match exactly what llama.cpp expects (no path prefix)
For VL models, verify the mmproj filename is correct and that modelsDir in Settings points to the directory containing both files

Benchmark timeouts

Increase the timeout in the Benchmarks tab configuration
Reduce concurrency to 1
Large models may require more time for TTFT on first load

Multi-model benchmark slower than individual runs

This is expected: the engine intentionally waits for VRAM to be fully freed between models (polls GET /running every 500 ms, up to 60 s) and adds a 3-second settling pause after model confirmation. If you run benchmarks individually you skip these safety delays. The results should be comparable — if they are still significantly worse, check that no other process is using GPU memory during the run.

VRAM section does not appear in Results

The GPU stats endpoint (http://aion.home.lan:9999/gpu) must be reachable during the benchmark run. If it is not available, getGpuMetrics() returns nulls silently and the section is hidden. Check connectivity to the stats service on the GPU host.

License

MIT — see LICENSE

Original project: leestott/FLPerformance (MIT)

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
benchmarks/suites		benchmarks/suites
docs		docs
results/example		results/example
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
QUICK_START.md		QUICK_START.md
README.md		README.md
START_APP.sh		START_APP.sh
models.example.json		models.example.json
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LlamaPerformance

What this fork changes

Features

Requirements

Installation

Starting the application

Normal use (single port)

Development (hot-reload on both ends)

Manual production build

First-time configuration

Using the application

Models tab

Benchmarks tab

Results tab

Settings tab

Project structure

Configuration reference

models.json format

RAG suite format

Troubleshooting

Cannot connect to the llama.cpp server

SSH scan fails

Model fails to load

Benchmark timeouts

Multi-model benchmark slower than individual runs

VRAM section does not appear in Results

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LlamaPerformance

What this fork changes

Features

Requirements

Installation

Starting the application

Normal use (single port)

Development (hot-reload on both ends)

Manual production build

First-time configuration

Using the application

Models tab

Benchmarks tab

Results tab

Settings tab

Project structure

Configuration reference

models.json format

RAG suite format

Troubleshooting

Cannot connect to the llama.cpp server

SSH scan fails

Model fails to load

Benchmark timeouts

Multi-model benchmark slower than individual runs

VRAM section does not appear in Results

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages