A full-stack job queue dashboard for running LLM inference through Ollama with real-time GPU monitoring. Submit prompts, track job status, and observe GPU utilization, memory, temperature, and token throughput from a single UI.
┌────────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐
│ Frontend │────▶│ Backend │────▶│ SQLite │◀────│ Worker │
│ React/Nginx │ │ FastAPI │ │ (WAL) │ │ Python │
└────────────┘ └──────────┘ └──────────┘ └───┬────┘
▲ │
│ ▼
┌─────┴──────┐ ┌────────┐
│ GPU Monitor │ │ Ollama │
│ pynvml │ │ LLMs │
└─────────────┘ └────────┘
Five Docker services:
| Service | Role | Port |
|---|---|---|
| frontend | React 19 + Vite + Tailwind, served via Nginx | 3001 |
| backend | FastAPI REST API | 8001 |
| worker | Polls for pending jobs, calls Ollama, writes results | - |
| gpu-monitor | Polls NVIDIA GPU metrics via pynvml, writes to DB | - |
| ollama | LLM inference server | 11435 |
All Python services share a shared/ package containing the SQLAlchemy models, database engine, and config.
- Docker & Docker Compose
- NVIDIA GPU with drivers installed
- NVIDIA Container Toolkit
# 1. Clone and configure
git clone git@github.com:chrisfauerbach/queuedagents.git
cd queuedagents
cp .env.example .env
# 2. Launch everything
docker compose up --build -d
# 3. Pull a model into Ollama
docker compose exec ollama ollama pull gemma3:12b
# 4. Open the dashboard
open http://localhost:3001Environment variables (set in .env):
| Variable | Default | Description |
|---|---|---|
DATABASE_URL |
sqlite+aiosqlite:///./data/queue.db |
SQLAlchemy async database URL |
OLLAMA_HOST |
http://ollama:11434 |
Ollama API base URL |
POLL_INTERVAL |
1.0 |
Worker job polling interval (seconds) |
GPU_POLL_INTERVAL |
2.0 |
GPU metrics polling interval (seconds) |
All endpoints are prefixed with /api.
| Method | Path | Description |
|---|---|---|
POST |
/api/jobs |
Submit a new job |
GET |
/api/jobs |
List jobs (query: status, limit, offset) |
GET |
/api/jobs/:id |
Get a single job |
GET |
/api/stats |
Aggregate job status counts |
GET |
/api/token-usage?hours=24 |
Cumulative token usage per model (1-168 hour window) |
| Method | Path | Description |
|---|---|---|
GET |
/api/gpu/metrics?minutes=10 |
GPU time-series data (1-60 min window) |
| Method | Path | Description |
|---|---|---|
POST |
/api/comparisons |
Create a comparison (runs same prompt across N models) |
GET |
/api/comparisons |
List all comparisons with their jobs |
GET |
/api/comparisons/:id |
Get a single comparison with jobs |
| Method | Path | Description |
|---|---|---|
GET |
/api/models |
List available Ollama models |
GET |
/api/models/catalog |
Curated model catalog with installed status |
POST |
/api/models/pull |
Pull/download a model (streams NDJSON progress) |
POST |
/api/models/show |
Get detailed model info (license, family, quantization) |
DELETE |
/api/models |
Delete a local model |
| Method | Path | Description |
|---|---|---|
POST |
/api/prompts |
Save a reusable prompt |
GET |
/api/prompts |
List all saved prompts |
GET |
/api/prompts/:id |
Get a single prompt |
PUT |
/api/prompts/:id |
Update a prompt |
DELETE |
/api/prompts/:id |
Delete a prompt |
| Method | Path | Description |
|---|---|---|
GET |
/api/leaderboard |
Model performance leaderboard with win rates |
| Method | Path | Description |
|---|---|---|
GET |
/api/health |
Returns {"status": "ok"} |
- Submit a job via the dashboard or API with a model name, prompt, and optional parameters (system prompt, temperature, max tokens).
- The worker picks up the oldest pending job, marks it as
processing, and sends it to Ollama. - On completion, the worker records the result along with input tokens, output tokens, and generation time from the Ollama response.
- The dashboard polls for updates and displays status, results, and token throughput.
The Models page (/models) provides a full model management interface:
- Installed models table — Shows name, size, family, parameters, and quantization for all local models. Two-click delete confirmation.
- Curated catalog — Browse ~15 model families across 6 categories (General Purpose, Code, Reasoning, Chat/Instruct, Small/Fast, Multilingual). Click a variant chip to pull it.
- Real-time download progress — Pull streams NDJSON from Ollama with animated progress bars.
- Custom pull — Text input for pulling any model by name (e.g.
llama3.1:8b).
The Compare feature (/compare) lets you run the same prompt against multiple models side-by-side. A comparison creates one job per selected model, all sharing the same prompt and parameters. Results are displayed in a side-by-side grid with per-model status, output, token counts, and generation speed. The detail page auto-refreshes until all jobs complete.
The Prompts page (/prompts) lets you save reusable prompts with parameters (system prompt, temperature, max tokens). Saved prompts can be edited, deleted, run directly as a comparison against selected models, or sent to the Compare page pre-filled.
The Leaderboard page (/leaderboard) ranks models by performance metrics including average tokens per second, generation time, total token usage, and comparison win rate.
The dashboard includes a cumulative token usage chart that tracks input and output tokens consumed per model over time. The chart:
- Shows one line per model, each in a distinct color (16-color palette)
- Displays cumulative total tokens on the Y-axis with auto-scaled labels (K/M suffixes)
- Updates every 10 seconds via polling
- Queries the last 24 hours of completed jobs by default
The data is derived from input_tokens and output_tokens already recorded on each completed job — no additional database tables are required.
The gpu-monitor service reads metrics from NVIDIA GPUs every 2 seconds via pynvml:
- GPU utilization %
- Memory used / total (MB)
- Temperature (Celsius)
- Power draw (Watts)
Metrics older than 1 hour are automatically pruned. The dashboard renders a live SVG line chart showing utilization, memory %, and temperature over the selected time window.
queuedagents/
├── backend/ # FastAPI application
│ ├── app/
│ │ ├── main.py # App entrypoint, CORS, router mounting
│ │ ├── routes/
│ │ │ ├── jobs.py # Job CRUD + stats endpoints
│ │ │ ├── gpu.py # GPU metrics endpoint
│ │ │ ├── comparisons.py # Model comparison endpoints
│ │ │ ├── models.py # Model listing, catalog, pull, show, delete
│ │ │ ├── prompts.py # Prompt CRUD endpoints
│ │ │ └── leaderboard.py # Model leaderboard endpoint
│ │ ├── model_catalog.py # Curated model catalog data
│ │ └── schemas.py # Pydantic request/response models
│ ├── alembic/ # Database migrations
│ ├── Dockerfile
│ └── requirements.txt
├── worker/ # Job processing worker
│ ├── app/
│ │ ├── main.py # Polling loop, job claim/complete/fail
│ │ └── ollama_client.py # Ollama HTTP client
│ ├── Dockerfile
│ └── requirements.txt
├── gpu-monitor/ # GPU metrics collector
│ ├── app/
│ │ └── main.py # pynvml polling loop
│ ├── Dockerfile
│ └── requirements.txt
├── frontend/ # React SPA
│ ├── src/
│ │ ├── api/client.ts
│ │ ├── components/ # GpuChart, TokenChart, JobList, JobDetail, StatsCards, Layout, etc.
│ │ ├── hooks/ # usePolling
│ │ ├── pages/ # Dashboard, JobDetail, Prompts, Compare, ComparisonDetail, Leaderboard, Models
│ │ └── types/
│ ├── Dockerfile
│ └── nginx.conf
├── shared/ # Shared Python package
│ ├── config.py # Pydantic settings
│ ├── database.py # SQLAlchemy async engine + session
│ └── models.py # Job, Comparison, GpuMetric, Prompt ORM models
├── tests/ # 136 pytest tests (97% coverage)
│ ├── conftest.py # In-memory DB, session, and ASGI client fixtures
│ └── test_*.py # 16 test modules
├── requirements-test.txt
├── pytest.ini
├── docker-compose.yml
└── .env.example
The backend has a comprehensive test suite covering all Python services: backend (FastAPI routes), worker (job processing), gpu-monitor (metrics collection), and shared (models, config, database).
# One-time setup
ln -sf gpu-monitor gpu_monitor
pip install -r requirements-test.txt
# Run all 136 tests with coverage
pytest --cov --cov-report=term-missing -v
# Run a single test file
pytest tests/test_routes_jobs.py -vThe test suite runs entirely offline — no Docker, no GPU, no Ollama needed. Three external systems are mocked out:
| System | Mock Strategy | Why |
|---|---|---|
| SQLite database | Replaced with an in-memory SQLite engine (sqlite+aiosqlite://). Each test gets a fresh database via function-scoped fixtures — tables are created before the test and dropped after. |
Eliminates filesystem I/O, prevents test pollution, runs in milliseconds. |
| Ollama HTTP API | Intercepted at the httpx transport layer using respx. Routes that call Ollama (/api/models, /api/models/catalog, /api/models/show, /api/models/pull, DELETE /api/models) and the worker's generate() function all get deterministic fake responses. |
Ollama requires a running server with downloaded models. Mocking lets us test every code path — success, HTTP errors, missing fields, timeouts — without a live inference server. |
| NVIDIA GPU driver (pynvml) | The entire pynvml module is replaced via unittest.mock.patch with a MagicMock that returns SimpleNamespace objects mimicking real GPU handles, utilization rates, memory info, temperature, and power readings. |
The gpu-monitor service calls pynvml.nvmlDeviceGetHandleByIndex() and related C library bindings that require an NVIDIA GPU. Mocking lets us verify unit conversions (milliwatts → watts, bytes → megabytes), multi-GPU iteration, metric pruning, and error handling. |
tests/
├── conftest.py # Shared fixtures (engine, session, FastAPI client)
├── test_shared_config.py # Settings defaults and env overrides
├── test_shared_models.py # ORM defaults, relationships, enums
├── test_shared_database.py # Engine, Base metadata, get_session
├── test_backend_main.py # Health endpoint, CORS, router registration
├── test_backend_schemas.py # Pydantic validation on all request schemas
├── test_backend_seed.py # Prompt seeding logic and idempotency
├── test_model_catalog.py # Catalog structure and data integrity
├── test_routes_jobs.py # Job CRUD, token-usage cumulative logic, stats
├── test_routes_comparisons.py # Comparison CRUD, set/clear winner validation
├── test_routes_gpu.py # GPU metrics query with time filtering
├── test_routes_models.py # Ollama proxy endpoints (respx mocks)
├── test_routes_leaderboard.py # TPS calculation, win rate, sorting
├── test_routes_prompts.py # Prompt CRUD with partial updates
├── test_worker_main.py # Job claim/complete/fail lifecycle, main loop
├── test_worker_ollama.py # generate() request/response handling
└── test_gpu_monitor.py # Record/prune metrics, main loop resilience
For local frontend development with hot reload:
cd frontend
npm install
npm run devThis starts Vite on port 3000 with API requests proxied to localhost:8000.
MIT