A comprehensive benchmarking and optimization tool for Ollama models with both Terminal User Interface (TUI) and Command-Line Interface (CLI) modes.
Jeremiah Pegues jeremiah@pegues.io
- Dual Interface: Full-featured TUI with real-time graphs or simple CLI mode
- Model Benchmarking: Compare performance across multiple Ollama models
- System Optimization: Automatically tune models for your hardware
- Resource Monitoring: Real-time CPU, GPU, and RAM usage tracking
- Batch Processing: Optimize multiple models in parallel
- Export Results: Save benchmark data in CSV format
- Modelfile Optimization: Generate optimized configurations based on system specs
- Batch Model Optimization: Optimize all models with a single command
- Hardware Detection: Automatic detection of CPU, RAM, and GPU capabilities
- Platform-Specific Tuning: Special optimizations for Apple Silicon
- Performance Profiling: Detailed metrics including tokens/sec and memory usage
git clone https://github.com/peguesj/ollama-bench.git
cd ollama-bench
pip install -e .pip install psutil pynvml py-cpuinfoollama-bench
# or
python -m ollama_benchollama-bench --cli# Optimize a single model
python optimize_model.py llama2
# Optimize all models
python optimize_all.py --parallel
# Clean up optimized models
python optimize_all.py --cleanupThe TUI provides a rich interactive experience with:
┌─────────────────────────────────────────────────────────┐
│ Ollama Bench v2.0.0 │
├─────────────────┬───────────────────────────────────────┤
│ === Models === │ Benchmark Results │
│ qwen2.5-coder │ Model: qwen2.5-coder │
│ llama2:7b │ Tokens/sec: 42.3 │
│ codellama:34b │ Time: 1.2s │
│ │ Peak RAM: 7.2 GB │
│ === Actions ===│ │
│> Run Benchmark │ ┌─Performance Graph──────┐ │
│ Configuration │ │ ████████████████ │ │
│ Optimize Model │ │ CPU: 45% GPU: 80% │ │
│ Export Results │ └───────────────────────┘ │
├─────────────────┴───────────────────────────────────────┤
│ [Up/Down] Navigate [Enter] Select [O] Optimize [Q] Quit │
├─────────────────────────────────────────────────────────┤
│ Ready CPU: 12% RAM: 8GB │
└─────────────────────────────────────────────────────────┘
- Arrow Keys: Navigate menu
- Enter: Select menu item
- Space: Start/stop benchmark
- O: Optimize selected model (or all if none selected)
- E: Edit configuration
- M: Edit Modelfile
- X: Export results
- Q: Quit
The CLI provides a simple menu-driven interface:
$ ollama-bench --cli
============================================================
Ollama Bench CLI - Benchmarking Tool
============================================================
=== Main Menu ===
1. Run Benchmark
2. List Models
3. Show Configuration
4. Export Results
5. Optimize Single Model
6. Optimize All Models
7. Show System Info
8. Clean Optimized Models
Q. Quit
Enter choice: $ ollama-bench --cli
# Select option 7
System Specifications
============================================================
Platform: Darwin (Apple Silicon)
CPU: 12 cores @ 3.2 GHz
RAM: 48.0 GB total, 32.0 GB available
GPU: Apple Silicon GPU (36.0 GB)
Optimal Parameters
============================================================
Context Size: 4096 tokens
Batch Size: 512
Threads: 11
GPU Layers: 999# Optimize all models with parallel processing
$ python optimize_all.py --parallel --workers 4
# Optimize specific models
$ python optimize_all.py llama2:7b codellama:13b
# Generate benchmark comparison script
$ python optimize_all.py --benchmark
# Clean up when done
$ python optimize_all.py --cleanupThe optimizer automatically configures:
| Parameter | Description | Impact |
|---|---|---|
num_ctx |
Context window size | Larger = better comprehension |
num_batch |
Batch processing size | Larger = higher throughput |
num_gpu |
GPU layers to offload | 999 = full GPU acceleration |
num_thread |
CPU threads | Optimized for core count |
use_mlock |
Memory locking | Prevents swapping |
use_mmap |
Memory mapping | Efficient for large models |
| Available RAM | Model Size | Example Models |
|---|---|---|
| < 8 GB | 3B-7B | qwen2.5:3b, tinyllama |
| 8-16 GB | 7B | llama2:7b, mistral:7b |
| 16-32 GB | 13B | llama2:13b, codellama:13b |
| 32-64 GB | 34B | codellama:34b |
| > 64 GB | 70B+ | llama2:70b, mixtral:8x7b |
Results are saved in CSV format with detailed metrics:
model,iteration,elapsed_s,tokens_per_sec,peak_rss_bytes,cpu_percent,gpu_percent
qwen2.5-coder,1,1.234,42.3,7516192768,45.2,78.9
llama2:7b,1,2.456,38.1,8589934592,52.1,82.3
Configuration is stored in ~/.config/ollama_bench/config.yaml:
benchmark:
iterations: 3
timeout: 120
num_predict: 100
temperature: 0.7
seed: 42
workdir: ~/.ollama_bench
resources:
max_cpu_percent: 80
max_gpu_percent: 90
max_ram_gb: null
throttle_enabled: false
ui:
theme: default
refresh_rate: 0.5
show_graph: trueTypical optimization results:
- Speed: 20-70% faster token generation
- Memory: 10-20% lower RAM usage
- Stability: Reduced out-of-memory errors
- Efficiency: Better CPU/GPU utilization
ollama-bench/
├── ollama_bench/ # Main package
│ ├── core/ # Core functionality
│ │ ├── benchmark.py # Benchmarking engine
│ │ ├── models.py # Model management
│ │ ├── monitor.py # Resource monitoring
│ │ ├── config.py # Configuration
│ │ ├── system_optimizer.py # Hardware optimization
│ │ └── batch_optimizer.py # Batch processing
│ ├── tui/ # Terminal UI
│ │ ├── app.py # Main TUI application
│ │ ├── components/ # UI components
│ │ └── widgets/ # Interactive widgets
│ ├── cli.py # CLI interface
│ └── utils/ # Utilities
├── optimize_model.py # Single model optimizer
├── optimize_all.py # Batch optimizer
└── setup.py # Package setup
# Run tests
python test_optimization.py
# Test TUI import
python -c "from ollama_bench.tui import OllamaBenchTUI"
# Test CLI
python -m ollama_bench.cli- If you see Unicode errors, the tool automatically falls back to ASCII
- For best results, use a terminal that supports UTF-8
- NVIDIA: Requires nvidia-ml-py
- Apple Silicon: Automatic Metal acceleration
- No GPU: Falls back to CPU-only optimization
- Reduce
num_ctxfor lower memory usage - Enable
low_vrammode for limited GPU memory - Use quantized models (q4_0, q4_K_M)
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Run tests and benchmarks
- Submit a pull request
MIT License - see LICENSE file
- Ollama team for the excellent local LLM platform
- Python curses library for terminal UI capabilities
- psutil for cross-platform system monitoring
- Added Modelfile optimization based on system specs
- Implemented batch model optimization
- Added hardware detection and profiling
- Improved TUI with optimization features
- Added parallel processing support
- Fixed terminal compatibility issues
- Initial release with TUI and CLI interfaces
- Basic benchmarking functionality
- Resource monitoring
- Model management
Author: Jeremiah Pegues
Email: jeremiah@pegues.io
GitHub: github.com/peguesj
Built with ❤️ for the Ollama community