This repository contains AI/LLM benchmarks for single node configurations and benchmarking data compiled by Jeff Geerling, using llama.cpp and Ollama.
For automated AI cluster benchmarking, see Beowulf AI Cluster. Results from that testing are also listed in this README file below.
Benchmarking AI models is daunting, because you have to deal with hardware issues, OS issues, driver issues, stability issues... and that's all before deciding on:
- What models to benchmark (which quantization, what particular gguf, etc.?)
- How to benchmark the models (what context size, with or without features like flash attention, etc.?).
- What results to worry about (prompt processing speed, generated tokens per second, etc.?)
Most Linux distributions should Just Work™. However, this project is most frequently tested against sytems running:
- Debian Linux
- Ubuntu Linux
- Fedora Linux
- macOS
For macOS: This project assumes you already have the build dependencies cmake, ninja, wget, and curl installed. These can be installed with Homebrew using brew install cmake ninja wget curl.
Most of the time I rely on llama.cpp, as it is more broadly compatible, works with more models on more systems, and incorporates features that are useful for hardware acceleration more quickly than Ollama. For example, Vulkan was supported for years in llama.cpp prior to Ollama supporting it. Vulkan enables many AMD and Intel GPUs (as well as other Vulkan-compatible iGPUs) to work for LLM inference.
The repository includes a Pyinfra script to run either llama.cpp or ollama benchmarks with any given LLM.
If you already have llama.cpp installed, you can run a quick benchmark using the llama-bench tool directly:
# Download a model (gguf)
cd models && wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf && cd ..
# Run a benchmark
./build/bin/llama-bench -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2
But if you want to have llama.cpp compiled automatically, and run one or many llama-bench runs with customizable options, you can use Pyinfra to do so:
- Make a copy of the inventory file and edit it to point to your server (or localhost):
cp example.inventory.py inventory.py - Edit the variables inside
group_data/all.pyto your liking. - Run the benchmarks:
pyinfra inventory.py ai-benchmarks.py -y
On systems where Ollama is supported and runs well, you can run the obench.sh script directly. It can run a predefined benchmark on Ollama one to many times, and generate an average score.
For a quick installation of Ollama, try:
curl -fsSL https://ollama.com/install.sh | sh
If you're not running Linux, download Ollama from the official site.
Verify you can run ollama with a given model:
ollama run llama3.2:3b
Then run this benchmark script, for three runs, summarizing the data in a Markdown table:
./obench.sh -m llama3.2:3b -c 3 --markdown
Uninstall Ollama following the official uninstall instructions.
Usage: ./obench.sh [OPTIONS]
Options:
-h, --help Display this help message
-d, --default Run a benchmark using some default small models
-m, --model Specify a model to use
-c, --count Number of times to run the benchmark
--ollama-bin Point to ollama executable or command (e.g if using Docker)
--markdown Format output as markdown
All resorts are sorted by token generation rate (tg), listed here as 'Eval Rate', in descending order. Eventually I may find a better way to sort these findings, and include more data. For now, click through the System name to find all the test details.
| System | CPU/GPU | Eval Rate | Power (Peak) |
|---|---|---|---|
| Mac Studio (M3 Ultra 512GB) | GPU | 19.89 Tokens/s | 261.8W |
| AmpereOne A192-32X - 512GB | CPU | 4.18 Tokens/s | 477 W |
| System | CPU/GPU | Eval Rate | Power (Peak) |
|---|---|---|---|
| Intel 265K Custom PC (Nvidia RTX 4090) | GPU | 279.65 Tokens/s | 509.1 W |
| Pi CM5 - 16GB (Nvidia RTX 40901) | GPU | 189.09 Tokens/s | 322.1W |
| Intel 265K Custom PC (AMD Radeon AI Pro R9700) | GPU | 163.26 Tokens/s | 332.9W |
| Intel 265K Custom PC (Nvidia RTX 4070 Ti) | GPU | 163.14 Tokens/s | 298.7 W |
| Pi CM5 - 16GB (Nvidia RTX 3080 Ti1) | GPU | 143.73 Tokens/s | 434W |
| Pi CM5 - 16GB (Nvidia RTX 4070 Ti1) | GPU | 132.88 Tokens/s | 235W |
| Mac Studio (M3 Ultra 512GB) | GPU | 115.29 Tokens/s | 227W |
| Intel 265K Custom PC (Nvidia RTX A4000) | GPU | 93.55 Tokens/s | 199.4W |
| Pi CM5 - 16GB (Nvidia RTX A40001) | GPU | 90.68 Tokens/s | 163W |
| Dell Pro Max with GB10 (Nvidia Spark) | GPU | 87.97 Tokens/s | 122.8W |
| Intel 265K Custom PC (Nvidia RTX 3060) | GPU | 83.96 Tokens/s | 229.3W |
| Pi CM5 - 16GB (Nvidia RTX 30601) | GPU | 79.14 Tokens/s | 180.9W |
| Pi CM5 - 16GB (AMD Radeon AI Pro R97001) | GPU | 19.59 Tokens/s | 318.4W |
| System | CPU/GPU | Eval Rate | Power (Peak) |
|---|---|---|---|
| Mac Studio (M3 Ultra 512GB) | GPU | 14.08 Tokens/s | 243W |
| M1 Max Mac Studio (10 core - 64GB) | GPU | 7.25 Tokens/s | N/A |
| Framework Desktop Mainboard (395+) | GPU/CPU | 4.97 Tokens/s | 133 W |
| Dell Pro Max with GB10 (Nvidia Spark) | GPU | 4.71 Tokens/s | 156 W |
| AmpereOne A192-32X (512GB) | CPU | 3.86 Tokens/s | N/A |
| Ryzen 9 7900X (Nvidia 4090) | GPU/CPU | 3.10 Tokens/s | N/A |
| Raspberry Pi CM5 Cluster (10x 16GB) | CPU | 0.85 Tokens/s | 70W |
| Minisforum MS-R1 | CPU | 0.77 Tokens/s | 38.2 W |
1 These GPUs were tested using llama.cpp with Vulkan support.
| System | CPU/GPU | Eval Rate | Power (Peak) |
|---|---|---|---|
| Mac Studio (M3 Ultra 512GB) | GPU | 1.96 Tokens/s | 256.4W |
| AmpereOne A192-32X (512GB) | CPU | 0.90 Tokens/s | N/A |
| Framework Mainboard Cluster (512GB) | GPU | 0.71 Tokens/s | N/A |
These benchmarks are in no way comprehensive, and I normally only compare one aspect of generative AI performance—inference tokens per second. There are many other aspects that are as important (or more important) my benchmarking does not cover, though sometimes I get deeper into the weeds in individual issues.
See All about Timing: A quick look at metrics for LLM serving for a good overview of other metrics you may want to compare.
This benchmark was originally based on an upstream project focused only on Ollama tabletuser-blogspot/ollama-benchmark. This fork is maintained by Jeff Geerling.