AI/LLM Benchmarks (llama.cpp and Ollama)

This repository contains AI/LLM benchmarks for single node configurations and benchmarking data compiled by Jeff Geerling, using llama.cpp and Ollama.

For automated AI cluster benchmarking, see Beowulf AI Cluster. Results from that testing are also listed in this README file below.

Benchmarking AI models is daunting, because you have to deal with hardware issues, OS issues, driver issues, stability issues... and that's all before deciding on:

What models to benchmark (which quantization, what particular gguf, etc.?)
How to benchmark the models (what context size, with or without features like flash attention, etc.?).
What results to worry about (prompt processing speed, generated tokens per second, etc.?)

OS / Distro Support

Most Linux distributions should Just Work™. However, this project is most frequently tested against sytems running:

Debian Linux
Ubuntu Linux
Fedora Linux
macOS

For macOS: This project assumes you already have the build dependencies cmake, ninja, wget, and curl installed. These can be installed with Homebrew using brew install cmake ninja wget curl.

Llama.cpp Benchmarks

Most of the time I rely on llama.cpp, as it is more broadly compatible, works with more models on more systems, and incorporates features that are useful for hardware acceleration more quickly than Ollama. For example, Vulkan was supported for years in llama.cpp prior to Ollama supporting it. Vulkan enables many AMD and Intel GPUs (as well as other Vulkan-compatible iGPUs) to work for LLM inference.

The repository includes a Pyinfra script to run either llama.cpp or ollama benchmarks with any given LLM.

If you already have llama.cpp installed, you can run a quick benchmark using the llama-bench tool directly:

# Download a model (gguf)
cd models && wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf && cd ..

# Run a benchmark
./build/bin/llama-bench -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2

But if you want to have llama.cpp compiled automatically, and run one or many llama-bench runs with customizable options, you can use Pyinfra to do so:

Make a copy of the inventory file and edit it to point to your server (or localhost): cp example.inventory.py inventory.py
Edit the variables inside group_data/all.py to your liking.
Run the benchmarks: pyinfra inventory.py ai-benchmarks.py -y

Ollama Benchmark

On systems where Ollama is supported and runs well, you can run the obench.sh script directly. It can run a predefined benchmark on Ollama one to many times, and generate an average score.

For a quick installation of Ollama, try:

curl -fsSL https://ollama.com/install.sh | sh

If you're not running Linux, download Ollama from the official site.

Verify you can run ollama with a given model:

ollama run llama3.2:3b

Then run this benchmark script, for three runs, summarizing the data in a Markdown table:

./obench.sh -m llama3.2:3b -c 3 --markdown

Uninstall Ollama following the official uninstall instructions.

Ollama benchmark CLI Options

Usage: ./obench.sh [OPTIONS]
Options:
 -h, --help      Display this help message
 -d, --default   Run a benchmark using some default small models
 -m, --model     Specify a model to use
 -c, --count     Number of times to run the benchmark
 --ollama-bin    Point to ollama executable or command (e.g if using Docker)
 --markdown      Format output as markdown

Findings

All resorts are sorted by token generation rate (tg), listed here as 'Eval Rate', in descending order. Eventually I may find a better way to sort these findings, and include more data. For now, click through the System name to find all the test details.

DeepSeek R1 14b

System	CPU/GPU	Eval Rate	Power (Peak)
Intel 265K Custom PC (Nvidia RTX 4090)	GPU	92.49 Tokens/s	464.2 W
Pi CM5 - 16GB (Nvidia RTX 4090¹)	GPU	83.31 Tokens/s	397.5W
Intel 265K Custom PC (Nvidia RTX 3080 Ti)	GPU	70.74 Tokens/s	495 W
Pi CM5 - 16GB (Nvidia RTX 3080 Ti¹)	GPU	67.74 Tokens/s	476W
Radxa Orion O6 - 16GB (Nvidia RTX 3080 Ti)	GPU	64.58 Tokens/s	465 W
Intel 265K Custom PC (AMD Radeon AI Pro R9700)	GPU	53.72 Tokens/s	330.8 W
Mac Studio (M3 Ultra 512GB)	GPU	51.85 Tokens/s	227W
Intel 265K Custom PC (Nvidia RTX 4070 Ti)	GPU	49.74 Tokens/s	311.1 W
Pi CM5 - 16GB (Nvidia RTX 4070 Ti¹)	GPU	48.24 Tokens/s	253W
Pi CM5 - 16GB (Nvidia RTX A4000¹)	GPU	36.21 Tokens/s	162.8W
M1 Ultra (48 GPU Core) 64GB	GPU	35.89 Tokens/s	N/A
Intel 265K Custom PC (Nvidia RTX 3060)	GPU	29.77 Tokens/s	224W
Pi CM5 - 16GB (Nvidia RTX 3060¹)	GPU	29.40 Tokens/s	193.7W
Dell Pro Max with GB10 (Nvidia Spark)	GPU	23.92 Tokens/s	138.4 W
Pi 5 - 16GB (AMD Pro W7700¹)	GPU	19.90 Tokens/s	164 W
Framework Mainboard (128GB)	CPU	11.37 Tokens/s	140W
Pi CM5 - 16GB (AMD AI Pro R9700¹)	GPU	7.98 Tokens/s	279.7W
Framework 13 (Ryzen AI 5 340 16GB)	CPU	5.83 Tokens/s	50.4W
Radxa Orion O6 - 16GB	CPU	4.33 Tokens/s	34.7 W
Minisforum MS-R1	CPU	3.39 Tokens/s	38.4 W
GMKtek G3 Plus (Intel N150) - 16GB	CPU	2.13 Tokens/s	30.3 W
Pi 5 - 16GB	CPU	1.20 Tokens/s	13.0 W

DeepSeek R1 671b

System	CPU/GPU	Eval Rate	Power (Peak)
Mac Studio (M3 Ultra 512GB)	GPU	19.89 Tokens/s	261.8W
AmpereOne A192-32X - 512GB	CPU	4.18 Tokens/s	477 W

gpt-oss 20b

System	CPU/GPU	Eval Rate	Power (Peak)
Intel 265K Custom PC (Nvidia RTX 4090)	GPU	279.65 Tokens/s	509.1 W
Pi CM5 - 16GB (Nvidia RTX 4090¹)	GPU	189.09 Tokens/s	322.1W
Intel 265K Custom PC (AMD Radeon AI Pro R9700)	GPU	163.26 Tokens/s	332.9W
Intel 265K Custom PC (Nvidia RTX 4070 Ti)	GPU	163.14 Tokens/s	298.7 W
Pi CM5 - 16GB (Nvidia RTX 3080 Ti¹)	GPU	143.73 Tokens/s	434W
Pi CM5 - 16GB (Nvidia RTX 4070 Ti¹)	GPU	132.88 Tokens/s	235W
Mac Studio (M3 Ultra 512GB)	GPU	115.29 Tokens/s	227W
Intel 265K Custom PC (Nvidia RTX A4000)	GPU	93.55 Tokens/s	199.4W
Pi CM5 - 16GB (Nvidia RTX A4000¹)	GPU	90.68 Tokens/s	163W
Dell Pro Max with GB10 (Nvidia Spark)	GPU	87.97 Tokens/s	122.8W
Intel 265K Custom PC (Nvidia RTX 3060)	GPU	83.96 Tokens/s	229.3W
Pi CM5 - 16GB (Nvidia RTX 3060¹)	GPU	79.14 Tokens/s	180.9W
Pi CM5 - 16GB (AMD Radeon AI Pro R9700¹)	GPU	19.59 Tokens/s	318.4W

Llama 3.2:3b

System	CPU/GPU	Eval Rate	Power (Peak)
Intel 265K Custom PC (Nvidia RTX 4090)	GPU	334.67 Tokens/s	261.9 W
Intel 13900K (Nvidia 5090)	GPU	271.40 Tokens/s	N/A
Intel 265K Custom PC (Nvidia RTX 3080 Ti)	GPU	263.24 Tokens/s	495 W
Ryzen 9 7900X (Nvidia 4090)	GPU	237.05 Tokens/s	N/A
Pi CM5 - 16GB (Nvidia RTX 4090¹)	GPU	235.61 Tokens/s	290.2W
Intel 13900K (Nvidia 4090)	GPU	216.48 Tokens/s	N/A
Pi CM5 - 16GB (Nvidia RTX 3080 Ti¹)	GPU	208.50 Tokens/s	394W
Ryzen 9 7950X (Nvidia 4080)	GPU	204.45 Tokens/s	N/A
Ryzen 9 7950X (Nvidia 4070 Ti Super)	GPU	198.95 Tokens/s	N/A
Intel 265K Custom PC (Nvidia RTX 4070 Ti)	GPU	196.15 Tokens/s	266.9 W
Intel 265K Custom PC (AMD Radeon AI Pro R9700)	GPU	195.91 Tokens/s	333.8 W
Pi CM5 - 16GB (Nvidia RTX 4070 Ti¹)	GPU	166.06 Tokens/s	228W
Ryzen 9 5950X (Nvidia 4070)	GPU	160.72 Tokens/s	N/A
Mac Studio (M3 Ultra 512GB)	GPU	154.60 Tokens/s	223W
Intel 265K Custom PC (Nvidia RTX A4000)	GPU	149.44 Tokens/s	269.2 W
Pi CM5 - 16GB (Nvidia RTX A4000¹)	GPU	134.47 Tokens/s	162.7W
Ryzen 9 9950X (AMD 7900 XT)	GPU	131.2 Tokens/s	N/A
Intel 265K Custom PC (Nvidia RTX 3060)	GPU	122.85 Tokens/s	214W
Pi CM5 - 16GB (Nvidia RTX 3060¹)	GPU	112.77 Tokens/s	192.3W
M1 Ultra (48 GPU Core) 64GB	GPU	108.67 Tokens/s	N/A
Pi 500+ - 16GB (AMD RX 7900 XT¹)	GPU	108.58 Tokens/s	315 W
Dell Pro Max with GB10 (Nvidia Spark)	GPU	97.93 Tokens/s	128.5 W
System76 Thelio Astra (Nvidia A4000)	GPU	90.92 Tokens/s	244 W
Pi 500+ - 16GB (AMD RX 9070 XT¹)	GPU	89.63 Tokens/s	304 W
System76 Thelio Astra (AMD Pro W7700¹)	GPU	89.31 Tokens/s	261 W
Framework Mainboard (128GB)	GPU	88.14 Tokens/s	133W
Pi CM5 - 16GB (AMD AI Pro R9700¹)	GPU	77.54 Tokens/s	281W
Minisforum MS-R1 (Nvidia RTX A2000)	GPU	71.36 Tokens/s	94.3 W
M1 Max Mac Studio (10 core - 64GB)	GPU	59.38 Tokens/s	N/A
Pi 5 - 8GB (AMD Pro W7700¹)	GPU	56.14 Tokens/s	145 W
Pi 5 - 8GB (AMD RX 6700 XT¹) 12GB	GPU	49.01 Tokens/s	94 W
Pi 5 - 8GB (AMD RX 7600¹)	GPU	48.47 Tokens/s	156 W
Pi 500+ - 16GB (Intel Arc B580¹)	GPU	47.38 Tokens/s	146 W
M4 Mac mini (10 core - 32GB)	GPU	41.31 Tokens/s	30.1 W
Pi 5 - 8GB (AMD RX 6500 XT¹)	GPU	39.82 Tokens/s	88 W
HiFive Premier P550 (AMD RX 580)	GPU	36.23 Tokens/s	150 W
System76 Thelio Astra (Nvidia A400)	GPU	35.51 Tokens/s	167 W
Pi 500+ - 16GB (Intel Arc Pro B50¹)	GPU	29.80 Tokens/s	78.5 W
AmpereOne A192-32X (512GB)	CPU	23.52 Tokens/s	N/A
Framework 13 (Ryzen AI 5 340)	CPU	23.81 Tokens/s	51.1W
Pi 500+ - 16GB (Intel Arc A310 ECO¹)	GPU	13.36 Tokens/s	50 W
Minisforum MS-R1	CPU	12.12 Tokens/s	35 W
GMKtec G3 Plus (Intel N150) - 16GB	CPU	9.06 Tokens/s	26.4 W
Pi 500+ - 16GB	CPU	5.55 Tokens/s	13 W
Pi 5 - 16GB	CPU	4.88 Tokens/s	11.9 W
Pi 5 - 8GB	CPU	4.61 Tokens/s	13.9 W
Pi 400 - 4GB	CPU	1.60 Tokens/s	6 W
Dell Optiplex 780 (C2Q Q8400)	CPU	1.09 Tokens/s	146 W
DC-ROMA Mainboard II (8-core RISC-V)	CPU	0.31 Tokens/s	30.6 W
HiFive Premier P550 (4-core RISC-V)	CPU	0.24 Tokens/s	13.5 W

Llama 3.1:70b

System	CPU/GPU	Eval Rate	Power (Peak)
Mac Studio (M3 Ultra 512GB)	GPU	14.08 Tokens/s	243W
M1 Max Mac Studio (10 core - 64GB)	GPU	7.25 Tokens/s	N/A
Framework Desktop Mainboard (395+)	GPU/CPU	4.97 Tokens/s	133 W
Dell Pro Max with GB10 (Nvidia Spark)	GPU	4.71 Tokens/s	156 W
AmpereOne A192-32X (512GB)	CPU	3.86 Tokens/s	N/A
Ryzen 9 7900X (Nvidia 4090)	GPU/CPU	3.10 Tokens/s	N/A
Raspberry Pi CM5 Cluster (10x 16GB)	CPU	0.85 Tokens/s	70W
Minisforum MS-R1	CPU	0.77 Tokens/s	38.2 W

¹ These GPUs were tested using llama.cpp with Vulkan support.

Llama 3.1:405b

System	CPU/GPU	Eval Rate	Power (Peak)
Mac Studio (M3 Ultra 512GB)	GPU	1.96 Tokens/s	256.4W
AmpereOne A192-32X (512GB)	CPU	0.90 Tokens/s	N/A
Framework Mainboard Cluster (512GB)	GPU	0.71 Tokens/s	N/A

Author

This benchmark was originally based on an upstream project focused only on Ollama tabletuser-blogspot/ollama-benchmark. This fork is maintained by Jeff Geerling.

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
.github/workflows		.github/workflows
group_data		group_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ai-benchmarks.py		ai-benchmarks.py
example.inventory.py		example.inventory.py
obench.sh		obench.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI/LLM Benchmarks (llama.cpp and Ollama)

OS / Distro Support

Llama.cpp Benchmarks

Ollama Benchmark

Ollama benchmark CLI Options

Findings

DeepSeek R1 14b

DeepSeek R1 671b

gpt-oss 20b

Llama 3.2:3b

Llama 3.1:70b

Llama 3.1:405b

Further Reading

Author

About

Uh oh!

Releases

Packages

Languages

License

geerlingguy/ai-benchmarks

Folders and files

Latest commit

History

Repository files navigation

AI/LLM Benchmarks (llama.cpp and Ollama)

OS / Distro Support

Llama.cpp Benchmarks

Ollama Benchmark

Ollama benchmark CLI Options

Findings

DeepSeek R1 14b

DeepSeek R1 671b

gpt-oss 20b

Llama 3.2:3b

Llama 3.1:70b

Llama 3.1:405b

Further Reading

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages