Skip to content

geerlingguy/ai-benchmarks

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI/LLM Benchmarks (llama.cpp and Ollama)

.github/workflows/shellcheck.yaml

This repository contains AI/LLM benchmarks for single node configurations and benchmarking data compiled by Jeff Geerling, using llama.cpp and Ollama.

For automated AI cluster benchmarking, see Beowulf AI Cluster. Results from that testing are also listed in this README file below.

Benchmarking AI models is daunting, because you have to deal with hardware issues, OS issues, driver issues, stability issues... and that's all before deciding on:

  1. What models to benchmark (which quantization, what particular gguf, etc.?)
  2. How to benchmark the models (what context size, with or without features like flash attention, etc.?).
  3. What results to worry about (prompt processing speed, generated tokens per second, etc.?)

OS / Distro Support

Most Linux distributions should Just Work™. However, this project is most frequently tested against sytems running:

  • Debian Linux
  • Ubuntu Linux
  • Fedora Linux
  • macOS

For macOS: This project assumes you already have the build dependencies cmake, ninja, wget, and curl installed. These can be installed with Homebrew using brew install cmake ninja wget curl.

Llama.cpp Benchmarks

Most of the time I rely on llama.cpp, as it is more broadly compatible, works with more models on more systems, and incorporates features that are useful for hardware acceleration more quickly than Ollama. For example, Vulkan was supported for years in llama.cpp prior to Ollama supporting it. Vulkan enables many AMD and Intel GPUs (as well as other Vulkan-compatible iGPUs) to work for LLM inference.

The repository includes a Pyinfra script to run either llama.cpp or ollama benchmarks with any given LLM.

If you already have llama.cpp installed, you can run a quick benchmark using the llama-bench tool directly:

# Download a model (gguf)
cd models && wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf && cd ..

# Run a benchmark
./build/bin/llama-bench -m models/Llama-3.2-3B-Instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2

But if you want to have llama.cpp compiled automatically, and run one or many llama-bench runs with customizable options, you can use Pyinfra to do so:

  1. Make a copy of the inventory file and edit it to point to your server (or localhost): cp example.inventory.py inventory.py
  2. Edit the variables inside group_data/all.py to your liking.
  3. Run the benchmarks: pyinfra inventory.py ai-benchmarks.py -y

Ollama Benchmark

On systems where Ollama is supported and runs well, you can run the obench.sh script directly. It can run a predefined benchmark on Ollama one to many times, and generate an average score.

For a quick installation of Ollama, try:

curl -fsSL https://ollama.com/install.sh | sh

If you're not running Linux, download Ollama from the official site.

Verify you can run ollama with a given model:

ollama run llama3.2:3b

Then run this benchmark script, for three runs, summarizing the data in a Markdown table:

./obench.sh -m llama3.2:3b -c 3 --markdown

Uninstall Ollama following the official uninstall instructions.

Ollama benchmark CLI Options

Usage: ./obench.sh [OPTIONS]
Options:
 -h, --help      Display this help message
 -d, --default   Run a benchmark using some default small models
 -m, --model     Specify a model to use
 -c, --count     Number of times to run the benchmark
 --ollama-bin    Point to ollama executable or command (e.g if using Docker)
 --markdown      Format output as markdown

Findings

All resorts are sorted by token generation rate (tg), listed here as 'Eval Rate', in descending order. Eventually I may find a better way to sort these findings, and include more data. For now, click through the System name to find all the test details.

DeepSeek R1 14b

System CPU/GPU Eval Rate Power (Peak)
Intel 265K Custom PC (Nvidia RTX 4090) GPU 92.49 Tokens/s 464.2 W
Pi CM5 - 16GB (Nvidia RTX 40901) GPU 83.31 Tokens/s 397.5W
Intel 265K Custom PC (Nvidia RTX 3080 Ti) GPU 70.74 Tokens/s 495 W
Pi CM5 - 16GB (Nvidia RTX 3080 Ti1) GPU 67.74 Tokens/s 476W
Radxa Orion O6 - 16GB (Nvidia RTX 3080 Ti) GPU 64.58 Tokens/s 465 W
Intel 265K Custom PC (AMD Radeon AI Pro R9700) GPU 53.72 Tokens/s 330.8 W
Mac Studio (M3 Ultra 512GB) GPU 51.85 Tokens/s 227W
Intel 265K Custom PC (Nvidia RTX 4070 Ti) GPU 49.74 Tokens/s 311.1 W
Pi CM5 - 16GB (Nvidia RTX 4070 Ti1) GPU 48.24 Tokens/s 253W
Pi CM5 - 16GB (Nvidia RTX A40001) GPU 36.21 Tokens/s 162.8W
M1 Ultra (48 GPU Core) 64GB GPU 35.89 Tokens/s N/A
Intel 265K Custom PC (Nvidia RTX 3060) GPU 29.77 Tokens/s 224W
Pi CM5 - 16GB (Nvidia RTX 30601) GPU 29.40 Tokens/s 193.7W
Dell Pro Max with GB10 (Nvidia Spark) GPU 23.92 Tokens/s 138.4 W
Pi 5 - 16GB (AMD Pro W77001) GPU 19.90 Tokens/s 164 W
Framework Mainboard (128GB) CPU 11.37 Tokens/s 140W
Pi CM5 - 16GB (AMD AI Pro R97001) GPU 7.98 Tokens/s 279.7W
Framework 13 (Ryzen AI 5 340 16GB) CPU 5.83 Tokens/s 50.4W
Radxa Orion O6 - 16GB CPU 4.33 Tokens/s 34.7 W
Minisforum MS-R1 CPU 3.39 Tokens/s 38.4 W
GMKtek G3 Plus (Intel N150) - 16GB CPU 2.13 Tokens/s 30.3 W
Pi 5 - 16GB CPU 1.20 Tokens/s 13.0 W

DeepSeek R1 671b

System CPU/GPU Eval Rate Power (Peak)
Mac Studio (M3 Ultra 512GB) GPU 19.89 Tokens/s 261.8W
AmpereOne A192-32X - 512GB CPU 4.18 Tokens/s 477 W

gpt-oss 20b

System CPU/GPU Eval Rate Power (Peak)
Intel 265K Custom PC (Nvidia RTX 4090) GPU 279.65 Tokens/s 509.1 W
Pi CM5 - 16GB (Nvidia RTX 40901) GPU 189.09 Tokens/s 322.1W
Intel 265K Custom PC (AMD Radeon AI Pro R9700) GPU 163.26 Tokens/s 332.9W
Intel 265K Custom PC (Nvidia RTX 4070 Ti) GPU 163.14 Tokens/s 298.7 W
Pi CM5 - 16GB (Nvidia RTX 3080 Ti1) GPU 143.73 Tokens/s 434W
Pi CM5 - 16GB (Nvidia RTX 4070 Ti1) GPU 132.88 Tokens/s 235W
Mac Studio (M3 Ultra 512GB) GPU 115.29 Tokens/s 227W
Intel 265K Custom PC (Nvidia RTX A4000) GPU 93.55 Tokens/s 199.4W
Pi CM5 - 16GB (Nvidia RTX A40001) GPU 90.68 Tokens/s 163W
Dell Pro Max with GB10 (Nvidia Spark) GPU 87.97 Tokens/s 122.8W
Intel 265K Custom PC (Nvidia RTX 3060) GPU 83.96 Tokens/s 229.3W
Pi CM5 - 16GB (Nvidia RTX 30601) GPU 79.14 Tokens/s 180.9W
Pi CM5 - 16GB (AMD Radeon AI Pro R97001) GPU 19.59 Tokens/s 318.4W

Llama 3.2:3b

System CPU/GPU Eval Rate Power (Peak)
Intel 265K Custom PC (Nvidia RTX 4090) GPU 334.67 Tokens/s 261.9 W
Intel 13900K (Nvidia 5090) GPU 271.40 Tokens/s N/A
Intel 265K Custom PC (Nvidia RTX 3080 Ti) GPU 263.24 Tokens/s 495 W
Ryzen 9 7900X (Nvidia 4090) GPU 237.05 Tokens/s N/A
Pi CM5 - 16GB (Nvidia RTX 40901) GPU 235.61 Tokens/s 290.2W
Intel 13900K (Nvidia 4090) GPU 216.48 Tokens/s N/A
Pi CM5 - 16GB (Nvidia RTX 3080 Ti1) GPU 208.50 Tokens/s 394W
Ryzen 9 7950X (Nvidia 4080) GPU 204.45 Tokens/s N/A
Ryzen 9 7950X (Nvidia 4070 Ti Super) GPU 198.95 Tokens/s N/A
Intel 265K Custom PC (Nvidia RTX 4070 Ti) GPU 196.15 Tokens/s 266.9 W
Intel 265K Custom PC (AMD Radeon AI Pro R9700) GPU 195.91 Tokens/s 333.8 W
Pi CM5 - 16GB (Nvidia RTX 4070 Ti1) GPU 166.06 Tokens/s 228W
Ryzen 9 5950X (Nvidia 4070) GPU 160.72 Tokens/s N/A
Mac Studio (M3 Ultra 512GB) GPU 154.60 Tokens/s 223W
Intel 265K Custom PC (Nvidia RTX A4000) GPU 149.44 Tokens/s 269.2 W
Pi CM5 - 16GB (Nvidia RTX A40001) GPU 134.47 Tokens/s 162.7W
Ryzen 9 9950X (AMD 7900 XT) GPU 131.2 Tokens/s N/A
Intel 265K Custom PC (Nvidia RTX 3060) GPU 122.85 Tokens/s 214W
Pi CM5 - 16GB (Nvidia RTX 30601) GPU 112.77 Tokens/s 192.3W
M1 Ultra (48 GPU Core) 64GB GPU 108.67 Tokens/s N/A
Pi 500+ - 16GB (AMD RX 7900 XT1) GPU 108.58 Tokens/s 315 W
Dell Pro Max with GB10 (Nvidia Spark) GPU 97.93 Tokens/s 128.5 W
System76 Thelio Astra (Nvidia A4000) GPU 90.92 Tokens/s 244 W
Pi 500+ - 16GB (AMD RX 9070 XT1) GPU 89.63 Tokens/s 304 W
System76 Thelio Astra (AMD Pro W77001) GPU 89.31 Tokens/s 261 W
Framework Mainboard (128GB) GPU 88.14 Tokens/s 133W
Pi CM5 - 16GB (AMD AI Pro R97001) GPU 77.54 Tokens/s 281W
Minisforum MS-R1 (Nvidia RTX A2000) GPU 71.36 Tokens/s 94.3 W
M1 Max Mac Studio (10 core - 64GB) GPU 59.38 Tokens/s N/A
Pi 5 - 8GB (AMD Pro W77001) GPU 56.14 Tokens/s 145 W
Pi 5 - 8GB (AMD RX 6700 XT1) 12GB GPU 49.01 Tokens/s 94 W
Pi 5 - 8GB (AMD RX 76001) GPU 48.47 Tokens/s 156 W
Pi 500+ - 16GB (Intel Arc B5801) GPU 47.38 Tokens/s 146 W
M4 Mac mini (10 core - 32GB) GPU 41.31 Tokens/s 30.1 W
Pi 5 - 8GB (AMD RX 6500 XT1) GPU 39.82 Tokens/s 88 W
HiFive Premier P550 (AMD RX 580) GPU 36.23 Tokens/s 150 W
System76 Thelio Astra (Nvidia A400) GPU 35.51 Tokens/s 167 W
Pi 500+ - 16GB (Intel Arc Pro B501) GPU 29.80 Tokens/s 78.5 W
AmpereOne A192-32X (512GB) CPU 23.52 Tokens/s N/A
Framework 13 (Ryzen AI 5 340) CPU 23.81 Tokens/s 51.1W
Pi 500+ - 16GB (Intel Arc A310 ECO1) GPU 13.36 Tokens/s 50 W
Minisforum MS-R1 CPU 12.12 Tokens/s 35 W
GMKtec G3 Plus (Intel N150) - 16GB CPU 9.06 Tokens/s 26.4 W
Pi 500+ - 16GB CPU 5.55 Tokens/s 13 W
Pi 5 - 16GB CPU 4.88 Tokens/s 11.9 W
Pi 5 - 8GB CPU 4.61 Tokens/s 13.9 W
Pi 400 - 4GB CPU 1.60 Tokens/s 6 W
Dell Optiplex 780 (C2Q Q8400) CPU 1.09 Tokens/s 146 W
DC-ROMA Mainboard II (8-core RISC-V) CPU 0.31 Tokens/s 30.6 W
HiFive Premier P550 (4-core RISC-V) CPU 0.24 Tokens/s 13.5 W

Llama 3.1:70b

System CPU/GPU Eval Rate Power (Peak)
Mac Studio (M3 Ultra 512GB) GPU 14.08 Tokens/s 243W
M1 Max Mac Studio (10 core - 64GB) GPU 7.25 Tokens/s N/A
Framework Desktop Mainboard (395+) GPU/CPU 4.97 Tokens/s 133 W
Dell Pro Max with GB10 (Nvidia Spark) GPU 4.71 Tokens/s 156 W
AmpereOne A192-32X (512GB) CPU 3.86 Tokens/s N/A
Ryzen 9 7900X (Nvidia 4090) GPU/CPU 3.10 Tokens/s N/A
Raspberry Pi CM5 Cluster (10x 16GB) CPU 0.85 Tokens/s 70W
Minisforum MS-R1 CPU 0.77 Tokens/s 38.2 W

1 These GPUs were tested using llama.cpp with Vulkan support.

Llama 3.1:405b

System CPU/GPU Eval Rate Power (Peak)
Mac Studio (M3 Ultra 512GB) GPU 1.96 Tokens/s 256.4W
AmpereOne A192-32X (512GB) CPU 0.90 Tokens/s N/A
Framework Mainboard Cluster (512GB) GPU 0.71 Tokens/s N/A

Further Reading

These benchmarks are in no way comprehensive, and I normally only compare one aspect of generative AI performance—inference tokens per second. There are many other aspects that are as important (or more important) my benchmarking does not cover, though sometimes I get deeper into the weeds in individual issues.

See All about Timing: A quick look at metrics for LLM serving for a good overview of other metrics you may want to compare.

Author

This benchmark was originally based on an upstream project focused only on Ollama tabletuser-blogspot/ollama-benchmark. This fork is maintained by Jeff Geerling.

About

Simple AI/LLM benchmarking tools.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 82.1%
  • Shell 17.9%