Llama3.1 benchmarking

Configuration

Hardware:

Apple M2 Ultra
192GB
76Core GPU
24Core CPU

Models:

Main model: llama3.1 70B instruct, fp16
draft model: llama3.1 8B instruct, various quant level

Context: 4096t

Summary

Can get ~2x

No speculation

time ./llama-cli -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -n 512 -c 4096 -f ../llama_duo/test_prompt.txt -ngl 99

llama_print_timings:        load time =    5704.32 ms
llama_print_timings:      sample time =      22.17 ms /   512 runs   (    0.04 ms per token, 23090.11 tokens per second)
llama_print_timings: prompt eval time =     547.92 ms /    35 tokens (   15.65 ms per token,    63.88 tokens per second)
llama_print_timings:        eval time =  107556.41 ms /   511 runs   (  210.48 ms per token,     4.75 tokens per second)
llama_print_timings:       total time =  108173.03 ms /   546 tokens

1.56s user 5.04s system 5% cpu 1:54.20 total

FP16 Draft

4 tokens speculation

time ./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md ../llms/gguf/llama8b.fp16.inst.gguf -f ./test_prompt.txt -n 512 --draft 4 -ngl 99 -ngld 0 -c 4096

tokens: 514 tps: 6.51197
ggml_metal_free: deallocating
./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md  -f  -n 512  4  9  1015.31s user 26.31s system 1136% cpu 1:31.65 total

5 tokens speculation

time ./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md ../llms/gguf/llama8b.fp16.inst.gguf -f ./test_prompt.txt -n 512 --draft 5 -ngl 99 -ngld 0 -c 4096
1138.39s user 26.48s system 1300% cpu 1:29.59 total

Q6 Draft

5 token speculation

time ./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md ../llms/gguf/llama8b.Q6_K.inst.gguf -f ./test_prompt.txt -n 512 --draft 5 -ngl 99 -ngld 0 -c 4096
tokens: 516 tps: 8.07657
ggml_metal_free: deallocating
./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md  -f  -n 512  5  9  595.52s user 14.90s system 773% cpu 1:18.91 total

6 token speculation

time ./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md ../llms/gguf/llama8b.Q6_K.inst.gguf -f ./test_prompt.txt -n 512 --draft 6 -ngl 99 -ngld 0 -c 4096
tokens: 514 tps: 8.4809
ggml_metal_free: deallocating
./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md  -f  -n 512  5  9  662.96s user 12.44s system 992% cpu 1:08.08 total

7 token speculation

time ./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md ../llms/gguf/llama8b.Q6_K.inst.gguf -f ./test_prompt.txt -n 512 --draft 7 -ngl 99 -ngld 0 -c 4096

tokens: 518 tps: 8.41811
ggml_metal_free: deallocating
./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md  -f  -n 512  7  9  746.57s user 13.10s system 1100% cpu 1:09.04 total

Q4_K draft

6 tokens

time ./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md ../llms/gguf/llama8b.Q4_K.inst.gguf -f ./test_prompt.txt -n 512 --draft 6 -ngl 99 -ngld 0 -c 4096
tokens: 515 tps: 8.94025
ggml_metal_free: deallocating
./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md  -f  -n 512  6  9  522.33s user 10.42s system 820% cpu 1:04.90 total

7 tokens

time ./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md ../llms/gguf/llama8b.Q4_K.inst.gguf -f ./test_prompt.txt -n 512 --draft 7 -ngl 99 -ngld 0 -c 4096

tokens: 512 tps: 9.25817
ggml_metal_free: deallocating
./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md  -f  -n 512  7  9  587.84s user 11.65s system 953% cpu 1:02.90 total

8 tokens

time ./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md ../llms/gguf/llama8b.Q4_K.inst.gguf -f ./test_prompt.txt -n 512 --draft 8 -ngl 99 -ngld 0 -c 4096
...
tokens: 513 tps: 9.36503
ggml_metal_free: deallocating
./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md  -f  -n 512  8  9  637.24s user 11.56s system 1042% cpu 1:02.25 total

9 tokens


./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md ../llms/gguf/llama8b.Q4_K.inst.gguf -f ./test_prompt.txt -n 512 --draft 9 -ngl 99 -ngld 0 -c 4096

tokens: 512 tps: 9.4619
ggml_metal_free: deallocating
./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md  -f  -n 512  9  9  679.92s user 12.23s system 1120% cpu 1:01.76 total

10 tokens

time ./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md ../llms/gguf/llama8b.Q4_K.inst.gguf -f ./test_prompt.txt -n 512 --draft 10 -ngl 99 -ngld 0 -c 4096

tokens: 519 tps: 9.37436
ggml_metal_free: deallocating
./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md  -f  -n 512  10    737.46s user 12.27s system 1188% cpu 1:03.07 total

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama3.1 benchmarking

Configuration

Summary

No speculation

FP16 Draft

4 tokens speculation

5 tokens speculation

Q6 Draft

5 token speculation

6 token speculation

7 token speculation

Q4_K draft

6 tokens

7 tokens

8 tokens

9 tokens

10 tokens

Clone this wiki locally