-
Notifications
You must be signed in to change notification settings - Fork 0
Llama3.1 benchmarking
Oleksandr Kuvshynov edited this page Aug 14, 2024
·
1 revision
Hardware:
- Apple M2 Ultra
- 192GB
- 76Core GPU
- 24Core CPU
Models:
- Main model: llama3.1 70B instruct, fp16
- draft model: llama3.1 8B instruct, various quant level
Context: 4096t
Can get ~2x
time ./llama-cli -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -n 512 -c 4096 -f ../llama_duo/test_prompt.txt -ngl 99
llama_print_timings: load time = 5704.32 ms
llama_print_timings: sample time = 22.17 ms / 512 runs ( 0.04 ms per token, 23090.11 tokens per second)
llama_print_timings: prompt eval time = 547.92 ms / 35 tokens ( 15.65 ms per token, 63.88 tokens per second)
llama_print_timings: eval time = 107556.41 ms / 511 runs ( 210.48 ms per token, 4.75 tokens per second)
llama_print_timings: total time = 108173.03 ms / 546 tokens
1.56s user 5.04s system 5% cpu 1:54.20 total
time ./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md ../llms/gguf/llama8b.fp16.inst.gguf -f ./test_prompt.txt -n 512 --draft 4 -ngl 99 -ngld 0 -c 4096
tokens: 514 tps: 6.51197
ggml_metal_free: deallocating
./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md -f -n 512 4 9 1015.31s user 26.31s system 1136% cpu 1:31.65 total
time ./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md ../llms/gguf/llama8b.fp16.inst.gguf -f ./test_prompt.txt -n 512 --draft 5 -ngl 99 -ngld 0 -c 4096
1138.39s user 26.48s system 1300% cpu 1:29.59 total
time ./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md ../llms/gguf/llama8b.Q6_K.inst.gguf -f ./test_prompt.txt -n 512 --draft 5 -ngl 99 -ngld 0 -c 4096
tokens: 516 tps: 8.07657
ggml_metal_free: deallocating
./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md -f -n 512 5 9 595.52s user 14.90s system 773% cpu 1:18.91 total
time ./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md ../llms/gguf/llama8b.Q6_K.inst.gguf -f ./test_prompt.txt -n 512 --draft 6 -ngl 99 -ngld 0 -c 4096
tokens: 514 tps: 8.4809
ggml_metal_free: deallocating
./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md -f -n 512 5 9 662.96s user 12.44s system 992% cpu 1:08.08 total
time ./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md ../llms/gguf/llama8b.Q6_K.inst.gguf -f ./test_prompt.txt -n 512 --draft 7 -ngl 99 -ngld 0 -c 4096
tokens: 518 tps: 8.41811
ggml_metal_free: deallocating
./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md -f -n 512 7 9 746.57s user 13.10s system 1100% cpu 1:09.04 total
time ./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md ../llms/gguf/llama8b.Q4_K.inst.gguf -f ./test_prompt.txt -n 512 --draft 6 -ngl 99 -ngld 0 -c 4096
tokens: 515 tps: 8.94025
ggml_metal_free: deallocating
./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md -f -n 512 6 9 522.33s user 10.42s system 820% cpu 1:04.90 total
time ./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md ../llms/gguf/llama8b.Q4_K.inst.gguf -f ./test_prompt.txt -n 512 --draft 7 -ngl 99 -ngld 0 -c 4096
tokens: 512 tps: 9.25817
ggml_metal_free: deallocating
./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md -f -n 512 7 9 587.84s user 11.65s system 953% cpu 1:02.90 total
time ./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md ../llms/gguf/llama8b.Q4_K.inst.gguf -f ./test_prompt.txt -n 512 --draft 8 -ngl 99 -ngld 0 -c 4096
...
tokens: 513 tps: 9.36503
ggml_metal_free: deallocating
./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md -f -n 512 8 9 637.24s user 11.56s system 1042% cpu 1:02.25 total
./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md ../llms/gguf/llama8b.Q4_K.inst.gguf -f ./test_prompt.txt -n 512 --draft 9 -ngl 99 -ngld 0 -c 4096
tokens: 512 tps: 9.4619
ggml_metal_free: deallocating
./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md -f -n 512 9 9 679.92s user 12.23s system 1120% cpu 1:01.76 total
time ./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md ../llms/gguf/llama8b.Q4_K.inst.gguf -f ./test_prompt.txt -n 512 --draft 10 -ngl 99 -ngld 0 -c 4096
tokens: 519 tps: 9.37436
ggml_metal_free: deallocating
./_build/duo -m ../llms/gguf/llama3.1.70b.f16.inst.gguf -md -f -n 512 10 737.46s user 12.27s system 1188% cpu 1:03.07 total