Bug: The token generation speed is slower compared to the upstream llama.cpp project #533

BIGPPWONG · 2024-08-13T01:14:37Z

Contact Details

No response

What happened?

Issue Description:

The token generation speed of llamafile is slower compared to the upstream llama.cpp project.

Details:

llamafile version 0.8.12：gglm-cuda built with command:

nvcc -arch=all -DIGNORE123 -O3 --shared --use_fast_math --forward-unknown-to-host-compiler --compiler-options "/nologo /EHsc /O2 /GR /MT" -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_MMV_Y=1 -DGGML_USE_CUBLAS -DTEHFLASH -o ggml-cuda.dll.all ggml-cuda.cu -lcublas -lcuda

llama.cpp versions tested:
- b3567 (latest version)
- b2968 (version from May 22nd, which means llamafile should have included all updates from this version)

In comparison:

llamafile only achieves 26 tokens/s.
Both versions of llama.cpp achieve 51 tokens/s.

GPU Utilization:

Using nvidia-smi, the GPU utilization for llamafile is observed to be 41%, whereas for llama.cpp, it reaches 80%.

Model Used for Testing:

Model: Qwen/Qwen2-7B-Instruct-GGUF
Specific file: qwen2-7b-instruct-q3_k_m.gguf

Test Environment:

Operating System: Windows 10
GPU: RTX 2080
CUDA Version: 12.6

Version

llamafile v0.8.12

What operating system are you seeing the problem on?

Windows

Relevant log output

Logs Comparison:

llama.cpp Log (b3567):

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2080, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.32 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        CPU buffer size =   223.33 MiB
llm_load_tensors:      CUDA0 buffer size =  3402.96 MiB

llamafile Log:

ggml_cuda_link: welcome to CUDA SDK with cuBLAS
...
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2080, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.38 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloaded 28/29 layers to GPU
llm_load_tensors:        CPU buffer size =  3626.29 MiB
llm_load_tensors:      CUDA0 buffer size =  2976.59 MiB

The llamafile log is missing the line offloading non-repeating layers to GPU. I’m wondering if this could be the reason for the performance issue.

The text was updated successfully, but these errors were encountered:

BIGPPWONG added bug medium severity labels Aug 13, 2024

BIGPPWONG mentioned this issue Aug 13, 2024

Fix GPU Layer Limitation in llamafile #534

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: The token generation speed is slower compared to the upstream llama.cpp project #533

Bug: The token generation speed is slower compared to the upstream llama.cpp project #533

BIGPPWONG commented Aug 13, 2024 •

edited

Loading

Bug: The token generation speed is slower compared to the upstream llama.cpp project #533

Bug: The token generation speed is slower compared to the upstream llama.cpp project #533

Comments

BIGPPWONG commented Aug 13, 2024 • edited Loading

Contact Details

What happened?

Version

What operating system are you seeing the problem on?

Relevant log output

BIGPPWONG commented Aug 13, 2024 •

edited

Loading