Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: low CPU usage on AWS Graviton4 compared to ollama #503

Open
nlothian opened this issue Jul 24, 2024 · 0 comments
Open

Bug: low CPU usage on AWS Graviton4 compared to ollama #503

nlothian opened this issue Jul 24, 2024 · 0 comments

Comments

@nlothian
Copy link

Contact Details

No response

What happened?

I'm trying llamafile on a ARM based AWS Graviton4, specifically a r8x.12large instance (48vCPU, 384GiB RAM).
Using Ollama llama3.1:70b I'm seeing much lower performance on llamafile compared to ollama.

For ollama I'm seeing 4.62 TPS with 4734% CPU usage measured via top, whereas for llamafile I'm seeing 3.7 TPS with 1987% CPU usage.

I've tried the -t argument with llamafile to increase the number of threads but there seems to be no measurable improvement that I can see.

Is this expected at the moment or is there other configuration options I should explore?

Version

llamafile v0.8.11

What operating system are you seeing the problem on?

Linux

Relevant log output

{"function":"print_timings","level":"INFO","line":327,"msg":"generation eval time =   88129.03 ms /   319 runs   (  276.27 ms per token,     3.62 tokens per second)","n_decoded":319,"n_tokens_second":3.619692707832883,"slot_id":0,"t_token":276.26654545454545,"t_token_generation":88129.028,"task_id":321,"tid":"34364393952","timestamp":1721802661}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant