Bug: low CPU usage on AWS Graviton4 compared to ollama #503

nlothian · 2024-07-24T07:10:37Z

Contact Details

No response

What happened?

I'm trying llamafile on a ARM based AWS Graviton4, specifically a r8x.12large instance (48vCPU, 384GiB RAM).
Using Ollama llama3.1:70b I'm seeing much lower performance on llamafile compared to ollama.

For ollama I'm seeing 4.62 TPS with 4734% CPU usage measured via top, whereas for llamafile I'm seeing 3.7 TPS with 1987% CPU usage.

I've tried the -t argument with llamafile to increase the number of threads but there seems to be no measurable improvement that I can see.

Is this expected at the moment or is there other configuration options I should explore?

Version

llamafile v0.8.11

What operating system are you seeing the problem on?

Linux

Relevant log output

{"function":"print_timings","level":"INFO","line":327,"msg":"generation eval time =   88129.03 ms /   319 runs   (  276.27 ms per token,     3.62 tokens per second)","n_decoded":319,"n_tokens_second":3.619692707832883,"slot_id":0,"t_token":276.26654545454545,"t_token_generation":88129.028,"task_id":321,"tid":"34364393952","timestamp":1721802661}

The text was updated successfully, but these errors were encountered:

nlothian added bug low severity labels Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: low CPU usage on AWS Graviton4 compared to ollama #503

Bug: low CPU usage on AWS Graviton4 compared to ollama #503

nlothian commented Jul 24, 2024

Bug: low CPU usage on AWS Graviton4 compared to ollama #503

Bug: low CPU usage on AWS Graviton4 compared to ollama #503

Comments

nlothian commented Jul 24, 2024

Contact Details

What happened?

Version

What operating system are you seeing the problem on?

Relevant log output