Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: NUMA support on Windows #502

Open
BernoulliBox opened this issue Jul 23, 2024 · 0 comments
Open

Bug: NUMA support on Windows #502

BernoulliBox opened this issue Jul 23, 2024 · 0 comments

Comments

@BernoulliBox
Copy link

Contact Details

git@nic.cix.co.uk

What happened?

Improved NUMA support on Windows

When using Llamafile on a NUMA (Non-Uniform Memory Access) Windows system, it's crucial for optimal performance that users can control which node Llamafile loads into memory on, especially when loading multiple models simultaneously. Currently, Llamafile appears to override the NODE option specified in the Windows start command, always filling the memory of node 0 before utilizing node 1, regardless of the specified node. There is no problem assigning CPU cores from other nodes, the problem is only related to loading the model into RAM into a specified node.

Current Behavior

Llamafile ignores the /NODE option specified in the Windows start command.
Memory is always filled on node 0 first, then node 1, regardless of the specified node.
This behavior causes performance issues when running multiple models on NUMA systems.

Expected Behavior

Llamafile should respect the /NODE option specified in the Windows start command.
Memory allocation should prioritize the specified node.
This would allow users to effectively distribute model loads across NUMA nodes for optimal performance.

Example

Currently, when using these Windows commands:

start /NODE 0 llamafile.exe ... llm_1.gguf
start /NODE 1 llamafile.exe ... llm_2.gguf

Both instances will load into node 0's memory, which is not expected.

Impact

This issue significantly impacts performance when running multiple models on NUMA systems. It prevents full utilization of available cores due to the relatively slow interconnect between nodes when memory is not local to the executing cores.

Proposed Solution

Either
(1) Modify Llamafile to respect the Windows start command's /NODE option, OR
(2) Implement a command-line option for Llamafile (e.g., --numa-node) to specify the preferred NUMA node directly.

Ensure that memory allocation prioritizes the specified node before utilizing other nodes.

Additional Context

This improvement would be a big improvement for users running multiple LLM instances on multi-socket workstations or servers running Windows, enabling the best distribution of workload and best utilization of hardware.

It aligns with Llamafile's goal of providing efficient, flexible LLM deployment options.

Environment

OS: Windows 10 LTSC
Llamafile version: 0.8.9
Hardware: Dual-socket workstation with Intel Xeon processors (Broadwell), 256GB RAM per node

Version

Llamafile version: 0.8.9

What operating system are you seeing the problem on?

No response

Relevant log output

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant