-
Notifications
You must be signed in to change notification settings - Fork 161
Add New VRAM Example #853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add New VRAM Example #853
Conversation
This commit provides VRAM transfer example in Python. The block base transfer model is inspired by vLLM and show transfer bandwidth with the specific configurations. Signed-off-by: Kota Tsuyuzaki <bloodeagle40234@gmail.com>
👋 Hi bloodeagle40234! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
Because copyright check apparently requires NVIDIA's copyright. https://github.com/ai-dynamo/nixl/blob/main/.github/workflows/copyright-check.sh#L32-L35
For reviewers: To pass the failed copyright check, I added commit bcba809, which modifies the copyright notice from my own to the NVIDIA template. However, I'm not sure if it's appropriate to use the NVIDIA copyright template, as I'm not an NVIDIA employee. If it's acceptable to update the check to allow more general copyright notices, including non-NVIDIA ones. In that case, I can remove the above commit. |
Hi Kota, Thanks a lot for your contribution, this is really valuable. I have few comments and noticed some issues, let's clarify/improve them before we proceed with merging the PR:
I have some more nitpicking comments, such as some syntax (like VRAM is not needed if xfer_desc_list is passed) or the name of the example being VRAM example is downplaying it as it has more important capabilities (also examples like blocking_send_recv.py is also VRAM already). Moein |
Thanks Moein (@mkhazraee ) for your valuable comments and for taking a deep look at my code.
Perhaps I’m misunderstanding your suggestion regarding the NIXL API.
For reference, I reviewed the vLLM KV cache allocation code, and from that, I assumed vLLM allocates multiple tensors—each corresponding to a layer.
However, I may also be missing your point about how tensors are actually structured, such as whether you meant “one more dimension for multiple tensors” or “a single-dimensional tensor with slices,” or something else.
OK, this was due to my misunderstanding of the NIXL API.
Ah, does this mean we can use "KEEPALIVE" or "COMPLETE" directly as the notification to confirm the transfer is done, instead of using a "UUID"?
At this point, I’d like to clarify your suggestion further.
This comment seems to be related to the same context I responded to earlier, so I’m wondering if it would be acceptable to simply say: I really appreciate your comprehensive review. I'd like to do my best to improve my work based on your feedback. Thanks! |
This commit provides a Python example for VRAM (GPU memory) transfer. The block-based transfer model is inspired by vLLM and demonstrates transfer bandwidth under specific configurations.
What?
This commit adds a new Python example for VRAM (GPU memory) transfer. The example follows vLLM's block-based memory management and shows how to configure NIXL using the
make_prepped_xfer
API.Why?
The current NIXL upstream does not include a VRAM-based example, which may make it unclear for users how to configure and debug data transfers between GPUs. To address this, this PR provides a working example. It also demonstrates index-based transfer, as used in vLLM, along with bandwidth performance results. This helps users understand KV cache transfer in disaggregated inference scenarios.
How?
This PR includes three components and one documentation file. The example is based on a server-client model: the client reads specific tensor blocks from the running server and outputs performance results to the console.