Skip to content

Conversation

bloodeagle40234
Copy link

This commit provides a Python example for VRAM (GPU memory) transfer. The block-based transfer model is inspired by vLLM and demonstrates transfer bandwidth under specific configurations.

What?

This commit adds a new Python example for VRAM (GPU memory) transfer. The example follows vLLM's block-based memory management and shows how to configure NIXL using the make_prepped_xfer API.

Why?

The current NIXL upstream does not include a VRAM-based example, which may make it unclear for users how to configure and debug data transfers between GPUs. To address this, this PR provides a working example. It also demonstrates index-based transfer, as used in vLLM, along with bandwidth performance results. This helps users understand KV cache transfer in disaggregated inference scenarios.

How?

This PR includes three components and one documentation file. The example is based on a server-client model: the client reads specific tensor blocks from the running server and outputs performance results to the console.

This commit provides VRAM transfer example in Python.
The block base transfer model is inspired by vLLM and show
transfer bandwidth with the specific configurations.

Signed-off-by: Kota Tsuyuzaki <bloodeagle40234@gmail.com>
@bloodeagle40234 bloodeagle40234 requested a review from a team as a code owner October 2, 2025 08:14
Copy link

copy-pr-bot bot commented Oct 2, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link

github-actions bot commented Oct 2, 2025

👋 Hi bloodeagle40234! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@bloodeagle40234
Copy link
Author

For reviewers:

To pass the failed copyright check, I added commit bcba809, which modifies the copyright notice from my own to the NVIDIA template.

However, I'm not sure if it's appropriate to use the NVIDIA copyright template, as I'm not an NVIDIA employee. If it's acceptable to update the check to allow more general copyright notices, including non-NVIDIA ones. In that case, I can remove the above commit.

@mkhazraee
Copy link
Contributor

mkhazraee commented Oct 11, 2025

Hi Kota,

Thanks a lot for your contribution, this is really valuable. I have few comments and noticed some issues, let's clarify/improve them before we proceed with merging the PR:

  • I see you create the tensors, then per tensor get the address and len from the tensor to create the descriptor. Why don't you create a tensor with 1 more dimension, register that full tensor, and then for xfer_descs use the slices of that tensor? NIXL API already supports tensors directly. Also if your registration is back to back, better to do it once (iirc similar to vllm as they have a big pool of KV blocks).

  • In client code, you call agent.fetch_remote_metadata("server", args.ip, args.port) in main, without verifying the metadata was received, instead you do agent.check_remote_metadata("server") in each transfer, and actually do a double check there. As long as you don't call remote_remote_agent the metadata will not be removed, which you never do. So I suggest to add that check in the main loop instead of each transfer, could be few operations later before the transfer calls.

  • I see you practically are doing blocking operations, without always the fixed notification of "UUID", and after the blocking transfer is done, you send another notification, which can be "KEEPALIVE" or "COMPLETE". So it would make more sense to use that msg for the notification of the transfer itself, which would be automatically sent when the notification is complete, as opposed to wait for it to finish and then send a second notification.

  • I see you're doing blocking transfers one after another. One great feature of nixl is all of the APIs are non-blocking. And you can start all the transfer together. So I think being closer to what vLLM also does, you can start all the transfers, add the handles into a list, and remove them from the list when they're complete (again similar to vLLM implementation). When they're all done, then you can use the send_msg to indicate all of them are complete. I also suggest putting something more informative in the notifications instead of keepalive or complete, let's say the block ID. That being said, there is one point to consider here, we don't provide guarantees between transfer ordering. So you should check all of them are done, before sending that final notification. Or if you really don't care, you can skip assigning any notifications to the transfers, and just have one at the end, after checking all the transfers where complete.

  • One point of having multiple transfers per agent in vLLM is they go to different agents, but here all of them go to the same agent. So you could've made a single larger transfer, and the underlying backend might have been optimized that transfer in a better way. It's good to have this as an example, but let's clarify that with a comment.

I have some more nitpicking comments, such as some syntax (like VRAM is not needed if xfer_desc_list is passed) or the name of the example being VRAM example is downplaying it as it has more important capabilities (also examples like blocking_send_recv.py is also VRAM already).

Moein

@bloodeagle40234
Copy link
Author

bloodeagle40234 commented Oct 14, 2025

Thanks Moein (@mkhazraee ) for your valuable comments and for taking a deep look at my code.
Before I start making changes, I’d like to make sure I fully understand your suggestions.
I’ve added responses to each of your comments, so I’d appreciate it if you could check whether my understanding aligns with your intent.

  • I see you create the tensors, then per tensor get the address and len from the tensor to create the descriptor. Why don't you create a tensor with 1 more dimension, register that full tensor, and then for xfer_descs use the slices of that tensor? NIXL API already supports tensors directly. Also if your registration is back to back, better to do it once (iirc similar to vllm as they have a big pool of KV blocks).

Perhaps I’m misunderstanding your suggestion regarding the NIXL API.
For reference, I reviewed the vLLM KV cache allocation code, and from that, I assumed vLLM allocates multiple tensors—each corresponding to a layer.
However, I may also be missing your point about how tensors are actually structured, such as whether you meant “one more dimension for multiple tensors” or “a single-dimensional tensor with slices,” or something else.
I agree with your point that vLLM manages a large pool of KV blocks. That said, in the example I referred to, the blocks are also managed across multiple tensors, with a single index list used by make_prepped_xfer. If you could show me an example or document reference on your point, it could be helpful for me to get the idea.

  • In client code, you call agent.fetch_remote_metadata("server", args.ip, args.port) in main, without verifying the metadata was received, instead you do agent.check_remote_metadata("server") in each transfer, and actually do a double check there. As long as you don't call remote_remote_agent the metadata will not be removed, which you never do. So I suggest to add that check in the main loop instead of each transfer, could be few operations later before the transfer calls.

OK, this was due to my misunderstanding of the NIXL API.
It seems preferable to call check_remote_metadata only once after fetch_remote_metadata, rather than calling check_remote_metadata repeatedly in a loop.

  • I see you practically are doing blocking operations, without always the fixed notification of "UUID", and after the blocking transfer is done, you send another notification, which can be "KEEPALIVE" or "COMPLETE". So it would make more sense to use that msg for the notification of the transfer itself, which would be automatically sent when the notification is complete, as opposed to wait for it to finish and then send a second notification.

Ah, does this mean we can use "KEEPALIVE" or "COMPLETE" directly as the notification to confirm the transfer is done, instead of using a "UUID"?
This was the part I struggled with during implementation, and it’s one of the reasons I was motivated to write this example.
If that’s the case, let me take some time to verify whether your recommended status synchronization logic between the server and client works well without relying on UUID-based notifications.

  • I see you're doing blocking transfers one after another. One great feature of nixl is all of the APIs are non-blocking. And you can start all the transfer together. So I think being closer to what vLLM also does, you can start all the transfers, add the handles into a list, and remove them from the list when they're complete (again similar to vLLM implementation). When they're all done, then you can use the send_msg to indicate all of them are complete. I also suggest putting something more informative in the notifications instead of keepalive or complete, let's say the block ID. That being said, there is one point to consider here, we don't provide guarantees between transfer ordering. So you should check all of them are done, before sending that final notification. Or if you really don't care, you can skip assigning any notifications to the transfers, and just have one at the end, after checking all the transfers where complete.

At this point, I’d like to clarify your suggestion further.
In the current example, the code emulates transferring multiple blocks across tensors at once using make_prepped_xfer, and I assume this follows your suggestion for asynchronous transfer.
The reason I placed a for loop in the main function is simply to allow users to perform multiple attempts for throughput experiments. That’s why I believe each step should run in a blocking manner.

  • One point of having multiple transfers per agent in vLLM is they go to different agents, but here all of them go to the same agent. So you could've made a single larger transfer, and the underlying backend might have been optimized that transfer in a better way. It's good to have this as an example, but let's clarify that with a comment.

This comment seems to be related to the same context I responded to earlier, so I’m wondering if it would be acceptable to simply say:
“This example loops multiple attempts to evaluate the transfer rate for each step. Each step transfers multiple blocks asynchronously.”
Does this align with your thought?

I really appreciate your comprehensive review. I'd like to do my best to improve my work based on your feedback. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants