Skip to content

CUDAGraphs as executor/transform/fusion pass #656

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 27, 2024
Merged

Conversation

nikitaved
Copy link
Contributor

@nikitaved nikitaved commented Jun 26, 2024

As per title. Fixes #635.

Also, it fixes the following subtle bugs:

  • CUDAGraphExecutor - does not properly update static buffers when the same graph is invoked on inputs with meta-data that allows to fetch cached graphs, but with different storage data. The area of concern - training and the backward pass.
  • horizontal_merge in the fusion logic - that one, when grouping bound symbols, does not consider precedence between ops horizontally. It is not an issue with nvfusions, but it could cause issues when deciding whether to place something like del x after op(x) in a custom FusionExecutor. The fix sorts bsyms in each group wrt trace position (which is expected to be toposorted prior to any fusions) and, hence, restores the inter-/intra-bsym groups topological order.

@nikitaved nikitaved force-pushed the nikitaved/cudagraph_ex branch from 4151315 to 1ee2fb5 Compare June 26, 2024 10:46
@nikitaved nikitaved requested a review from IvanYashchuk June 26, 2024 10:48
@nikitaved nikitaved force-pushed the nikitaved/cudagraph_ex branch 2 times, most recently from e4c0efb to bd68121 Compare June 26, 2024 16:07
@nikitaved nikitaved force-pushed the nikitaved/cudagraph_ex branch from 0defe05 to 791dab8 Compare June 26, 2024 16:11
Copy link
Collaborator

@t-vi t-vi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall, thank you @nikitaved .
I have one question in the comments and then I'd be all for merging it.

if bsym.sym.id in do_not_fuse_sym_set:
return False

return True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we need to check for fixed sizes of the proxies in input and output or ist this handled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether the future API is agreed upon? Does it mean that dynamic shape Tensors will contain Integer proxies in their meta-data?

@IvanYashchuk
Copy link
Collaborator

Nice! This change is needed to make my PR #214 work with CUDA Graphs correctly. Because there I try to put torch.autograd.Function.apply into the forward trace but it should be executed outside of the CUDA Graph-captured region.

Copy link
Collaborator

@lantiga lantiga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @nikitaved!
I'm imagining that in the future we could consider not special-casing use_cudagraphs but keeping it as a transform, but maybe that's overkill and it's certainly it's not for now.

@t-vi t-vi merged commit 390d8e3 into main Jun 27, 2024
27 of 31 checks passed
@t-vi t-vi deleted the nikitaved/cudagraph_ex branch June 27, 2024 10:18
@t-vi
Copy link
Collaborator

t-vi commented Jun 27, 2024

Let's merge and fix anything that needs fixing later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make CudaGraph wrapping a transform (but call it explicitly)
4 participants