-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
conv backward in thunder #799
Comments
The backward for conv is conv again. Both for the input and for the weight. But let's indeed investigate the decomp performance. Maybe, again, we can replace it with the PyTorch kernel... Maybe some things we'd better upcast in the decomp (for example, reduction for bias, unless such things are taken care of by NVFuser)... |
The decomp working is what we see with #655 in fp6, too. But I think we would want to look into whether the decomp has perf or numerical accuracy impacts. |
The grad test are solid and super comprehensive, but these are done in high precision modes. We might loose things between the calls to convolutions though (I expect PyTorch to behave well there in forward, but I am not sure), when run in lower precision modes, unless NVFuser handles these things with grace. And I am not 100% sure this is the case... |
Maybe conv.backward in PyTorch does update grads in a single kernel I wonder? The best perf improvement I see is to make NVFuser to have its own native convolution support. I mean best in a way of not introducing changes to Thunder... cc @tfogal |
Thanks for tagging me. I am sure we will want convs in nvFuser eventually, but it'll be quite some time; it's not even on the roadmap as of today. Plus, convs are hard :-). I generally trust nvFuser to get the best perf for memory-bound workloads today but the situation is spotty for compute-bound workloads and convs are generally compute-bound. I think we should pursue other approaches, e.g. as you mention seeing if PyTorch will have a single kernel here, or maybe we can just directly call cuDNN. cc @vedaanta |
Note that this is more for investigation and knowing what's going on than any set conclusion. Comparing to fp64 in this trivial example it seems that we're not worse than eager autocast in accuracy, so it might just be different. |
triage review:
|
while testing #797 , it seems that Thunder's backward might lead to things not being optimal (accurary, speed?):
gives (note the backward running conv2d forward twice again(?))
Edit: I don't have the output for the accuracy that I edited into the script here, but my impression is that the accuracy of the Thunder backward is not worse than the eager one in this example and this example is not terribly relevant for perf. It's just that we should develop insight into what's going on because we will bump into the question.
I wonder if #655 is related, maybe the same method could provide a 0th-order analysis of what's going on.
cc @tfogal
The text was updated successfully, but these errors were encountered: