-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reverting special handling of copy_ in nvfuser executor #806
Conversation
I'll wait until nvfuser patch is merged before marking this one as ready for review. |
for more information, see https://pre-commit.ci
quick "i didn't really look at the code" thought (i'll look tomorrow)---do we need to update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say it'd be better to have an if-else of nvfuser version to make use of the linked nvfuser pr when available
Re: That's a good question. We need that required nvfuser fix to pass CI I think, so it would be a good thing to have there. But I do want to point out that thunder main as-is does clear CI but it would still fail the python test in this PR. Regarding the requirements. I don't see where we use that to choose the nvfuser package installation. Would you guys mind pointing me where it is? |
😂 looks like we don't actually require nvFuser right now. is thunder usable without nvFuser, though? that's arguably a bug... |
It is usable. It will fallback to PyTorch eager Cuda. But some imports will probably fail? |
we can add a requirement for tests. |
Do you mean to have the if-else in the executor logic? or just to skip tests? |
To my mind, having an assert on the nvfuser version being >= first fixed in the executor and having a wheel available for that would be great and then we can drop the special handling. Upgrading NVFuser is not a big burden (with a wheel available) but failing silently would be bad. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for fixing this.
Could you link to the nvFuser PR that is required here? Just for my curiosity + posterity.
NVIDIA/Fuser#2638 |
If we absolutely don't want thunder to run with older nvfuser version, I think we can add this as an entry in requirement.txt. |
@xwang233 QQ: is pypi publish of nvfuser against stable torch automatic at this point? |
No |
We don't want to require a version of nvfuser (i.e. you could run thunder without it, starting with our nongpu CI), but we want to avoid versions which have a particular bug. I don't have an opinion about the nvfuser view of the bug severity, but from the thunder side, this is "silent bad results", which I think is good to guard against by failing loudly (assert or runtime error, I am not particular about). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you, @jjsjann123 !
Note that the CI with stable PyTorch does not find updated packages with the fix yet, apparently. |
Sounds reasonable. I'll have nvfuser version bumped and assert on that. lightning-thunder/thunder/executors/nvfuserex_impl.py Lines 240 to 243 in f04a88f
I can probably do that in a separate clean up PR.
Thanks for pointing it out. I'll work with Xiao to have it updated. |
|
follow up clean up on nvfuser version is #856 |
The nvfuser bits are OK, CI failure is unrelated windows things, so I'll merge this and #856 shortly after. |
Fixes: #791
The WAR on nvfuser's
copy_ among fusion inputs
cannot cover all use cases (as shown in the repro, where the test fails even with the patch)nvfuser behavior is being patched in NVIDIA/Fuser#2638, so the WAR in nvfuser executor is no longer necessary.