Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reverting special handling of copy_ in nvfuser executor #806

Merged
merged 9 commits into from
Jul 25, 2024

Conversation

jjsjann123
Copy link
Collaborator

@jjsjann123 jjsjann123 commented Jul 19, 2024

Fixes: #791

The WAR on nvfuser's copy_ among fusion inputs cannot cover all use cases (as shown in the repro, where the test fails even with the patch)

nvfuser behavior is being patched in NVIDIA/Fuser#2638, so the WAR in nvfuser executor is no longer necessary.

@jjsjann123 jjsjann123 marked this pull request as draft July 19, 2024 00:02
@jjsjann123
Copy link
Collaborator Author

I'll wait until nvfuser patch is merged before marking this one as ready for review.

@tfogal
Copy link
Collaborator

tfogal commented Jul 19, 2024

quick "i didn't really look at the code" thought (i'll look tomorrow)---do we need to update requirements.txt if there's now a required nvfuser fix?

Copy link
Collaborator

@crcrpar crcrpar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say it'd be better to have an if-else of nvfuser version to make use of the linked nvfuser pr when available

@jjsjann123
Copy link
Collaborator Author

Re: do we need to update requirements.txt if there's now a required nvfuser fix?

That's a good question. We need that required nvfuser fix to pass CI I think, so it would be a good thing to have there. But I do want to point out that thunder main as-is does clear CI but it would still fail the python test in this PR.

Regarding the requirements. I don't see where we use that to choose the nvfuser package installation. Would you guys mind pointing me where it is?

@tfogal
Copy link
Collaborator

tfogal commented Jul 19, 2024

Regarding the requirements. I don't see where we use that to choose the nvfuser package installation. Would you guys mind pointing me where it is?

😂 looks like we don't actually require nvFuser right now.

is thunder usable without nvFuser, though? that's arguably a bug...

@nikitaved
Copy link
Contributor

nikitaved commented Jul 19, 2024

It is usable. It will fallback to PyTorch eager Cuda. But some imports will probably fail?

@jjsjann123
Copy link
Collaborator Author

Regarding the requirements. I don't see where we use that to choose the nvfuser package installation. Would you guys mind pointing me where it is?

😂 looks like we don't actually require nvFuser right now.

is thunder usable without nvFuser, though? that's arguably a bug...

we can add a requirement for tests.
I think alternatively we can just skip the nvfuser test for versions where we do not have the fix.

@jjsjann123
Copy link
Collaborator Author

I'd say it'd be better to have an if-else of nvfuser version to make use of the linked nvfuser pr when available

Do you mean to have the if-else in the executor logic? or just to skip tests?

@t-vi
Copy link
Collaborator

t-vi commented Jul 19, 2024

To my mind, having an assert on the nvfuser version being >= first fixed in the executor and having a wheel available for that would be great and then we can drop the special handling. Upgrading NVFuser is not a big burden (with a wheel available) but failing silently would be bad.

Copy link
Collaborator

@tfogal tfogal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for fixing this.

Could you link to the nvFuser PR that is required here? Just for my curiosity + posterity.

@jjsjann123
Copy link
Collaborator Author

Could you link to the nvFuser PR that is required here? Just for my curiosity + posterity.

NVIDIA/Fuser#2638
I'm just awaiting CI to finish before merging it.

@jjsjann123
Copy link
Collaborator Author

having an assert on the nvfuser version being >= first fixed in the executor

If we absolutely don't want thunder to run with older nvfuser version, I think we can add this as an entry in requirement.txt.
The nvfuser side patch is a bug fix. Not sure if it justifies as an assert on that version of nvfuser shouldn't be used.

@jjsjann123 jjsjann123 marked this pull request as ready for review July 19, 2024 23:14
@jjsjann123
Copy link
Collaborator Author

@xwang233 QQ: is pypi publish of nvfuser against stable torch automatic at this point?

@xwang233
Copy link
Collaborator

@xwang233 QQ: is pypi publish of nvfuser against stable torch automatic at this point?

No

@t-vi
Copy link
Collaborator

t-vi commented Jul 20, 2024

If we absolutely don't want thunder to run with older nvfuser version, I think we can add this as an entry in requirement.txt.
The nvfuser side patch is a bug fix. Not sure if it justifies as an assert on that version of nvfuser shouldn't be used.

We don't want to require a version of nvfuser (i.e. you could run thunder without it, starting with our nongpu CI), but we want to avoid versions which have a particular bug. I don't have an opinion about the nvfuser view of the bug severity, but from the thunder side, this is "silent bad results", which I think is good to guard against by failing loudly (assert or runtime error, I am not particular about).

Copy link
Contributor

@nikitaved nikitaved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you, @jjsjann123 !

@t-vi
Copy link
Collaborator

t-vi commented Jul 22, 2024

Note that the CI with stable PyTorch does not find updated packages with the fix yet, apparently.

@jjsjann123
Copy link
Collaborator Author

I don't have an opinion about the nvfuser view of the bug severity, but from the thunder side, this is "silent bad results", which I think is good to guard against by failing loudly (assert or runtime error, I am not particular about).

Sounds reasonable. I'll have nvfuser version bumped and assert on that.
Does that mean we can drop these logics on older nvfuser versions, since we are going to be asserting on those?

elif nv_version >= LooseVersion("0.0.17"):
nv = fd.define_tensor(shape=symbolic_shape, contiguity=contiguity, dtype=nvdtype)
elif nv_version >= LooseVersion("0.0.9"):
nv = fd.define_tensor(symbolic_sizes=symbolic_shape, contiguity=contiguity, dtype=nvdtype)

I can probably do that in a separate clean up PR.

Note that the CI with stable PyTorch does not find updated packages with the fix yet, apparently.

Thanks for pointing it out. I'll work with Xiao to have it updated.

@t-vi
Copy link
Collaborator

t-vi commented Jul 23, 2024

Does that mean we can drop these logics on older nvfuser versions, since we are going to be asserting on those?
If the assert is "global", yes.

@jjsjann123
Copy link
Collaborator Author

follow up clean up on nvfuser version is #856

@t-vi
Copy link
Collaborator

t-vi commented Jul 25, 2024

The nvfuser bits are OK, CI failure is unrelated windows things, so I'll merge this and #856 shortly after.
Thank you @jjsjann123 @xwang233 @tfogal @nikitaved @crcrpar

@t-vi t-vi merged commit c734e99 into main Jul 25, 2024
34 of 37 checks passed
@t-vi t-vi deleted the copy__nvfuser_patch branch July 25, 2024 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

prims.copy_ with NVFuser: sometimes it has to be a kernel.
6 participants