-
Notifications
You must be signed in to change notification settings - Fork 102
Add an option to print nvFuser repros. #1362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
We didn't always accept that kwarg.
for more information, see https://pre-commit.ci
This looks to have been added to nvFuser on Oct 8th, in 7f0bd0d0f9ba3271c2c340da134e6fd44da53838.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
So that we don't need a copy in every object.
for more information, see https://pre-commit.ci
How does it differ from the util here? lightning-thunder/thunder/examine/__init__.py Line 249 in 9c916d9
|
It's pretty similar, just a more user-accessible way to get at the same functionality really. Inside nvFuser it calls the same code. The intent is to turn this on when nvFuser is segfaulting and then just grab the last repro printed out for a bug report. To satisfy that case with the existing entry points, one would need to first dump the thunder trace to get all the |
I have three issues with this
|
Thanks for taking a look!
Yes, indeed, segfaults are the main reason one would want this. I'm happy to look at other options but it's not clear (to me) what other ones exist. Or maybe your intent is that we shouldn't do anything, and instead just manually edit thunder to print these repros when we hit such cases? |
So lazily search segfault in the NVFuser repo issues gives 8 from this year, not all of which would apply here, maybe 3-4 that affected thunder CI. So my gut feeling is that it is exceedingly rare to hit those. |
Yes, segfaults are rare (thankfully!). Sorry, it's not clear what behavior or action you're recommending, or what debugging flow you'd like developers to use when they hit a segfault. Could you clarify? |
It's good to have as few options as possible for a general user to use Thunder. Going to the README of the project we would see "Thunder aims to be usable, understandable, and extensible.". This option is not random, it aids usability and ease of reporting a problem for debugging. Unfortunately, due to the nature of active development nvFuser segfaults, while rare, are inevitable, and there can be other problems besides segfaults where help from the nvFuser team would be needed.
The scope of the change is nvFuser files. nvFuser should be able to introduce any options they think are necessary for a better experience. If there would be any other executor willing to reuse the same option for their reproducer generation they can do so. @t-vi, what would you recommend doing here? |
I would not consider this option specific to segfaults. This option enables you to generally dump nvFuser repros. I would just consider it a productivity improvement as it takes less effort to add a single knob. In addition, this would be necessary to catch a segfault. |
Of course, it is not random to the people implementing it, but it seems relatively random to the user. Part of the problem is the option mechanism that Thunder provides here, which basically makes options very non-discoverable. To my mind
|
@tfogal is there still interest in this? If you could update to use the debug option facility, I'd be glad to merge if you want to. |
Yes, thanks for the reminder. I will update it to use the DebugOption mechanism. |
What does this PR do?
This adds a minor nvidia-specific debug option to print out reproduction information whenever possible.
This could be useful if there is a segfault, or if an nvFuser developer wants to drop all the fusions in a network for importing into their CI.
PR review
Anyone in the community is free to review the PR once the tests have passed.