Skip to content

Add an option to print nvFuser repros. #1362

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Conversation

tfogal
Copy link
Collaborator

@tfogal tfogal commented Oct 29, 2024

What does this PR do?

This adds a minor nvidia-specific debug option to print out reproduction information whenever possible.

This could be useful if there is a segfault, or if an nvFuser developer wants to drop all the fusions in a network for importing into their CI.

PR review

Anyone in the community is free to review the PR once the tests have passed.

tfogal and others added 6 commits October 29, 2024 11:28
@tfogal tfogal marked this pull request as ready for review October 29, 2024 23:49
Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

tfogal and others added 2 commits October 30, 2024 12:42
@riccardofelluga
Copy link
Collaborator

riccardofelluga commented Oct 30, 2024

How does it differ from the util here?

def get_nvfuser_repro(trace: TraceCtx, fusion_name: str, /) -> str:

@tfogal
Copy link
Collaborator Author

tfogal commented Oct 30, 2024

How does it differ from the util here?

def get_nvfuser_repro(trace: TraceCtx, fusion_name: str, /) -> str:

It's pretty similar, just a more user-accessible way to get at the same functionality really. Inside nvFuser it calls the same code.

The intent is to turn this on when nvFuser is segfaulting and then just grab the last repro printed out for a bug report. To satisfy that case with the existing entry points, one would need to first dump the thunder trace to get all the nvFusion<X> names, and then iterate through and print each one.

@t-vi
Copy link
Collaborator

t-vi commented Oct 31, 2024

I have three issues with this

  • I don't like random options added to the .jit call it just turns into a giant mess over time.
  • It special-cases nvfuser a lot and pretty much duplicates the existing functionality. - We print on errors and we can print after the fact.
  • So the remaining use-case is segfaults. In my experience, these are really rare.

@tfogal
Copy link
Collaborator Author

tfogal commented Oct 31, 2024

Thanks for taking a look!

  • So the remaining use-case is segfaults. In my experience, these are really rare.

Yes, indeed, segfaults are the main reason one would want this.

I'm happy to look at other options but it's not clear (to me) what other ones exist. Or maybe your intent is that we shouldn't do anything, and instead just manually edit thunder to print these repros when we hit such cases?

@t-vi
Copy link
Collaborator

t-vi commented Nov 1, 2024

So lazily search segfault in the NVFuser repo issues gives 8 from this year, not all of which would apply here, maybe 3-4 that affected thunder CI. So my gut feeling is that it is exceedingly rare to hit those.

@tfogal
Copy link
Collaborator Author

tfogal commented Nov 1, 2024

So lazily search segfault in the NVFuser repo issues gives 8 from this year, not all of which would apply here, maybe 3-4 that affected thunder CI. So my gut feeling is that it is exceedingly rare to hit those.

Yes, segfaults are rare (thankfully!).

Sorry, it's not clear what behavior or action you're recommending, or what debugging flow you'd like developers to use when they hit a segfault. Could you clarify?

@IvanYashchuk
Copy link
Collaborator

IvanYashchuk commented Nov 4, 2024

I don't like random options added to the .jit call it just turns into a giant mess over time.

It's good to have as few options as possible for a general user to use Thunder. Going to the README of the project we would see "Thunder aims to be usable, understandable, and extensible.". This option is not random, it aids usability and ease of reporting a problem for debugging. Unfortunately, due to the nature of active development nvFuser segfaults, while rare, are inevitable, and there can be other problems besides segfaults where help from the nvFuser team would be needed.

It special-cases nvfuser a lot and pretty much duplicates the existing functionality. - We print on errors and we can print after the fact.

The scope of the change is nvFuser files. nvFuser should be able to introduce any options they think are necessary for a better experience. If there would be any other executor willing to reuse the same option for their reproducer generation they can do so.

@t-vi, what would you recommend doing here?

@kevinstephano
Copy link
Collaborator

I would not consider this option specific to segfaults. This option enables you to generally dump nvFuser repros. I would just consider it a productivity improvement as it takes less effort to add a single knob. In addition, this would be necessary to catch a segfault.

@t-vi
Copy link
Collaborator

t-vi commented Nov 18, 2024

Of course, it is not random to the people implementing it, but it seems relatively random to the user. Part of the problem is the option mechanism that Thunder provides here, which basically makes options very non-discoverable.

To my mind

  • options should be discoverable,
  • they should be documented,
  • we should avoid having many options to the jit cluttering that call
    • if it is specific to an executor or to a transform, maybe setting a flag there with a function / property is better and then
      it can be documented in the executor class documentation.
    • for debug options in particular, I created a DebugOptions class that you can register options with in add extensible DebugOptions class #1447 . Those options will be automatically added to the doc string and to the documentation of the DebugOptions class, so people can interactively find them or look them up in the docs. We will use these for debug verbosity options in the interpreter and for enabling trace checking etc.
  • This adds the third way to get nvfuser repros and to my mind, this is a bit much. I know that nvfuser is very dear to you, but still. If an option like the present one is adopted, will we drop @riccardofelluga 's way of doing it? Printing the repro on error seems to be great value, but why do we need to have two options to print it when no error happens?

@t-vi
Copy link
Collaborator

t-vi commented Jan 10, 2025

@tfogal is there still interest in this? If you could update to use the debug option facility, I'd be glad to merge if you want to.

@tfogal
Copy link
Collaborator Author

tfogal commented Jan 10, 2025

@tfogal is there still interest in this? If you could update to use the debug option facility, I'd be glad to merge if you want to.

Yes, thanks for the reminder. I will update it to use the DebugOption mechanism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants