-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ThunderFX: Save the reproducer script into files #1380
Conversation
This is really cool! A couple questions, @kiya00:
For this part, can the script instead know how to produce the same compilation as in the original reproduction? Like if the original was compiled like
Do this CUDA-specific calls only appear if one or more of the input tensors is generated on a CUDA device?
Where do the low and high values for this call to randint come from? Should we think about using make_tensor instead? Can we add a comment with the original FX graph, too?
Not a question, I just thought this was really cool and helpful.
Why this final call the the function? |
e3e0451
to
df2c199
Compare
Hi @mruberry , thanks for the advice I'll change it accordingly
I mostly kept the repro script as the one written by Tom, I think it's more like a debug+benchmark script on different backend depending on the environment variable. If we just want to produce the same compilation as in the original reproduction, I can change it to the thunder options actually used.
When we can get the real tensor instead of FakeTensor in the Dynamo graph, the min/max value can be obtained from it. And sometimes it's needed to ensure correctness (e.g. nanogpt input must be in range 0-255). Currently the original torch interface is used to create the inputs, like the nvFuser repro does, maybe it's more user friendly to use the native torch API, but it's easy to change to
I'll modify it to only appear when cuda is used. |
That makes a lot of sense! I think we can let @tfogal comment when he's back, but I think the principal desire behind these reproduction scripts is to create a standalone file that someone interested in reviewing thunder's performance or correctness can quickly run to replicate an issue. That's why I'd suggest making it so that when someone clicks "run" the script executes what thunder did the same way thunder did it. Of course it's great to add notes for how to override/compare that behavior, too!
That's really cool. Querying for the min and max values from the real tensors sounds like a good solution. I would take a look at
Awesome! |
Thanks for the ping, sorry for the delay. The original code conflates two things, and I think there's a good argument to be made that during this cleanup process we rethink things. That is, we've got two use cases to service:
These two are very close but not quite the same. I think/hope that the ThunderCompilerBenchmark class that Yan added recently suffices for the latter goal. Without thinking about how hard it is / if it's appropriate for a single PR, what I'd love to see is:
Does that make sense to both of you? |
Yep. I think providing a default that mimics the actual behavior, and the ability to override/tweak/investigate alternatives would be great! |
Hi @tfogal @mruberry , thank you for the reply, this is the example repro script I generated so far, it would be helpful if we could align on the structure of the repro script
we can create a simple clean repro for this case(
When you mention “providing ‘fiddling’ cases”, does that mean we use the ThunderCompilerBenchmark to provide the benchmarking numbers of the following few types of executors, like the following? When user calls
then run the benchmarking repro script will give the following:
|
The first script looks perfect to me (great if it also has the version information from the original machine, but that can be added later). The benchmarking variant also looks great! |
Your two sample outputs look fantastic, Yan. Thank you for driving this, and for your patience with my initially-incompletely-expressed asks! |
90d9f81
to
34c0c43
Compare
Hi @mruberry @tfogal @IvanYashchuk , it's ready for review, the generated sample script is in #1380 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great overall. I've got a few questions and suggestions inline.
Hi @mruberry @tfogal , it's ready for review, the generated sample script is in #1380 (comment), and adding comments about split reason and submodule structure will be fixed in the issue #1461 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's 11pm and I start a vacation tomorrow. I trust the two of you. If I had any other comments I am sure they would be minor. Please feel free to merge while I'm on break and I'll take a look when I am back. |
Based on the code provided by @tfogal
(https://github.com/tfogal/NeMo/blob/a0c711deae6c6f7342662795425684a342f95b8d/examples/multimodal/multimodal_llm/neva/neva_pretrain.py#L164), I added the
ThunderCompiler.save_reproducer_to_folder
interface to save the reproducer script in an "offline" way.SubgraphInfo.thunder_compiled_fns_example_inputs
is added to record the input tensor metadata and after execution we retrieve the information inSubgraphInfo
and write the reproducer to file.Fixes #1082.
An example of the saved reproducer script using CPU inputs
An example of the saved reproducer script using CUDA inputs
An example of the saved benchmark script
cc @Borda @apaz-cli