-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data types mismatch inside Megatron TransformerBlock #1044
Comments
Is there no further traceback? (I'm asking if the call to layer itself should not throw the assert.) |
I am getting the same error when trying to run with eager (with the following patch) diff --git a/thunder/tests/test_neva_modules.py b/thunder/tests/test_neva_modules.py
index 25fd4f8..865dc1f 100644
--- a/thunder/tests/test_neva_modules.py
+++ b/thunder/tests/test_neva_modules.py
@@ -165,7 +165,7 @@ def _test_megatron_transformer_block(input_data):
block = TransformerBlock(transformer_config, get_gpt_layer_with_transformer_engine_spec())
block.to(device)
- jblock = thunder.jit(block)
+ jblock = block
hidden_states = torch.ones((4096, 1, transformer_config.hidden_size))
hidden_states = hidden_states.cuda() Snippet of logs
Full logs - logs.txt |
Oh wait I'll do a check on the repro/test script first to see if it's a setup issue |
My bad it was the wrong input dtype, however now we get another error. I close this issue and open another with the correct description. |
New issue can be found here #1053 |
🚀 Model / language coverage
When running the
TransformerBlock
module from Megatron-LM with the WAR described in the comments in #753, Thunder raises an AssertionError:Probably similar to what happened in #678
Minimal Repro
I've created a branch with the setup to test neva modules. To repro this issue you can pull thunder in the pytorch container and checkout to the
neva-modules-tests
branch.Then install megatron with
pip install megatron-core
and run the test with:-p no:warnings
will prevent warnings from crashing pytest and-s
is to forward the prints from the test for inspection.cc @apaz-cli @tfogal
The text was updated successfully, but these errors were encountered: