Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data types mismatch inside Megatron TransformerBlock #1044

Closed
riccardofelluga opened this issue Aug 26, 2024 · 5 comments
Closed

Data types mismatch inside Megatron TransformerBlock #1044

riccardofelluga opened this issue Aug 26, 2024 · 5 comments
Labels
high priority nemo Issues needed to support NVIDIA NeMo models. program-coverage Requests for model and program coverage

Comments

@riccardofelluga
Copy link
Collaborator

riccardofelluga commented Aug 26, 2024

🚀 Model / language coverage

When running the TransformerBlock module from Megatron-LM with the WAR described in the comments in #753, Thunder raises an AssertionError:

Probably similar to what happened in #678

[... omitted ...]
 File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 6407, in _impl
    return fn.__func__(fn.__self__, *args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/megatron/core/transformer/transformer_block.py", line 411, in forward
    hidden_states, context = layer(
 AssertionError: Data types for parameters must match when outside of autocasted region.  Found input dtype: thunder.dtypes.float32 and 'weight' dtype: thunder.dtypes.bfloat16

Minimal Repro

I've created a branch with the setup to test neva modules. To repro this issue you can pull thunder in the pytorch container and checkout to the neva-modules-tests branch.

Then install megatron with pip install megatron-core and run the test with:

pytest thunder/tests/test_neva_modules.py -p no:warnings -s 

-p no:warnings will prevent warnings from crashing pytest and -s is to forward the prints from the test for inspection.

cc @apaz-cli @tfogal

@riccardofelluga riccardofelluga added nemo Issues needed to support NVIDIA NeMo models. program-coverage Requests for model and program coverage high priority labels Aug 26, 2024
@Lightning-AI Lightning-AI deleted a comment Aug 26, 2024
@Lightning-AI Lightning-AI deleted a comment Aug 26, 2024
@t-vi
Copy link
Collaborator

t-vi commented Aug 26, 2024

Is there no further traceback? (I'm asking if the call to layer itself should not throw the assert.)

@kshitij12345
Copy link
Collaborator

I am getting the same error when trying to run with eager (with the following patch)

diff --git a/thunder/tests/test_neva_modules.py b/thunder/tests/test_neva_modules.py
index 25fd4f8..865dc1f 100644
--- a/thunder/tests/test_neva_modules.py
+++ b/thunder/tests/test_neva_modules.py
@@ -165,7 +165,7 @@ def _test_megatron_transformer_block(input_data):
     block = TransformerBlock(transformer_config, get_gpt_layer_with_transformer_engine_spec())
 
     block.to(device)
-    jblock = thunder.jit(block)
+    jblock = block
     hidden_states = torch.ones((4096, 1, transformer_config.hidden_size))
     hidden_states = hidden_states.cuda()

Snippet of logs

   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/base.py", line 678, in prepare_forward
    self.set_activation_dtype(inp)
   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/base.py", line 588, in set_activation_dtype
    assert dtype == param.dtype, (
 AssertionError: Data types for parameters must match when outside of autocasted region.  Found input dtype: torch.float32 and 'layer_norm_weight' dtype: torch.bfloat16

Full logs - logs.txt

@riccardofelluga
Copy link
Collaborator Author

Oh wait I'll do a check on the repro/test script first to see if it's a setup issue

@riccardofelluga
Copy link
Collaborator Author

My bad it was the wrong input dtype, however now we get another error. I close this issue and open another with the correct description.

@riccardofelluga
Copy link
Collaborator Author

New issue can be found here #1053

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority nemo Issues needed to support NVIDIA NeMo models. program-coverage Requests for model and program coverage
Projects
None yet
Development

No branches or pull requests

4 participants
@riccardofelluga @kshitij12345 @t-vi and others