Data types mismatch inside Megatron TransformerBlock #1044

riccardofelluga · 2024-08-26T08:15:56Z

🚀 Model / language coverage

When running the TransformerBlock module from Megatron-LM with the WAR described in the comments in #753, Thunder raises an AssertionError:

Probably similar to what happened in #678

[... omitted ...]
 File "/opt/pytorch/lightning-thunder/thunder/core/interpreter.py", line 6407, in _impl
    return fn.__func__(fn.__self__, *args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/megatron/core/transformer/transformer_block.py", line 411, in forward
    hidden_states, context = layer(
 AssertionError: Data types for parameters must match when outside of autocasted region.  Found input dtype: thunder.dtypes.float32 and 'weight' dtype: thunder.dtypes.bfloat16

Minimal Repro

I've created a branch with the setup to test neva modules. To repro this issue you can pull thunder in the pytorch container and checkout to the neva-modules-tests branch.

Then install megatron with pip install megatron-core and run the test with:

pytest thunder/tests/test_neva_modules.py -p no:warnings -s

-p no:warnings will prevent warnings from crashing pytest and -s is to forward the prints from the test for inspection.

cc @apaz-cli @tfogal

The text was updated successfully, but these errors were encountered:

t-vi · 2024-08-26T12:41:15Z

Is there no further traceback? (I'm asking if the call to layer itself should not throw the assert.)

kshitij12345 · 2024-08-26T12:56:54Z

I am getting the same error when trying to run with eager (with the following patch)

diff --git a/thunder/tests/test_neva_modules.py b/thunder/tests/test_neva_modules.py
index 25fd4f8..865dc1f 100644
--- a/thunder/tests/test_neva_modules.py
+++ b/thunder/tests/test_neva_modules.py
@@ -165,7 +165,7 @@ def _test_megatron_transformer_block(input_data):
     block = TransformerBlock(transformer_config, get_gpt_layer_with_transformer_engine_spec())
 
     block.to(device)
-    jblock = thunder.jit(block)
+    jblock = block
     hidden_states = torch.ones((4096, 1, transformer_config.hidden_size))
     hidden_states = hidden_states.cuda()

Snippet of logs

   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/base.py", line 678, in prepare_forward
    self.set_activation_dtype(inp)
   File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/base.py", line 588, in set_activation_dtype
    assert dtype == param.dtype, (
 AssertionError: Data types for parameters must match when outside of autocasted region.  Found input dtype: torch.float32 and 'layer_norm_weight' dtype: torch.bfloat16

Full logs - logs.txt

riccardofelluga · 2024-08-26T15:37:56Z

Oh wait I'll do a check on the repro/test script first to see if it's a setup issue

riccardofelluga · 2024-08-26T16:16:09Z

My bad it was the wrong input dtype, however now we get another error. I close this issue and open another with the correct description.

riccardofelluga · 2024-08-26T16:29:19Z

New issue can be found here #1053

riccardofelluga added nemo Issues needed to support NVIDIA NeMo models. program-coverage Requests for model and program coverage high priority labels Aug 26, 2024

riccardofelluga mentioned this issue Aug 26, 2024

Unknown attribute _base inside Megatron core #753

Open

Lightning-AI deleted a comment Aug 26, 2024

riccardofelluga closed this as completed Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data types mismatch inside Megatron TransformerBlock #1044

Data types mismatch inside Megatron TransformerBlock #1044

riccardofelluga commented Aug 26, 2024 •

edited

Loading

t-vi commented Aug 26, 2024

kshitij12345 commented Aug 26, 2024

riccardofelluga commented Aug 26, 2024

riccardofelluga commented Aug 26, 2024

riccardofelluga commented Aug 26, 2024

Data types mismatch inside Megatron TransformerBlock #1044

Data types mismatch inside Megatron TransformerBlock #1044

Comments

riccardofelluga commented Aug 26, 2024 • edited Loading

🚀 Model / language coverage

Minimal Repro

t-vi commented Aug 26, 2024

kshitij12345 commented Aug 26, 2024

riccardofelluga commented Aug 26, 2024

riccardofelluga commented Aug 26, 2024

riccardofelluga commented Aug 26, 2024

riccardofelluga commented Aug 26, 2024 •

edited

Loading