Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add output node if it does not exist in the split module #1480

Merged
merged 7 commits into from
Dec 2, 2024
Merged

Conversation

kiya00
Copy link
Collaborator

@kiya00 kiya00 commented Nov 26, 2024

Before submitting
  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Fixes #1476 .

Add a workaround to make ThunderFX working with older version of PyTorch by going through all submodules of split_module and adds an output node if it's missing

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

Copy link
Collaborator

@IvanYashchuk IvanYashchuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the quick fix!

@IvanYashchuk
Copy link
Collaborator

An instance check is missing because currently, the code may try accessing .graph for a normal nn.Module. Here's the error I see when running the model from #1476:

  File "/workspace/lightning-thunder/thunder/dynamo/splitter.py", line 150, in _splitter
    add_output(original_split_gm)
  File "/workspace/lightning-thunder/thunder/dynamo/splitter.py", line 143, in add_output
    add_output(getattr(m, node.target))
  File "/workspace/lightning-thunder/thunder/dynamo/splitter.py", line 143, in add_output
    add_output(getattr(m, node.target))
  File "/workspace/lightning-thunder/thunder/dynamo/splitter.py", line 141, in add_output
    for node in m.graph.nodes:
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1728, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
torch._dynamo.exc.BackendCompilerFailed: backend='<thunder.dynamo.compiler.ThunderCompiler object at 0x7f6ed5737280>' raised:
AttributeError: 'Embedding' object has no attribute 'graph'

@IvanYashchuk IvanYashchuk self-requested a review November 27, 2024 08:38
@kiya00 kiya00 marked this pull request as ready for review November 27, 2024 12:24
@kiya00
Copy link
Collaborator Author

kiya00 commented Nov 27, 2024

with this change the assertion error can be fixed:

root@f2a3ac3a1f9b:/workspace# python _nemo.py --model checkpoints/$model --mbs 1 --seq-length 2048 --jit-backend thunder
[NeMo W 2024-11-27 10:02:28 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/pyannote/core/notebook.py:134: Matplotlib                                                     DeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colorm                                                     aps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.
      cm = get_cmap("Set1")

Namespace(model='checkpoints/microsoft/Phi-3.5-mini-instruct', strategy='auto', devices=1, accelerator='gpu', max_steps=100, wa                                                     ndb_project=None, mbs=1, grad_acc_steps=1, seq_length=2048, jit_backend='thunder', output=None, profile_mem=False)
Map: 100%|████████████████████████████████████████████████████████████████████████████| 100/100 [00:10<00:00,  9.88 examples/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo I 2024-11-27 10:02:40 nemo_logger:145] Experiments will be logged at /workspace/nemo_experiments/default/2024-11-27_10-02                                                     -40
[NeMo W 2024-11-27 10:02:40 nemo_logger:173] "update_logger_directory" is True. Overwriting tensorboard logger "save_dir" to /w                                                     orkspace/nemo_experiments
[NeMo W 2024-11-27 10:02:40 nemo_logging:361] /usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/configuration_v                                                     alidator.py:70: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.

[NeMo I 2024-11-27 10:02:40 model_transform:66] Setting up ModelTransform for stage: fit
[NeMo I 2024-11-27 10:02:40 model_transform:69] Found model_transform attribute on pl_module
[NeMo I 2024-11-27 10:02:40 model_transform:72] Set model_transform to: <function _call_counter.<locals>.wrapper at 0x710f652d3                                                     a30>
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.31it/s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name  | Type            | Params | Mode
--------------------------------------------------
0 | model | Phi3ForCausalLM | 3.8 B  | train
--------------------------------------------------
3.8 B     Trainable params
0         Non-trainable params
3.8 B     Total params
15,284.318Total estimated model params size (MB)
423       Modules in train mode
0         Modules in eval mode
Epoch 0:   0%|                                                                                         | 0/100 [00:00<?, ?it/s]                                                     [NeMo I 2024-11-27 10:02:42 model_transform:90] After applying model_transform:
      | Name  | Type            | Params | Mode
    --------------------------------------------------
    0 | model | Phi3ForCausalLM | 3.8 B  | train
    --------------------------------------------------
    25.2 M    Trainable params
    3.8 B     Non-trainable params
    3.8 B     Total params
    15,384.982Total estimated model params size (MB)
    679       Modules in train mode
    0         Modules in eval mode
[NeMo I 2024-11-27 10:02:42 peft:192] Setting up optimizers
[NeMo W 2024-11-27 10:02:42 peft:213] MegatronOptimizerModule not found in trainer callbacks. finalize_model_grads is not prope                                                     rly set up for PEFT.
Epoch 0:   8%|███▊                                           | 8/100 [01:15<14:22,  0.11it/s, v_num=2-40, train_log_step=0.969]                                                     W1127 10:03:58.072000 124318609024128 torch/_dynamo/convert_frame.py:744] [35/8] torch._dynamo hit config.cache_size_limit (8)
W1127 10:03:58.072000 124318609024128 torch/_dynamo/convert_frame.py:744] [35/8]    function: 'torch_dynamo_resume_in_hook_at_1                                                     06' (/workspace/_nemo.py:106)
W1127 10:03:58.072000 124318609024128 torch/_dynamo/convert_frame.py:744] [35/8]    last reason: not L['self'].stats                                                                
W1127 10:03:58.072000 124318609024128 torch/_dynamo/convert_frame.py:744] [35/8] To log all recompilation reasons, use TORCH_LO                                                     GS="recompiles".
W1127 10:03:58.072000 124318609024128 torch/_dynamo/convert_frame.py:744] [35/8] To diagnose recompilation issues, see https://                                                     pytorch.org/docs/main/torch.compiler_troubleshooting.html.
Epoch 0: 100%|██████████████████████| 100/100 [01:48<00:00,  0.92it/s, v_num=2-40, train_log_step=0.969, train_log_epoch=0.969]                                                     `Trainer.fit` stopped: `max_steps=100` reached.
Epoch 0: 100%|██████████████████████| 100/100 [01:48<00:00,  0.92it/s, v_num=2-40, train_log_step=0.969, train_log_epoch=0.969]
71.24638748168945 32400820224 1 2048 32064
1.7490043640136719 32652466176 1 2048 32064
0.34031224250793457 32711318016 1 2048 32064
0.34903740882873535 32711318016 1 2048 32064
0.3445291519165039 32711318016 1 2048 32064
0.34233975410461426 32711318016 1 2048 32064
0.3496072292327881 32711318016 1 2048 32064
0.3462686538696289 32711318016 1 2048 32064
0.34028124809265137 32711318016 1 2048 32064
0.3493776321411133 32711318016 1 2048 32064
0.34827423095703125 32711318016 1 2048 32064
0.3481743335723877 32711318016 1 2048 32064
0.34900951385498047 32711318016 1 2048 32064
0.3499159812927246 32711318016 1 2048 32064
0.34909939765930176 32711318016 1 2048 32064
0.35044121742248535 32711318016 1 2048 32064
0.3503093719482422 32711318016 1 2048 32064
0.35189032554626465 32711318016 1 2048 32064
0.35217785835266113 32711318016 1 2048 32064
0.35206127166748047 32711318016 1 2048 32064
0.35222792625427246 32711318016 1 2048 32064
0.35178208351135254 32711318016 1 2048 32064
0.3539600372314453 32711318016 1 2048 32064
0.35262608528137207 32711318016 1 2048 32064
0.35372424125671387 32711318016 1 2048 32064
0.3552875518798828 32711318016 1 2048 32064
0.3546924591064453 32711318016 1 2048 32064
0.3549506664276123 32711318016 1 2048 32064
0.35607314109802246 32711318016 1 2048 32064
0.35628795623779297 32711318016 1 2048 32064
0.3566772937774658 32711318016 1 2048 32064
0.3565561771392822 32711318016 1 2048 32064
0.35889339447021484 32711318016 1 2048 32064
0.3573873043060303 32711318016 1 2048 32064
0.35866880416870117 32711318016 1 2048 32064
0.35920119285583496 32711318016 1 2048 32064
0.3578832149505615 32711318016 1 2048 32064
0.3608591556549072 32711318016 1 2048 32064
0.3609433174133301 32711318016 1 2048 32064
0.3593254089355469 32711318016 1 2048 32064
0.36185503005981445 32711318016 1 2048 32064
0.36078333854675293 32711318016 1 2048 32064
0.36107325553894043 32711318016 1 2048 32064
0.3626222610473633 32711318016 1 2048 32064
0.3623805046081543 32711318016 1 2048 32064
0.3635847568511963 32711318016 1 2048 32064
0.3640592098236084 32711318016 1 2048 32064
0.364102840423584 32711318016 1 2048 32064
0.36544227600097656 32711318016 1 2048 32064
0.36483168601989746 32711318016 1 2048 32064
0.3660876750946045 32711318016 1 2048 32064
0.36612462997436523 32711318016 1 2048 32064
0.36690330505371094 32711318016 1 2048 32064
0.36809730529785156 32711318016 1 2048 32064
0.3661482334136963 32711318016 1 2048 32064
0.36955690383911133 32711318016 1 2048 32064
0.368344783782959 32711318016 1 2048 32064
0.3678436279296875 32711318016 1 2048 32064
0.37161684036254883 32711318016 1 2048 32064
0.36888957023620605 32711318016 1 2048 32064
0.3699483871459961 32711318016 1 2048 32064
0.3716769218444824 32711318016 1 2048 32064
0.37128305435180664 32711318016 1 2048 32064
0.37117910385131836 32711318016 1 2048 32064
0.37291693687438965 32711318016 1 2048 32064
0.3728320598602295 32711318016 1 2048 32064
0.37095189094543457 32711318016 1 2048 32064
0.3759768009185791 32711318016 1 2048 32064
0.3717625141143799 32711318016 1 2048 32064
0.37224578857421875 32711318016 1 2048 32064
0.37676262855529785 32711318016 1 2048 32064
0.37230944633483887 32711318016 1 2048 32064
0.3759279251098633 32711318016 1 2048 32064
0.3772103786468506 32711318016 1 2048 32064
0.3745462894439697 32711318016 1 2048 32064
0.37671732902526855 32711318016 1 2048 32064
0.3769521713256836 32711318016 1 2048 32064
0.3758692741394043 32711318016 1 2048 32064
0.3771376609802246 32711318016 1 2048 32064
0.3784637451171875 32711318016 1 2048 32064
0.37657976150512695 32711318016 1 2048 32064
0.3799467086791992 32711318016 1 2048 32064
0.37900495529174805 32711318016 1 2048 32064
0.37968873977661133 32711318016 1 2048 32064
0.38152074813842773 32711318016 1 2048 32064
0.379547119140625 32711318016 1 2048 32064
0.3811225891113281 32711318016 1 2048 32064
0.3815124034881592 32711318016 1 2048 32064
0.38086843490600586 32711318016 1 2048 32064
0.382535457611084 32711318016 1 2048 32064
0.38179779052734375 32711318016 1 2048 32064
0.3838808536529541 32711318016 1 2048 32064
0.3811781406402588 32711318016 1 2048 32064
0.38353872299194336 32711318016 1 2048 32064
0.383298397064209 32711318016 1 2048 32064
0.3834874629974365 32711318016 1 2048 32064
0.38483405113220215 32711318016 1 2048 32064
0.3836386203765869 32711318016 1 2048 32064
0.3848240375518799 32711318016 1 2048 32064
0.3864426612854004 32711318016 1 2048 32064

Copy link
Collaborator

@kshitij12345 kshitij12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @kiya00

Should we add a test actually checking this instead of test_thundercompiler_optim_step implicitly testing this?

@IvanYashchuk
Copy link
Collaborator

@t-vi, could you please merge this one?

@IvanYashchuk IvanYashchuk enabled auto-merge (squash) November 28, 2024 08:21
@IvanYashchuk IvanYashchuk added the thunderfx for things that could be applicable to the dynamo+thunder frontend label Nov 28, 2024
@kiya00
Copy link
Collaborator Author

kiya00 commented Dec 2, 2024

Hi @t-vi , I think it's ready to merge

Copy link
Collaborator

@t-vi t-vi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IvanYashchuk IvanYashchuk merged commit 15c48ef into main Dec 2, 2024
41 checks passed
@IvanYashchuk IvanYashchuk deleted the wa1476 branch December 2, 2024 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
thunderfx for things that could be applicable to the dynamo+thunder frontend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AssertionError for Phi-3.5-mini-instruct and Qwen2.5-7B-Instruct with NeMo + ThunderFX
4 participants