Skip to content

Conversation

@allaffa
Copy link
Owner

@allaffa allaffa commented Feb 9, 2026

Applied the FSDP2 fully_shard wrapper so the runtime uses the new API and the toggles now take effect.

Changes:

  • Switched the global wrapper to FSDP2 fully_shard with a DeviceMesh built from the active process group ranks in distributed.py:14-399.

  • Updated MultiTaskModelMP to use FSDP2 on encoder/decoder with per-group meshes in [MultiTaskModelMP.py:8-276].

  • Added FSDP2 detection for save/load paths so checkpointing doesn’t call FSDP1-only APIs in [model.py:20-325].

  • Added a per-rank runtime log for FSDP2 activation in [distributed.py:392-399]. It will print FSDP2 active on rank <rank>: <ModelClass> when HYDRAGNN_USE_FSDP=1.

Notes:

  • FSDP2 currently supports FULL_SHARD only; the code now warns and ignores other HYDRAGNN_FSDP_STRATEGY values.

  • [set_reshard_after_backward(False/True)] toggles in [train_validate_test.py:70-958] will now apply to FSDP2 and should address the [autograd.grad()] storage error.

@allaffa allaffa self-assigned this Feb 9, 2026
@allaffa allaffa added the bug Something isn't working label Feb 9, 2026
@allaffa allaffa requested a review from jychoi-hpc February 9, 2026 16:02
@jychoi-hpc
Copy link
Collaborator

Just tested. I still have the error:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/lustre/orion/world-shared/lrn070/jyc/frontier/HydraGNN/examples/multidataset_hpo_sc26/gfm_mlip_all_mpnn.py", line 692, in <module>
[rank0]:     hydragnn.train.train_validate_test(
[rank0]:   File "/lustre/orion/world-shared/lrn070/jyc/frontier/HydraGNN/hydragnn/train/train_validate_test.py", line 286, in train_validate_test
[rank0]:     train_loss, train_taskserr = train(
[rank0]:                                  ^^^^^^
[rank0]:   File "/lustre/orion/world-shared/lrn070/jyc/frontier/HydraGNN/hydragnn/train/train_validate_test.py", line 687, in train
[rank0]:     loss.backward()
[rank0]:   File "/lustre/orion/world-shared/lrn070/jyc/frontier/HydraGNN/HydraGNN-Installation-Frontier/hydragnn_venv/lib/python3.11/site-packages/torch/_tensor.py", line 625, in backward
[rank0]:     torch.autograd.backward(
[rank0]:   File "/lustre/orion/world-shared/lrn070/jyc/frontier/HydraGNN/HydraGNN-Installation-Frontier/hydragnn_venv/lib/python3.11/site-packages/torch/autograd/__init__.py", line 354, in backward
[rank0]:     _engine_run_backward(
[rank0]:   File "/lustre/orion/world-shared/lrn070/jyc/frontier/HydraGNN/HydraGNN-Installation-Frontier/hydragnn_venv/lib/python3.11/site-packages/torch/autograd/graph.py", line 841, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: setStorage: sizes [1, 10], strides [10, 1], storage offset 4720, and itemsize 8 requiring a storage size of 37840 are out of bounds for storage of size 0

@allaffa allaffa force-pushed the fsdp_optimizer_compute_grad branch from 0b115ae to 6618cfb Compare February 10, 2026 14:29
@allaffa allaffa force-pushed the fsdp_optimizer_compute_grad branch from 6618cfb to a18aa12 Compare February 10, 2026 14:34
@allaffa allaffa added the enhancement New feature or request label Feb 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants