update use of optimizer to avoid runtime errors on sharded gradients #54

allaffa · 2026-02-09T03:13:57Z

Applied the FSDP2 fully_shard wrapper so the runtime uses the new API and the toggles now take effect.

Changes:

Switched the global wrapper to FSDP2 fully_shard with a DeviceMesh built from the active process group ranks in distributed.py:14-399.
Updated MultiTaskModelMP to use FSDP2 on encoder/decoder with per-group meshes in [MultiTaskModelMP.py:8-276].
Added FSDP2 detection for save/load paths so checkpointing doesn’t call FSDP1-only APIs in [model.py:20-325].
Added a per-rank runtime log for FSDP2 activation in [distributed.py:392-399]. It will print FSDP2 active on rank <rank>: <ModelClass> when HYDRAGNN_USE_FSDP=1.

Notes:

FSDP2 currently supports FULL_SHARD only; the code now warns and ignores other HYDRAGNN_FSDP_STRATEGY values.
[set_reshard_after_backward(False/True)] toggles in [train_validate_test.py:70-958] will now apply to FSDP2 and should address the [autograd.grad()] storage error.

jychoi-hpc · 2026-02-09T16:04:43Z

Just tested. I still have the error:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/lustre/orion/world-shared/lrn070/jyc/frontier/HydraGNN/examples/multidataset_hpo_sc26/gfm_mlip_all_mpnn.py", line 692, in <module>
[rank0]:     hydragnn.train.train_validate_test(
[rank0]:   File "/lustre/orion/world-shared/lrn070/jyc/frontier/HydraGNN/hydragnn/train/train_validate_test.py", line 286, in train_validate_test
[rank0]:     train_loss, train_taskserr = train(
[rank0]:                                  ^^^^^^
[rank0]:   File "/lustre/orion/world-shared/lrn070/jyc/frontier/HydraGNN/hydragnn/train/train_validate_test.py", line 687, in train
[rank0]:     loss.backward()
[rank0]:   File "/lustre/orion/world-shared/lrn070/jyc/frontier/HydraGNN/HydraGNN-Installation-Frontier/hydragnn_venv/lib/python3.11/site-packages/torch/_tensor.py", line 625, in backward
[rank0]:     torch.autograd.backward(
[rank0]:   File "/lustre/orion/world-shared/lrn070/jyc/frontier/HydraGNN/HydraGNN-Installation-Frontier/hydragnn_venv/lib/python3.11/site-packages/torch/autograd/__init__.py", line 354, in backward
[rank0]:     _engine_run_backward(
[rank0]:   File "/lustre/orion/world-shared/lrn070/jyc/frontier/HydraGNN/HydraGNN-Installation-Frontier/hydragnn_venv/lib/python3.11/site-packages/torch/autograd/graph.py", line 841, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: setStorage: sizes [1, 10], strides [10, 1], storage offset 4720, and itemsize 8 requiring a storage size of 37840 are out of bounds for storage of size 0

…ion casting

allaffa self-assigned this Feb 9, 2026

allaffa added the bug Something isn't working label Feb 9, 2026

allaffa requested a review from jychoi-hpc February 9, 2026 16:02

allaffa force-pushed the fsdp_optimizer_compute_grad branch from 0b115ae to 6618cfb Compare February 10, 2026 14:29

allaffa added 2 commits February 10, 2026 09:33

FSDP1 replaced with FSDP2 to enable parameter resharding

371cc0d

added MixedPrecisionPolicy handler to use FSDP2 native machine precis…

a18aa12

…ion casting

allaffa force-pushed the fsdp_optimizer_compute_grad branch from 6618cfb to a18aa12 Compare February 10, 2026 14:34

allaffa added the enhancement New feature or request label Feb 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update use of optimizer to avoid runtime errors on sharded gradients #54

update use of optimizer to avoid runtime errors on sharded gradients #54

allaffa commented Feb 9, 2026 •

edited

Loading

Uh oh!

jychoi-hpc commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

update use of optimizer to avoid runtime errors on sharded gradients #54

Are you sure you want to change the base?

update use of optimizer to avoid runtime errors on sharded gradients #54

Conversation

allaffa commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jychoi-hpc commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

allaffa commented Feb 9, 2026 •

edited

Loading