optim save for distributed filesystem #64

Thaurun · 2025-12-24T05:20:26Z

The new method can avoid gather, thereby saving additional GPU memory overhead.

jthomy · 2026-01-09T17:20:38Z

@Thaurun @ISEEKYAN I noticed that this commit breaks saving at least for Deepseek:
There is a race condition, we should have a torch.distributed.barrier() right after the assertion that the huggingface folder doesn't contain safetensor files.
I also observed on other issue where w_files[0][4] is None, causing TypeError: '>' not supported between instances of 'NoneType' and 'int'
I think this happens because tensor_model_parallel can be None for some weights because export_weights_without_gather returns that, which is then stored in the tuple as None.

ISEEKYAN · 2026-02-09T07:41:46Z

@Thaurun @ISEEKYAN I noticed that this commit breaks saving at least for Deepseek: There is a race condition, we should have a torch.distributed.barrier() right after the assertion that the huggingface folder doesn't contain safetensor files. I also observed on other issue where w_files[0][4] is None, causing TypeError: '>' not supported between instances of 'NoneType' and 'int' I think this happens because tensor_model_parallel can be None for some weights because export_weights_without_gather returns that, which is then stored in the tuple as None.

Thanks @jthomy for pointing out this issue, this has been fixed in #77

Thaurun added 2 commits December 24, 2025 13:21

save hf ckpt without gather

267fd56

fix

4287c26

Thaurun force-pushed the gyhe/optim-save branch from a3127ad to 4287c26 Compare December 24, 2025 05:21

Thaurun added 2 commits December 24, 2025 17:05

fix ep_size=1 error

e360c18

add raise

f2a3e0a

ISEEKYAN merged commit 715ba9d into main Jan 8, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optim save for distributed filesystem #64

optim save for distributed filesystem #64

Uh oh!

Thaurun commented Dec 24, 2025

Uh oh!

Uh oh!

jthomy commented Jan 9, 2026

Uh oh!

ISEEKYAN commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

optim save for distributed filesystem #64

optim save for distributed filesystem #64

Uh oh!

Conversation

Thaurun commented Dec 24, 2025

Uh oh!

Uh oh!

jthomy commented Jan 9, 2026

Uh oh!

ISEEKYAN commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants