Skip to content

Conversation

@Thaurun
Copy link
Collaborator

@Thaurun Thaurun commented Dec 24, 2025

The new method can avoid gather, thereby saving additional GPU memory overhead.

@ISEEKYAN ISEEKYAN merged commit 715ba9d into main Jan 8, 2026
1 check passed
@jthomy
Copy link

jthomy commented Jan 9, 2026

@Thaurun @ISEEKYAN I noticed that this commit breaks saving at least for Deepseek:
There is a race condition, we should have a torch.distributed.barrier() right after the assertion that the huggingface folder doesn't contain safetensor files.
I also observed on other issue where w_files[0][4] is None, causing TypeError: '>' not supported between instances of 'NoneType' and 'int'
I think this happens because tensor_model_parallel can be None for some weights because export_weights_without_gather returns that, which is then stored in the tuple as None.

@ISEEKYAN
Copy link
Owner

ISEEKYAN commented Feb 9, 2026

@Thaurun @ISEEKYAN I noticed that this commit breaks saving at least for Deepseek: There is a race condition, we should have a torch.distributed.barrier() right after the assertion that the huggingface folder doesn't contain safetensor files. I also observed on other issue where w_files[0][4] is None, causing TypeError: '>' not supported between instances of 'NoneType' and 'int' I think this happens because tensor_model_parallel can be None for some weights because export_weights_without_gather returns that, which is then stored in the tuple as None.

Thanks @jthomy for pointing out this issue, this has been fixed in #77

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants