prevent saving a large number of small files #78
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Originally, the
_save_weight_fastfunction saved each small weight as an individual file. When the number of weights is large, this results in a concentrated burst of creating and deleting a large number of new files in a short period. This not only may put pressure on the distributed file system but is also relatively inefficient. Therefore, I have added a new logic for saving batch files here.I have verified the correctness, and testing before and after the modifications showed that the
save_weightstime for an 80B MoE model on 16 GPUs was reduced from 250s to 190s, a decrease of 24%.