-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First Shard Group Save and Load Checkpoint for HSDP #709
Comments
Why would this lead to data corruption?
DCP will only save one copy of the data if the data is replicated across ranks. It is not necessary the first rank/first group will save the replicated data. DCP will decide this during the planning phase. |
@fegin
When multiple processes write to the same file, isn’t it common to encounter data corruption without proper file locks or scheduling mechanisms?
Interesting—thank you for clarifying. If there’s a planner coordinating the writes, the file system corruption issue should not occur. In the meantime, I’ve been exploring the DCP implementation and APIs. However, there is no detailed documentation explaining the coordinator or planner components. I’d like to share what I’ve found so far. Please correct me if I’m mistaken, and hopefully, this will help others as well: Based on this logic, it seems that setting process_group=None might be the best approach. Could you confirm if this should always be the case? When do we need pass non None arg for Additionally, I have another question: |
Yes, but even if DCP's planner decides to save multiple copies, it still won't cause data corruption because different ranks write to different files. What your understanding is mostly correct. As for world PG, there may be case where users would like to save among only a subset of ranks, this is not common but some advanced users may have their own infra architectures design that is not common as well. As for the loading, there is again a planning phase which will coordinate all the ranks to load the data correct without loading redundant data. And DCP assumes a distributed file system such that each rank can access the required files. If such a file system does not exist, users will need to ensure the required files can be accessed. |
thank you for your explanation @fegin ! it makes sense to me now after checking source code |
Based on my understanding, current strategy:
1. All ranks currently read and load the checkpoint.
2. All ranks also save and write the checkpoint.
I have a question regarding the HSDP case:
If different shard groups write data to storage, could this lead to data corruption?
Ideally, should only the first shard group read the data, broadcast it, and handle writing to ensure consistency?
The text was updated successfully, but these errors were encountered: