Closed
Description
Based on my understanding, current strategy:
1. All ranks currently read and load the checkpoint.
2. All ranks also save and write the checkpoint.
I have a question regarding the HSDP case:
If different shard groups write data to storage, could this lead to data corruption?
Ideally, should only the first shard group read the data, broadcast it, and handle writing to ensure consistency?
Metadata
Metadata
Assignees
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
qsh-zh commentedon Dec 2, 2024
maye @tianyu-l @fegin know ?
fegin commentedon Dec 2, 2024
Why would this lead to data corruption?
DCP will only save one copy of the data if the data is replicated across ranks. It is not necessary the first rank/first group will save the replicated data. DCP will decide this during the planning phase.
qsh-zh commentedon Dec 2, 2024
@fegin
Thank you for your explanation.
When multiple processes write to the same file, isn’t it common to encounter data corruption without proper file locks or scheduling mechanisms?
Interesting—thank you for clarifying. If there’s a planner coordinating the writes, the file system corruption issue should not occur.
In the meantime, I’ve been exploring the DCP implementation and APIs. However, there is no detailed documentation explaining the coordinator or planner components.
I’d like to share what I’ve found so far. Please correct me if I’m mistaken, and hopefully, this will help others as well:
• dcp.save has an argument called process_group.
• The _DistWrapper class accepts the process_group.
• In this code snippet, central_plan: SavePlan = distW.reduce_scatter("plan", local_step, global_step) seems to coordinate the saving process.
• If we pass process_group=None, this code handles deduplication for world PG
Based on this logic, it seems that setting process_group=None might be the best approach. Could you confirm if this should always be the case? When do we need pass non None arg for
process_group
?Additionally, I have another question:
Does the logic of dcp.load work similarly to dcp.save, or do all ranks operate independently without synchronization? For replicated groups, do they read the same data? It seems there are no deduplication and broadcast states.
fegin commentedon Dec 3, 2024
Yes, but even if DCP's planner decides to save multiple copies, it still won't cause data corruption because different ranks write to different files.
What your understanding is mostly correct. As for world PG, there may be case where users would like to save among only a subset of ranks, this is not common but some advanced users may have their own infra architectures design that is not common as well.
As for the loading, there is again a planning phase which will coordinate all the ranks to load the data correct without loading redundant data. And DCP assumes a distributed file system such that each rank can access the required files. If such a file system does not exist, users will need to ensure the required files can be accessed.
qsh-zh commentedon Jan 8, 2025
thank you for your explanation @fegin ! it makes sense to me now after checking source code