Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First Shard Group Save and Load Checkpoint for HSDP #709

Closed
qsh-zh opened this issue Nov 29, 2024 · 5 comments
Closed

First Shard Group Save and Load Checkpoint for HSDP #709

qsh-zh opened this issue Nov 29, 2024 · 5 comments
Labels
question Further information is requested

Comments

@qsh-zh
Copy link

qsh-zh commented Nov 29, 2024

Based on my understanding, current strategy:
1. All ranks currently read and load the checkpoint.
2. All ranks also save and write the checkpoint.

I have a question regarding the HSDP case:
If different shard groups write data to storage, could this lead to data corruption?
Ideally, should only the first shard group read the data, broadcast it, and handle writing to ensure consistency?

@qsh-zh
Copy link
Author

qsh-zh commented Dec 2, 2024

maye @tianyu-l @fegin know ?

@tianyu-l tianyu-l added the question Further information is requested label Dec 2, 2024
@fegin
Copy link
Contributor

fegin commented Dec 2, 2024

If different shard groups write data to storage, could this lead to data corruption?

Why would this lead to data corruption?

Ideally, should only the first shard group read the data, broadcast it, and handle writing to ensure consistency?

DCP will only save one copy of the data if the data is replicated across ranks. It is not necessary the first rank/first group will save the replicated data. DCP will decide this during the planning phase.

@qsh-zh
Copy link
Author

qsh-zh commented Dec 2, 2024

@fegin
Thank you for your explanation.

Why would this lead to data corruption?

When multiple processes write to the same file, isn’t it common to encounter data corruption without proper file locks or scheduling mechanisms?

DCP will only save one copy of the data if the data is replicated across ranks.

Interesting—thank you for clarifying. If there’s a planner coordinating the writes, the file system corruption issue should not occur.

In the meantime, I’ve been exploring the DCP implementation and APIs. However, there is no detailed documentation explaining the coordinator or planner components.

I’d like to share what I’ve found so far. Please correct me if I’m mistaken, and hopefully, this will help others as well:
• dcp.save has an argument called process_group.
• The _DistWrapper class accepts the process_group.
• In this code snippet, central_plan: SavePlan = distW.reduce_scatter("plan", local_step, global_step) seems to coordinate the saving process.
• If we pass process_group=None, this code handles deduplication for world PG

Based on this logic, it seems that setting process_group=None might be the best approach. Could you confirm if this should always be the case? When do we need pass non None arg for process_group?

Additionally, I have another question:
Does the logic of dcp.load work similarly to dcp.save, or do all ranks operate independently without synchronization? For replicated groups, do they read the same data? It seems there are no deduplication and broadcast states.

@fegin
Copy link
Contributor

fegin commented Dec 3, 2024

When multiple processes write to the same file, isn’t it common to encounter data corruption without proper file locks or scheduling mechanisms?

Yes, but even if DCP's planner decides to save multiple copies, it still won't cause data corruption because different ranks write to different files.

What your understanding is mostly correct. As for world PG, there may be case where users would like to save among only a subset of ranks, this is not common but some advanced users may have their own infra architectures design that is not common as well.

As for the loading, there is again a planning phase which will coordinate all the ranks to load the data correct without loading redundant data. And DCP assumes a distributed file system such that each rank can access the required files. If such a file system does not exist, users will need to ensure the required files can be accessed.

@fegin fegin closed this as completed Jan 8, 2025
@qsh-zh
Copy link
Author

qsh-zh commented Jan 8, 2025

thank you for your explanation @fegin ! it makes sense to me now after checking source code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants