Skip to content

First Shard Group Save and Load Checkpoint for HSDP #709

Closed
@qsh-zh

Description

@qsh-zh

Based on my understanding, current strategy:
1. All ranks currently read and load the checkpoint.
2. All ranks also save and write the checkpoint.

I have a question regarding the HSDP case:
If different shard groups write data to storage, could this lead to data corruption?
Ideally, should only the first shard group read the data, broadcast it, and handle writing to ensure consistency?

Activity

qsh-zh

qsh-zh commented on Dec 2, 2024

@qsh-zh
Author

maye @tianyu-l @fegin know ?

fegin

fegin commented on Dec 2, 2024

@fegin
Contributor

If different shard groups write data to storage, could this lead to data corruption?

Why would this lead to data corruption?

Ideally, should only the first shard group read the data, broadcast it, and handle writing to ensure consistency?

DCP will only save one copy of the data if the data is replicated across ranks. It is not necessary the first rank/first group will save the replicated data. DCP will decide this during the planning phase.

qsh-zh

qsh-zh commented on Dec 2, 2024

@qsh-zh
Author

@fegin
Thank you for your explanation.

Why would this lead to data corruption?

When multiple processes write to the same file, isn’t it common to encounter data corruption without proper file locks or scheduling mechanisms?

DCP will only save one copy of the data if the data is replicated across ranks.

Interesting—thank you for clarifying. If there’s a planner coordinating the writes, the file system corruption issue should not occur.

In the meantime, I’ve been exploring the DCP implementation and APIs. However, there is no detailed documentation explaining the coordinator or planner components.

I’d like to share what I’ve found so far. Please correct me if I’m mistaken, and hopefully, this will help others as well:
• dcp.save has an argument called process_group.
• The _DistWrapper class accepts the process_group.
• In this code snippet, central_plan: SavePlan = distW.reduce_scatter("plan", local_step, global_step) seems to coordinate the saving process.
• If we pass process_group=None, this code handles deduplication for world PG

Based on this logic, it seems that setting process_group=None might be the best approach. Could you confirm if this should always be the case? When do we need pass non None arg for process_group?

Additionally, I have another question:
Does the logic of dcp.load work similarly to dcp.save, or do all ranks operate independently without synchronization? For replicated groups, do they read the same data? It seems there are no deduplication and broadcast states.

fegin

fegin commented on Dec 3, 2024

@fegin
Contributor

When multiple processes write to the same file, isn’t it common to encounter data corruption without proper file locks or scheduling mechanisms?

Yes, but even if DCP's planner decides to save multiple copies, it still won't cause data corruption because different ranks write to different files.

What your understanding is mostly correct. As for world PG, there may be case where users would like to save among only a subset of ranks, this is not common but some advanced users may have their own infra architectures design that is not common as well.

As for the loading, there is again a planning phase which will coordinate all the ranks to load the data correct without loading redundant data. And DCP assumes a distributed file system such that each rank can access the required files. If such a file system does not exist, users will need to ensure the required files can be accessed.

qsh-zh

qsh-zh commented on Jan 8, 2025

@qsh-zh
Author

thank you for your explanation @fegin ! it makes sense to me now after checking source code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @fegin@qsh-zh@tianyu-l

        Issue actions

          First Shard Group Save and Load Checkpoint for HSDP · Issue #709 · pytorch/torchtitan