How to convert parallel state_dict to normal state_dict? #122

JinchaoLove · 2023-09-18T13:54:29Z

Hi, there! I saved parallel state_dict (requires_grad True only) with 8 GPUs remotely, how to load these state_dicts and save them as one locally? Thanks in advance.

collie_dp0_pp0_tp0.pt  collie_zero_dp0_pp0_tp0.pt  collie_zero_dp2_pp0_tp0.pt  collie_zero_dp4_pp0_tp0.pt  collie_zero_dp6_pp0_tp0.pt
collie.json            collie_zero_dp1_pp0_tp0.pt  collie_zero_dp3_pp0_tp0.pt  collie_zero_dp5_pp0_tp0.pt  collie_zero_dp7_pp0_tp0.pt

The text was updated successfully, but these errors were encountered:

KaiLv69 · 2023-09-18T14:07:13Z

Hi, the model weights should be saved in files like pytorch_model.bin with CheckpointCallback below.

callbacks = [CheckpointCallback(your_path, every_n_batches=1600, model_only=False,peft_only=False)]

BTW, are you using the main branch or dev branch? Recommend using dev now.

JinchaoLove · 2023-09-18T14:31:39Z

Hi, the model weights should be saved in files like pytorch_model.bin with CheckpointCallback below.
callbacks = [CheckpointCallback(your_path, every_n_batches=1600, model_only=False,peft_only=False)]
BTW, are you using the main branch or dev branch? Recommend using dev now.

Got it. I'm using the dev branch. So the aforementioned are all trainer state (not model weights) as defined in the Trainer. The issue caused by my filter method of if requires_grad, which is always False in state_dict.

self.checkpoint_file = "collie_dp{}_pp{}_tp{}.pt".format(env.dp_rank, env.pp_rank, env.tp_rank)  # Trainer state
state_dict = {n: p.detach().cpu() for n, p in model.state_dict().items() if p.requires_grad}  # always empty

JinchaoLove · 2023-09-19T03:42:51Z

The topk in the CheckpointCallback defaults to 0, which will not save the model... I think it's better to set it to be 1 or -1 or raise a warning by default in case of misconfiguration.

JinchaoLove closed this as completed Sep 18, 2023

JinchaoLove reopened this Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to convert parallel state_dict to normal state_dict? #122

How to convert parallel state_dict to normal state_dict? #122

JinchaoLove commented Sep 18, 2023 •

edited

Loading

KaiLv69 commented Sep 18, 2023

JinchaoLove commented Sep 18, 2023 •

edited

Loading

JinchaoLove commented Sep 19, 2023 •

edited

Loading

How to convert parallel state_dict to normal state_dict? #122

How to convert parallel state_dict to normal state_dict? #122

Comments

JinchaoLove commented Sep 18, 2023 • edited Loading

KaiLv69 commented Sep 18, 2023

JinchaoLove commented Sep 18, 2023 • edited Loading

JinchaoLove commented Sep 19, 2023 • edited Loading

JinchaoLove commented Sep 18, 2023 •

edited

Loading

JinchaoLove commented Sep 18, 2023 •

edited

Loading

JinchaoLove commented Sep 19, 2023 •

edited

Loading