Skip to content

Resuming from Megatron CKPT fails with optimizer length mismatch #8240

@Superskyyy

Description

@Superskyyy

Checklist / 检查清单

  • I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。

Bug Description / Bug 描述

[rank15]:   File "/usr/local/lib/python3.11/site-packages/swift/cli/_megatron/sft.py", line 7, in <module>                                                                                                                                                                  
[rank15]:     megatron_sft_main()                                                                                                                                                                                                                                           
[rank15]:   File "/usr/local/lib/python3.11/site-packages/swift/megatron/pipelines/train/sft.py", line 88, in megatron_sft_main                                                                                                                                             
[rank15]:     return MegatronSft(args).main()                                                                                                                                                                                                                               
[rank15]:            ^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                               
[rank15]:   File "/usr/local/lib/python3.11/site-packages/swift/pipelines/base.py", line 52, in main                                                                                                                                                                        
[rank15]:     result = self.run()                                                                                                                                                                                                                                           
[rank15]:              ^^^^^^^^^^                                                                                                                                                                                                                                           
[rank15]:   File "/usr/local/lib/python3.11/site-packages/swift/megatron/pipelines/train/sft.py", line 65, in run                                                                                                                                                           
[rank15]:     trainer = self.prepare_trainer()                                                                                                                                                                                                                              
[rank15]:               ^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                              
[rank15]:   File "/usr/local/lib/python3.11/site-packages/swift/megatron/pipelines/train/sft.py", line 33, in prepare_trainer                                                                                                                                               
[rank15]:     return MegatronTrainer(self.args, self.template)                                                                                                                                                                                                              
[rank15]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                              
[rank15]:   File "/usr/local/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 73, in __init__                                                                                                                                                            
[rank15]:     self._load_checkpoint()                                                                                                                                                                                                                                       
[rank15]:   File "/usr/local/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 106, in _load_checkpoint                                                                                                                                                   
[rank15]:     self.state.iteration = load_mcore_checkpoint(                                                                                                                                                                                                                 
[rank15]:                            ^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                 
[rank15]:   File "/usr/local/lib/python3.11/site-packages/swift/megatron/utils/megatron_lm_utils.py", line 468, in load_mcore_checkpoint                                                                                                                                    
[rank15]:     state_dict = dist_checkpointing.load(sharded_state_dict, checkpoint_dir, load_strategy)                                                                                                                                                                       
[rank15]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                       
[rank15]:   File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/serialization.py", line 125, in load                                                                                                                                             
[rank15]:     merge(common_state_dict, nonpersistent_state_dict)                                                                                                                                                                                                            
[rank15]:   File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/dict_utils.py", line 227, in merge                                                                                                                                               
[rank15]:     x1[k] = merge(x1[k], v2, key=key + (k,))                                                                                                                                                                                                                      
[rank15]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                      
[rank15]:   File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/dict_utils.py", line 227, in merge                                                                                                                                               
[rank15]:     x1[k] = merge(x1[k], v2, key=key + (k,))
[rank15]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank15]:   File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/dict_utils.py", line 227, in merge
[rank15]:     x1[k] = merge(x1[k], v2, key=key + (k,))
[rank15]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank15]:   [Previous line repeated 2 more times]
[rank15]:   File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/dict_utils.py", line 235, in merge
[rank15]:     x1[i] = merge(x1[i], v2, key=key + (i,))
[rank15]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank15]:   File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/dict_utils.py", line 230, in merge
[rank15]:     raise ValueError(
[rank15]: ValueError: Cannot merge two lists with different lengths (194 and 192, encountered at level ('optimizer', 0, 'param_state', 0, (torch.bfloat16, torch.bfloat16), 0))

How to Reproduce / 如何复现

I'm resuming the training on a existing checkpoint with the exact same command

[megatron cmd] megatron sft --model /shared_workspace_mfs/original_models/GLM-4.7-Flash --save_safetensors true --cached_dataset /shared_workspace_mfs/yihao/FI-training-data/SFT_PILOT_STUDY/cached_dataset/model-glm-4__tmpl-glm4_7__data-fi-mindforge-1024-strict-glm47-v
2-terminal-filte-495ed3cb__len-65536/train --split_dataset_ratio 0 --load_from_cache_file true --template glm4_7 --agent_template glm4_7 --loss_scale hermes --tuner_type full --finetune false --tensor_model_parallel_size 4 --expert_model_parallel_size 8 --pipeline_mod
el_parallel_size 2 --context_parallel_size 1 --sequence_parallel true --micro_batch_size 1 --global_batch_size 32 --packing true --max_length 65536 --packing_length 65536 --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --gradient_accum
ulation_fusion false --cross_entropy_loss_fusion true --moe_permute_fusion true --moe_grouped_gemm true --moe_shared_expert_overlap true --moe_router_dtype fp32 --attention_backend flash --lr 1e-5 --weight_decay 0.01 --lr_warmup_fraction 0.1 --min_lr 1e-6 --num_train_
epochs 6 --output_dir /shared_workspace_mfs/yihao/FI-training-data/SFT_PILOT_STUDY/checkpoints/mcore/glm47flash_conservative_65536_20260308_063101/v0-20260308-143138 --save_steps 25 --logging_steps 1 --dataloader_num_workers 4 --dataset_num_proc 100 --report_to wandb 
--wandb_project fi-pilot-study --wandb_exp_name glm47flash-conservative-megatron-65536-20260308-154332 --no_save_optim false --no_save_rng false --use_precision_aware_optimizer true --optimizer_cpu_offload true --optimizer_offload_fraction 1.0 --decoder_first_pipeline
_num_layers 23 --mcore_model /shared_workspace_mfs/yihao/FI-training-data/SFT_PILOT_STUDY/checkpoints/mcore/glm47flash_conservative_65536_20260308_063101/v0-20260308-143138/checkpoint-75 --no_add_version                                                                 
                                                                                                                                                                                                                                                                            
/usr/local/lib/python3.11/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml 
for you.                                                                                                                                                                                                                                                                    
  import pynvml  # type: ignore[import]                                                                                                                                                                                                                                     
run sh: `/usr/local/bin/python -m torch.distributed.run --nproc_per_node 8 --master_port 29500 --nnodes 2 --node_rank 1 --master_addr 192.168.243.73 /usr/local/lib/python3.11/site-packages/swift/cli/_megatron/sft.py --model /shared_workspace_mfs/original_models/GLM-4.
7-Flash --save_safetensors true --cached_dataset /shared_workspace_mfs/yihao/FI-training-data/SFT_PILOT_STUDY/cached_dataset/model-glm-4__tmpl-glm4_7__data-fi-mindforge-1024-strict-glm47-v2-terminal-filte-495ed3cb__len-65536/train --split_dataset_ratio 0 --load_from_c
ache_file true --template glm4_7 --agent_template glm4_7 --loss_scale hermes --tuner_type full --finetune false --tensor_model_parallel_size 4 --expert_model_parallel_size 8 --pipeline_model_parallel_size 2 --context_parallel_size 1 --sequence_parallel true --micro_ba
tch_size 1 --global_batch_size 32 --packing true --max_length 65536 --packing_length 65536 --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --gradient_accumulation_fusion false --cross_entropy_loss_fusion true --moe_permute_fusion true 
--moe_grouped_gemm true --moe_shared_expert_overlap true --moe_router_dtype fp32 --attention_backend flash --lr 1e-5 --weight_decay 0.01 --lr_warmup_fraction 0.1 --min_lr 1e-6 --num_train_epochs 6 --output_dir /shared_workspace_mfs/yihao/FI-training-data/SFT_PILOT_STU
DY/checkpoints/mcore/glm47flash_conservative_65536_20260308_063101/v0-20260308-143138 --save_steps 25 --logging_steps 1 --dataloader_num_workers 4 --dataset_num_proc 100 --report_to wandb --wandb_project fi-pilot-study --wandb_exp_name glm47flash-conservative-megatron
-65536-20260308-154332 --no_save_optim false --no_save_rng false --use_precision_aware_optimizer true --optimizer_cpu_offload true --optimizer_offload_fraction 1.0 --decoder_first_pipeline_num_layers 23 --mcore_model /shared_workspace_mfs/yihao/FI-training-data/SFT_PI
LOT_STUDY/checkpoints/mcore/glm47flash_conservative_65536_20260308_063101/v0-20260308-143138/checkpoint-75 --no_add_version`      
[INFO:swift] args: MegatronSftArguments(use_ray=False, ray_exp_name=None, device_groups=None, model='/shared_workspace_mfs/original_models/GLM-4.7-Flash', model_type='glm4_moe_lite', model_revision=None, task_type='causal_lm', torch_dtype=torch.bfloat16, attn_impl=Non
e, experts_impl=None, new_special_tokens=[], num_labels=None, problem_type=None, rope_scaling=None, device_map=None, max_memory={}, max_model_len=None, local_repo_path=None, init_strategy=None, template='glm4_7', system=None, max_length=65536, truncation_strategy='del
ete', max_pixels=None, agent_template='glm4_7', norm_bbox=None, use_chat_template=True, padding_side='right', padding_free=True, loss_scale='hermes', sequence_parallel_size=1, template_backend='swift', response_prefix=None, enable_thinking=None, add_non_thinking_prefi
x=True, dataset=[], val_dataset=[], cached_dataset=['/shared_workspace_mfs/yihao/FI-training-data/SFT_PILOT_STUDY/cached_dataset/model-glm-4__tmpl-glm4_7__data-fi-mindforge-1024-strict-glm47-v2-terminal-filte-495ed3cb__len-65536/train'], cached_val_dataset=[], split_d
ataset_ratio=0.0, data_seed=42, dataset_num_proc=100, load_from_cache_file=True, dataset_shuffle=True, val_dataset_shuffle=False, streaming=False, interleave_prob=None, stopping_strategy='first_exhausted', shuffle_buffer_size=1000, download_mode='reuse_dataset_if_exis
ts', columns={}, strict=False, remove_unused_columns=True, model_name=None, model_author=None, custom_dataset_info=[], quant_method=None, quant_bits=None, hqq_axis=None, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, 
bnb_4bit_quant_storage=None, max_new_tokens=None, temperature=0.9, top_k=50, top_p=0.9, repetition_penalty=1.0, num_beams=1, stream=False, stop_words=[], logprobs=False, top_logprobs=None, structured_outputs_regex=None, tuner_backend='peft', tuner_type='full', train_t
ype=None, adapters=[], external_plugins=[], custom_register_path=[], seed=42, model_kwargs={}, load_args=False, load_data_args=False, packing=True, packing_length=65536, packing_num_proc=1, lazy_tokenize=False, use_hf=False, hub_token=None, ddp_timeout=18000000, ddp_b
ackend='nccl', ignore_args_error=False, use_swift_lora=False, freeze_llm=False, freeze_vit=True, freeze_aligner=True, freeze_parameters=[], freeze_parameters_regex=None, freeze_parameters_ratio=0.0, trainable_parameters=[], trainable_parameters_regex=None, target_modu
les=['all-linear'], target_regex=None, modules_to_save=[], lora_rank=8, lora_alpha=32, lora_dropout=0.05, lora_bias='none', lora_dtype=None, use_rslora=False, rlhf_type=None, loss_type=None, mcore_ref_model=None, mcore_ref_adapter=None, beta=None, rpo_alpha=None, refe
rence_free=False, label_smoothing=0.0, f_divergence_type='reverse_kl', desirable_weight=1.0, undesirable_weight=1.0, calculate_KL=None, center_rewards_coefficient=None, teacher_model=None, teacher_model_type=None, teacher_model_revision=None, teacher_model_server=None
, gkd_logits_topk=None, lmbda=0.5, seq_kd=False, offload_teacher_model=False, sft_alpha=0.0, generation_batch_size=None, steps_per_generation=None, num_generations=8, num_generations_eval=None, max_completion_length=512, importance_sampling_level='token', tau_pos=1.0,
 tau_neg=1.05, epsilon=0.2, epsilon_high=None, delta=None, use_vllm=True, vllm_mode=None, vllm_enable_prefix_caching=True, vllm_gpu_memory_utilization=0.9, vllm_tensor_parallel_size=1, vllm_max_model_len=None, vllm_enforce_eager=False, vllm_limit_mm_per_prompt=None, v
llm_disable_cascade_attn=False, vllm_max_num_seqs=None, vllm_mm_processor_cache_gb=None, vllm_engine_kwargs=None, sleep_level=0, offload_optimizer=False, offload_model=False, offload_bridge=False, vllm_server_base_url=None, vllm_server_host=None, vllm_server_port=[800
0], vllm_server_timeout=240.0, vllm_server_group_port=None, reward_funcs=[], reward_weights=None, cosine_min_len_value_wrong=-0.5, cosine_max_len_value_wrong=0.0, cosine_min_len_value_correct=1.0, cosine_max_len_value_correct=0.5, cosine_max_len=None, repetition_n_gra
ms=3, repetition_max_penalty=-1.0, soft_max_length=None, soft_cache_length=None, dynamic_sample=False, max_resample_times=3, overlong_filter=False, scale_rewards='group', advantage_estimator='grpo', kl_in_reward=False, wandb_log_unique_prompts=None, log_completions=Fa
lse, rollout_importance_sampling_mode=None, rollout_importance_sampling_threshold=2.0, log_rollout_offpolicy_metrics=False, off_policy_sequence_mask_delta=None, log_entropy=False, top_entropy_quantile=1.0, reward_model=None, reward_model_plugin=None, sync_ref_model=Fa
lse, ref_model_sync_steps=512, ref_model_mixup_alpha=0.6, async_generate=False, move_model_batches=None, multi_turn_scheduler=None, max_turns=None, completion_length_limit_scope='per_round', vllm_server_pass_dataset=False, num_iterations=1, micro_batch_size=1, global_
batch_size=32, recompute_granularity='full', recompute_method='uniform', recompute_num_layers=1, recompute_modules=['core_attn'], train_iters=None, num_train_epochs=6, masked_softmax_fusion=True, bias_dropout_fusion=True, bias_activation_fusion=True, apply_rope_fusion
=False, gradient_accumulation_fusion=False, cross_entropy_loss_fusion=True, cross_entropy_fusion_impl='native', calculate_per_token_loss=True, attention_backend=<AttnBackend.flash: 1>, optimizer='adam', optimizer_cpu_offload=True, optimizer_offload_fraction=1.0, use_p
recision_aware_optimizer=True, main_grads_dtype=torch.float32, main_params_dtype=torch.float32, exp_avg_dtype=torch.float32, exp_avg_sq_dtype=torch.float32, manual_gc=False, manual_gc_steps=0, manual_gc_eval=True, train_dataloader_shuffle=True, dataloader_num_workers=
4, dataloader_pin_memory=True, dataloader_persistent_workers=True, dataloader_prefetch_factor=2, data_sharding=False, group_by_length=False, te_rng_tracker=False, data_parallel_random_init=False, mlp_padding_free=False, lr_warmup_init=0.0, lr=1e-05, lr_decay_style='co
sine', lr_decay_iters=None, lr_warmup_iters=0, lr_warmup_fraction=0.1, min_lr=1e-06, lr_wsd_decay_style='exponential', lr_wsd_decay_iters=None, weight_decay=0.01, weight_decay_incr_style='constant', start_weight_decay=0.01, end_weight_decay=0.01, clip_grad=1.0, adam_b
eta1=0.9, adam_beta2=0.95, adam_eps=1e-08, sgd_momentum=0.9, output_dir='/shared_workspace_mfs/yihao/FI-training-data/SFT_PILOT_STUDY/checkpoints/mcore/glm47flash_conservative_65536_20260308_063101/v0-20260308-143138', save_steps=25, no_save_optim=False, no_save_rng=F
alse, mcore_model='/shared_workspace_mfs/yihao/FI-training-data/SFT_PILOT_STUDY/checkpoints/mcore/glm47flash_conservative_65536_20260308_063101/v0-20260308-143138/checkpoint-75', mcore_adapter=None, no_load_optim=False, no_load_rng=False, finetune=False, perform_initi
alization=False, use_cpu_initialization=False, async_save=False, save_total_limit=None, metric_for_best_model='loss', greater_is_better=False, use_persistent_ckpt_worker=False, dist_ckpt_save_pre_mcore_014=False, dist_ckpt_optim_fully_reshardable=False, distrib_optim_
fully_reshardable_mem_efficient=False, local_rank=0, use_distributed_optimizer=True, tensor_model_parallel_size=4, pipeline_model_parallel_size=2, decoder_first_pipeline_num_layers=23, decoder_last_pipeline_num_layers=None, account_for_embedding_in_pipeline_split=Fals
e, account_for_loss_in_pipeline_split=False, overlap_p2p_comm=False, align_param_gather=False, sequence_parallel=True, context_parallel_size=1, tp_comm_overlap=False, overlap_grad_reduce=False, overlap_param_gather=False, overlap_param_gather_with_optimizer_step=False
, align_grad_reduce=True, virtual_pipeline_model_parallel_size=None, microbatch_group_size_per_vp_stage=None, pipeline_model_parallel_layout=None, expert_model_parallel_size=8, expert_tensor_parallel_size=1, report_to=['wandb'], logging_steps=1, tensorboard_dir='/shar
ed_workspace_mfs/yihao/FI-training-data/SFT_PILOT_STUDY/checkpoints/mcore/glm47flash_conservative_65536_20260308_063101/v0-20260308-143138/runs', tensorboard_queue_size=50, wandb_project='fi-pilot-study', wandb_exp_name='glm47flash-conservative-megatron-65536-20260308
-154332', swanlab_project='megatron-swift', swanlab_exp_name=None, eval_iters=-1, eval_steps=25, fp8_format=None, fp8_recipe='delayed', fp8_amax_history_len=1024, fp8_amax_compute_algo='max', fp8_param_gather=False, fp16=False, bf16=True, apply_query_key_layer_scaling
=False, attention_softmax_in_fp32=True, accumulate_allreduce_grads_in_fp32=True, moe_router_load_balancing_type=None, moe_router_dtype='fp32', moe_token_dispatcher_type='alltoall', moe_enable_deepep=False, moe_grouped_gemm=True, moe_permute_fusion=True, moe_aux_loss_c
oeff=0.0, moe_z_loss_coeff=None, moe_shared_expert_overlap=True, moe_layer_recompute=False, moe_expert_capacity_factor=None, moe_pad_expert_input_to_capacity=False, moe_token_drop_policy='probs', mtp_num_layers=None, mtp_loss_scaling_factor=0.1, save_safetensors=True,
 ref_model=None, ref_adapters=[], merge_lora=True, max_shard_size='5GB', vit_gradient_checkpointing=False, vit_lr=None, aligner_lr=None, gradient_checkpointing_kwargs=None, check_model=True, apply_wd_to_qk_layernorm=False, enable_dft_loss=False, enable_channel_loss=Fa
lse, save_strategy='steps', callbacks=['print', 'default_flow', 'wandb'], add_version=False, create_checkpoint_symlink=False)   

Additional Information / 补充信息

[INFO:swift] model_kwargs: {'device_map': 'cuda:0', 'dtype': torch.bfloat16} 15:44:06 [358/1830]
[INFO:swift] [rank9] model_parameter_info: GPTModel: 2131.2307M Params (2131.2307M Trainable [100.0000%]), 0.0031M Buffers.
[INFO:swift] [rank10] model_parameter_info: GPTModel: 2131.2307M Params (2131.2307M Trainable [100.0000%]), 0.0031M Buffers.
[INFO:swift] model: GPTModel(
(decoder): TransformerBlock(
(layers): ModuleList(
(0-23): 24 x TransformerLayer(
(input_layernorm): RMSNorm()
(self_attention): MLASelfAttention(
(core_attention): TEDotProductAttention(
(flash_attention): FlashAttention()
(fused_attention): FusedAttention()
(unfused_attention): UnfusedDotProductAttention(
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0.0, inplace=False)
)
)
(linear_proj): TERowParallelLinear(in_features=1280, out_features=2048, bias=False, TP=4)
(linear_q_down_proj): TELinear(in_features=2048, out_features=768, bias=False, TP=1)
(linear_q_up_proj): TELayerNormColumnParallelLinear(in_features=768, out_features=1280, bias=False, TP=4)
(linear_kv_down_proj): TELinear(in_features=2048, out_features=576, bias=False, TP=1)
(linear_kv_up_proj): TELayerNormColumnParallelLinear(in_features=512, out_features=2240, bias=False, TP=4)
(q_layernorm): IdentityOp()
(kv_layernorm): IdentityOp()
)
(pre_cross_attn_layernorm): IdentityOp()
(cross_attention): IdentityOp()
(cross_attn_bda): IdentityFuncOp()
(pre_mlp_layernorm): RMSNorm()
(mlp): MoELayer(
(router): TopKRouter()
(experts): TEGroupedMLP(
(linear_fc1): TEColumnParallelGroupedLinear()
(linear_fc2): TERowParallelGroupedLinear()
)
(shared_experts): SharedExpertMLP(
(linear_fc1): TEColumnParallelLinear(in_features=2048, out_features=768, bias=False, TP=4)
(linear_fc2): TERowParallelLinear(in_features=384, out_features=2048, bias=False, TP=4)
)
)
)
)
(final_layernorm): RMSNorm()
)
(output_layer): ColumnParallelLinear(in_features=2048, out_features=154880, bias=False, TP=4)
(rotary_pos_emb): RotaryEmbedding()
)
[INFO:swift] [rank8] model_parameter_info: GPTModel: 2131.2307M Params (2131.2307M Trainable [100.0000%]), 0.0031M Buffers.
[INFO:swift] [rank11] model_parameter_info: GPTModel: 2131.2307M Params (2131.2307M Trainable [100.0000%]), 0.0031M Buffers.
[INFO:swift] padding_to: 4
[INFO:swift] checkpoint_dir: /shared_workspace_mfs/original_models/GLM-4.7-Flash
No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions