Resuming from Megatron CKPT fails with optimizer length mismatch

### Checklist / 检查清单

- [x] I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues，确认这是一个新的 bug report。

### Bug Description / Bug 描述

```
[rank15]:   File "/usr/local/lib/python3.11/site-packages/swift/cli/_megatron/sft.py", line 7, in <module>                                                                                                                                                                  
[rank15]:     megatron_sft_main()                                                                                                                                                                                                                                           
[rank15]:   File "/usr/local/lib/python3.11/site-packages/swift/megatron/pipelines/train/sft.py", line 88, in megatron_sft_main                                                                                                                                             
[rank15]:     return MegatronSft(args).main()                                                                                                                                                                                                                               
[rank15]:            ^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                               
[rank15]:   File "/usr/local/lib/python3.11/site-packages/swift/pipelines/base.py", line 52, in main                                                                                                                                                                        
[rank15]:     result = self.run()                                                                                                                                                                                                                                           
[rank15]:              ^^^^^^^^^^                                                                                                                                                                                                                                           
[rank15]:   File "/usr/local/lib/python3.11/site-packages/swift/megatron/pipelines/train/sft.py", line 65, in run                                                                                                                                                           
[rank15]:     trainer = self.prepare_trainer()                                                                                                                                                                                                                              
[rank15]:               ^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                              
[rank15]:   File "/usr/local/lib/python3.11/site-packages/swift/megatron/pipelines/train/sft.py", line 33, in prepare_trainer                                                                                                                                               
[rank15]:     return MegatronTrainer(self.args, self.template)                                                                                                                                                                                                              
[rank15]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                              
[rank15]:   File "/usr/local/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 73, in __init__                                                                                                                                                            
[rank15]:     self._load_checkpoint()                                                                                                                                                                                                                                       
[rank15]:   File "/usr/local/lib/python3.11/site-packages/swift/megatron/trainers/base.py", line 106, in _load_checkpoint                                                                                                                                                   
[rank15]:     self.state.iteration = load_mcore_checkpoint(                                                                                                                                                                                                                 
[rank15]:                            ^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                 
[rank15]:   File "/usr/local/lib/python3.11/site-packages/swift/megatron/utils/megatron_lm_utils.py", line 468, in load_mcore_checkpoint                                                                                                                                    
[rank15]:     state_dict = dist_checkpointing.load(sharded_state_dict, checkpoint_dir, load_strategy)                                                                                                                                                                       
[rank15]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                       
[rank15]:   File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/serialization.py", line 125, in load                                                                                                                                             
[rank15]:     merge(common_state_dict, nonpersistent_state_dict)                                                                                                                                                                                                            
[rank15]:   File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/dict_utils.py", line 227, in merge                                                                                                                                               
[rank15]:     x1[k] = merge(x1[k], v2, key=key + (k,))                                                                                                                                                                                                                      
[rank15]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                      
[rank15]:   File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/dict_utils.py", line 227, in merge                                                                                                                                               
[rank15]:     x1[k] = merge(x1[k], v2, key=key + (k,))
[rank15]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank15]:   File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/dict_utils.py", line 227, in merge
[rank15]:     x1[k] = merge(x1[k], v2, key=key + (k,))
[rank15]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank15]:   [Previous line repeated 2 more times]
[rank15]:   File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/dict_utils.py", line 235, in merge
[rank15]:     x1[i] = merge(x1[i], v2, key=key + (i,))
[rank15]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank15]:   File "/usr/local/lib/python3.11/site-packages/megatron/core/dist_checkpointing/dict_utils.py", line 230, in merge
[rank15]:     raise ValueError(
[rank15]: ValueError: Cannot merge two lists with different lengths (194 and 192, encountered at level ('optimizer', 0, 'param_state', 0, (torch.bfloat16, torch.bfloat16), 0))
```

### How to Reproduce / 如何复现

I'm resuming the training on a existing checkpoint with the exact same command
```
[megatron cmd] megatron sft --model /shared_workspace_mfs/original_models/GLM-4.7-Flash --save_safetensors true --cached_dataset /shared_workspace_mfs/yihao/FI-training-data/SFT_PILOT_STUDY/cached_dataset/model-glm-4__tmpl-glm4_7__data-fi-mindforge-1024-strict-glm47-v
2-terminal-filte-495ed3cb__len-65536/train --split_dataset_ratio 0 --load_from_cache_file true --template glm4_7 --agent_template glm4_7 --loss_scale hermes --tuner_type full --finetune false --tensor_model_parallel_size 4 --expert_model_parallel_size 8 --pipeline_mod
el_parallel_size 2 --context_parallel_size 1 --sequence_parallel true --micro_batch_size 1 --global_batch_size 32 --packing true --max_length 65536 --packing_length 65536 --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --gradient_accum
ulation_fusion false --cross_entropy_loss_fusion true --moe_permute_fusion true --moe_grouped_gemm true --moe_shared_expert_overlap true --moe_router_dtype fp32 --attention_backend flash --lr 1e-5 --weight_decay 0.01 --lr_warmup_fraction 0.1 --min_lr 1e-6 --num_train_
epochs 6 --output_dir /shared_workspace_mfs/yihao/FI-training-data/SFT_PILOT_STUDY/checkpoints/mcore/glm47flash_conservative_65536_20260308_063101/v0-20260308-143138 --save_steps 25 --logging_steps 1 --dataloader_num_workers 4 --dataset_num_proc 100 --report_to wandb 
--wandb_project fi-pilot-study --wandb_exp_name glm47flash-conservative-megatron-65536-20260308-154332 --no_save_optim false --no_save_rng false --use_precision_aware_optimizer true --optimizer_cpu_offload true --optimizer_offload_fraction 1.0 --decoder_first_pipeline
_num_layers 23 --mcore_model /shared_workspace_mfs/yihao/FI-training-data/SFT_PILOT_STUDY/checkpoints/mcore/glm47flash_conservative_65536_20260308_063101/v0-20260308-143138/checkpoint-75 --no_add_version                                                                 
                                                                                                                                                                                                                                                                            
/usr/local/lib/python3.11/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml 
for you.                                                                                                                                                                                                                                                                    
  import pynvml  # type: ignore[import]                                                                                                                                                                                                                                     
run sh: `/usr/local/bin/python -m torch.distributed.run --nproc_per_node 8 --master_port 29500 --nnodes 2 --node_rank 1 --master_addr 192.168.243.73 /usr/local/lib/python3.11/site-packages/swift/cli/_megatron/sft.py --model /shared_workspace_mfs/original_models/GLM-4.
7-Flash --save_safetensors true --cached_dataset /shared_workspace_mfs/yihao/FI-training-data/SFT_PILOT_STUDY/cached_dataset/model-glm-4__tmpl-glm4_7__data-fi-mindforge-1024-strict-glm47-v2-terminal-filte-495ed3cb__len-65536/train --split_dataset_ratio 0 --load_from_c
ache_file true --template glm4_7 --agent_template glm4_7 --loss_scale hermes --tuner_type full --finetune false --tensor_model_parallel_size 4 --expert_model_parallel_size 8 --pipeline_model_parallel_size 2 --context_parallel_size 1 --sequence_parallel true --micro_ba
tch_size 1 --global_batch_size 32 --packing true --max_length 65536 --packing_length 65536 --recompute_granularity full --recompute_method uniform --recompute_num_layers 1 --gradient_accumulation_fusion false --cross_entropy_loss_fusion true --moe_permute_fusion true 
--moe_grouped_gemm true --moe_shared_expert_overlap true --moe_router_dtype fp32 --attention_backend flash --lr 1e-5 --weight_decay 0.01 --lr_warmup_fraction 0.1 --min_lr 1e-6 --num_train_epochs 6 --output_dir /shared_workspace_mfs/yihao/FI-training-data/SFT_PILOT_STU
DY/checkpoints/mcore/glm47flash_conservative_65536_20260308_063101/v0-20260308-143138 --save_steps 25 --logging_steps 1 --dataloader_num_workers 4 --dataset_num_proc 100 --report_to wandb --wandb_project fi-pilot-study --wandb_exp_name glm47flash-conservative-megatron
-65536-20260308-154332 --no_save_optim false --no_save_rng false --use_precision_aware_optimizer true --optimizer_cpu_offload true --optimizer_offload_fraction 1.0 --decoder_first_pipeline_num_layers 23 --mcore_model /shared_workspace_mfs/yihao/FI-training-data/SFT_PI
LOT_STUDY/checkpoints/mcore/glm47flash_conservative_65536_20260308_063101/v0-20260308-143138/checkpoint-75 --no_add_version`      
```

```
[INFO:swift] args: MegatronSftArguments(use_ray=False, ray_exp_name=None, device_groups=None, model='/shared_workspace_mfs/original_models/GLM-4.7-Flash', model_type='glm4_moe_lite', model_revision=None, task_type='causal_lm', torch_dtype=torch.bfloat16, attn_impl=Non
e, experts_impl=None, new_special_tokens=[], num_labels=None, problem_type=None, rope_scaling=None, device_map=None, max_memory={}, max_model_len=None, local_repo_path=None, init_strategy=None, template='glm4_7', system=None, max_length=65536, truncation_strategy='del
ete', max_pixels=None, agent_template='glm4_7', norm_bbox=None, use_chat_template=True, padding_side='right', padding_free=True, loss_scale='hermes', sequence_parallel_size=1, template_backend='swift', response_prefix=None, enable_thinking=None, add_non_thinking_prefi
x=True, dataset=[], val_dataset=[], cached_dataset=['/shared_workspace_mfs/yihao/FI-training-data/SFT_PILOT_STUDY/cached_dataset/model-glm-4__tmpl-glm4_7__data-fi-mindforge-1024-strict-glm47-v2-terminal-filte-495ed3cb__len-65536/train'], cached_val_dataset=[], split_d
ataset_ratio=0.0, data_seed=42, dataset_num_proc=100, load_from_cache_file=True, dataset_shuffle=True, val_dataset_shuffle=False, streaming=False, interleave_prob=None, stopping_strategy='first_exhausted', shuffle_buffer_size=1000, download_mode='reuse_dataset_if_exis
ts', columns={}, strict=False, remove_unused_columns=True, model_name=None, model_author=None, custom_dataset_info=[], quant_method=None, quant_bits=None, hqq_axis=None, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, 
bnb_4bit_quant_storage=None, max_new_tokens=None, temperature=0.9, top_k=50, top_p=0.9, repetition_penalty=1.0, num_beams=1, stream=False, stop_words=[], logprobs=False, top_logprobs=None, structured_outputs_regex=None, tuner_backend='peft', tuner_type='full', train_t
ype=None, adapters=[], external_plugins=[], custom_register_path=[], seed=42, model_kwargs={}, load_args=False, load_data_args=False, packing=True, packing_length=65536, packing_num_proc=1, lazy_tokenize=False, use_hf=False, hub_token=None, ddp_timeout=18000000, ddp_b
ackend='nccl', ignore_args_error=False, use_swift_lora=False, freeze_llm=False, freeze_vit=True, freeze_aligner=True, freeze_parameters=[], freeze_parameters_regex=None, freeze_parameters_ratio=0.0, trainable_parameters=[], trainable_parameters_regex=None, target_modu
les=['all-linear'], target_regex=None, modules_to_save=[], lora_rank=8, lora_alpha=32, lora_dropout=0.05, lora_bias='none', lora_dtype=None, use_rslora=False, rlhf_type=None, loss_type=None, mcore_ref_model=None, mcore_ref_adapter=None, beta=None, rpo_alpha=None, refe
rence_free=False, label_smoothing=0.0, f_divergence_type='reverse_kl', desirable_weight=1.0, undesirable_weight=1.0, calculate_KL=None, center_rewards_coefficient=None, teacher_model=None, teacher_model_type=None, teacher_model_revision=None, teacher_model_server=None
, gkd_logits_topk=None, lmbda=0.5, seq_kd=False, offload_teacher_model=False, sft_alpha=0.0, generation_batch_size=None, steps_per_generation=None, num_generations=8, num_generations_eval=None, max_completion_length=512, importance_sampling_level='token', tau_pos=1.0,
 tau_neg=1.05, epsilon=0.2, epsilon_high=None, delta=None, use_vllm=True, vllm_mode=None, vllm_enable_prefix_caching=True, vllm_gpu_memory_utilization=0.9, vllm_tensor_parallel_size=1, vllm_max_model_len=None, vllm_enforce_eager=False, vllm_limit_mm_per_prompt=None, v
llm_disable_cascade_attn=False, vllm_max_num_seqs=None, vllm_mm_processor_cache_gb=None, vllm_engine_kwargs=None, sleep_level=0, offload_optimizer=False, offload_model=False, offload_bridge=False, vllm_server_base_url=None, vllm_server_host=None, vllm_server_port=[800
0], vllm_server_timeout=240.0, vllm_server_group_port=None, reward_funcs=[], reward_weights=None, cosine_min_len_value_wrong=-0.5, cosine_max_len_value_wrong=0.0, cosine_min_len_value_correct=1.0, cosine_max_len_value_correct=0.5, cosine_max_len=None, repetition_n_gra
ms=3, repetition_max_penalty=-1.0, soft_max_length=None, soft_cache_length=None, dynamic_sample=False, max_resample_times=3, overlong_filter=False, scale_rewards='group', advantage_estimator='grpo', kl_in_reward=False, wandb_log_unique_prompts=None, log_completions=Fa
lse, rollout_importance_sampling_mode=None, rollout_importance_sampling_threshold=2.0, log_rollout_offpolicy_metrics=False, off_policy_sequence_mask_delta=None, log_entropy=False, top_entropy_quantile=1.0, reward_model=None, reward_model_plugin=None, sync_ref_model=Fa
lse, ref_model_sync_steps=512, ref_model_mixup_alpha=0.6, async_generate=False, move_model_batches=None, multi_turn_scheduler=None, max_turns=None, completion_length_limit_scope='per_round', vllm_server_pass_dataset=False, num_iterations=1, micro_batch_size=1, global_
batch_size=32, recompute_granularity='full', recompute_method='uniform', recompute_num_layers=1, recompute_modules=['core_attn'], train_iters=None, num_train_epochs=6, masked_softmax_fusion=True, bias_dropout_fusion=True, bias_activation_fusion=True, apply_rope_fusion
=False, gradient_accumulation_fusion=False, cross_entropy_loss_fusion=True, cross_entropy_fusion_impl='native', calculate_per_token_loss=True, attention_backend=<AttnBackend.flash: 1>, optimizer='adam', optimizer_cpu_offload=True, optimizer_offload_fraction=1.0, use_p
recision_aware_optimizer=True, main_grads_dtype=torch.float32, main_params_dtype=torch.float32, exp_avg_dtype=torch.float32, exp_avg_sq_dtype=torch.float32, manual_gc=False, manual_gc_steps=0, manual_gc_eval=True, train_dataloader_shuffle=True, dataloader_num_workers=
4, dataloader_pin_memory=True, dataloader_persistent_workers=True, dataloader_prefetch_factor=2, data_sharding=False, group_by_length=False, te_rng_tracker=False, data_parallel_random_init=False, mlp_padding_free=False, lr_warmup_init=0.0, lr=1e-05, lr_decay_style='co
sine', lr_decay_iters=None, lr_warmup_iters=0, lr_warmup_fraction=0.1, min_lr=1e-06, lr_wsd_decay_style='exponential', lr_wsd_decay_iters=None, weight_decay=0.01, weight_decay_incr_style='constant', start_weight_decay=0.01, end_weight_decay=0.01, clip_grad=1.0, adam_b
eta1=0.9, adam_beta2=0.95, adam_eps=1e-08, sgd_momentum=0.9, output_dir='/shared_workspace_mfs/yihao/FI-training-data/SFT_PILOT_STUDY/checkpoints/mcore/glm47flash_conservative_65536_20260308_063101/v0-20260308-143138', save_steps=25, no_save_optim=False, no_save_rng=F
alse, mcore_model='/shared_workspace_mfs/yihao/FI-training-data/SFT_PILOT_STUDY/checkpoints/mcore/glm47flash_conservative_65536_20260308_063101/v0-20260308-143138/checkpoint-75', mcore_adapter=None, no_load_optim=False, no_load_rng=False, finetune=False, perform_initi
alization=False, use_cpu_initialization=False, async_save=False, save_total_limit=None, metric_for_best_model='loss', greater_is_better=False, use_persistent_ckpt_worker=False, dist_ckpt_save_pre_mcore_014=False, dist_ckpt_optim_fully_reshardable=False, distrib_optim_
fully_reshardable_mem_efficient=False, local_rank=0, use_distributed_optimizer=True, tensor_model_parallel_size=4, pipeline_model_parallel_size=2, decoder_first_pipeline_num_layers=23, decoder_last_pipeline_num_layers=None, account_for_embedding_in_pipeline_split=Fals
e, account_for_loss_in_pipeline_split=False, overlap_p2p_comm=False, align_param_gather=False, sequence_parallel=True, context_parallel_size=1, tp_comm_overlap=False, overlap_grad_reduce=False, overlap_param_gather=False, overlap_param_gather_with_optimizer_step=False
, align_grad_reduce=True, virtual_pipeline_model_parallel_size=None, microbatch_group_size_per_vp_stage=None, pipeline_model_parallel_layout=None, expert_model_parallel_size=8, expert_tensor_parallel_size=1, report_to=['wandb'], logging_steps=1, tensorboard_dir='/shar
ed_workspace_mfs/yihao/FI-training-data/SFT_PILOT_STUDY/checkpoints/mcore/glm47flash_conservative_65536_20260308_063101/v0-20260308-143138/runs', tensorboard_queue_size=50, wandb_project='fi-pilot-study', wandb_exp_name='glm47flash-conservative-megatron-65536-20260308
-154332', swanlab_project='megatron-swift', swanlab_exp_name=None, eval_iters=-1, eval_steps=25, fp8_format=None, fp8_recipe='delayed', fp8_amax_history_len=1024, fp8_amax_compute_algo='max', fp8_param_gather=False, fp16=False, bf16=True, apply_query_key_layer_scaling
=False, attention_softmax_in_fp32=True, accumulate_allreduce_grads_in_fp32=True, moe_router_load_balancing_type=None, moe_router_dtype='fp32', moe_token_dispatcher_type='alltoall', moe_enable_deepep=False, moe_grouped_gemm=True, moe_permute_fusion=True, moe_aux_loss_c
oeff=0.0, moe_z_loss_coeff=None, moe_shared_expert_overlap=True, moe_layer_recompute=False, moe_expert_capacity_factor=None, moe_pad_expert_input_to_capacity=False, moe_token_drop_policy='probs', mtp_num_layers=None, mtp_loss_scaling_factor=0.1, save_safetensors=True,
 ref_model=None, ref_adapters=[], merge_lora=True, max_shard_size='5GB', vit_gradient_checkpointing=False, vit_lr=None, aligner_lr=None, gradient_checkpointing_kwargs=None, check_model=True, apply_wd_to_qk_layernorm=False, enable_dft_loss=False, enable_channel_loss=Fa
lse, save_strategy='steps', callbacks=['print', 'default_flow', 'wandb'], add_version=False, create_checkpoint_symlink=False)   
```

### Additional Information / 补充信息
[INFO:swift] model_kwargs: {'device_map': 'cuda:0', 'dtype': torch.bfloat16}                                                                                                                                                                             15:44:06 [358/1830]
[INFO:swift] [rank9] model_parameter_info: GPTModel: 2131.2307M Params (2131.2307M Trainable [100.0000%]), 0.0031M Buffers.
[INFO:swift] [rank10] model_parameter_info: GPTModel: 2131.2307M Params (2131.2307M Trainable [100.0000%]), 0.0031M Buffers.
[INFO:swift] model: GPTModel(
  (decoder): TransformerBlock(
    (layers): ModuleList(
      (0-23): 24 x TransformerLayer(
        (input_layernorm): RMSNorm()
        (self_attention): MLASelfAttention(
          (core_attention): TEDotProductAttention(
            (flash_attention): FlashAttention()
            (fused_attention): FusedAttention()
            (unfused_attention): UnfusedDotProductAttention(
              (scale_mask_softmax): FusedScaleMaskSoftmax()
              (attention_dropout): Dropout(p=0.0, inplace=False)
            )
          )
          (linear_proj): TERowParallelLinear(in_features=1280, out_features=2048, bias=False, TP=4)
          (linear_q_down_proj): TELinear(in_features=2048, out_features=768, bias=False, TP=1)
          (linear_q_up_proj): TELayerNormColumnParallelLinear(in_features=768, out_features=1280, bias=False, TP=4)
          (linear_kv_down_proj): TELinear(in_features=2048, out_features=576, bias=False, TP=1)
          (linear_kv_up_proj): TELayerNormColumnParallelLinear(in_features=512, out_features=2240, bias=False, TP=4)
          (q_layernorm): IdentityOp()
          (kv_layernorm): IdentityOp()
        )
        (pre_cross_attn_layernorm): IdentityOp()
        (cross_attention): IdentityOp()
        (cross_attn_bda): IdentityFuncOp()
        (pre_mlp_layernorm): RMSNorm()
        (mlp): MoELayer(
          (router): TopKRouter()
          (experts): TEGroupedMLP(
            (linear_fc1): TEColumnParallelGroupedLinear()
            (linear_fc2): TERowParallelGroupedLinear()
          )
          (shared_experts): SharedExpertMLP(
            (linear_fc1): TEColumnParallelLinear(in_features=2048, out_features=768, bias=False, TP=4)
            (linear_fc2): TERowParallelLinear(in_features=384, out_features=2048, bias=False, TP=4)
          )
        )
      )
    )
    (final_layernorm): RMSNorm()
  )
  (output_layer): ColumnParallelLinear(in_features=2048, out_features=154880, bias=False, TP=4)
  (rotary_pos_emb): RotaryEmbedding()
)
[INFO:swift] [rank8] model_parameter_info: GPTModel: 2131.2307M Params (2131.2307M Trainable [100.0000%]), 0.0031M Buffers.
[INFO:swift] [rank11] model_parameter_info: GPTModel: 2131.2307M Params (2131.2307M Trainable [100.0000%]), 0.0031M Buffers.
[INFO:swift] padding_to: 4
[INFO:swift] checkpoint_dir: /shared_workspace_mfs/original_models/GLM-4.7-Flash
_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resuming from Megatron CKPT fails with optimizer length mismatch #8240

Checklist / 检查清单

Bug Description / Bug 描述

How to Reproduce / 如何复现

Additional Information / 补充信息

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Resuming from Megatron CKPT fails with optimizer length mismatch #8240

Description

Checklist / 检查清单

Bug Description / Bug 描述

How to Reproduce / 如何复现

Additional Information / 补充信息

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions