tensor parallel + zero3 error #99

LZY-the-boys · 2023-07-31T11:56:58Z

zero3能够和模型并行一起用吗？我在尝试中使用

config.use_flash = False
config.tp_size = 4
config.ds_config = {
        "fp16": {
            "enabled": True
        },
        "zero_allow_untested_optimizer": True,
        "zero_force_ds_cpu_optimizer": False,
        "zero_optimization": {
            "stage": 3,
            "offload_optimizer": {
                "device": "cpu",
                "pin_memory": False
            }
        },
        "monitor_config": {
            "enabled": True,
            "tag": "adan",
            "csv_monitor": {
                "enabled": True,
                "output_path": "./ds_logs/"
            }
        }
}

有如下的错误

Traceback (most recent call last):
  File "examples/alpaca/train.py", line 97, in <module>
    model.load_state_dict(state_dict)
  File "/home/xx/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
        size mismatch for layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for layers.0.mlp.gate_proj.weight: copying a param with shape torch.Size([2752, 4096]) from checkpoint, the shape in current model is torch.Size([0]).

The text was updated successfully, but these errors were encountered:

Carol-gutianle · 2023-08-01T02:18:14Z

你好，zero3可以和模型并行一起使用，但是不推荐以手动model.load_state_dict()的形式加载权重，以下推荐2种方式：
方式一（更推荐）：直接使用LlamaForCausalLM.from_pretrained加载权重；
方式二：参考collie.models.base.py line322，提供deepspeed.zero.GatheredParameters()改写后的上下文。

00INDEX added the help wanted Extra attention is needed label Aug 1, 2023

00INDEX assigned Carol-gutianle Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensor parallel + zero3 error #99

tensor parallel + zero3 error #99

LZY-the-boys commented Jul 31, 2023 •

edited

Loading

Carol-gutianle commented Aug 1, 2023

tensor parallel + zero3 error #99

tensor parallel + zero3 error #99

Comments

LZY-the-boys commented Jul 31, 2023 • edited Loading

Carol-gutianle commented Aug 1, 2023

LZY-the-boys commented Jul 31, 2023 •

edited

Loading