Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensor parallel + zero3 error #99

Open
LZY-the-boys opened this issue Jul 31, 2023 · 1 comment
Open

tensor parallel + zero3 error #99

LZY-the-boys opened this issue Jul 31, 2023 · 1 comment
Assignees
Labels
help wanted Extra attention is needed

Comments

@LZY-the-boys
Copy link

LZY-the-boys commented Jul 31, 2023

zero3能够和模型并行一起用吗?我在尝试中使用

config.use_flash = False
config.tp_size = 4
config.ds_config = {
        "fp16": {
            "enabled": True
        },
        "zero_allow_untested_optimizer": True,
        "zero_force_ds_cpu_optimizer": False,
        "zero_optimization": {
            "stage": 3,
            "offload_optimizer": {
                "device": "cpu",
                "pin_memory": False
            }
        },
        "monitor_config": {
            "enabled": True,
            "tag": "adan",
            "csv_monitor": {
                "enabled": True,
                "output_path": "./ds_logs/"
            }
        }
}

有如下的错误

Traceback (most recent call last):
  File "examples/alpaca/train.py", line 97, in <module>
    model.load_state_dict(state_dict)
  File "/home/xx/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
        size mismatch for layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for layers.0.mlp.gate_proj.weight: copying a param with shape torch.Size([2752, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
@Carol-gutianle
Copy link
Collaborator

你好,zero3可以和模型并行一起使用,但是不推荐以手动model.load_state_dict()的形式加载权重,以下推荐2种方式:
方式一(更推荐):直接使用LlamaForCausalLM.from_pretrained加载权重;
方式二:参考collie.models.base.py line322,提供deepspeed.zero.GatheredParameters()改写后的上下文。
image

@00INDEX 00INDEX added the help wanted Extra attention is needed label Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants