Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lomo训练65b llama实测 Lomo is incompatible with pipeline parallelism #152

Open
zlh1992 opened this issue Feb 4, 2024 · 1 comment
Open

Comments

@zlh1992
Copy link

zlh1992 commented Feb 4, 2024

配置:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:29402 --nnodes=1 --nproc_per_node=8 train.py
config.tp_size = 1
config.dp_size = 1 # 8 无所谓
config.pp_size = 1
config.train_epochs = 1
config.eval_per_n_steps = 0
config.eval_per_n_epochs = 1
config.train_micro_batch_size = 1
config.eval_batch_size = 1
config.ds_config = {
"fp16": {
"enabled": True
},
"zero_allow_untested_optimizer": True,
"zero_force_ds_cpu_optimizer": False,
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": False
}
}
}
8张A100 每张消耗在30gb左右 内存消耗130GB 加载模型offload峰值约400GB内存

想做pp试试,修改配置如下:
config.tp_size = 4
config.dp_size = 1
config.pp_size = 2

collie/module.py中需要修改:
self.parts = [int(i) for i in self.parts]
os.environ["COLLIE_PP_PARTS"] = json.dumps(self.parts)

目前发现现在还不支持:Lomo is incompatible with pipeline parallelism

@KaiLv69
Copy link
Collaborator

KaiLv69 commented Feb 5, 2024

你好,因为fused_backward过程,lomo不支持1f1b调度的流水线并行。
8张A100 每张消耗在30gb左右确实有点高,可能deepspeed的通信buffer占掉了许多显存。 lomo和deepspeed的offload兼容性还没测试,不知道其结果怎么样,也不确定有没有真正offload到cpu上。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants