We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
配置: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:29402 --nnodes=1 --nproc_per_node=8 train.py config.tp_size = 1 config.dp_size = 1 # 8 无所谓 config.pp_size = 1 config.train_epochs = 1 config.eval_per_n_steps = 0 config.eval_per_n_epochs = 1 config.train_micro_batch_size = 1 config.eval_batch_size = 1 config.ds_config = { "fp16": { "enabled": True }, "zero_allow_untested_optimizer": True, "zero_force_ds_cpu_optimizer": False, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": False } } } 8张A100 每张消耗在30gb左右 内存消耗130GB 加载模型offload峰值约400GB内存
想做pp试试,修改配置如下: config.tp_size = 4 config.dp_size = 1 config.pp_size = 2
collie/module.py中需要修改: self.parts = [int(i) for i in self.parts] os.environ["COLLIE_PP_PARTS"] = json.dumps(self.parts)
目前发现现在还不支持:Lomo is incompatible with pipeline parallelism
The text was updated successfully, but these errors were encountered:
你好,因为fused_backward过程,lomo不支持1f1b调度的流水线并行。 8张A100 每张消耗在30gb左右确实有点高,可能deepspeed的通信buffer占掉了许多显存。 lomo和deepspeed的offload兼容性还没测试,不知道其结果怎么样,也不确定有没有真正offload到cpu上。
Sorry, something went wrong.
No branches or pull requests
配置:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:29402 --nnodes=1 --nproc_per_node=8 train.py
config.tp_size = 1
config.dp_size = 1 # 8 无所谓
config.pp_size = 1
config.train_epochs = 1
config.eval_per_n_steps = 0
config.eval_per_n_epochs = 1
config.train_micro_batch_size = 1
config.eval_batch_size = 1
config.ds_config = {
"fp16": {
"enabled": True
},
"zero_allow_untested_optimizer": True,
"zero_force_ds_cpu_optimizer": False,
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": False
}
}
}
8张A100 每张消耗在30gb左右 内存消耗130GB 加载模型offload峰值约400GB内存
想做pp试试,修改配置如下:
config.tp_size = 4
config.dp_size = 1
config.pp_size = 2
collie/module.py中需要修改:
self.parts = [int(i) for i in self.parts]
os.environ["COLLIE_PP_PARTS"] = json.dumps(self.parts)
目前发现现在还不支持:Lomo is incompatible with pipeline parallelism
The text was updated successfully, but these errors were encountered: