We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8张 V100 显卡,开启 Zero3,TP=1,PP=1,DP=8,LlamaForCausalLM.from_pretrained llama 70B 模型会出现 OOM (内存不够,不是显存不够),物理内存 512GB。 原因是 dev 分支中,base.py 304行, state_dict = {} if not is_zero3_enabled(config) or env.dp_rank == 0 or config.low_cpu_mem_usage or config.quantization_config.load_in_8bit or getattr(config.quantization_config, "load_in_4bit", False): state_dict = cls.load_parallel_state_dict( path=model_path_or_name, config=config, process_exclusion=process_exclusion, **kwargs ) 会导致 8 个进程 都 加载一次 state_dict,内存消耗很大,导致OOM
The text was updated successfully, but these errors were encountered:
@KaiLv69 大佬方便看一下吗?
Sorry, something went wrong.
@xiaopqr 您好,很抱歉造成您使用当中的不便,此问题已在 1871bcb 中修复,请使用dev分支的版本,或者等待下个版本的主分支合并。
I have tested this version of the branch in 4 * A100 80GB. The training is happening, but I'm getting OOM while saving the checkpoint.
Hi, the bug is fixed in dev branch, maybe you can have a try.
FYI: 82869ee ac6eed4
Could you pls share the script of training 70B?
00INDEX
No branches or pull requests
8张 V100 显卡,开启 Zero3,TP=1,PP=1,DP=8,LlamaForCausalLM.from_pretrained llama 70B 模型会出现 OOM (内存不够,不是显存不够),物理内存 512GB。
原因是 dev 分支中,base.py 304行,
state_dict = {}
if not is_zero3_enabled(config) or env.dp_rank == 0
or config.low_cpu_mem_usage or config.quantization_config.load_in_8bit
or getattr(config.quantization_config, "load_in_4bit", False):
state_dict = cls.load_parallel_state_dict(
path=model_path_or_name, config=config,
process_exclusion=process_exclusion, **kwargs
)
会导致 8 个进程 都 加载一次 state_dict,内存消耗很大,导致OOM
The text was updated successfully, but these errors were encountered: