Error： llama2 70B LlamaForCausalLM.from_pretrained 开启Zero3，会消耗大量内存导致 OOM #98

xiaopqr · 2023-07-31T08:14:16Z

8张 V100 显卡，开启 Zero3，TP=1，PP=1，DP=8，LlamaForCausalLM.from_pretrained llama 70B 模型会出现 OOM (内存不够，不是显存不够)，物理内存 512GB。
原因是 dev 分支中，base.py 304行，
state_dict = {}
if not is_zero3_enabled(config) or env.dp_rank == 0
or config.low_cpu_mem_usage or config.quantization_config.load_in_8bit
or getattr(config.quantization_config, "load_in_4bit", False):
state_dict = cls.load_parallel_state_dict(
path=model_path_or_name, config=config,
process_exclusion=process_exclusion, **kwargs
)
会导致 8 个进程都加载一次 state_dict，内存消耗很大，导致OOM

xiaopqr · 2023-08-01T03:55:28Z

@KaiLv69 大佬方便看一下吗？

00INDEX · 2023-08-01T13:10:04Z

@xiaopqr 您好，很抱歉造成您使用当中的不便，此问题已在 1871bcb 中修复，请使用dev分支的版本，或者等待下个版本的主分支合并。

dittops · 2023-08-01T16:47:09Z

I have tested this version of the branch in 4 * A100 80GB. The training is happening, but I'm getting OOM while saving the checkpoint.

KaiLv69 · 2023-08-23T08:41:57Z

I have tested this version of the branch in 4 * A100 80GB. The training is happening, but I'm getting OOM while saving the checkpoint.

Hi, the bug is fixed in dev branch, maybe you can have a try.

FYI: 82869ee ac6eed4

0three · 2023-09-02T01:31:31Z

Could you pls share the script of training 70B?

xiaopqr changed the title ~~llama2 70B~~ Error： llama2 70B LlamaForCausalLM.from_pretrained 开启Zero3，会消耗大量内存导致 OOM Jul 31, 2023

KaiLv69 assigned KaiLv69 and 00INDEX and unassigned KaiLv69 Aug 1, 2023

00INDEX added the bug Something isn't working label Aug 1, 2023

dittops mentioned this issue Aug 2, 2023

Support for LLaMA-2 70B with Grouped-Query Attention #91

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error： llama2 70B LlamaForCausalLM.from_pretrained 开启Zero3，会消耗大量内存导致 OOM #98

Error： llama2 70B LlamaForCausalLM.from_pretrained 开启Zero3，会消耗大量内存导致 OOM #98

xiaopqr commented Jul 31, 2023 •

edited

Loading

xiaopqr commented Aug 1, 2023

00INDEX commented Aug 1, 2023

dittops commented Aug 1, 2023

KaiLv69 commented Aug 23, 2023

0three commented Sep 2, 2023

Error： llama2 70B LlamaForCausalLM.from_pretrained 开启Zero3，会消耗大量内存导致 OOM #98

Error： llama2 70B LlamaForCausalLM.from_pretrained 开启Zero3，会消耗大量内存导致 OOM #98

Comments

xiaopqr commented Jul 31, 2023 • edited Loading

xiaopqr commented Aug 1, 2023

00INDEX commented Aug 1, 2023

dittops commented Aug 1, 2023

KaiLv69 commented Aug 23, 2023

0three commented Sep 2, 2023

xiaopqr commented Jul 31, 2023 •

edited

Loading