Llama2 70B 训练报错 #104

xiaopqr · 2023-08-14T08:43:39Z

使用最新 dev分支代码训练 llama2 70B ，存在以下问题：
│collie/collie/models/llama/model.py:203 in _forward │
│ │
│ 200 │ │ │ │ │ │ │ .permute(0, 2, 1, 4, 3) \ │
│ 201 │ │ │ │ │ │ │ .reshape(batch_size, self.num_key_value_heads, │
│ 202 │ │ │ │ │ │ │ │ │ seq_len + start_pos, -1) │
│ ❱ 203 │ │ │ new_layer_past = torch.stack((present_key, value.permute([0, 2, 1, 3])), dim │
│ 204 │ │ attention_mask = attention_mask if attention_mask is not None else torch.ones((q │
│ 205 │ │ if self.config.use_flash: │
│ 206 │ │ │ output = flash_attention(query, key, value, attention_mask)
RuntimeError: stack expects each tensor to be equal size, but got [1, 8, 2048, 1024] at entry 0 and [1, 64, 2048, 128] at entry 1

上面是一个问题，还有一个问题是前几天的 dev分支代码， trainer.save_model，llama2 70B（8张V100, 可以训练）会出现显存 OOM，按道理能跑训练，不应该显存不够，最新dev代码可能还有这个问题，只是还没跑到就报错了
│ │
│ /opt/conda/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py:1553 in │
│ _allgather_params_coalesced │
│ │
│ 1550 │ │ allgather_params = [] │
│ 1551 │ │ for psize in partition_sizes: │
│ 1552 │ │ │ tensor_size = psize * self.num_partitions │
│ ❱ 1553 │ │ │ flat_tensor = torch.empty(tensor_size, dtype=param_list[0].dtype, device=sel │
│ 1554 │ │ │ flat_tensor.requires_grad = False │
│ 1555 │ │ │ allgather_params.append(flat_tensor) │
│ 1556 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB (GPU 7; 31.75 GiB total capacity; 29.60 GiB already allocated; 312.75 MiB free; 29.63 GiB reserved in total by PyTorch) If reserved memory
is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@00INDEX 方便看一下吗？

xiaopqr · 2023-08-16T01:36:31Z

@KaiLv69 @QipengGuo 方便看一下吗？

KaiLv69 · 2023-08-21T05:23:43Z

你好，使用zero3时保存模型的bug正在解决中

KaiLv69 · 2023-08-23T08:40:55Z

你好，可以更新到最新的dev分支尝试一下。

FYI: 82869ee ac6eed4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama2 70B 训练报错 #104

Llama2 70B 训练报错 #104

xiaopqr commented Aug 14, 2023

xiaopqr commented Aug 16, 2023

KaiLv69 commented Aug 21, 2023

KaiLv69 commented Aug 23, 2023

Llama2 70B 训练报错 #104

Llama2 70B 训练报错 #104

Comments

xiaopqr commented Aug 14, 2023

xiaopqr commented Aug 16, 2023

KaiLv69 commented Aug 21, 2023

KaiLv69 commented Aug 23, 2023