[OOM Error] Out of Memory with 32k tokens #5

JL-Cheng · 2023-08-30T06:52:23Z

Thank you for your valuable contribution! I have been experimenting with your evaluation codes on the LongChat-Lines dataset. However, I encountered an Out of Memory Error when the token length reached 32k.

I am fortunate to have multiple 80G A100 GPUs at my disposal. However, I noticed that your evaluation code does not incorporate parallel processing, and only one GPU is utilized during evaluation.

I would greatly appreciate it if you could provide more information about the resources used in the experimental section of your paper. Additionally, I am curious if you implemented any form of parallelization to enhance the evaluation process.

Thank you once again for your assistance!

The text was updated successfully, but these errors were encountered:

JL-Cheng · 2023-08-31T12:55:40Z

You can consider using DeepSpeed-Inference to solve this problem, which may also speed up the inference process. It provides a simple implementation by making slight modifications to the model, input data, and launch method.

You can refer to the file ./python/eval/longeval/utils.py to modify the model and the input data

## ./python/eval/longeval/utils.py

import deepspeed

# AT LINE 82
-  model = model.cuda()
-  model.eval()
+  model = deepspeed.init_inference(
+    model=model,
+    mp_size=int(os.getenv("WORLD_SIZE", "1")),
+    dtype=torch.float16,
+    replace_with_kernel_inject=False,
+    max_out_tokens=35,
+)

# AT LINE 106
-  input = tokenizer(prompt, return_tensors="pt")
-  prompt_length = input.input_ids.shape[-1]
-
-  output = model.generate(input_ids=input.input_ids.to(model.device), min_new_tokens=5, max_new_tokens=35, use_cache=False)[0]

+  local_rank = int(os.getenv("LOCAL_RANK", "0"))
+  inputs = tokenizer.encode(prompt, return_tensors="pt").to(f"cuda:{local_rank}")
+  prompt_length = inputs.shape[-1]
+  
+  output = model.generate(inputs, min_new_tokens=5, max_new_tokens=35, use_cache=False)[0]

Then, maybe you need to add one line in the file ./python/eval/longeval/eval.py:

## ./python/eval/longeval/utils.py

# AT LINE 58
+  parser.add_argument("--local_rank", type=int, default=0, help="local rank")

And you can use the DeepSpeed launcher deepspeed to launch inference on multiple GPUs:

## run.sh
#!/bin/bash

PRETRAINED_MODEL_DIR= xxx

deepspeed --num_gpus 2 ./eval/longeval/eval.py \
    --model-name-or-path $PRETRAINED_MODEL_DIR \
    --scale-context 1.0 \
    --base-model

mces89 · 2023-09-07T05:48:19Z

hi, for this line:
inputs = tokenizer.encode(prompt, return_tensors="pt").to(f"cuda:{local_rank}")
does it mean for every gpu(local_rank), it will encode the same inputs?
also which model-name-or-path do you use?
Thanks.

JL-Cheng · 2023-09-15T07:01:23Z

hi, for this line: inputs = tokenizer.encode(prompt, return_tensors="pt").to(f"cuda:{local_rank}") does it mean for every gpu(local_rank), it will encode the same inputs? also which model-name-or-path do you use? Thanks.

There is a blog about DeepSpeed Inference, which may help you to understand clearly how DeepSpeed accelerates inference.

For the first problem, DeepSpeed uses tensor parallelism to shard the model and generate results through communication. This means that each GPU will not encode the same inputs. For more details, you can refer to the issue: microsoft/DeepSpeed#4154.

For the second problem, model-name-or-path refers to the local path where llama2-7b model is stored. You can download it from HuggingFace.

wutong4012 · 2024-03-21T05:13:49Z

hi, @JL-Cheng , how did you get 32K of data? As far as I know, in https://huggingface.co/datasets/abacusai/LongChat-Lines/viewer/default/100, the maximum data length is 26K.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OOM Error] Out of Memory with 32k tokens #5

[OOM Error] Out of Memory with 32k tokens #5

JL-Cheng commented Aug 30, 2023

JL-Cheng commented Aug 31, 2023 •

edited

Loading

mces89 commented Sep 7, 2023 •

edited

Loading

JL-Cheng commented Sep 15, 2023

wutong4012 commented Mar 21, 2024

[OOM Error] Out of Memory with 32k tokens #5

[OOM Error] Out of Memory with 32k tokens #5

Comments

JL-Cheng commented Aug 30, 2023

JL-Cheng commented Aug 31, 2023 • edited Loading

mces89 commented Sep 7, 2023 • edited Loading

JL-Cheng commented Sep 15, 2023

wutong4012 commented Mar 21, 2024

JL-Cheng commented Aug 31, 2023 •

edited

Loading

mces89 commented Sep 7, 2023 •

edited

Loading