Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OOM Error] Out of Memory with 32k tokens #5

Open
JL-Cheng opened this issue Aug 30, 2023 · 4 comments
Open

[OOM Error] Out of Memory with 32k tokens #5

JL-Cheng opened this issue Aug 30, 2023 · 4 comments

Comments

@JL-Cheng
Copy link

Thank you for your valuable contribution! I have been experimenting with your evaluation codes on the LongChat-Lines dataset. However, I encountered an Out of Memory Error when the token length reached 32k.

I am fortunate to have multiple 80G A100 GPUs at my disposal. However, I noticed that your evaluation code does not incorporate parallel processing, and only one GPU is utilized during evaluation.

I would greatly appreciate it if you could provide more information about the resources used in the experimental section of your paper. Additionally, I am curious if you implemented any form of parallelization to enhance the evaluation process.

Thank you once again for your assistance!

@JL-Cheng
Copy link
Author

JL-Cheng commented Aug 31, 2023

You can consider using DeepSpeed-Inference to solve this problem, which may also speed up the inference process. It provides a simple implementation by making slight modifications to the model, input data, and launch method.

You can refer to the file ./python/eval/longeval/utils.py to modify the model and the input data

## ./python/eval/longeval/utils.py

import deepspeed

# AT LINE 82
-  model = model.cuda()
-  model.eval()
+  model = deepspeed.init_inference(
+    model=model,
+    mp_size=int(os.getenv("WORLD_SIZE", "1")),
+    dtype=torch.float16,
+    replace_with_kernel_inject=False,
+    max_out_tokens=35,
+)

# AT LINE 106
-  input = tokenizer(prompt, return_tensors="pt")
-  prompt_length = input.input_ids.shape[-1]
-
-  output = model.generate(input_ids=input.input_ids.to(model.device), min_new_tokens=5, max_new_tokens=35, use_cache=False)[0]

+  local_rank = int(os.getenv("LOCAL_RANK", "0"))
+  inputs = tokenizer.encode(prompt, return_tensors="pt").to(f"cuda:{local_rank}")
+  prompt_length = inputs.shape[-1]
+  
+  output = model.generate(inputs, min_new_tokens=5, max_new_tokens=35, use_cache=False)[0]

Then, maybe you need to add one line in the file ./python/eval/longeval/eval.py:

## ./python/eval/longeval/utils.py

# AT LINE 58
+  parser.add_argument("--local_rank", type=int, default=0, help="local rank")

And you can use the DeepSpeed launcher deepspeed to launch inference on multiple GPUs:

## run.sh
#!/bin/bash

PRETRAINED_MODEL_DIR= xxx

deepspeed --num_gpus 2 ./eval/longeval/eval.py \
    --model-name-or-path $PRETRAINED_MODEL_DIR \
    --scale-context 1.0 \
    --base-model 

@mces89
Copy link

mces89 commented Sep 7, 2023

hi, for this line:
inputs = tokenizer.encode(prompt, return_tensors="pt").to(f"cuda:{local_rank}")
does it mean for every gpu(local_rank), it will encode the same inputs?
also which model-name-or-path do you use?
Thanks.

@JL-Cheng
Copy link
Author

hi, for this line: inputs = tokenizer.encode(prompt, return_tensors="pt").to(f"cuda:{local_rank}") does it mean for every gpu(local_rank), it will encode the same inputs? also which model-name-or-path do you use? Thanks.

There is a blog about DeepSpeed Inference, which may help you to understand clearly how DeepSpeed accelerates inference.

For the first problem, DeepSpeed uses tensor parallelism to shard the model and generate results through communication. This means that each GPU will not encode the same inputs. For more details, you can refer to the issue: microsoft/DeepSpeed#4154.

For the second problem, model-name-or-path refers to the local path where llama2-7b model is stored. You can download it from HuggingFace.

@wutong4012
Copy link

hi, @JL-Cheng , how did you get 32K of data? As far as I know, in https://huggingface.co/datasets/abacusai/LongChat-Lines/viewer/default/100, the maximum data length is 26K.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants