有没有一些能够减少显存的操作,我在推理128k的needlebench时,8张A100仍然显存爆炸[Feature] #1131
Unanswered
1518630367
asked this question in
Q&A
Replies: 2 comments
-
You can try LMDeploy |
Beta Was this translation helpful? Give feedback.
0 replies
-
可以尝试加入model_kwargs=dict(tensor_parallel_size=2, gpu_memory_utilization=0.7),在配置文件中,我加了后可以正常运行 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Describe the feature
下面是我的一些参数配置
from opencompass.models import HuggingFaceCausalLM
from opencompass.models import VLLM
_meta_template = dict(
round=[
dict(role="HUMAN", begin="<|start_header_id|>user<|end_header_id|>\n\n", end="<|eot_id|>"),
dict(role="BOT", begin="<|start_header_id|>assistant<|end_header_id|>\n\n", end="<|eot_id|>", generate=True),
],
)
models = [
dict(
type=HuggingFaceCausalLM,
abbr="llama-3-8b-instruct-hf",
path="/opt/218/models/Meta-Llama-3-8B-Instruct-NTK",
model_kwargs=dict(device_map="auto"),
tokenizer_kwargs=dict(
padding_side="left",
truncation_side="left",
use_fast=False,
),
meta_template=_meta_template,
max_out_len=128,
max_seq_len=122880,
batch_size=1,
run_cfg=dict(num_gpus=7, num_procs=1),
generation_kwargs={"eos_token_id": [128001, 128009]},
# batch_padding=True,
)
]
Will you implement it?
Beta Was this translation helpful? Give feedback.
All reactions