Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xinference最新版本启动加载模型很容易崩溃 #1847

Closed
worm128 opened this issue Jul 11, 2024 · 1 comment
Closed

Xinference最新版本启动加载模型很容易崩溃 #1847

worm128 opened this issue Jul 11, 2024 · 1 comment
Labels
Milestone

Comments

@worm128
Copy link

worm128 commented Jul 11, 2024

错误截图:
图片
模型:Qwen1.5-14B-Chat-GPTQ-int4
加载引擎:vllm
错误信息:
torch.cuda.OutOfMemoryError: [address=0.0.0.0:43411, pid=101] CUDA out of memory. Tried to allocate 70.00 MiB. GPU
2024-07-11 21:46:49 INFO 07-11 13:46:49 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
2024-07-11 21:46:49 INFO 07-11 13:46:49 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.

说明:
我这边是24G的显卡,显存还剩10G就告诉我显存不够,也不知道是显存碎片化问题还是什么,反正这个问题经常出现在新版0.12.3,特别是在加载了m3e-base向量模型后,在加载llm模型,就会容易报错,旧版0.8.5加载模型比较文档,不会出现这些错误。

欢迎大家加q群讨论一下:27831318

@XprobeBot XprobeBot added gpu and removed feature labels Jul 11, 2024
@XprobeBot XprobeBot added this to the v0.13.1 milestone Jul 11, 2024
@ChengjieLi28
Copy link
Contributor

如果是vllm引擎,一个LLM几乎就会吃掉一个卡的所有显存。所以不建议embedding和LLM(vllm)部署在一个卡上。

@ChengjieLi28 ChengjieLi28 closed this as not planned Won't fix, can't repro, duplicate, stale Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants