Xinference最新版本启动加载模型很容易崩溃 #1847

worm128 · 2024-07-11T13:51:49Z

错误截图：

模型：Qwen1.5-14B-Chat-GPTQ-int4
加载引擎：vllm
错误信息：
torch.cuda.OutOfMemoryError: [address=0.0.0.0:43411, pid=101] CUDA out of memory. Tried to allocate 70.00 MiB. GPU
2024-07-11 21:46:49 INFO 07-11 13:46:49 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
2024-07-11 21:46:49 INFO 07-11 13:46:49 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.

说明：
我这边是24G的显卡，显存还剩10G就告诉我显存不够，也不知道是显存碎片化问题还是什么，反正这个问题经常出现在新版0.12.3，特别是在加载了m3e-base向量模型后，在加载llm模型，就会容易报错，旧版0.8.5加载模型比较文档，不会出现这些错误。

欢迎大家加q群讨论一下：27831318

The text was updated successfully, but these errors were encountered:

ChengjieLi28 · 2024-07-12T02:26:54Z

如果是vllm引擎，一个LLM几乎就会吃掉一个卡的所有显存。所以不建议embedding和LLM（vllm）部署在一个卡上。

worm128 added the feature label Jul 11, 2024

XprobeBot added gpu and removed feature labels Jul 11, 2024

XprobeBot added this to the v0.13.1 milestone Jul 11, 2024

ChengjieLi28 closed this as not planned Won't fix, can't repro, duplicate, stale Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xinference最新版本启动加载模型很容易崩溃 #1847

Xinference最新版本启动加载模型很容易崩溃 #1847

worm128 commented Jul 11, 2024 •

edited

Loading

ChengjieLi28 commented Jul 12, 2024

Xinference最新版本启动加载模型很容易崩溃 #1847

Xinference最新版本启动加载模型很容易崩溃 #1847

Comments

worm128 commented Jul 11, 2024 • edited Loading

ChengjieLi28 commented Jul 12, 2024

worm128 commented Jul 11, 2024 •

edited

Loading