Zero inference is too slow #5418
Unanswered
garyyang85
asked this question in
Q&A
Replies: 1 comment 4 replies
-
@garyyang85, zero-inference is expected to be slower before of streaming weights over the slower PCIe link. Here are a couple of things to do.
|
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am using deepspeed zero-inference for inference. The model is 13b float16, and running on a v100 32G GPU. For common inference, when the inputs token is more than 2000(should support max 4096), it will report "cuda out of memory". So I found the zero-reference solution in deepspeed here. But the inference speed is too slow. And the GPU memory usage is about 8G. Is there a way to speed up the inference and use more GPU? Thanks.
Beta Was this translation helpful? Give feedback.
All reactions