Replies: 1 comment
-
I think this was discussed in the past but I can't recall what was the conclusion. I suppose that more logits requires more memory to be copied from the GPU so it's normal to cause slowdown. Though I am not sure if the current implementation is optimal. I will likely revisit this logic soon within the context of #11213. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi everyone,
I'm experimenting with llama.cpp for a project, and I’d like to get the logits from the GGUF model. However, when passing logits=True, it takes almost double the time compared to generating only the tokens. how can I optimize it?
If anyone could provide suggestions or guidance to optimize this process or retrieve logits efficiently, I’d greatly appreciate it!
Beta Was this translation helpful? Give feedback.
All reactions