Improving server inference/TTFT via prompt/input streaming #11348
-
Hello llama.cpp community, for my use case, I want to summarize a transcript of a longer conversation (60 -90 min) after it is finished. This requires a model with longer context length since this corresponds to 10-20k tokens (I am currently using Qwen2.5 either the 14B or 32B version on an 36GB M3 MacBook Pro) and the issue is that the prompt evaluation takes like 2-3 minutes before the model can actually start streaming its response (TTFT). This is not great for the user and I was wondering if there is a way to perform the prompt evaluation in a streaming manner, while the transcript is being recorded. I would assume that something similar is happening in OpenAIs advanced voice mode, but I didn't find any information on it. Since I have barely any experience with cpp and the llama.cpp project, I wanted to ask if any of you know something about this before trying to hack my own solution to this. Do you think this is possible with the current code version, if yes, do you have any pointers and how complex do you think this is? Overall it seems like a general improvement in user facing apps to already start the processing while the user is creating the prompt in order to reduce time to first token. Thanks a lot for any help! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
From a technical standpoint this is definitely possible. The TTFT is due to the need to populate the KV cache before a new token can be generated. With the current code you should be able to just periodically send the current transcript to the server. If prompt caching is enabled the KV cache values will be kept and only any newly appended tokens will need to be processed. I don't know whether the server allows you to request the generation of zero new tokens but in terms of performance requesting one new token (which can then simply be discarded) is virtually the same. |
Beta Was this translation helpful? Give feedback.
From a technical standpoint this is definitely possible. The TTFT is due to the need to populate the KV cache before a new token can be generated. With the current code you should be able to just periodically send the current transcript to the server. If prompt caching is enabled the KV cache values will be kept and only any newly appended tokens will need to be processed. I don't know whether the server allows you to request the generation of zero new tokens but in terms of performance requesting one new token (which can then simply be discarded) is virtually the same.