Improving server inference/TTFT via prompt/input streaming #11348

Akkretion · 2025-01-22T10:58:11Z

Akkretion
Jan 22, 2025

Hello llama.cpp community,

for my use case, I want to summarize a transcript of a longer conversation (60 -90 min) after it is finished. This requires a model with longer context length since this corresponds to 10-20k tokens (I am currently using Qwen2.5 either the 14B or 32B version on an 36GB M3 MacBook Pro) and the issue is that the prompt evaluation takes like 2-3 minutes before the model can actually start streaming its response (TTFT).

This is not great for the user and I was wondering if there is a way to perform the prompt evaluation in a streaming manner, while the transcript is being recorded.
So the prompt and the transcript are streamed to the server and are evaluated bit by bit. This might be slower per token than doing it all at once, but the idea is that the prompt is already (99%) ingested when the user finishes the conversation, so he receives the summary faster.

I would assume that something similar is happening in OpenAIs advanced voice mode, but I didn't find any information on it.

Since I have barely any experience with cpp and the llama.cpp project, I wanted to ask if any of you know something about this before trying to hack my own solution to this.

Do you think this is possible with the current code version, if yes, do you have any pointers and how complex do you think this is?

Overall it seems like a general improvement in user facing apps to already start the processing while the user is creating the prompt in order to reduce time to first token.

Thanks a lot for any help!

Answered by JohannesGaessler

Jan 22, 2025

From a technical standpoint this is definitely possible. The TTFT is due to the need to populate the KV cache before a new token can be generated. With the current code you should be able to just periodically send the current transcript to the server. If prompt caching is enabled the KV cache values will be kept and only any newly appended tokens will need to be processed. I don't know whether the server allows you to request the generation of zero new tokens but in terms of performance requesting one new token (which can then simply be discarded) is virtually the same.

View full answer

JohannesGaessler · 2025-01-22T16:42:15Z

JohannesGaessler
Jan 22, 2025
Collaborator

From a technical standpoint this is definitely possible. The TTFT is due to the need to populate the KV cache before a new token can be generated. With the current code you should be able to just periodically send the current transcript to the server. If prompt caching is enabled the KV cache values will be kept and only any newly appended tokens will need to be processed. I don't know whether the server allows you to request the generation of zero new tokens but in terms of performance requesting one new token (which can then simply be discarded) is virtually the same.

3 replies

ggerganov Jan 22, 2025
Maintainer

Yup, this is supported. Simply send your requests with n_predict: 0 as your transcript grows. All existing prefix will be fetched from the cache and only the new tokens will be processed without generating new tokens.

Akkretion Jan 22, 2025
Author

Awesome, it indeed just works™ as you both described.

Just to summarize (info from the server documentation):
n_predict: ...When 0, no tokens will be generated but the prompt is evaluated into the cache. ...
cache_prompt: Re-use KV cache from a previous request if possible. This way the common prefix does not have to be re-processed, only the suffix that differs between the requests. Because (depending on the backend) the logits are not guaranteed to be bit-for-bit identical for different batch sizes (prompt processing vs. token generation) enabling this option can cause nondeterministic results. Default: true

So I can indeed just send the first $N$ tokens of the transcript that I have at any given time with n_predict: 0 and then once the conversation is finished, I can send the remaining transcript without the n_predict: 0 and it will immediately start generating.

Thank you so much for your answer!❤️

Just for my understanding also regarding this comment in the cache_prompt description:
"Because (depending on the backend) the logits are not guaranteed to be bit-for-bit identical for different batch sizes (prompt processing vs. token generation) enabling this option can cause nondeterministic results"

Is filling the KV-cache incrementally like this the same as ingesting the entire prompt at once? I.e. is the KV-cache built up sequentially where K and V values only depend on the token itself and the previous tokens or can they also depend on future tokens? Or is that what the batch processing refers to?

JohannesGaessler Jan 22, 2025
Collaborator

Depending on the batch size and therefore matrix shapes the floating point rounding error can be slightly different. If a single token happens to be sampled differently because of this the entire sequence will diverge. With infinite machine precision there would be no difference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving server inference/TTFT via prompt/input streaming #11348

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Improving server inference/TTFT via prompt/input streaming #11348

Akkretion Jan 22, 2025

Replies: 1 comment · 3 replies

JohannesGaessler Jan 22, 2025 Collaborator

ggerganov Jan 22, 2025 Maintainer

Akkretion Jan 22, 2025 Author

JohannesGaessler Jan 22, 2025 Collaborator

Akkretion
Jan 22, 2025

Replies: 1 comment 3 replies

JohannesGaessler
Jan 22, 2025
Collaborator

ggerganov Jan 22, 2025
Maintainer

Akkretion Jan 22, 2025
Author

JohannesGaessler Jan 22, 2025
Collaborator