Skip to content

Improving server inference/TTFT via prompt/input streaming #11348

Answered by JohannesGaessler
Akkretion asked this question in Q&A
Discussion options

You must be logged in to vote

From a technical standpoint this is definitely possible. The TTFT is due to the need to populate the KV cache before a new token can be generated. With the current code you should be able to just periodically send the current transcript to the server. If prompt caching is enabled the KV cache values will be kept and only any newly appended tokens will need to be processed. I don't know whether the server allows you to request the generation of zero new tokens but in terms of performance requesting one new token (which can then simply be discarded) is virtually the same.

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@ggerganov
Comment options

@Akkretion
Comment options

@JohannesGaessler
Comment options

Answer selected by Akkretion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants