Skip to content

Conversation

@remi-or
Copy link
Collaborator

@remi-or remi-or commented Nov 7, 2025

This PR adds a prefix sharing mechanism to the continuous batching API like the one present in VLLM.
It only activates if the model is a full-attention model, as is the case in VLLM.

The mechanism has two main components:

  • block hashing: each block in the cache, once it is filled up, is given a hash that depends on all the tokens in the sequence up to and including the ones in the block
  • prefix detection: when starting prefill for a request, we first look for a prefix with KV cache already computed, and if such a prefix is found, we skip the KV computation for it, using references to completed blocks to save compute
  • block de-reference: when a block is given a hash, we check that no other block shares the same hash. This ensures that each block sharing the same information is unique, and helps keep the cache size in control

What is missing from this PR:

  • more documentations
  • checking the code again
  • edge case: if the prefix is the entire initial request, we still need to do a forward with the last token of the request
  • checks TODOs to adress

The PR is draft until these are resolved but any early comment is welcome

@remi-or remi-or requested a review from McPatate November 7, 2025 15:57
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants