Skip to content

Conversation

@zucchini-nlp
Copy link
Member

What does this PR do?

Fixes #41863 and fixes #40910

We always have had an imperfect way to infer if we're in prefill or decoding stage, which caused us many bugs in the past. The most reliable way is to check cache position values but it is not compile-compatible and also has an edge case

Recently Manuel merged a PR to split prefill into its own function so now we can benefit from it and know with 100% certainty which stage we're in. This PR adds is_prefill flag to generation input preparation and replaces existing logic with the flag.

Also it adds a test case for the above linked issue

@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: aria, aya_vision, chameleon, clvp, cohere2_vision, deepseek_vl, deepseek_vl_hybrid, emu3, florence2, fuyu, gemma3, gemma3n, glm4v, glm4v_moe, got_ocr2, granite_speech

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp
Copy link
Member Author

Another worm of cans, assisted decoding has no prefill separated out and is causing issues now 😢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants