-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LLM][NPU] Ported sampler from Stateless to Stateful pipeline #1507
base: master
Are you sure you want to change the base?
[LLM][NPU] Ported sampler from Stateless to Stateful pipeline #1507
Conversation
AsyaPronina
commented
Jan 8, 2025
- Ported sampler functionality from Stateless to Stateful pipeline
if (streamer_ptr && streamer_ptr->put(last_token)) { | ||
return results; | ||
} | ||
// Swap max_new_token to get_max_new_token() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this comment is valid? here we don't play with max_new_tokens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not valid, sorry
// Swap max_new_token to get_max_new_token() | ||
auto sequence_group = std::make_shared<SequenceGroup>( | ||
0 /* request_id */, input_ids, config, 1 /* block_size */); | ||
sequence_group->update_processed_tokens_num(input_ids.get_size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sequence_group->update_processed_tokens_num(input_ids.get_size()); | |
sequence_group->update_processed_tokens_num(sequence_group->get_prompt_len() - output_sequence_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it work w/o SLICE_OUT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose yes.
w/o Slice output len is the same as prompt len and, hence, number of processed tokens is 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, however, for first input_ids we have length == length of prompt, for example, 18
. But output_sequence_len
is equal to 1024
for first logits
as it is output from prefill model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, could you also enable testing of StatefulLLMPipeline
in test_llm_pipeline_static.py
?
// Swap max_new_token to get_max_new_token() | ||
auto sequence_group = std::make_shared<SequenceGroup>( | ||
0 /* request_id */, input_ids, config, 1 /* block_size */); | ||
sequence_group->update_processed_tokens_num(input_ids.get_size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it work w/o SLICE_OUT?