Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add chat model sample for gemma models #343

Closed
wants to merge 3 commits into from

Conversation

olpipi
Copy link
Collaborator

@olpipi olpipi commented Apr 3, 2024

CVS-133845

@olpipi olpipi force-pushed the chat_model branch 2 times, most recently from 2e65321 to 51225d7 Compare April 3, 2024 16:04
find_package(OpenVINO REQUIRED COMPONENTS Runtime)
target_link_libraries(chat_model_lm PRIVATE openvino::runtime)
set_target_properties(chat_model_lm PROPERTIES CXX_STANDARD 17)
set_target_properties(chat_model_lm PROPERTIES CXX_STANDARD_REQUIRED ON)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please also add GHA CI pipeline for this sample?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can also move this sample to a dedicated folder llm-chatbot with its own README.md?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add id in the next PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI should be done in current PR

text_generation/causal_lm/cpp/chat_model_lm.cpp Outdated Show resolved Hide resolved
text_generation/causal_lm/cpp/chat_model_lm.cpp Outdated Show resolved Hide resolved
text_generation/causal_lm/cpp/chat_model_lm.cpp Outdated Show resolved Hide resolved
text_generation/causal_lm/cpp/chat_model_lm.cpp Outdated Show resolved Hide resolved
text_generation/causal_lm/cpp/chat_model_lm.cpp Outdated Show resolved Hide resolved
text_generation/causal_lm/cpp/chat_model_lm.cpp Outdated Show resolved Hide resolved
text_generation/causal_lm/cpp/chat_model_lm.cpp Outdated Show resolved Hide resolved
core.compile_model(std::string{argv[1]} + "/openvino_tokenizer.xml", "CPU").create_infer_request();
ov::InferRequest detokenizer =
core.compile_model(std::string{argv[1]} + "/openvino_detokenizer.xml", "CPU").create_infer_request();
// The model can be compiled for GPU as well
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tried it on GPU?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tired it on integrated GPU. It works even slower than on CPU

find_package(OpenVINO REQUIRED COMPONENTS Runtime)
target_link_libraries(chat_model_lm PRIVATE openvino::runtime)
set_target_properties(chat_model_lm PROPERTIES CXX_STANDARD 17)
set_target_properties(chat_model_lm PROPERTIES CXX_STANDARD_REQUIRED ON)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can also move this sample to a dedicated folder llm-chatbot with its own README.md?

@ilya-lavrenov ilya-lavrenov self-assigned this Apr 5, 2024
@olpipi olpipi requested a review from ilya-lavrenov April 10, 2024 12:02
std::cout << "Answer: " << detokenize(detokenizer, answer.tokens) << "\n_______\n";

auto answer_str = detokenize(detokenizer, answer.tokens);
answer_str = answer_str.substr(0, answer_str.find("<eos>"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this hack?
CC @apaniukov

Copy link
Collaborator Author

@olpipi olpipi Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to beam search. Some beams may be much shorter than others and they looks like this
"The answer is 4. 2 + 2 = 4.<eos><eos><eos><eos><eos><eos><eos><eos>..."
I add this just to make output more readable

Copy link
Contributor

@apaniukov apaniukov Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, adding --skip-special-tokens to the convert_tokenizer command should remove these tokens during detokenization.

Why don't we end the beam generation in the case of the <eos> token anyway? It should speed up other beams generation in that case. It also might affect a beam score.

Copy link
Contributor

@ilya-lavrenov ilya-lavrenov Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we end the beam generation in the case of the token anyway?

It should work this way, I wonder why EOS does not work..
CC @as-suvorov

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should indeed, I'll check this sample

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot reproduce it locally. Getting result without eos tokens from detokenizer. I'll sync with @olpipi on that.

}

Beam answer;
float highest_score = std::numeric_limits<float>().min();
Copy link
Contributor

@as-suvorov as-suvorov Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beam search has log_softmax scoring which gives negative floats. Sample started working for me only after:

Suggested change
float highest_score = std::numeric_limits<float>().min();
float highest_score = -std::numeric_limits<float>().infinity();

Tested on google/gemma-2b-it

lm.set_tensor("beam_idx", ov::Tensor{ov::element::i32, {batch_size}, next_beams.data()});
// Set auxiliary inputs
ov::Tensor attention_mask = lm.get_tensor("attention_mask");
ov::Shape mask_shape{batch_size, attention_mask.get_shape().at(seq_len_dim_idx) + 1};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems mask size should not be increased at the last step. As on the next inference the whole new prompt ids would be passed instead of single token.

@@ -72,6 +72,7 @@ cmake -S .\ -B .\build\ && cmake --build .\build\ --config Release -j
### Download and convert the model and tokenizers

The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.
For gemma models transformers with version >= 4.38 is required. So use 4.38.1 for it
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, extend the list of supported models with the corresponding models. Modify python -m pip install --upgrade-strategy eager for windows and Linux below instead of describing the required version.

@olpipi olpipi closed this Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants