Add chat model sample for gemma models #343

olpipi · 2024-04-03T16:01:29Z

CVS-133845

ilya-lavrenov · 2024-04-04T11:17:42Z

text_generation/causal_lm/cpp/CMakeLists.txt

+find_package(OpenVINO REQUIRED COMPONENTS Runtime)
+target_link_libraries(chat_model_lm PRIVATE openvino::runtime)
+set_target_properties(chat_model_lm PROPERTIES CXX_STANDARD 17)
+set_target_properties(chat_model_lm PROPERTIES CXX_STANDARD_REQUIRED ON)


could you please also add GHA CI pipeline for this sample?

maybe we can also move this sample to a dedicated folder llm-chatbot with its own README.md?

Can we add id in the next PR?

CI should be done in current PR

text_generation/causal_lm/cpp/chat_model_lm.cpp

ilya-lavrenov · 2024-04-05T10:42:40Z

text_generation/causal_lm/cpp/chat_model_lm.cpp

+        core.compile_model(std::string{argv[1]} + "/openvino_tokenizer.xml", "CPU").create_infer_request();
+    ov::InferRequest detokenizer =
+        core.compile_model(std::string{argv[1]} + "/openvino_detokenizer.xml", "CPU").create_infer_request();
+    // The model can be compiled for GPU as well


have you tried it on GPU?

Yes, I tired it on integrated GPU. It works even slower than on CPU

ilya-lavrenov · 2024-04-05T10:43:20Z

text_generation/causal_lm/cpp/CMakeLists.txt

+find_package(OpenVINO REQUIRED COMPONENTS Runtime)
+target_link_libraries(chat_model_lm PRIVATE openvino::runtime)
+set_target_properties(chat_model_lm PROPERTIES CXX_STANDARD 17)
+set_target_properties(chat_model_lm PROPERTIES CXX_STANDARD_REQUIRED ON)


maybe we can also move this sample to a dedicated folder llm-chatbot with its own README.md?

ilya-lavrenov · 2024-04-11T21:01:18Z

text_generation/causal_lm/cpp/chat_model_lm.cpp

-        std::cout << "Answer: " << detokenize(detokenizer, answer.tokens) << "\n_______\n";
+
+        auto answer_str = detokenize(detokenizer, answer.tokens);
+        answer_str = answer_str.substr(0, answer_str.find("<eos>"));


why do we need this hack?
CC @apaniukov

Due to beam search. Some beams may be much shorter than others and they looks like this
"The answer is 4. 2 + 2 = 4.<eos><eos><eos><eos><eos><eos><eos><eos>..."
I add this just to make output more readable

No, adding --skip-special-tokens to the convert_tokenizer command should remove these tokens during detokenization.

Why don't we end the beam generation in the case of the <eos> token anyway? It should speed up other beams generation in that case. It also might affect a beam score.

Why don't we end the beam generation in the case of the token anyway?

It should work this way, I wonder why EOS does not work..
CC @as-suvorov

It should indeed, I'll check this sample

I cannot reproduce it locally. Getting result without eos tokens from detokenizer. I'll sync with @olpipi on that.

as-suvorov · 2024-04-16T13:10:39Z

text_generation/causal_lm/cpp/chat_model_lm.cpp

+        }
+
+        Beam answer;
+        float highest_score = std::numeric_limits<float>().min();


Beam search has log_softmax scoring which gives negative floats. Sample started working for me only after:

Suggested change

float highest_score = std::numeric_limits<float>().min();

float highest_score = -std::numeric_limits<float>().infinity();

Tested on google/gemma-2b-it

as-suvorov · 2024-04-16T13:15:54Z

text_generation/causal_lm/cpp/chat_model_lm.cpp

+            lm.set_tensor("beam_idx", ov::Tensor{ov::element::i32, {batch_size}, next_beams.data()});
+            // Set auxiliary inputs
+            ov::Tensor attention_mask = lm.get_tensor("attention_mask");
+            ov::Shape mask_shape{batch_size, attention_mask.get_shape().at(seq_len_dim_idx) + 1};


It seems mask size should not be increased at the last step. As on the next inference the whole new prompt ids would be passed instead of single token.

Wovchena · 2024-04-24T10:04:37Z

text_generation/causal_lm/cpp/README.md

@@ -72,6 +72,7 @@ cmake -S .\ -B .\build\ && cmake --build .\build\ --config Release -j
 ### Download and convert the model and tokenizers

 The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.
+For gemma models transformers with version >= 4.38 is required. So use 4.38.1 for it


Please, extend the list of supported models with the corresponding models. Modify python -m pip install --upgrade-strategy eager for windows and Linux below instead of describing the required version.

olpipi force-pushed the chat_model branch 2 times, most recently from 2e65321 to 51225d7 Compare April 3, 2024 16:04

pavel-esir requested review from pavel-esir and ilya-lavrenov April 3, 2024 16:12

olpipi force-pushed the chat_model branch from 51225d7 to 16391f0 Compare April 3, 2024 16:23

ilya-lavrenov reviewed Apr 4, 2024

View reviewed changes

ilya-lavrenov reviewed Apr 5, 2024

View reviewed changes

ilya-lavrenov self-assigned this Apr 5, 2024

olpipi requested a review from ilya-lavrenov April 10, 2024 12:02

ilya-lavrenov reviewed Apr 11, 2024

View reviewed changes

as-suvorov reviewed Apr 16, 2024

View reviewed changes

olpipi added 3 commits April 24, 2024 16:27

Add chat model sample for gemma models

1419cbf

Apply comments

6419fd0

Fixes

c7b1e9e

olpipi force-pushed the chat_model branch from df2c6bc to c7b1e9e Compare April 24, 2024 12:28

Wovchena requested changes Apr 27, 2024

View reviewed changes

olpipi closed this Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add chat model sample for gemma models #343

Add chat model sample for gemma models #343

olpipi commented Apr 3, 2024 •

edited

Loading

ilya-lavrenov Apr 4, 2024

ilya-lavrenov Apr 5, 2024

olpipi Apr 10, 2024

ilya-lavrenov Apr 11, 2024

ilya-lavrenov Apr 5, 2024

olpipi Apr 10, 2024

ilya-lavrenov Apr 5, 2024

ilya-lavrenov Apr 11, 2024

olpipi Apr 12, 2024 •

edited

Loading

apaniukov Apr 12, 2024 •

edited

Loading

ilya-lavrenov Apr 12, 2024 •

edited

Loading

as-suvorov Apr 12, 2024

as-suvorov Apr 16, 2024

as-suvorov Apr 16, 2024 •

edited

Loading

as-suvorov Apr 16, 2024

Wovchena Apr 24, 2024

	float highest_score = std::numeric_limits<float>().min();
	float highest_score = -std::numeric_limits<float>().infinity();

Add chat model sample for gemma models #343

Add chat model sample for gemma models #343

Conversation

olpipi commented Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olpipi Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

apaniukov Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

ilya-lavrenov Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

as-suvorov Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olpipi commented Apr 3, 2024 •

edited

Loading

olpipi Apr 12, 2024 •

edited

Loading

apaniukov Apr 12, 2024 •

edited

Loading

ilya-lavrenov Apr 12, 2024 •

edited

Loading

as-suvorov Apr 16, 2024 •

edited

Loading