You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: python/llm/example/NPU/HF-Transformers-AutoModels/LLM/CPP_Examples/README.md
+16-17Lines changed: 16 additions & 17 deletions
Original file line number
Diff line number
Diff line change
@@ -81,9 +81,9 @@ Arguments info:
81
81
-`--max-prompt-len MAX_PROMPT_LEN`: argument defining the maximum number of tokens that the input prompt can contain. It is default to be `512`.
82
82
-`--low-bit LOW_BIT`: argument defining the low bit optimizations that will be applied to the model. Current available options are `"sym_int4"`, `"asym_int4"` and `"sym_int8"`, with `"sym_int4"` as the default.
83
83
84
-
## 3. Build C++ Example `llm-npu-cli`
84
+
## 3. Build C++ Example `llama-cli-npu`(Optional)
85
85
86
-
You can run below cmake script in cmd to build `llm-npu-cli`, don't forget to replace below conda env dir with your own path.
86
+
-You can run below cmake script in cmd to build `llama-cli-npu` by yourself, don't forget to replace below <CONDA_ENV_DIR> with your own path.
AI stands for Artificial Intelligence, which is the field of study focused on creating and developing intelligent machines that can perform tasks that typically require human intelligence, such as visual and auditory recognition, speech recognition, and decision-making. AI is a broad and diverse field that includes a wide range
129
127
130
-
Decode 63 tokens cost xxxx ms (avg xxxx ms each token).
131
-
Output:
132
-
AI stands for Artificial Intelligence, which is the field of study focused on creating and developing intelligent machines that can perform tasks that typically require human intelligence, such as visual and auditory recognition, speech recognition, and decision-making. AI is a broad and diverse field that includes a wide range
128
+
llm_perf_print: load time = xxxx.xx ms
129
+
llm_perf_print: prompt eval time = xxxx.xx ms / 26 tokens ( xx.xx ms per token, xx.xx tokens per second)
130
+
llm_perf_print: eval time = xxxx.xx ms / 63 runs ( xx.xx ms per token, xx.xx tokens per second)
131
+
llm_perf_print: total time = xxxx.xx ms / 89 tokens
0 commit comments