Skip to content

Commit 02621d0

Browse files
committed
backported miscellaneous fixes and features from continuous_batching branch
1 parent 5905d4c commit 02621d0

File tree

3 files changed

+14
-12
lines changed

3 files changed

+14
-12
lines changed

README.md

+7-8
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,15 @@
1-
# Intel-NPU-LLM
2-
3-
A simple Python script for running LLMs on Intel’s Neural Processing Units (NPUs). 🔧🌌
4-
51
## Features 🌟
62

7-
- **Currently Supported Models**:
3+
- **Default models list**:
84
- meta-llama/Meta-Llama-3.1-8B-Instruct
95
- microsoft/Phi-3-mini-4k-instruct
106
- Qwen/Qwen2-7B
117
- mistralai/Mistral-7B-Instruct-v0.2
128
- openbmb/MiniCPM-1B-sft-bf16
139
- TinyLlama/TinyLlama-1.1B-Chat-v1.0
10+
- **User can input any model they like**
11+
- No guarantee that every model will compile for the NPU, though
12+
- [here is a list of models likely to run on NPU](https://docs.openvino.ai/2024/about-openvino/performance-benchmarks/generative-ai-performance.html)
1413
- **One-Time Setup**: The script downloads the model, quantizes it, converts it to OpenVINO IR format, compiles it for the NPU, and caches the result for future use. 💡⌛
1514
- **Performance**: Surprisingly fast inference speeds, even on devices with modest computational power (e.g., my Meteor Lake's 13 TOPS NPU). ⚡⏳
1615
- **Power Efficiency**: While inference might be faster on a CPU or GPU for some devices, the NPU is significantly more energy-efficient, making it ideal for laptops. 🔋🌐
@@ -30,8 +29,8 @@ As you can see, It's using NPU for text generation.
3029
- Python **3.9** to **3.12**
3130
- An Intel processor with an NPU:
3231
- Meteor Lake (Core Ultra Series 1, i.e., 1XX chips)
33-
- Lunar Lake (Core Ultra Series 2, i.e., 2XX chips)
3432
- Arrow Lake (Core Ultra Series 2, i.e., 2XX chips)
33+
- Lunar Lake (Core Ultra Series 2, i.e., 2XX chips)
3534
- Newest Intel NPU driver
3635
- [Windows](https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-windows.html)
3736
- [Linux](https://github.com/intel/linux-npu-driver/releases/)
@@ -77,7 +76,7 @@ python intel_npu_llm.py
7776
## Notes ℹ️
7877

7978
- **Resource-Intensive Compilation**: The quantization and compilation steps can be time-consuming, taking up to tens of minutes depending on your hardware. However, these steps are performed only once per model and are cached for future use. ⌛⚙️
80-
- **Performance**: Despite the resource-intensive setup, inference is optimized for NPUs and provides excellent performance, even on modest hardware. ✨⏳
79+
- **But wait, why does context fill up and then reset?**: [Continuous batching](https://docs.openvino.ai/2024/api/genai_api/_autosummary/openvino_genai.ContinuousBatchingPipeline.html) has not yet been implemented for NPU's by Intel's OpenVINO engineers. You can check API [coverage % here](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html). 🚧🛠️
8180

8281
## Contributing ⭐
8382

@@ -89,4 +88,4 @@ This project is licensed under the [MIT License](LICENSE). 🔒✨
8988

9089
---
9190

92-
Enjoy using `intel-npu-llm`! For any questions or feedback, please reach out or open an issue on GitHub. ✨🔧
91+
Enjoy using `intel-npu-llm` ! For any questions or feedback, please reach out or open an issue on GitHub. ✨🔧

intel_npu_llm.py

+7-4
Original file line numberDiff line numberDiff line change
@@ -182,7 +182,10 @@ def load(model_name, model_path, prompt_length):
182182
warnings.filterwarnings("ignore", category=DeprecationWarning)
183183
print(Fore.GREEN + loading_text + Fore.RESET, flush=True)
184184
model_load_start = time.time()
185-
pipeline_config = { "NPUW_CACHE_DIR": os.path.join(script_dir, "npu_cache", model_name), "GENERATE_HINT": "BEST_PERF", "MAX_PROMPT_LEN": prompt_length }
185+
pipeline_config = {
186+
"NPUW_CACHE_DIR": os.path.join(script_dir, "npu_cache", model_name.split("/")[0], model_name.split("/")[1]),
187+
"GENERATE_HINT": "BEST_PERF",
188+
"MAX_PROMPT_LEN": prompt_length }
186189
pipe = openvino_genai.LLMPipeline(os.path.join(script_dir, model_path), 'NPU', pipeline_config)
187190
model_load_stop = time.time()
188191
print(Fore.GREEN + loaded_text + str(round(model_load_stop - model_load_start,1)) + " seconds. \n" + Fore.RESET, flush=True)
@@ -229,9 +232,9 @@ def main():
229232
config.do_sample = False
230233
config.top_k = 50
231234
config.top_p = 0.9
232-
config.repetition_penalty = 1.3
233-
config.no_repeat_ngram_size = 2
234-
config.temperature = 0.7
235+
config.repetition_penalty = 1.2
236+
config.no_repeat_ngram_size = 3
237+
config.temperature = 0.75
235238

236239
print(Fore.GREEN + "Chat commands: \nexit - unload the model and exit the script \nreset - resets the chat context manually\n" + Fore.RESET, flush=True)
237240
generate(pipe, config)

resources/screenshot_npu_usage.png

18.2 KB
Loading

0 commit comments

Comments
 (0)