You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+7-8
Original file line number
Diff line number
Diff line change
@@ -1,16 +1,15 @@
1
-
# Intel-NPU-LLM
2
-
3
-
A simple Python script for running LLMs on Intel’s Neural Processing Units (NPUs). 🔧🌌
4
-
5
1
## Features 🌟
6
2
7
-
-**Currently Supported Models**:
3
+
-**Default models list**:
8
4
- meta-llama/Meta-Llama-3.1-8B-Instruct
9
5
- microsoft/Phi-3-mini-4k-instruct
10
6
- Qwen/Qwen2-7B
11
7
- mistralai/Mistral-7B-Instruct-v0.2
12
8
- openbmb/MiniCPM-1B-sft-bf16
13
9
- TinyLlama/TinyLlama-1.1B-Chat-v1.0
10
+
-**User can input any model they like**
11
+
- No guarantee that every model will compile for the NPU, though
12
+
-[here is a list of models likely to run on NPU](https://docs.openvino.ai/2024/about-openvino/performance-benchmarks/generative-ai-performance.html)
14
13
-**One-Time Setup**: The script downloads the model, quantizes it, converts it to OpenVINO IR format, compiles it for the NPU, and caches the result for future use. 💡⌛
15
14
-**Performance**: Surprisingly fast inference speeds, even on devices with modest computational power (e.g., my Meteor Lake's 13 TOPS NPU). ⚡⏳
16
15
-**Power Efficiency**: While inference might be faster on a CPU or GPU for some devices, the NPU is significantly more energy-efficient, making it ideal for laptops. 🔋🌐
@@ -30,8 +29,8 @@ As you can see, It's using NPU for text generation.
30
29
- Python **3.9** to **3.12**
31
30
- An Intel processor with an NPU:
32
31
- Meteor Lake (Core Ultra Series 1, i.e., 1XX chips)
33
-
- Lunar Lake (Core Ultra Series 2, i.e., 2XX chips)
34
32
- Arrow Lake (Core Ultra Series 2, i.e., 2XX chips)
33
+
- Lunar Lake (Core Ultra Series 2, i.e., 2XX chips)
-**Resource-Intensive Compilation**: The quantization and compilation steps can be time-consuming, taking up to tens of minutes depending on your hardware. However, these steps are performed only once per model and are cached for future use. ⌛⚙️
80
-
-**Performance**: Despite the resource-intensive setup, inference is optimized for NPUs and provides excellent performance, even on modest hardware. ✨⏳
79
+
-**But wait, why does context fill up and then reset?**: [Continuous batching](https://docs.openvino.ai/2024/api/genai_api/_autosummary/openvino_genai.ContinuousBatchingPipeline.html) has not yet been implemented for NPU's by Intel's OpenVINO engineers. You can check API [coverage % here](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html). 🚧🛠️
81
80
82
81
## Contributing ⭐
83
82
@@ -89,4 +88,4 @@ This project is licensed under the [MIT License](LICENSE). 🔒✨
89
88
90
89
---
91
90
92
-
Enjoy using `intel-npu-llm`! For any questions or feedback, please reach out or open an issue on GitHub. ✨🔧
91
+
Enjoy using `intel-npu-llm`! For any questions or feedback, please reach out or open an issue on GitHub. ✨🔧
0 commit comments