backported miscellaneous fixes and features from continuous_batching branch

justADeni · justADeni · commit 02621d0f281e · 2025-01-27T02:24:25.000+01:00
diff --git a/README.md b/README.md
@@ -1,16 +1,15 @@
-# Intel-NPU-LLM
-
-A simple Python script for running LLMs on Intel’s Neural Processing Units (NPUs). 🔧🌌
-
 ## Features 🌟
 
-- **Currently Supported Models**:
+- **Default models list**:
   - meta-llama/Meta-Llama-3.1-8B-Instruct
   - microsoft/Phi-3-mini-4k-instruct
   - Qwen/Qwen2-7B
   - mistralai/Mistral-7B-Instruct-v0.2
   - openbmb/MiniCPM-1B-sft-bf16
   - TinyLlama/TinyLlama-1.1B-Chat-v1.0
+- **User can input any model they like**
+  - No guarantee that every model will compile for the NPU, though
+  - [here is a list of models likely to run on NPU](https://docs.openvino.ai/2024/about-openvino/performance-benchmarks/generative-ai-performance.html)
 - **One-Time Setup**: The script downloads the model, quantizes it, converts it to OpenVINO IR format, compiles it for the NPU, and caches the result for future use. 💡⌛
 - **Performance**: Surprisingly fast inference speeds, even on devices with modest computational power (e.g., my Meteor Lake's 13 TOPS NPU). ⚡⏳
 - **Power Efficiency**: While inference might be faster on a CPU or GPU for some devices, the NPU is significantly more energy-efficient, making it ideal for laptops. 🔋🌐
@@ -30,8 +29,8 @@ As you can see, It's using NPU for text generation.
 - Python **3.9** to **3.12**
 - An Intel processor with an NPU:
   - Meteor Lake (Core Ultra Series 1, i.e., 1XX chips)
-  - Lunar Lake (Core Ultra Series 2, i.e., 2XX chips)
   - Arrow Lake (Core Ultra Series 2, i.e., 2XX chips)
+  - Lunar Lake (Core Ultra Series 2, i.e., 2XX chips)
 - Newest Intel NPU driver
   - [Windows](https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-windows.html)
   - [Linux](https://github.com/intel/linux-npu-driver/releases/)
@@ -77,7 +76,7 @@ python intel_npu_llm.py
 ## Notes ℹ️
 
 - **Resource-Intensive Compilation**: The quantization and compilation steps can be time-consuming, taking up to tens of minutes depending on your hardware. However, these steps are performed only once per model and are cached for future use. ⌛⚙️
-- **Performance**: Despite the resource-intensive setup, inference is optimized for NPUs and provides excellent performance, even on modest hardware. ✨⏳
+- **But wait, why does context fill up and then reset?**: [Continuous batching](https://docs.openvino.ai/2024/api/genai_api/_autosummary/openvino_genai.ContinuousBatchingPipeline.html) has not yet been implemented for NPU's by Intel's OpenVINO engineers. You can check API [coverage % here](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html). 🚧🛠️
 
 ## Contributing ⭐
 
@@ -89,4 +88,4 @@ This project is licensed under the [MIT License](LICENSE). 🔒✨
 
 ---
 
-Enjoy using `intel-npu-llm`! For any questions or feedback, please reach out or open an issue on GitHub. ✨🔧
+Enjoy using `intel-npu-llm` ! For any questions or feedback, please reach out or open an issue on GitHub. ✨🔧
diff --git a/intel_npu_llm.py b/intel_npu_llm.py
@@ -182,7 +182,10 @@ def load(model_name, model_path, prompt_length):
     warnings.filterwarnings("ignore", category=DeprecationWarning) 
     print(Fore.GREEN + loading_text + Fore.RESET, flush=True)
     model_load_start = time.time()
-    pipeline_config = { "NPUW_CACHE_DIR": os.path.join(script_dir, "npu_cache", model_name), "GENERATE_HINT": "BEST_PERF", "MAX_PROMPT_LEN": prompt_length }
+    pipeline_config = { 
+        "NPUW_CACHE_DIR": os.path.join(script_dir, "npu_cache", model_name.split("/")[0], model_name.split("/")[1]),
+        "GENERATE_HINT": "BEST_PERF",
+        "MAX_PROMPT_LEN": prompt_length }
     pipe = openvino_genai.LLMPipeline(os.path.join(script_dir, model_path), 'NPU', pipeline_config)
     model_load_stop = time.time()
     print(Fore.GREEN + loaded_text + str(round(model_load_stop - model_load_start,1)) + " seconds. \n" + Fore.RESET, flush=True)
@@ -229,9 +232,9 @@ def main():
     config.do_sample = False
     config.top_k = 50
     config.top_p = 0.9
-    config.repetition_penalty = 1.3
-    config.no_repeat_ngram_size = 2
-    config.temperature = 0.7
+    config.repetition_penalty = 1.2
+    config.no_repeat_ngram_size = 3
+    config.temperature = 0.75
 
     print(Fore.GREEN + "Chat commands: \nexit - unload the model and exit the script \nreset - resets the chat context manually\n" + Fore.RESET, flush=True)
     generate(pipe, config)
diff --git a/resources/screenshot_npu_usage.png b/resources/screenshot_npu_usage.png