Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Run LLMs with OpenVINO GenAI Flavor on NPU #1216

Open
taikai-zz opened this issue Nov 15, 2024 · 12 comments
Open

[Bug]: Run LLMs with OpenVINO GenAI Flavor on NPU #1216

taikai-zz opened this issue Nov 15, 2024 · 12 comments
Assignees
Labels
bug Something isn't working category: LLM LLM pipeline (stateful, static) category: NPU PSE support_request Support team

Comments

@taikai-zz
Copy link

taikai-zz commented Nov 15, 2024

OpenVINO Version

Name: openvino
Version: 2024.4.0
Summary: OpenVINO(TM) Runtime
Home-page: https://docs.openvino.ai/2023.0/index.html
Author: Intel(R) Corporation
Author-email: openvino@intel.com
License: OSI Approved :: Apache Software License
Location: /root/openvino_env/lib/python3.12/site-packages
Requires: numpy, openvino-telemetry, packaging
Required-by: openvino-tokenizers

Operating System

Ubuntu 24.04 LTS  Linux ubuntu 6.8.0-48-generic

Device used for inference

NPU

Framework

None

Model used

TinyLlama

Issue description

Refer to official documentation:
https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide-npu.html

This is my hardware information
1731632595149

import openvino_genai as ov_genai
help(ov_genai.LLMPipeline)
1731632229545
As shown in the above figure: The device does not support NPU. I followed the instructions in the document and changed it to NPU, but the result was empty. Changing it to CPU or GPU restored normal operation. May I ask where I made the operation error?
1731632595149
1731633075481

There is another issue: if you check the usage rate of NPU in Ubuntu environment, such as tools like nvidia-smi

@taikai-zz taikai-zz added bug Something isn't working support_request Support team labels Nov 15, 2024
@ilya-lavrenov ilya-lavrenov transferred this issue from openvinotoolkit/openvino Nov 15, 2024
@ilya-lavrenov ilya-lavrenov assigned l-bat and TolyaTalamanov and unassigned l-bat Nov 15, 2024
@ilya-lavrenov ilya-lavrenov added category: LLM LLM pipeline (stateful, static) category: NPU labels Nov 15, 2024
@Wan-Intel
Copy link

Did you encountered issue when using the latest version of OpenVINO™ GenAI?

You may use the latest OpenVINO™ GenAI and use NPU by following the steps in Run LLMs with OpenVINO GenAI Flavor on NPU.

@helena-intel
Copy link
Collaborator

In addition to using the latest OpenVINO GenAI, if you haven't exported the model with --sym, please try exporting the model with that option. This should work:

optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 --sym --ratio 1.0 --group-size 128 TinyLlama-1.1B-Chat-v1.0

Also note that for NPU, you should add do_sample=False to the pipe.generate() call. See the documentation for more limitations/recommendations.

@taikai-zz
Copy link
Author

taikai-zz commented Dec 4, 2024

optimum-cli export openvino -m Qwen/Qwen2-7B --weight-format int4 --sym --ratio 1.0 --group-size 128 Qwen2-7B
1733297624039

1733297866528

1733297995752

@helena-intel
Copy link
Collaborator

Unfortunately, there is an issue with using NPU for LLMs on Ubuntu. The NPU team is working on it; the issue is not with OpenVINO, but on the kernel/driver level. I am sorry you're running into this. We will keep you informed.

@dmatveev
Copy link
Contributor

dmatveev commented Dec 4, 2024

This is 32GB MTL.. Ticket was opened for TinyLLaMa, but logs are mentioning group-quantized QWEN2-7B - a completely different league

@Wan-Intel
Copy link

Wan-Intel commented Dec 10, 2024

I also encountered issue only when using NPU.
npu failed

I'll escalate the case to relevant the team and we'll provide an update as soon as possible.

@Wan-Intel Wan-Intel added the PSE label Dec 10, 2024
@helena-intel
Copy link
Collaborator

@taikai-zz a new NPU driver was released today with a fix for LLM on LNL: https://github.com/intel/linux-npu-driver/releases/tag/v1.10.1 Could you check if that fixes the issue for you? We also had a new openvino-genai release this week, 2024.6, with performance improvements on NPU, so please upgrade with pip install --upgrade openvino-genai.

Also note that for running larger LLMs (>4B) you should use per-channel quantization. This note will be added to the docs too, I'm mentioning it here because I see you're using a 7B model. Instead of group-size 128, you should specify group-size -1 (note the minus sign). This is an example from the docs for Llama-2-7b: optimum-cli export openvino -m meta-llama/Llama-2-7b-chat-hf --weight-format int4 --sym --group-size -1 --ratio 1.0 --awq --scale-estimation --dataset=wikitext2 Llama-2-7b-chat-hf

@taikai-zz
Copy link
Author

Thank you for your help. It has now returned to normal, but the speed is too slow。
1735005386604

There is an error in the document, please fix it
1735004820289

@helena-intel
Copy link
Collaborator

I'm glad to hear the issue is fixed! For faster speed, please see the document (same one you screenshotted) about model caching. That will speed up model loading time. Since model loading time only occurs once, it's also useful to measure inference time, by adding start = time.perf_counter() and end = time.perf_counter() before and after pipe.generate()and then showing duration asprint(end-start)`.

The group_size -1 enables channel-wise quantization, your screenshot is from the group quantization tab. Also note that I recommended this for larger models. For the 1.1B model, group quantization will work fine too. As I mentioned, this will be clarified in the docs.

@kmaki565
Copy link

kmaki565 commented Jan 11, 2025

I am trying to run the chat sample with NPU on Windows.
The command-line option mentioned above must be --group-size, not --group_size. With this correction (and I needed pip install nncf) I can convert the model, but creating the pipeline on NPU fails (GPU worked well with the same model).
It looks like the app crashes in npu_level_zero_umd.dll.
I use the latest NPU driver 32.0.100.3104, and confirmed the same crash occurred with MTL and LNL.
image

@TolyaTalamanov
Copy link
Collaborator

Hi @kmaki565! Could you please clarify the following:

  • What model you're trying to run? (HF tag)
  • How do you convert model? (optimum-cli command line)
  • What is the version for the following components: nncf, onnx, optimum-intel
  • What is the version for openvino, openvino-tokenizers, openvino-genai

Thanks!

@kmaki565
Copy link

@TolyaTalamanov

  • TinyLlama/TinyLlama-1.1B-Chat-v1.0
  • optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 --sym --ratio 1.0 --group-size 128 TinyLlama-1.1B-Chat-v1.0 (I followed this doc)
  • Output of pip freeze:
    pip.freeze.txt
  • App crash dump analysis:
    PythonChatSampleCrash.log

Thank you for your support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working category: LLM LLM pipeline (stateful, static) category: NPU PSE support_request Support team
Projects
None yet
Development

No branches or pull requests

8 participants