[OpenVINO] support ai-sage/GigaChat3-10B-A1.8B-bf16#1626
[OpenVINO] support ai-sage/GigaChat3-10B-A1.8B-bf16#1626Mohamed-Ashraf273 wants to merge 24 commits intohuggingface:mainfrom
Conversation
d376bab to
f88c6a8
Compare
949c2b6 to
c26ffe8
Compare
|
Hi @popovaan , |
|
Thanks for the PR! Please add tests for this model. For now, use a locally generated tiny model. I'm currently investigating whether we're allowed to invite GSoC contributors to the |
Got it, thanks! |
Hi @popovaan, @rkazants, |
rkazants
left a comment
There was a problem hiding this comment.
please also add export tests. The same test set that you have added for the previuos model.
Update documentation.
There was a problem hiding this comment.
Pull request overview
This PR aims to add OpenVINO export/inference support coverage for the ai-sage/GigaChat3-10B-A1.8B-bf16 family by extending OpenVINO test fixtures and adjusting DeepSeek patching logic used during export.
Changes:
- Add a
gigachat3tiny-random model fixture and include it in OpenVINO decoder integration coverage. - Update decoder tests for
gigachat3(expected SDPA count, relaxed logits tolerance, and skip conditions for incompatible Transformers versions). - Refactor DeepSeek attention patching to use a versioned factory function and extend MoE patching to handle MLP blocks exposing
expertsbut notmoe_infer.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
tests/openvino/utils_tests.py |
Adds the gigachat3 test model mapping; adjusts which models are treated as remote-code in tests. |
tests/openvino/test_decoder.py |
Adds gigachat3 to tested architectures and config expectations; tweaks tolerance/skip logic; adds debug output. |
optimum/exporters/openvino/model_patcher.py |
Updates DeepSeek patcher to use a unified attention forward factory and broadens MoE patching behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
@rkazants |
Hi @popovaan, I’ve finished adding the tests and temporarily published Would it be possible to invite me to the group so I can publish it there, or would you prefer to handle the publishing? Please let me know if any changes are needed. |
|
Hi. Can I help test the model? |
Hi! |
|
Thank you, @Mohamed-Ashraf273 ❤️ Working with CPU
Not working with GPU yet...It loads endlessly to GPU, but it works with the CPU in OpenArс Tool (OVGenAI engine).
|
|
Hi @savvadesogle, I ran a demo test with a tiny GigaChat3 model on GPU and it worked correctly. I was able to successfully:
All steps completed without issues and the GPU execution finished successfully. From your description, it sounds like the GPU loading/compilation for the full model may simply require more time and RAM. The tiny model finishes quickly, but the real GigaChat3 model is significantly larger, so it would be expected that:
Since the same pipeline works correctly with the tiny model on GPU, the real model should also work, but it may just need more time and memory for the GPU compilation step. For reference, here is the script I used for the GPU test: import torch
from transformers import AutoTokenizer
from optimum.intel.openvino import OVModelForCausalLM
import openvino as ov
# ── 0. Check available devices ────────────────────────────────────────────────
core = ov.Core()
print("Available devices:", core.available_devices)
assert "GPU" in " ".join(core.available_devices), "No Intel GPU found!"
MODEL_DIR = "./tiny-random-gigachat3"
# ── 1. Export (CPU export, then load on GPU) ──────────────────────────────────
print("\n[1] Exporting tiny-random-gigachat3 to OpenVINO...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
ov_model = OVModelForCausalLM.from_pretrained(
MODEL_DIR,
export=True,
device="GPU", # compile directly on GPU after export
)
print(" Export + GPU compile: OK")
# ── 2. Basic forward pass ─────────────────────────────────────────────────────
print("\n[2] Running forward pass on GPU...")
prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = ov_model(**inputs)
logits = outputs.logits
print(f" Logits shape : {logits.shape}")
print(f" Logits dtype : {logits.dtype}")
print(f" Logits sample: {logits[0, -1, :5].tolist()}")
assert logits.shape[0] == 1, "Batch size mismatch"
print(" Forward pass : OK")
# ── 3. Generation ─────────────────────────────────────────────────────────────
print("\n[3] Running generate() on GPU...")
ov_model.generation_config.eos_token_id = None # avoid early stop on tiny model
output_ids = ov_model.generate(**inputs, max_new_tokens=10, do_sample=False)
decoded = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(f" Output : {decoded!r}")
assert output_ids.shape[1] > inputs["input_ids"].shape[1], "No tokens generated"
print(" Generate : OK")
# ── 4. Batch generation ───────────────────────────────────────────────────────
print("\n[4] Running batched generate() on GPU...")
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
prompts = ["Hello world", "The sky is"]
batch = tokenizer(prompts, return_tensors="pt", padding=True)
output_ids = ov_model.generate(**batch, max_new_tokens=5, do_sample=False)
for i, ids in enumerate(output_ids):
print(f" Batch[{i}]: {tokenizer.decode(ids, skip_special_tokens=True)!r}")
print(" Batched generate: OK")
print("\n✅ All GPU tests passed!")Output: |
I didn't expect the process to take so long. I'll have to wait and see. The conversion happens very quickly, up to 3 minutes to a regular int4, without any additional parameters. I'll definitely give it a try. I have 128 GB of RAM, so that should be enough. Other models load much faster on the GPU. I'll try waiting longer. |
|
Hi @popovaan, @rkazants, @IlyasMoutawwakil I’ve fixed the remaining issues. Could you please take a look when you have time? |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Hi @popovaan, @rkazants, @IlyasMoutawwakil, All tests are now passing. I’d really appreciate it if you could take a final look. Thanks! |
| if orig_transformers_version is not None: | ||
| import json as _json | ||
| from pathlib import Path as _Path | ||
|
|
||
| gen_cfg_path = _Path(output) / "generation_config.json" | ||
| if gen_cfg_path.exists(): | ||
| with open(gen_cfg_path, "r", encoding="utf-8") as _f: | ||
| _cfg = _json.load(_f) | ||
| if _cfg.get("transformers_version") != orig_transformers_version: | ||
| _cfg["transformers_version"] = orig_transformers_version | ||
| with open(gen_cfg_path, "w", encoding="utf-8") as _f: | ||
| _json.dump(_cfg, _f, indent=2) |
There was a problem hiding this comment.
Can we avoid this change? This modifies the common code for all models, which is undesirable.
There was a problem hiding this comment.
Thanks for your feedback!
I reverted the changes and set gen_config.do_sample = False specifically for deepseek in test_decoder.py.
Could you please take another look and let me know if anything else should be adjusted?
Thanks!
|
Hi @popovaan, @rkazants, @IlyasMoutawwakil, I’d really appreciate it if you could take a look. Thanks! |
| expert_mask = torch.nn.functional.one_hot(topk_indices, num_classes=len(self.experts)) | ||
| expert_mask = expert_mask.permute(2, 0, 1) | ||
|
|
||
| for expert_idx in range(len(self.experts)): |
There was a problem hiding this comment.
that's kinda inefficient, especially during decoding. do we replace this with some optimized MoE operator in openvino later ?
There was a problem hiding this comment.
Correct, this is inefficient for decoding. The current implementation intentionally runs all experts to avoid the data-dependent control flow in the original MoE (skipping experts with no tokens), which breaks torch.jit.trace required for OpenVINO export. So this change mainly serves as a temporary tracing workaround to produce a static, exportable graph.
Yes, the plan is to replace this with a custom OpenVINO MoE operator (similar to convert_recurrent_attention_cell used in Qwen3Next) so we can restore sparse execution and avoid loading all expert weights during decoding. This PR just unblocks model export for now, and the optimized operator is intended as a follow-up improvement.
There was a problem hiding this comment.
then maybe minimizing and standardizing the experts forward at least, in some models we use the batching trick to make experts exportable with minimal graph layout (the loop results in very long graphs), see http://github.com/huggingface/optimum-intel/blob/439b6319368c1667f3119ef508812ef167b0fef5/optimum/exporters/openvino/model_patcher.py#L7562 @rkazants
There was a problem hiding this comment.
Thanks for the suggestion!
I refactored the DeepSeek MoE patch to follow the same batching pattern used in the AFMoE implementation. The expert projections are now pre-stacked in the patcher, and the forward pass uses vectorized bmm operations instead of looping over experts, which helps keep the exported graph compact.
IlyasMoutawwakil
left a comment
There was a problem hiding this comment.
lgtm! approved to merge if all tests are passing, however we still don't solve the real perf issue in exported MoEs which is having an efficient implementation of either:
- torch.grouped_mm operator
- entire MoE operator
|
Could you please locally run OpenVINO GenAI WhoWhatBenchmark tool to check the accuracy of the full model (not the tiny one) and share the results? Here is the instruction: https://github.com/openvinotoolkit/openvino.genai/blob/master/tools/who_what_benchmark/README.md#compare-text-generation-models-llms |
Hello Anastasiia @popovaan Sorry. I won't be able to run the full model (BF16->converted OpenVINO full) on the GPU;
I only have 16GB of memory, and I'm afraid it won't all fit on a single GPU. And there are some issues with Hetero (openvinotoolkit/openvino#33012 (comment)) (with two or more GPUs). Or is it enough to test converted models in int8 and int4? |
Thanks for the suggestion! I'm running the WhoWhatBenchmark tool locally now to check the full model's accuracy. Will share results once it's done. |




What does this PR do?
Conversion cmd-line for CohereLabs/tiny-aya-base:
optimum-cli export openvino -m ai-sage/GigaChat3-10B-A1.8B-bf16 ./output_dir --task text-generation-with-pastInference of
ai-sage/GigaChat3-10B-A1.8B-bf16using OpenVINO backend:Solving Issue: #1608
Before submitting