Feature/qwen3.5 export support#1638
Feature/qwen3.5 export support#1638jpatrickiles-dev wants to merge 3 commits intohuggingface:mainfrom
Conversation
transformers 5.x removed/moved several APIs that optimum-intel imports: - transformers.onnx module removed: inline ParameterFormat and compute_serialized_parameters_size (trivial enum + multiply) - transformers.utils.is_offline_mode moved to huggingface_hub.constants.is_offline_mode - AutoModelForVision2Seq renamed to AutoModelForImageTextToText - huggingface_hub.HfFolder removed: add shim delegating to get_token() All fixes use try/except fallbacks to maintain backward compatibility with transformers 4.x. Also adds missing Qwen3_5MoeModelPatcher import and registers qwen3_5, qwen3_5_text, and qwen3_5_moe model types for OpenVINO export. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implement full OpenVINO export patcher for Qwen3.5 hybrid (GatedDeltaNet + Attention) models, enabling export of all Qwen3.5 variants: 0.8B, 4B, 9B dense and 35B-A3B MoE. Model patcher (model_patcher.py): - qwen3_5_gated_delta_net_forward: patched forward for GatedDeltaNet layers that eliminates dynamic control flow for OV tracing. Uses ov_causal_conv1d for unified prefill/decode conv handling, and patched_recurrent_gated_delta_rule + RecurrentAttentionCell for the recurrent attention computation. - Qwen3_5TextModelPatcher: handles cache decomposition (flat tensor list <-> per-layer conv_states/recurrent_states/key_cache/value_cache), GatedDeltaNet forward patching, and OV conversion extensions. - Qwen3_5MoeTextModelPatcher: extends above with MoE expert forward patching for the 35B-A3B model. Config registration (model_configs.py): - Register qwen3_5, qwen3_5_text, qwen3_5_moe, qwen3_5_moe_text as hybrid SSM model types inheriting from Qwen3NextOpenVINOConfig. Stateful transformation (stateful.py): - Normalize cache state dtypes to f32 before make_stateful to prevent bf16/f32 type mismatch in CPU plugin's Assign ops. - Convert bf16 logits output to f32 for openvino_genai compatibility. Other fixes: - Add Qwen3.5 model types to SSM_MODELS list for correct OVModelWithMambaForCausalLM class selection. - Handle AttributeError in _get_non_default_generation_parameters for Qwen3_5TextConfig which lacks this method. Tested: Qwen3.5-0.8B exports to INT4 and generates coherent text via openvino_genai.LLMPipeline on CPU. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The GPU plugin cannot handle 3D bf16 VariadicSplit/StridedSlice ops (no bfyx layout format available for rank-3 bf16 tensors). Fix: reshape the post-conv mixed_qkv tensor to 4D by inserting a dummy head dimension before slicing QKV components. The unsqueeze makes all slice operations 4D, which the GPU plugin handles correctly. The per-head reshape happens after slicing, preserving the correct contiguous [key_dim, key_dim, value_dim] layout from in_proj_qkv. Also adds _convert_bf16_to_f16 utility in stateful.py for future use with GPU devices that support f16 but not bf16 (Intel Arc Xe-LPG). Currently disabled as bf16→f16 conversion causes precision loss for Qwen3.5 models. The exported model runs correctly on CPU. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
We are already on this model enabling: #1634. |
|
No problem, my fork is working fine for me, so I'm happy. |
Did you have to edit Qwen3.5-27B/config.json at all? |
|
No, but I didn't test any models beyond 9B due to my own needs. There were
a number of problems, but I was able to correct everything and move on to
my next goal of implementing the NPU speculator pipeline.
…On Mon, Mar 16, 2026, 2:46 AM sund00bie ***@***.***> wrote:
*sund00bie* left a comment (huggingface/optimum-intel#1638)
<#1638 (comment)>
No problem, my fork is working fine for me, so I'm happy.
Did you have to edit Qwen3.5-27B/config.json at all?
—
Reply to this email directly, view it on GitHub
<#1638 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/B3X7JWBR37AAEDLXX6ZKWUL4Q6PKZAVCNFSM6AAAAACWSZWMNSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DANRVGQ2DSNJVGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
Hey @jpatrickiles-dev great work on this. Feel free to join us at openarc discord. I am also doing a lot work with openvino in my project, which you might find useful. Since you are deep diving NPU, I have an end to end from scratch reimplementation of qwen tts and qwen asr built into OpenArc, releasing soon. However I do not have NPU device and cannot optimize. My from scratch code has been easier to study outside of the library abstractions and if you are motivated could be a fantastic contribution/learning adventure. |
Title:
[OpenVINO] Add Qwen3.5 hybrid model export support with transformers 5.x compatibility
Architecture:
Qwen3.5 is a hybrid VLM architecture combining GatedDeltaNet (linear attention/SSM) layers (~75%) with full attention layers (~25%). All model sizes (0.8B–35B-A3B MoE) use this hybrid design with Qwen3_5DynamicCache storing both KV cache and recurrent/conv states.
Fixes:
Fixes transformers 5.x compatibility (removed transformers.onnx, is_offline_mode, HfFolder, AutoModelForVision2Seq)
Adds full Qwen3.5 GatedDeltaNet hybrid model export support (0.8B, 4B, 9B, 35B-A3B MoE)
Implements Qwen3_5TextModelPatcher and Qwen3_5MoeTextModelPatcher with static GatedDeltaNet forward patching and hybrid cache handling
Fixes GPU VariadicSplit layout error via 4D reshape in the patcher
Tickets:
N/A (no existing issue — this is new feature + compatibility fix)
Tested on:
Intel Arc Xe-LPG (Meteor Lake), Ubuntu 24.04, transformers 5.3.0, OpenVINO 2026.1. Validated Qwen3.5 0.8B/4B/9B export and inference on both CPU and GPU.
AI Assistance:
yes — Claude used for architecture analysis and patcher implementation. Human validation: exported and ran inference on all model sizes, verified coherent output on CPU and GPU.